0

About ten days ago I moved a site - mostly a Joomla discussion board - to a new server at a different IP address. During a brief scheduled downtime I replicated the content over and completed DNS switchover (via Cloudflare) as usual, and most traffic has followed it - all real users are able to access the site a new location, and what seems like the majority of web crawler requests.

However, I still have web crawlers attempting to access my site at the old IP. And I do mean, specifically by IP address - though they're attempting to crawl valid paths which now exist on the new server. It's primarily GoogleBot though I see a sporadic BingBot or Yahoo Slurp entry as well. Apache logs show 1-2 accesses per minute on the old server.

All three of these bots do most of their crawling on the new server, however.

I've removed the content from the old server, so these requests are met with 404s. Is there a convention for crawlers to somehow index by server IP?

Is there a way to kick them into looking at the new site? Should I actively be trying to redirect them with custom HTTP error codes?

Ryan
  • 81
  • 1
  • 8
  • Custom HTTP error codes? What's wrong with the standard HTTP moved permanently response code? – TessellatingHeckler Nov 18 '15 at 20:42
  • Shouldn't the crawlers be accessing only by domain name, which hasn't changed? My Apache config on the old server doesn't even reflect the old name-based virtual host. – Ryan Nov 18 '15 at 20:45
  • 1
    If they originally crawled it by IP address and you served them a page, how and why would they ever know the domain name? Shouldn't you have been redirecting IP address GET requests previously to instruct the crawler to use the domain name, and only serving the site for domain name queries? – TessellatingHeckler Nov 18 '15 at 20:46
  • Are you logging the `Host` header sent by these crawlers? If they are sending an IP address in the `Host` header, it indicates that they somehow found URLs with an IP address in the URL. Publishing the URLs with an IP address in the first place was a mistake (we can't know whether that mistake was made by you or somebody else). Responding to those requests with a 200 code was probably a mistake too. The suggestion by @TessellatingHeckler would make sense for any access using an IP address or an unexpected name in the `Host` header. – kasperd Nov 18 '15 at 21:35

1 Answers1

1

DNS cache refresh on these crawlers can take ridiculous amounts of time, but 10 days seems like a stretch to me. OTOH you're saying they hit your site by IP which is certainly erroneous. TBH it sounds more like a bad link somewhere and the fact that your webserver doesn't redirect IP addresses to actual FQDN URLs, and then the crawler keeps browsing through the site's own relative links (but that's just an assumption).

I wouldn't bother with redirects, unless you intend to run this old server for a looong time to do just that. For example we've enforced SSL a whole year ago (with a 301 redirect), yet we still get a lot of requests over plain HTTP. And they're direct links to specific assets (like downloads), so it's not that they're typing the main address without specifying https://. As long as you keep serving it, they'll keep using it.

If these same bots crawl your new server as well, I really wouldn't mind. My 2 cents.

bviktor
  • 756
  • 5
  • 12