Replication or?

Question

Recently, we have been hammered by Google Bots and all kinds of other bots( 60% of the website traffic that we experience on an average are from Bots. We are trying to segment out the Google Bot Traffic to a different server (low cost server). However, the databases either need to be replicated/or mirrored. Is there one solution better than the other if we want close to real time? We currently have our production servers' data on a SAN. We could replicate this, but that works more like snapshot replication.

score 5 · Answer 1 · answered Jun 09 '11 at 11:40

Don't "segment out" the spiders.

Trying to "segment out" WWW spiders is fighting against the WWW spider owners, who want, as far as possible, their spiders to see what everyone else sees. Go down that route, and you'll find yourself in a constant arms race with the spider owners.

Check your site design.

High spider traffic is sometimes symptomatic of bad site design. For example: Hyperlinks whose URLs contain session IDs will cause spiders to see and to crawl single pages multiple times. Check your content HTTP server logs for what the spider traffic actually is. If things are being crawled over and over, varying only by such things as session IDs, then adjust your site not to have this problem. See Google's technical guidelines for more errors in this vein to check for and to fix.

Use the tools provided to you as a final resort.

Google provides an adjuster knob for its crawl rate in its Webmaster Tools. If you've checked that your site adheres to the technical guidelines and your site design isn't the root cause of the high crawl traffic, use the Webmaster Tools. But note that if you keep having to do this, every 90 days, to keep the crawl rate down on static content then there's most likely something wrong with your site design that you haven't found and fixed.

score 2 · Answer 2 · answered Jun 08 '11 at 21:47

Does your data really change that much? Could you offer the bots a less frequently-updated version of your website on the proposed lower cost server? You might then be able to refresh that data overnight, or something.

Database mirroring for SQL Server doesn't really allow you to use the secondary for querying - unless you use database snapshots for read-only access, and that is an Enterprise edition feature. Things changes with the next release of SQL Server, but that's still some time off.

Database mirroring is also per database, so if you have several databases that make up the solution - you need to mirror them all.

Replication is more about moving a subset of the data - many might disagree with this. The more data you shift with any technology, the more bandwidth you need - or it'll start to fall behind.

Perhaps one solution would be to offer the bots more static content of your website, which gets updated periodically via a process.

score 0 · Answer 3 · answered Jun 10 '11 at 04:17

Thanks for the response. I am guessing that I will give replication a shot and see how it goes. Will only have the replication running at night.

@JdeBP I have already tried doing that and even tried setting crawl rates to min. Did not help in my case. Also, this is for almost 4000 websites.

Replication or?

3 Answers3

Don't "segment out" the spiders.

Check your site design.

Use the tools provided to you as a final resort.