1

Though it seems like it should be pretty straightforward, I have been unable to configure apache so that googlebot's requests are not stored in the access log. I've tried the following lines:

SetEnvIfNoCase User-Agent googlebot dontlog
BrowserMatchNoCase googlebot dontlog
CustomLog "/foo/bar/access_log" combined env=!dontlog

and I restarted apache after adding them, but the log is still recording all of google bot's requests. My understanding is that SetEnvIf User-Agent and BrowserMatch do the same thing. i tried each of them but neither works.

  • 1
    This may be silly, but I think you need to just put googlebot in quotes like `SetEnvIfNoCase User-Agent "googlebot" dontlog` Also make sure that is the exact case and spelling of the User-Agent in the logs (I don't remember and don't see any entries in my log currently) – Joe Apr 07 '15 at 16:55
  • I checked the apache manuals and you are definitely correct. unfortunately - even with the quotation marks added - it's still not working. I'm at a loss. – Jonathan Basile Apr 07 '15 at 18:21
  • on a second glance, at various places in the documentation i see that the regex is sometimes within quotation marks and sometimes not. also, i see both User-Agent and User_Agent. ive tried any combination though and none of them work. – Jonathan Basile Apr 07 '15 at 19:28
  • 1
    Can you post an example of a log entry that you want removed? – Joe Apr 07 '15 at 19:45
  • Sure - they all contain this at the end: (compatible; Googlebot/2.1; +h ttp://www.google.com/bot.html)" - i added that space between h and ttp because the link was truncated when i posted this originally – Jonathan Basile Apr 07 '15 at 19:51
  • 2
    Based on that, I understand you would want your config to be `SetEnvIfNoCase User-Agent "Googlebot" dontlog` or `SetEnvIfNoCase User-Agent "Googlebot/2.1" dontlog` It is case sensitive – Joe Apr 07 '15 at 20:28
  • Comparing the user-agent string is not 100% accurate method to check if its Googlebot. Malicious bots can impersonate the Googlebot. – pmagunia Jan 07 '20 at 23:28

1 Answers1

0

Find a log entry that you suspect is the Googlebot and make a note of the IP address.

Next do a lookup on that IP address with the following command:

host 66.249.64.156

Don't forget to substitute the IP address you recorded earlier with this command.

If the result looks something like this then you know it's the Googlebot. You want make sure it ends in googlebot.com:

156.64.249.66.in-addr.arpa domain name pointer crawl-66-249-64-156.googlebot.com.

Next, go to your Apache2 Virtualhost and add these directives adapted for your site:

SetEnvIf Remote_Addr "66.249.64.156" AND User-Agent "Googlebot" do_not_log
CustomLog ${APACHE_LOG_DIR}/access.log combined env=!do_not_log

You can repeat this process for the bingbot:

host 157.55.39.247

The entry should have something that ends in search.msn.com like this

247.39.55.157.in-addr.arpa domain name pointer msnbot-157-55-39-247.search.msn.com.

So you would add the additional line in the Virtualhost file after the Googlebot line:

SetEnvIf Remote_Addr "157.55.39.247" AND User-Agent "bing" do_not_log

Usually the Googlebot and MSN bot will use the same IP to check your pages, but if not you may need to add additional entries. You may just want to use "^66" out of convenience.

https://support.google.com/webmasters/answer/80553

https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/

pmagunia
  • 103
  • 1
  • 5