Since quite a while (over a month now) I see lines like the following in the apache logs:
180.76.15.138 - - [24/Jun/2015:16:13:34 -0400] "GET /manual/de/mod/module-dict.html HTTP/1.1" 403 396 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
180.76.15.159 - - [24/Jun/2015:16:28:34 -0400] "GET /manual/es/mod/mod_cache_disk.html HTTP/1.1" 403 399 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
66.249.75.86 - - [24/Jun/2015:16:18:01 -0400] "GET /manual/es/programs/apachectl.html HTTP/1.1" 403 436 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[Wed Jun 24 16:13:34.430884 2015] [access_compat:error] [pid 5059] [client 180.76.15.138:58811] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/de/mod/module-dict.html
[Wed Jun 24 16:18:01.037146 2015] [access_compat:error] [pid 2791] [client 66.249.75.86:56362] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/es/programs/apachectl.html
[Wed Jun 24 16:28:34.461298 2015] [access_compat:error] [pid 2791] [client 180.76.15.159:25833] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/es/mod/mod_cache_disk.html
The requests seem to really come from Baiduspider and Googlebot (checked using reverse DNS as explained here):
user@server:~$ host 66.249.75.86
86.75.249.66.in-addr.arpa domain name pointer crawl-66-249-75-86.googlebot.com.
user@server:~$ host crawl-66-249-75-86.googlebot.com
crawl-66-249-75-86.googlebot.com has address 66.249.75.86
I have read similar questions about this topic like this and this, but for those, these errors are actually preventing the site to work correctly. In my case instead, the html pages that the bots try to access do not exist, and this is therefore the expected behaviour of Apache. Only annoyance, is that Google seems slow at indexing my site, although the Google Webmaster Tools do not show any errors.
I am using Apache version 2.4.7 with the following vhost configuration:
<VirtualHost *:80>
ServerName example.com
ServerAlias www.example.com
DocumentRoot "/var/www/example.com/public"
<Directory />
Options None
AllowOverride None
Order Deny,Allow
Deny from all
Require all denied
</Directory>
<Directory "/var/www/example.com/public">
Options None
AllowOverride FileInfo Limit Options=FollowSymLinks
Order Allow,Deny
Allow from all
Require all granted
</Directory>
ErrorLog /var/log/apache2/example.com/error.log
CustomLog /var/log/apache2/example.com/access.log combined
</VirtualHost>
My questions are therefore:
- why are Baiduspider and Googlebot repeatedly trying to access content on my site which is not there and not referred by any links on the site?
- how do requests like
GET /manual/de/mod/...
get mapped to/usr/share/doc/apache2-doc/manual/de/mod/...
while, to my understanding, they should go to/var/www/example.com/public/manual/de/mod/...
? - in general: should I worry about those lines as a sign of misconfiguration, or is there an explanation for them?