8

Since quite a while (over a month now) I see lines like the following in the apache logs:

180.76.15.138 - - [24/Jun/2015:16:13:34 -0400] "GET /manual/de/mod/module-dict.html HTTP/1.1" 403 396 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
180.76.15.159 - - [24/Jun/2015:16:28:34 -0400] "GET /manual/es/mod/mod_cache_disk.html HTTP/1.1" 403 399 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
66.249.75.86 - - [24/Jun/2015:16:18:01 -0400] "GET /manual/es/programs/apachectl.html HTTP/1.1" 403 436 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[Wed Jun 24 16:13:34.430884 2015] [access_compat:error] [pid 5059] [client 180.76.15.138:58811] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/de/mod/module-dict.html
[Wed Jun 24 16:18:01.037146 2015] [access_compat:error] [pid 2791] [client 66.249.75.86:56362] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/es/programs/apachectl.html
[Wed Jun 24 16:28:34.461298 2015] [access_compat:error] [pid 2791] [client 180.76.15.159:25833] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/es/mod/mod_cache_disk.html

The requests seem to really come from Baiduspider and Googlebot (checked using reverse DNS as explained here):

user@server:~$ host 66.249.75.86
86.75.249.66.in-addr.arpa domain name pointer crawl-66-249-75-86.googlebot.com.
user@server:~$ host crawl-66-249-75-86.googlebot.com
crawl-66-249-75-86.googlebot.com has address 66.249.75.86

I have read similar questions about this topic like this and this, but for those, these errors are actually preventing the site to work correctly. In my case instead, the html pages that the bots try to access do not exist, and this is therefore the expected behaviour of Apache. Only annoyance, is that Google seems slow at indexing my site, although the Google Webmaster Tools do not show any errors.

I am using Apache version 2.4.7 with the following vhost configuration:

<VirtualHost *:80>
    ServerName example.com
    ServerAlias www.example.com

    DocumentRoot "/var/www/example.com/public"
    <Directory />
        Options None
        AllowOverride None
        Order Deny,Allow
        Deny from all
        Require all denied
    </Directory>
    <Directory "/var/www/example.com/public">
        Options None
        AllowOverride FileInfo Limit Options=FollowSymLinks 
        Order Allow,Deny
        Allow from all
        Require all granted
    </Directory>

    ErrorLog /var/log/apache2/example.com/error.log
    CustomLog /var/log/apache2/example.com/access.log combined
</VirtualHost>

My questions are therefore:

  1. why are Baiduspider and Googlebot repeatedly trying to access content on my site which is not there and not referred by any links on the site?
  2. how do requests like GET /manual/de/mod/... get mapped to /usr/share/doc/apache2-doc/manual/de/mod/... while, to my understanding, they should go to /var/www/example.com/public/manual/de/mod/...?
  3. in general: should I worry about those lines as a sign of misconfiguration, or is there an explanation for them?
matpen
  • 387
  • 2
  • 4
  • 10

3 Answers3

5

In 2.2, access control based on client hostname, IP address, and other characteristics of client requests was done using the directives Order, Allow, Deny, and Satisfy.

In 2.4, such access control is done in the same way as other authorization checks, using the new module mod_authz_host. The old access control idioms should be replaced by the new authentication mechanisms, although for compatibility with old configurations, the new module mod_access_compat is provided.

Looks like you've already set the new Require directive, so just remove the deprecated access directives and run sudo service apache2 reload

Cees Timmerman
  • 222
  • 3
  • 8
  • 1
    Thank you for pointing that out. In fact, the configuration as it is now, is redundant. I will remove the Order, Deny and Allow directives. – matpen Jul 19 '15 at 08:57
3

Since some time has passed without any answer, I decided to (partially) answer my own question according to my research so far.

  1. Unfortunately, the question why Googlebot and Baiduspider are trying to access the Apache documentation through my server remains unanswered.
  2. The /manual/... URLs get mapped to /usr/share/doc/apache2-doc/manual/... thanks to a pre-installed Alias on Ubuntu: I guess that this is so, in order to make it convenient to access the documentation. In my case this is not needed, so I decided to remove the Alias by issuing a2disconf apache2-doc followed by service apache2 reload.
  3. There is no reason to regard the log entries as signs of misconfiguration, as they are rather the desired behaviour. Before removing the Alias, the access to the documentation was blocked by the vhost config, thus returning a 403 "Forbidden" status code. After removing the Alias, the server correctly returns a 404 "Not Found" status code.
matpen
  • 387
  • 2
  • 4
  • 10
  • 1
    For question 1, there are DNS errors every day, and its quite often others sites were resolved to your IP by mistake. – Shiji.J Jul 10 '15 at 10:22
  • @Shiji.Jiang this might be an explanation. I am starting to see other similar log entries, with different URLs. Thank you for pointing that out. – matpen Jul 19 '15 at 08:55
0

Make sure below options have proper value and are not commented out in global.conf:

  • ServerName localhost (or your server name, not FQDN)
  • DirectoryIndex proper_value (in my case it was Login.php