Google-Bot fell in love with my 404-page

Question

Every day my access-log looks kind of this:

66.249.78.140 - - [21/Oct/2013:14:37:00 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.78.140 - - [21/Oct/2013:14:37:01 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.78.140 - - [21/Oct/2013:14:37:01 +0200] "GET /vuqffxiyupdh.html HTTP/1.1" 404 1189 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

or this

66.249.78.140 - - [20/Oct/2013:09:25:29 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.62 - - [20/Oct/2013:09:25:30 +0200] "GET /robots.txt HTTP/1.1" 200 112 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.78.140 - - [20/Oct/2013:09:25:30 +0200] "GET /zjtrtxnsh.html HTTP/1.1" 404 1186 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The bot calls the robots.txt twice and after that tries to access a file (zjtrtxnsh.html, vuqffxiyupdh.html, ...) which cannot exist and must return a 404 error. The same procedure every day, just the unexisting html-filename changes.

The content of my robots.txt:

User-agent: *
Disallow: /backend
Sitemap: http://mysitesname.de/sitemap.xml

The sitemap.xml is readable and valid, so there seems to be no reason why the bot should want to force a 404-error.
How should I interpret this behaviour? Does it point to a mistake I've done or should I ignore it?

UPDATE
@malware I scanned my website with several online-tools, nothing was found.
I have none of the standard-apps on the server like wordpress or phpmyadmin.
I receive a logwatch every day and there was no unauthorized ssh-access or something like that.
I have fail2ban set up.
I have restricted ssh-access to publickeys, no root-login allowed.
There was none of the sudo-commands which logwatch reported which I could not recognize as things that I've done that day.
There is no file in my web-directory which is new or not created by me or looks kinda weired (okay I cannot guarantee that 100%, but all looks okay).
I've done a full clamscan on the server without any result.
The softwarepackages are up-to-date.

What else can I do?

Something or someone is generating such links. Check your site carefully, especially for malware/compromise. — Michael Hampton, Oct 21 '13 at 20:32
@MichaelHampton please see my updated question. Maybe you can tell me what I else can do to check if there is something compromised. — 32bitfloat, Oct 21 '13 at 22:23
That really looks like apachelogs.. has that changed recently? (or any .htaccess file?) because you may have inadvertantly become an open-proxy.. but I think you would be seeing something more like: "GET /http://supacrackwareznshit.info/ldjkflsdjf.php".. etc, rather than those.. the UserAgent is a fiction, and anybody can put anything in their user-agent header. — Grizly, Oct 21 '13 at 22:44
I assume you're running *nix. You really ought to be running Tripwire on publicly-facing (DMZ) servers. Just a suggestion. — Michael Martinez, Oct 21 '13 at 22:45
@Grizly nginx is used as webserver. Sometimes the page is accessed to be used as proxy ("CONNECT mx3.mail2000.com.tw:25 HTTP/1.0") but that always returns status code 400. — 32bitfloat, Oct 21 '13 at 23:16
@MichaelMartinez You're right. Tripwire sounds good, I'll try that. Thank you. — 32bitfloat, Oct 21 '13 at 23:21
@Grizly The UA string is not reliable, but unless he runs HTTP over UDP, the traffic is legit, `66.249.78.140` is part of a Google owned AS and resolves to a .googlebot.com FQDN — Mathias R. Jessen, Oct 22 '13 at 01:42

score 1 · Answer 1 · edited Nov 02 '15 at 10:34

In short: If my memory servers me correct. Its to check the 404 pages on your site.

Longer answer: People create custom 404 pages and then forget to change the status code of the page. In the end you will return custom 404 pages with header status as 200 ok when google bot tries to access an invalid url. Now the bot has to make a decision. In order to aid it in this decision making it tries to hit your server with a randomly generated url which has a high probability of not being on your site and check what the response for the site is when requested for a not found page.

As I said I am not 100% sure about it.

I am not 100% sure either. But this answer is most likely correct. — kasperd, Nov 02 '15 at 10:36

Google-Bot fell in love with my 404-page

1 Answers1