Crawler massively changing user-agent

Question

This morning I noticed a single IP-address was kinda crawling my website, though it was querying the same page many times in a few minutes. Then I noticed that it was doing that with different user-agents.

I decided to check what was going on by analyzing the Apache httpd logs

  cut -d' ' -f1 /var/log/apache2/*access.log | # Extract all IP-addresses from the server logs
  sort -u |                                    # List every IP-address only once
  while read ip; do                            # Cycle through the list of IP-addresses
    printf "$ip\t";                            # Print the IP-address 
    grep ^$ip /var/log/apache2/*access.log |   # Select log entries for an IP-address
    sed 's/^.*\("[^"]*"\)$/\1/' |              # Extract the user-agent
    sort -u |                                  # Create a list of user-agents
    wc -l;                                     # Count the unique user-agents 
  done | 
  tee >( cat >&2; echo '=== SORTED ===' ) |    # Suspense is killing me, I want to see the progress while the script runs...
  sort -nk2 |                                  # Sort list by number of different user agents
  cat -n                                       # Add line numbers

Which results in a long list:

  line  IP-address      number of different user-agents used.
...
  1285  176.213.0.34    15
  1286  176.213.0.59    15
  1287  5.158.236.154   15
  1288  5.158.238.157   15
  1289  5.166.204.48    15
  1290  5.166.212.42    15
  1291  176.213.28.54   16
  1292  5.166.212.10    16
  1293  176.213.28.32   17
  1294  5.164.236.40    17
  1295  5.158.238.6     18
  1296  5.158.239.1     18
  1297  5.166.208.39    18
  1298  176.213.20.0    19
  1299  5.164.220.43    19
  1300  5.166.208.35    19

So there are tens of IP-addresses that are fiddling with the user agent over a span of a couple minutes. I checked the top 50 IP-addresses against my private little log of known bots, but no matches there.

This is what the access log looks like for a single IP address (vertically and horizontally truncated for readability):

"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.99 Safari/537.36" 
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0"
"GET / HTTP/1.0" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.99 Safari/537.36"

Are other people seeing this? Anyone a clue what is happening here?

Which page did it access? These IP addresses are mostly russian and probably related to spam. Just a few CIDRs / networks. And current entries at SANS ISC: https://isc.sans.edu/api/ip/5.158.239.1 — Daniel Ruf, Dec 25 '15 at 11:15
@DanielRuf as shown in the log, mostly / and some on a single blog post. I've turned off comments on my blog a while ago because of spam in the review queue. — jippie, Dec 25 '15 at 11:27
they are almost all from the same ASN, so me these are clearly (spam) bots: http://bgp.he.net/AS42682#_prefixes — Daniel Ruf, Dec 25 '15 at 11:30
@DanielRuf so why querying a single page on the website with tens of different user-agents? — jippie, Dec 25 '15 at 11:32
Not sure about this but I think they try different ones to bypass some tools like fail2ban or others. All of them use HTTP 1.0 and could be easily blocked with the CIDRs. But deginitely bots, maybe compromised clients as these useragent strings are not so old and legit. — Daniel Ruf, Dec 25 '15 at 11:37
This could also simply be a recon or enumeration activity. Many sites respond differently to different browsers in order to display the page correctly (e.g. different css for IE compatibility issues). It could be trying to see if more info or weaknesses are exposed to different browsers. — schroeder, Dec 25 '15 at 21:04
But Safari and Firefox useragent strings? I see no specific advatage or difference for these useragent strings. As webdesigner and webdeveloper I treat them in the same way (no changes needed). Safari (Webkit) is almost identical to Chrome (Chromium a fork of Webkit and now Blink) and FireFox (Gecko) mostly supports the same CSS rules. These modern useragent strings (Windows 10) are normally not used by (most spam) bots. Probably compromised and part of some sort of botnet (as these are dynamic addresses of an ISP, not hosting provider). — Daniel Ruf, Dec 25 '15 at 23:15
As schroeder mentioned this sounds like an automated scan. The scanning tools have a stack of user-agents and cycle through them all. Take a look at automated web-pentesting tools such as Uniscan, Arachni, Golismero, etc... You could try running one of these against your site to see if you get the same results. — 16b7195abb140a3929bbc322d1c6f1, Dec 27 '15 at 02:54

Mark Buffalo · Answer 1 · 2015-12-27T16:21:19.970

This is either just a simple spidering, penetration testing, browser randomization, or a mix of some of those.

Web Spiders

Many web-spiders allow you to randomize your user-agent while siphoning the contents of a website. This is rather trivial to implement, and some of my web spiders do the same thing. However, it is bad design to randomize user-agents while spidering.

Browser Randomization

There are browser plugins such as Secret Agent which allow you to randomize your browser fingerprint values to avoid detection.

Since you are only seeing upwards of 19 attempts, it's also possible they've viewed around 15-19 pages each, but it seems odd that they'd do this consistently. It could even be one person switching their VPN and browser settings for each page load, which would indicate next-level tinfoil hattery.

Penetration Testing

Automated penetration testing tools also randomize their user agents when visiting a page.

Conclusion

Without seeing more of what's going on, we can't really tell you what's happening beyond making a few guesses. Do you have any packet capture data? That would help tremendously.

score 1 · Answer 2 · answered Dec 27 '15 at 02:55

As schroeder mentioned this sounds like an automated scan. The scanning tools have a stack of user-agents and cycle through them all. Take a look at automated web-pentesting tools such as Uniscan, Arachni, Golismero, etc... You could try running one of these against your site to see if you get the same results.

score 0 · Answer 3 · answered Dec 27 '15 at 18:17

0

Just a wild guess, but it might be some service testing if your server is delivering drive-by downloads. But i'd say some (misbehaving ?) crawler is the more possible solution.

answered Dec 27 '15 at 18:17

SleepProgger

590
3
10

Crawler massively changing user-agent

3 Answers3

Web Spiders

Browser Randomization

Penetration Testing

Conclusion