1

In the course of about 2 hours, a logged in user on my website accessed roughly 1,600 pages in a way that looks suspiciously similar to a bot. I am concerned because users must purchase access to the site in order to get full access to our protected content; so I have reason to believe this person was scraping our content.

I know I should have had mitigation factors in place to prevent this type of activity from occurring in the first place. I'm working on that now.

Based on the Apache access and error logs, I have pretty strong circumstantial evidence that the user was using some sort of a crawler or bot. I'm wondering if there is any way to get direct evidence, i.e. based on the crawling pattern, can I 100% say that it is a script?

Here's a sampling of the access log:

###.###.###.### - - [06/Apr/2016:19:32:59 -0500] "GET /article/id/slug-slug-slug-slug HTTP/1.1" 200 15002 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:00 -0500] "GET /article/id/slug-slug-slug-slug HTTP/1.1" 200 15002 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:04 -0500] "GET /article/id/wordmark-icon.png HTTP/1.1" 404 5026 "mywebsite.com/article/id/slug-slug-slug-slug" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:05 -0500] "GET /article/id/60559332d74832ae81f6ea69f98e24cc.png HTTP/1.1" 404 5191 "mywebsite.com/article/id/slug-slug-slug-slug" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:05 -0500] "GET /article/id/9e8d61bdd8acf3735a02ef90192eefa8.png HTTP/1.1" 404 5189 "mywebsite.com/article/id/slug-slug-slug-slug" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:05 -0500] "GET /article/id/b75384c9aa61c22fa768cdfbafaf5351.png HTTP/1.1" 404 5190 "mywebsite.com/article/id/slug-slug-slug-slug" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:06 -0500] "HEAD /article/id2/slug-slug-slug-slug HTTP/1.1" 200 604 "mywebsite.com/article" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:07 -0500] "HEAD /article/id3/slug-slug-slug-slug HTTP/1.1" 200 604 "mywebsite.com/article" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:08 -0500] "GET /article/id3/slug-slug-slug-slug HTTP/1.1" 200 9983 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"

...and so on and so forth.

Here are some observations I've found in the above:

  1. There were two GET requests to the same URL in about one millisecond. I don't believe this is possible for a human to do, but I could be wrong.
  2. I am not familiar with seeing HEAD requests in typical user activity. Is that common, or evidence of a bot?
  3. After the first two GET requests above, there are additional requests to GET the images found in the article. However, in reality, those images are located on a CDN with an entirely different URL scheme. This person/bot/whatever is using the URI (/article/id/) and adding the actual image filename, resulting in a 404 error. This occurred in every single instance.

Is it safe to say this is a bot, beyond a shadow of a doubt? If so, is there any possible way to find out the specific script, or is that a long shot? At the very least, are there symptoms of a certain type of bot, web scraper, or script?

Thank you for your input.

Nick S.
  • 131
  • 1
  • "*There were two GET requests to the same URL in about one millisecond*" where do you see this? Your logs only have single-second resolution. – MadHatter Apr 07 '16 at 15:00
  • Oops. I misread that then. I guess that makes it more possible for a human, but the fact that this happened over a thousand times, each within a second or two of each other makes it seem like a stretch to me. – Nick S. Apr 07 '16 at 15:05

1 Answers1

1

Is it safe to say this is a bot, beyond a shadow of a doubt?

No. One could have multiple tabs of your site open, crash the browser, reopen the browser window with all tabs and cause this DOS-attack-like fingerprint.

If so, is there any possible way to find out the specific script, or is that a long shot?

I don't see any data that would precisely allow you to fingerprint such a script.

At the very least, are there symptoms of a certain type of bot, web scraper, or script?

The broken image requests do make it look suspicious. So yes, symptoms of automated activity, yes.

Rather than trying to find out exactly what this is, consider a behavioural/reputational monitoring tool like Repsheet. This allows you to first log activity and determine patterns you might want to mark as suspicious. Next, you can decide what to do with such suspicious activity.

If you cannot be sure it is a bot and don't want to anger what could be a real user, you can simply display a challenge, like a reCaptcha, or logging in again. Or you can redirect this user to a secondary server so performance doesn't suffer for trusted people on the main server. Or you can even send them to a honeypot and do whatever it is you want, show fake data, show cached data, etc.

JayMcTee
  • 3,763
  • 12
  • 20
  • What about the HEAD requests? My understanding is that a person using a browser under normal circumstances (without dev tools, etc.) won't generally issue these types of requests. – Nick S. Apr 07 '16 at 15:46
  • 1
    When I look up the browser agent: https://user-agents.me/useragent/mozilla50-windows-nt-61-win64-x64-rv450-gecko20100101-firefox450 it seems like normal desktop environment, but of course this can be faked. However, Firefox does use HEAD requests on Save Link As functionality. So perhaps it's a browser plugin which tries to save your pages/files, or even a TamperMonkey like script. – JayMcTee Apr 07 '16 at 16:09