In the course of about 2 hours, a logged in user on my website accessed roughly 1,600 pages in a way that looks suspiciously similar to a bot. I am concerned because users must purchase access to the site in order to get full access to our protected content; so I have reason to believe this person was scraping our content.
I know I should have had mitigation factors in place to prevent this type of activity from occurring in the first place. I'm working on that now.
Based on the Apache access and error logs, I have pretty strong circumstantial evidence that the user was using some sort of a crawler or bot. I'm wondering if there is any way to get direct evidence, i.e. based on the crawling pattern, can I 100% say that it is a script?
Here's a sampling of the access log:
###.###.###.### - - [06/Apr/2016:19:32:59 -0500] "GET /article/id/slug-slug-slug-slug HTTP/1.1" 200 15002 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:00 -0500] "GET /article/id/slug-slug-slug-slug HTTP/1.1" 200 15002 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:04 -0500] "GET /article/id/wordmark-icon.png HTTP/1.1" 404 5026 "mywebsite.com/article/id/slug-slug-slug-slug" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:05 -0500] "GET /article/id/60559332d74832ae81f6ea69f98e24cc.png HTTP/1.1" 404 5191 "mywebsite.com/article/id/slug-slug-slug-slug" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:05 -0500] "GET /article/id/9e8d61bdd8acf3735a02ef90192eefa8.png HTTP/1.1" 404 5189 "mywebsite.com/article/id/slug-slug-slug-slug" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:05 -0500] "GET /article/id/b75384c9aa61c22fa768cdfbafaf5351.png HTTP/1.1" 404 5190 "mywebsite.com/article/id/slug-slug-slug-slug" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:06 -0500] "HEAD /article/id2/slug-slug-slug-slug HTTP/1.1" 200 604 "mywebsite.com/article" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:07 -0500] "HEAD /article/id3/slug-slug-slug-slug HTTP/1.1" 200 604 "mywebsite.com/article" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
###.###.###.### - - [06/Apr/2016:19:33:08 -0500] "GET /article/id3/slug-slug-slug-slug HTTP/1.1" 200 9983 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0"
...and so on and so forth.
Here are some observations I've found in the above:
- There were two GET requests to the same URL in about one millisecond. I don't believe this is possible for a human to do, but I could be wrong.
- I am not familiar with seeing HEAD requests in typical user activity. Is that common, or evidence of a bot?
- After the first two GET requests above, there are additional requests to GET the images found in the article. However, in reality, those images are located on a CDN with an entirely different URL scheme. This person/bot/whatever is using the URI (/article/id/) and adding the actual image filename, resulting in a 404 error. This occurred in every single instance.
Is it safe to say this is a bot, beyond a shadow of a doubt? If so, is there any possible way to find out the specific script, or is that a long shot? At the very least, are there symptoms of a certain type of bot, web scraper, or script?
Thank you for your input.