85

I own a popular website that allows people to enter a phone number and get information back about that phone number, such as the name of the phone carrier. It's a free service, but it costs us money for each query so we show ads on the site to help pay for it. To make sure people don't abuse it, we have a captcha and use the IP addresses to limit the number of queries to 30 per month.

But we've been seeing abuse anyway; we'll suddenly get bursts of huge numbers of queries from all different IP addresses making queries (hundreds per minute), and getting the captchas correct. So I keep changing the captcha - I've tried ones with words, math equations, reCAPTCHA, etc. When I do this, it stops the "attack" for 24 hours or so, and then it begins again.

I understand people can use OCR and other methods to get around captchas, but I don't understand why they are coming from lots of different and unrelated IP addresses.

Maybe they're spoofing the IP addresses? If so, they can't be getting the results from the queries, correct? In this case, maybe the goal is to try to hurt us financially, as opposed to them simply wanting the data?

If they are not spoofing the IP addresses, maybe they have hacked a huge number of different computers and are executing queries from them? This doesn't make sense to me because of the sheer number of ip addresses we're seeing (hundreds of transactions per minute with a maximum of 30 queries per IP address, for long periods of time), and the fact that this data really isn't that valuable.

So I'm trying to understand their motivation as well as how they are accomplishing this, in order to be able to fight back appropriately.

Anders
  • 64,406
  • 24
  • 178
  • 215
Marc
  • 699
  • 1
  • 4
  • 4
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/102500/discussion-on-question-by-marc-how-and-why-is-my-site-being-abused). – Rory Alsop Dec 23 '19 at 16:32

10 Answers10

97

Interesting problem. I wonder if a solution to this problem may be to force your users' web browsers to solve a cryptographic problem (using javascript running in their web browser) that is 'hard' to solve, but 'easy' for your site to verify. By 'hard' to solve, I mean a problem that would take ~10 seconds to solve with the resources of a typical desktop or laptop. A problem akin to the problem that bitcoin miners solve when new blocks are mined, but of course on a much simpler scale.

Your legitimate users would not notice the difference, as the script churns away while they fill out the form on your site. But, it would slow down the abusers considerably, and force them to allocate a lot more resources, and force them to re-work whatever tool they are using to automate these posts to your site.

mti2935
  • 19,868
  • 2
  • 45
  • 64
  • 22
    This concept predates cryptocurrencies by several years, and was inspiration for Bitcoins. https://en.wikipedia.org/wiki/Hashcash – Ghedipunk Dec 18 '19 at 16:22
  • 1
    Interesting solution. This might be the only solution that would block both bots and captcha-solving humans in third world countries. It would become too expensive. – reed Dec 18 '19 at 21:17
  • 22
    This could be considered malware by browsers and be blocked accordingly. – Mast Dec 18 '19 at 22:33
  • 19
    @Mast, no, it wouldn't. I'm aware of a few reasonably-major sites that do exactly this without any problems (other than being inaccessible without Javascript). – Mark Dec 18 '19 at 22:59
  • 8
    Would this actually help if it's coming from a botnet? Yes it would be more computation work for the botnet, but since he's rate-limiting already this just increases the time to 300s per ip in the botnet instead 30*(whatever the captcha-cracking time is). –  Dec 19 '19 at 09:31
  • 10
    Cloudflare comes with a service like this. Highly recommended. – Anders Dec 19 '19 at 11:38
  • I don't understand how this actually blocks anyone. A distributed attack would just take 10 extra seconds per round. A captcha usually takes me like 5-10 seconds anyway if I get it wrong. Don't some auto-browsers do JS too like Selenium? – zero298 Dec 19 '19 at 14:10
  • 3
    @zero298, the thing is, you're rate limiting each request from the botnet, outside of any method a human could solve. You're also limiting the scalability of the distributed computing resources at their disposal. – Jarrod Christman Dec 19 '19 at 15:11
  • @Anders - There's a [ready to use python module](https://github.com/VeNoMouS/cloudscraper) to bypass that - It seems that all CF currently does is check if the client supports JavaScript. But in principle, this should work - one could even run one of the various client-side crypto mining scripts for a minute (one would need to enforce it server side to ensure the client actually does the computation work) or so as a challenge. Although you might need to ask users to turn off their adblocker, but maybe it's worthwhile even if a few of your users would leave. – Jonas Czech Dec 19 '19 at 16:51
  • Now that I think about it this reminds me of how TeamSpeak would create the public key for profiles. If you ban a profile then they'd need a new one that could take as short as 1 second to several days. (Time required was customized per server) –  Dec 19 '19 at 18:04
  • 1
    > Cloudflare comes with a service like this. Yes, and there is sample code. A friend of mine and I and did this for hidden services, he has friends at cloudflare who used it. https://github.com/jsavoie/proof-of-work-login –  Dec 19 '19 at 19:59
  • 5
    Do not do this if you ever want to support mobile, doing a hard crypto calculation is going to take double the time and absolutely thrash your user's battery. – Delioth Dec 20 '19 at 15:56
  • 1
    It doesn't seem unfair if OP's website "charged" users with a valid hash solution that could be submitted to some real cryptocurrency mining poll. Even if it is inexpensive to the attacker if every botnet device has to wait about 10 seconds per request, OP will still be paid for it. – lvella Dec 20 '19 at 16:16
  • @lvella I was thinking the same thing. Some message-boards do similar things to "stay in the black" more easily. – Seldom 'Where's Monica' Needy Dec 21 '19 at 01:40
  • 1
    Ivella, SeldomNeedy - It's a great idea, but there's very little to be made from this. Javascript running in my web browser (running on a decent, fairly new laptop) can do (at best) about 20,000 hashes per second. In contrast, all of the bitcoin miners working together are doing ~100,000,000 tera-hashes per second. They all grind away for approximately 10 minutes, to earn 12.5 bitcoin, which is worth about $90,000 USD. So, if you do the math - javascript in my browser, hashing for 10 seconds at 20,000 hashes/second is worth about $0.0000000003 USD. – mti2935 Dec 21 '19 at 13:36
  • This isn't really such a great idea. For a discussion of why this is the case, see https://news.ycombinator.com/item?id=7944540 –  Dec 21 '19 at 16:22
34

How?

Rented botnet and captcha farms.

Why?

Someone wants your data. It's cheaper to steal it than buy it.

What to do?

Stealing it is cheaper, but not free. It costs "them" (whoever ultimately wants the data, not the botnet or captcha farm) money to do these attacks. Make it more expensive to attack you than the data is worth.

  1. Identify patterns to identify spammers.

  2. Return legit looking, but bogus data to the spammers.

After a certain number of valid responses, start interspersing bogus data with the valid data. Then they have to take extra steps to validate your data. Those extra steps cost extra money.

If they don't validate it, their data is less useful, i.e. worth less. They may still be able to use it or sell it, but it's less valuable so again the cost to attack you is higher than the value returned.

  • 1
    This would not really solve the issue. It would still take a lot of time for the botnet owner to recognize that your playing with him and he would keep spamming your site. And if your bogus data is too obvious then the botter would just stop the bot for that ip once it detects bogus data. – Luc H Dec 20 '19 at 17:11
  • 1
    You're right, blocking the attack would be best. This is easy/cheap enough to implement *on top off* the other suggestions. All of the suggestions can be overcome, for a price. If you use additional methods, that adds to the price. But you're right, it does not immediately stop the costs to OP. – Zach Mierzejewski Dec 20 '19 at 19:17
  • 1
    How do you deal with the fact that you may give legitimate users bogus data? – DjangoBlockchain Dec 20 '19 at 20:46
22

You are doing CAPTCHA wrong.

The idea of CAPTCHA is to make it hard (read "next to impossible) for a computer to solve it, but easy for a human to do so. If you just use one static image, asking to type in 4 for instance, then a computer will have no trouble to repeatedly enter 4 when instructed to do so.

Instead, consider using reCAPTCHA or similar technologies. These problems have been solved already, and there is no need to reinvent the wheel, as demonstrated below:

[Reinventing the Wheel]

CC-BY-NC 2.5, Randall Munroe, xkcd.com/2140/

Adam Porad
  • 103
  • 3
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/102499/discussion-on-answer-by-mechmk1-how-and-why-is-my-site-being-abused). – Rory Alsop Dec 23 '19 at 16:31
14

If you can put up a simple "type the number in this picture" CAPTCHA and have that stick for 24 hours, you know your enemy is an amateur. You know this sort of primitive device involving bespoke code will slow them down for 24 hours. This could be fun :)

I would make extensive use of stylesheets to hide information in the page code: in two senses, first hide CAPTCHAs, and second hide informational answers. With an aim to sadistically mislead scrapers.

I would write a bit of code on server-side to create fake answers that are believable at first glance, but phoney in ways not easily confirmed. Further, use random seeding or MD5s to make sure the same input always gives the same phoney answer.

Misleading on CAPTCHAs:

For instance, leave the last CAPTCHA system you were using, but use stylesheets to hide it. Follow with a different CAPTCHA, obfuscated by Javascript; maybe even another reCaptcha with a different key.

Now, the scraper will not realize the first CAPTCHA is suppressed with stylesheets. It will cheerfully solve the CAPTCHA and return the answer with the wrong key. Gotcha. However, just like cracking Enigma, you can't make it obvious that you've busted the code; the scraper must continue to believe it is working.

Misleading on answers:

Present an answer as normal, with a stylesheet around it. The stylesheet hides this result to normal people; the scraper is oblivious that this sheet has the "hidden" property. The answer you present here is the fake. Present the true result afterwards. For bonus points, present the results in a graphic which makes it unscrapeable. Try to conceal this of course.

If you have telemetry (solved wrong CAPTCHA) that this is a scraped query, then don't even bother buying that query result from your service provider. Insert a sleep(t+random) for the typical range of time your service provider takes, then send back a fake answer.

Looks normal

The attacker will believe things are working normally, and only checking for query success, not quality of results. With luck, your attacker won't have logged when each query was made, and is simply dumping the answers into a database. It may take the attacker quite a long time to realize you have poisoned the data, by which time, the entire database would be corrupted, having no idea which entries are valid and which are poison (see the importance of making the fake data look legit?) Even if the attacker timestamped every entry, what a bug hunt! Having to manually check several entries for each day to figure when the data went bad.

And one more thing. Cache true answers, and if a botnet query is in cache, always give the correct answer from cache. So the scraper, troubleshoting, will hit your real website with a browser, and ask for a test number of 213-456-7890. The hiding will work and this will behave like a real query, so you'll compute the real answer and return it. Next, the scraper will tell the botnet to ask for 213-456-7890. To see if the bot gets a different result. You will detect the bot query. If you now give a fake answer, the scraper knows the jig is up, and will iterate on breaking your detection. So since you have the true answer in cache, give it, even in the hidden fields. Now, the scraper is perplexed: the botnet seems to work.


Why and how

Obviously someone finds your data valuable. They would get it from your source, but they don't want to pay for it, so they are scraping you.

  • It is possible they are actually a competitor website which does the same thing you do, and they generate a query to you when they get one from their visitor. In essence this is a scheme to use your service but put up their ads. You yourself know the value of that. You can test that by making obscure and different queries on every competitor site and see which queries pop up in your logs.

There are a zillion ways to get CAPTCHAs solved. In the example of a competitor website pulling your data for their customer, they may simply be passing your CAPTCHA onto their customers. There are also ways to trick humans into doing CAPTCHAs for you, such as "solve CAPTCHAs get free porn", or by offering an unrelated service that requires CAPTCHAs for some reason, like an anonymous bulletin board. Everytime someone posts, it sends you a query and gets its poster to solve your CAPTCHA. There is also CAPTCHA solving essentially slavery in the third world.

  • Excellent idea. Rather than just setting hidden on the fake answer, you might want to hide it indirectly, by randomly positioning it outside the viewport or putting it under another element, as that will be harder to detect. – barbecue Dec 21 '19 at 16:56
12

Why?
Data related to phone numbers, names and email addresses are extremely valuable, both in the legal and underground market.

How?
It sounds like someone is using a botnet to mine data from you. This could mean connections from a few dozen globally scattered IPs to thousands of them. Personally I have no idea how they are getting around reCapchas, other than using manual labour from sites offering capcha solving services. All of these costs them money in one way or another.

Solution?
Disclaimer: I'm not security expert.
Some free services use a queuing system after a certain amount of queries. Say you don't want to overload your system, you allow a maximum of 30 requests (or whatever number of concurrent requests your system can easily manage) at any single time. Requests put in while the queue is full gets a message explaining that the server is busy and they either have to try again at a later time or are automatically queued. This solution isn't without problems as your legitimate clients will sometimes have to wait to be served, specially during peak times or during an attack.

You mentioned changing your capcha method curbs the attacks for a while. Perhaps there's a way to alternate the capcha method for each visitor with every request randomly? At the very least the attacker would have to rewrite some of their methods. Best case scenario their successful attacks are divided by the amount of different methods you incorporate.

phLOx
  • 221
  • 1
  • 2
  • On the same thread as detecting the abuse and changing your response, consider responding with junk data which looks plausible. You would need to be sure that the request is fishy. I'm curious if the timing, source IPs, or request doesn't hold some unique quality that you could switch on. – blackboxlogic Dec 19 '19 at 14:21
  • 8
    Unfortunately, without appropriate methods to separate bots from legitimate users, both this solution and the solution from 'pants' in the comments turn the botnet activity into a DOS attack. I'm not sure if turning one attack into another is the best way to go about it. – Jarrod Christman Dec 19 '19 at 15:06
  • Many information services like OPs deliver your data by e-mail. This might be something they are avoiding but would allow you to rate limit deliveries to a domain. – Nathan Goings Dec 19 '19 at 21:13
  • Not a security expert, but I'm pretty sure the three Triad legs of of Security are: Confidentiality, Integrity, and Availability. That last leg kinda gets blown away by the proposed solution. – Kevin Dec 20 '19 at 19:43
  • I like your reason for the attack. I could see if you are an offshore text/phone scammer, having information about the phone number you are calling could be very valuable. – inund8 Dec 20 '19 at 20:08
3

Their motivation may simply that they are building a similar service themselves and need data. Your service could be one such data source that they've found and need to scrape.

Have you tried rate-limiting your requests? You say your getting hundreds a minute (assuming from the same IP address/es), then couldn't you log those requests, detect repeat visitors within a reasonable time period and then temporarily IP ban for a time period?

You could also add "honeypot" form elements into your form. Honeypot form elements are hidden from geniune users, but are auto-filled by bots. Any request with data in those field(s) are automatically discarded and maybe even banned.

  • 3
    In this case, it looks like the OP's website is being specifically targeted (since they're getting around captchas), and honeypot forms only really work with generic bots (e.g. ones that probe comment forms on many different websites). If the attacker is specifically targeting your website, they usually do it by manually looking at your website code and writing a script to submit forms or carry out actions - and will take the honeypot forms into account. – Jonas Czech Dec 19 '19 at 16:58
2

Don't use one captcha solution, use them all!

Since you already have multiple different ones laying around, why not rotate them (randomly) on either a 2 hour or even per-request basis? Even if the attackers theoretically have cracked them all, them needing to detect the type of captcha is in itself another captcha to solve for computers (while not affecting humans at all).

Also maybe include dumb questions as captchas like "what phone number are you looking up again" etc. the more random stuff the harder to do it for bots.

Especially if you use different disciplines (image recognition, reading numbers, math, general knowledge, etc) the botters would have a hard time following up.

And you don't need to outsmart them perfectly, you just need to make it not worth their time anymore.

Edit: this would also require throwing in new captcha types regularily

Hobbamok
  • 227
  • 1
  • 7
  • 8
    What is your basis for saying that detecting the type of captcha is harder than solving the captcha? It seems intuitively to me that it should be far easier as there will be all sorts of meta data that will vary between the types. – Jon Bentley Dec 19 '19 at 11:23
  • You assume that bots are hard-coded for a specific captcha type and only crack the one they are designed to. You are not assuming that bots inspect to see which captcha is displayed and apply the appropriate process. – schroeder Dec 19 '19 at 12:59
  • @schroeder if OP keeps screwing with the layout of the site, then it *would* require manual hardcoding to solve! To say nothing of style sheets to hide material. – Harper - Reinstate Monica Dec 19 '19 at 23:35
2

So I'm trying to understand their motivation as well as how they are accomplishing this, in order to be able to fight back appropriately.

It is also possible that proxies are used to access your service. Just google for open proxy list returns some sites presenting open proxies which could be used as well to mask the client's ip address.

I suggest to log HTTP Header X-Forwarded-For and Via on server side for some time and then check if it is plausible that such proxies are used to abuse your system. X-Forwarded-For usually contains the IP address of the client, Via contains the IPs of proxies in the chain (if any). Please be aware that using proxies in general is legit, but there might be some interesting patterns, e.g. if you see the same proxies being used over and over again within an attack period.

mottek
  • 121
  • 2
1

I don't consider this as an complete answer. I'm saying what I would do in somehow similar situation.

  1. Log the queries. Is there any pattern in their queries? for example, Specific country, or specific area. In case they are really using results, there must be a pattern. If not, I would consider number 2.

  2. You said when you change captcha type and technology, the attack stops for about 24 hours. I read this in this way:

    When I fight for 10 minutes of work time, I damage opponent for 24 hours of work.

    So all you need to do, is keep damaging their hours and persist in that. It makes whoever doing this, tired, and you can be sure they are the first to stop fighting. real odds of winning are -> 1 - (10/1440)

    That is not a real solution, rather, it is something I would consider before going to number 3.

    Remember, They maybe come back next month or six months later, but now they know you are persistently fighting back, and you are the one who loses too little.

    You could even make fighting for them a little harder, for example, more than 3 queries in one day, requres user to enter 2 types of captcha. after 10th, system hardcores even more, in a way that your real visitor would not realize that.

  3. Sad but Use Authorization. Even you can optionally make the first n (n<10) queries anonimously available, but more than that requires logging in.

FarhadGh
  • 23
  • 4
1

I developed a contact form that abusers have been trying to abuse for over a year now and have consistently failed.

My approach includes a combination of:

  1. After each required field validates it triggers an ajax call that retrieves a new 32-48 character randomly generated field name that is temporarily stored in a form validation db table. Then when the form is submitted a field coming in with a name that has not been generated by the server or has the original field name triggers logging into the db remote IP address as well as what they did submit related to the form submission. Once the field name is changed any submission with the original field name is detected as abuse and treated accordingly.
  2. They have to be on the page the form is on for at least 1.3 seconds per required field and all fields have to be validated before the submit button disable property is removed and at least the name of the submit input is changed either with a new ajax call or a name and or value received from a previous ajax field name call. The button name and value must match during the form validation on the server or abuse is detected and handled accordingly.
  3. I log all submissions to db and flag the abuse with a DENY target and once they have abused my form they are permanently blocked from even visiting the form page and are redirected directly to a 403 response after logging of the attempted visit.
  4. During one of the ajax calls I will sometimes randomly generate a new field and value that gets appended to the form before submission and must be present or the submission won't validate and will be detected as abuse.
  5. You can include honeypot fields but don't hide the field object it will be detected. Hide a a parent object if you are going to hid the field. You can also position it absolutely and position it far out of view. Any honeypot fields that come in to the server with a value of any kind is detected as abuse and dealt with as such.

Make sure to log all submissions so that you can monitor for new patterns meant to circumvent your security.

Philip Rowlands
  • 1,779
  • 1
  • 13
  • 27
  • 1
    All of that can be spoofed/duplicated by a targeted attacker who takes the time to understand your validation rules. In this case, the OP seems to be up against an attacker who is specifically targeting their systems, so this may end up being a lot of work for only very temporary benefit – Conor Mancone Dec 20 '19 at 14:09
  • I really did not submit or intend that this would be any kind of a permanent silver bullet. – Dan Stepaniak Dec 21 '19 at 00:24