Giving malicious crawlers and scripts a hard time

Question

My webserver has been up for < 25 hours and has already been crawled for various default pages, just to name one /administrator/index.php.

I understand that this is very common and it's not really an issue for me, as I have secured the server in a decent manner.

For the following idea, let's assume I don't care about the resulting traffic.

What if I were to create a number of the most requested files, usually representing administrator interfaces or other attack vectors of a common website.

The file (e.g. /administrator/index.php) could look like the following:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>/administrator/index.php</title>
  </head>

  <body>
    content ^1
  </body>
</html>

But for the actual body content, I just ram a couple of GB with random strings in there.

For example dd if=/dev/urandom bs=10M count=400 | base64 > /tmp/content and then wrap the above HTML tags around the file.

What would typical crawlers do on such an event?

I assume you already have a robots.txt file present so legitimate web crawlers like Google won't index your site? Your idea sounds sort of like a honeypot, where you are trying to mislead or misdirect malicious crawlers with bogus juicy target pages. — Mark Ripley, Dec 02 '16 at 09:18
You're correct. The question is what a crawler, looking for attack surface would do, when handling such responses? — SaAtomic, Dec 02 '16 at 09:36
A malicious crawler looking for specific web pages will probably do one of three things; inform it's creator that your site has these pages, inform him which pages you have and/or upload the contents of the pages to him. Only in the last case will adding random strings affect anything, and your tactic of GBs of random chars, although possibly annoying to him, will also be part of the bandwidth you buy from your ISP. Annoying unknown hackers, especially if you are paying to do so, is not the brightest idea unless you want to make that your life's work. Better to just harden your system. — Mark Ripley, Dec 02 '16 at 09:44
If a crawler tries to confirm if my webserver has a specific resource, would it not have to request it and wait for it to be transfered - thus having to download the complete document? I have loads of excess bandwidth, that was just an idea to annoy people with malicious intent. — SaAtomic, Dec 02 '16 at 09:52
The first step a crawler performs may just be a check to see if the file exists, rather than an attempt to open the file. That would take hardly any bandwidth at all. It all depends on how the crawler is written. Your solution assumes you know how some unknown crawler is written. You have to assume that at least some bad guys know how to optimize code. If you don't mind getting into the software version of war with someone, go ahead; but my preferred approach is to not go out of my way to look for fights; I usually have better things to do with my time. — Mark Ripley, Dec 02 '16 at 10:00
[Relevant](https://twitter.com/nick_craver/status/720062942960623616) - try it yourself! http://security.stackexchange.com/admin.php — James Thorpe, Dec 02 '16 at 11:21
See my answer in this thread: http://unix.stackexchange.com/questions/243490/editing-my-etc-hosts-deny/243503#243503 — Rui F Ribeiro, Dec 05 '16 at 14:44
We just give them eternal content and crap invalid chars: https://gist.github.com/dhaupin/5f467ef6fc951733ba6c — dhaupin, Dec 05 '16 at 16:06

Marcus Müller · Accepted Answer · 2016-12-02T10:25:28.513

61

You're hurting yourself.

The "attacker"/crawler... probably doesn't pay for their traffic or processing power (ie. they use a botnet, hijacked servers or at least are on a connection that doesn't make them pay for traffic), but you will be billed for traffic and CPU/storage/memory, or your server's hoster has a "fair usage" clause, under which your server's connection will be throttled or cut off if you serve Gigabytes of data on the short term, or your storage bandwidth will be reduced, or your CPU usage limited.

Also, why should anyone be so stupid to download Gigabytes of data when they're just looking for a specific page? Either, they're just looking for existence of that page, in which case the page's size won't matter, or they will definitely set both a timeout and a maximum size – no use waiting seconds for a server to complete a response if you've got hundreds of other servers to scan, and especially not when greylisting is a well-known technology to slow down attackers.

edited Dec 02 '16 at 10:25

answered Dec 02 '16 at 10:10

Marcus Müller

5,843
2
16
27

4

I agree with what you say, but: "For the following idea, let's assume I don't care about the resulting traffic." – Anders Dec 02 '16 at 10:20
2

Then care about CPU/storage/memory usage... edited. Thanks, @Anders! – Marcus Müller Dec 02 '16 at 10:25
9

@MarcusMüller How about some sort of "reverse Slowris"? Like, sending byte by byte every 10 seconds (or so), keeping the crawler busy on that page. Just to receive a stream of NULL bytes for hours. This should cost nearly nothing on CPU time, but could be used to DOS the server. The data sent would be small (around 6 bytes per minute). Or am I wrong? – Ismael Miguel Dec 02 '16 at 15:05
1

as said, every sensible crawler will just have a timeout. – Marcus Müller Dec 02 '16 at 16:04
9

Every sensible site will have timeouts, too, but plenty don't. These type of bots usually aren't the pinnacle of good software engineering. – Xiong Chiamiov Dec 02 '16 at 17:59
true, but I really don't see the benefit in slowing donw ONE thread of a badly-written crawler. like, do that with 10000 servers and gain something. Do that with one, and you're hurting noone, because it basicallly costs nothing on either side. – Marcus Müller Dec 02 '16 at 18:27
43

The trick is NOT sending gigabytes of random data. The trick is sending 4 gigabytes of zeroes, gzipped. (The first bits can look more real). This will take about 4MB of your data traffic, but it tends to blow up 32 bit decoders (such as hacked IoT devices) – MSalters Dec 03 '16 at 02:04
1

@MSalters better yet, convince the Internet to switch to deflate64, and you will be able to send many more zeroes much more efficiently. – John Dvorak Dec 03 '16 at 15:03
@MSalters you don't happen to have a reference for that actually doing something? zlib, which is almost certainly used by anyone ungzipping anything, is pretty hard to convince to "blow up" – Marcus Müller Dec 03 '16 at 16:22
@MarcusMüller: Oh, zlib will handle it per documentation, no doubt about that. The question is, does the malware deal properly handle error returns from zlib? – MSalters Dec 03 '16 at 17:26
1

Well considering *everything* can go wrong with data coming from the internet, the thing will most definitely have a fail handler. Otherwise, it would constantly run into trouble. – Marcus Müller Dec 03 '16 at 19:29
2

"why should anyone be so stupid to download Gigabytes of data when they're just looking for a specific page?" - probably because the easiest way to check if a page exists is usually to try to load it? `try{downloadURL(url); exists = true;} catch(FileNotFoundException e) {exists = false;}` – user253751 Dec 04 '16 at 21:29
1

@immiblis if your objective is to crawl websites, that's definitely **not** what you'd to – Marcus Müller Dec 04 '16 at 22:34

score 28 · Answer 2 · answered Dec 02 '16 at 14:34

28

Consider that serving anything other than HTTP 404 page for /administrator/index.php may get your server in the lists of potential targets, which means even more scans in the future. Since crackers who pay for such lists don't need to scan millions of IPs themselves, they can concentrate on you with much more sophisticated attacks than checking for the default admin page.

Unless your server is set up with the purpose of attracting malicious activity, looking like a potential victim will do you no good.

answered Dec 02 '16 at 14:34

Dmitry Grigoryev

10,072
1
26
56

4

Could still return a 404 page with a huge body. – jduncanator Dec 05 '16 at 04:14
3

@jduncanator The very first line always provides 404 to attacker and most likely they won't read any more lines in that case. – kubanczyk Dec 06 '16 at 16:49
@kubanczyk Assuming they are using a custom HTTP library, sure, but a lot of built-in HTTP libraries will download the full body, even for a 404 status code. – jduncanator Dec 06 '16 at 21:53
@jduncanator huh, then they'd be using a strange HTTP library. If you check if a page exists, you just ask for `HEAD`. libcurl and any other http library I've ever worked with allow this. – Marcus Müller Mar 24 '22 at 09:53

score 16 · Answer 3 · edited Dec 05 '16 at 22:46

As already said, its probably not worth it, but it is a very interesting topic to think about. There was a very good talk on that topic at DEF CON 21 called "Making Problems for Script Kiddies and Scanner Monkeys" which you can find here: https://www.youtube.com/watch?v=I3pNLB3Cq24

Several ideas are presented, and some are very simple and effective like sending certain random HTTP response codes, which in the do not affect the end user, but significantly slow down scanners. The talk is worth a look :)

Edit: This is brief summary of the talk for those who do not have time to watch it: Browsers interpret many HTTP responses the same way, independent of their response code. That is of course not the case for all response codes (like 302 redirects), but for example if your browser gets a 404 "not found" code, it will render the page the same way if it was a 200 "OK" code. But scanners/crawlers are different. They mostly depend on the returned response code. For example if they get a 404 response code, they conclude that the file does not exist, but if they get a 200 response code, they conclude that the file exist and will do some stuff with it (scan it, report it to the user, or something else).

But what if you would set your web server to only send 200 codes (even if the resource does not exist)? Normal users probably will not notice it, but scanners will get confused because all resources they try to access (for example by brute force) will be reported as existing. Or what if you return only 404 responses? Most scanners will think that none of the resources they are trying to access are available.

The talk addresses that and tests various response codes and various scanners, and most of them can get easily confused that way.

Edit2: Another idea I just got. Instead of sending 10Gb of data as you said to those who you think are scanners, why not just send a HTTP response with a Content-Length header with a value of 10000000000, but add only couple of bytes of data in the HTTP response body? Most clients will wait for the rest of the response, till the connection times out. That would massively slow down scanners. But again, it's probably not worth it, and you would have to be sure to do that only to those who you detect as scanners.

score 12 · Answer 4 · answered Dec 02 '16 at 14:36

12

Forget the gigabytes. Use minutes or hours of delay. Many web servers have modules that introduce artificial delays (aka tarpits). It occupies a thread/process on your server, but it comforts a few other servers on the Internet who would have been probed during the time. Of course other bots and other threads of this bot continue their work, so the annoyance is minor.

answered Dec 02 '16 at 14:36

kubanczyk

1,182
6
11

10

This seems rather trivial for the bot to ignore - any delay over a few seconds is likely to be irrelevant, because the bot will have timed out, disconnected, and moved onto the next target. If you're unlucky, they can do so in a way your server doesn't notice, and you've given yourself a nice denial of service attack. – IMSoP Dec 02 '16 at 14:48
In any case there are currently way more nodes in a botnet than there are people who want to run tarpits. So if you do it, do it because you enjoy the principle of messing with a bot and you're interested to see how it reacts, not because you think you're going to dent botnet capacity. – Steve Jessop Dec 04 '16 at 14:40
Instead of delaying on the HTTP level, try playing with TCP. E.g. lower the TCP window size to a single byte after the `404` header line and why bother sending the next packet until the scanner retransmits an ack? Many, many libraries will perform extremely poorly even with timeouts set. – billc.cn Dec 06 '16 at 15:53

Lizardx · Answer 5 · 2016-12-14T03:18:55.833

This is a topic near and dear to my heart, I've battled bots etc for ages. My conclusion is that your best strategy always, though not very gratifying, is to simply absorb the traffic, that is, detect that it is a bot, and then neutralize any further action on its part, without giving any programmatically obvious indication that you have done so (obvious would be: send a large file in response to an html page request; serve a page telling them they are bad and not allowed - a mistake I made for ages, not realizing that it's software, not a person). Programmatically non obvious means that to automated bot spidering software, the response looks normal, and does not raise red flags internally.

To understand what Programmatically non obvious pages look like, take some time examining for example an seo constructed spam website that you get to in a google search. Those pages are designed to work around google bot filters, that is, they do not trigger internal google spidering red flags, which is why they were offered as a SERP result to you in your search. Bad Bots are not very complicated (speed/efficiency is more critical for bot activities) in terms of programming sophistication, that is, their ability to 'read' the page or response tends to be quite basic, and it's not very hard to give them something that looks like a normal response.

Some of the responses here point to a problem that until you've personally experienced it, is easy to not understand the severity of, and that's underestimating the resources, and the possible vindictiveness, of bot masters. If you engage in the type of tricks contemplated here, only a few things will happen:

you use server resources for no reason and no gain
you aggravate the bot operators, who can then add you to lists that are distributed globally that basically are databases of sites to target heavily, usually because they are good targets, but they can just as easily add your site to this list. This has happened to me before, and it took me months to program around the attacks because while they resemble DDOS, they aren't actually that, they are simply various bot software operators using those master lists.

I've at some points tried the send large file to violators, but it's not just a waste of time, it's generally counterproductive.

There's also some fundamental misperceptions of how bot software works, it's often quite simple, and designed for rapid action, so the odds are quite good that as soon as a size limit is hit, it just stops.

To clear up some common misconceptions about bots and how they work, let's construct a simple one:

wget --spider -t 1 --read-timeout=5 -U "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0" [target url]

this is the initial url database builder bot component. It's only requesting the HEAD, that is, checking to see if the file exists. This will generate a server response code. As some people here indicated, correctly, one excellent strategy is to simply use this step to send them a 404, once their bot has been detected. The bot is software, not a person, so it will merely shrug its virtual shoulders and go, fine, page doesn't exist, next url. This is a very good strategy, and the one least likely to raise red flag alerts. It's also telling the server that it's firefox on windows 10, or it could be anything else they chose to enter in there, but fake lists of useragents that are switched or rotate are common features of decently designed bad bots.

However, serving the 404 still takes some server resources, particularly if your 404 page is constructed by a cms. If you just use the default server 404, that is light, but ugly for real users who might have also hit a 404.

A 301 is just going to send the request elsewhere, whatever you do, do not send it to a known anti spam/bot website, they know those as well or better than you. Just send it somewhere that doesn't exist.

However, bots do not in general respect 301s with any consistency, sometimes they do, sometimes they don't, depends on their programming.

So there's some decisions on how to handle a bot at the spidering stage. A 404 is a quite good solution, though make sure to actually test the real server response codes so that you know the response sent is a real 404, not just a redirect to a 404 page, which most bots won't realize was a failed test.

Now we can grab the file:

wget -t 1 -Nc --read-timeout=5 -U "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0" [target url]

Note again, we're timing it out again, which terminates the big file idea, and makes that a total waste of your server resources, and sending a fake useragent.

Obviously, bots will generally be written to be significantly more efficient than sending wget requests, or curl, but the ease of working around most ideas here should be obvious since I'm writing my bot using wget and built in options, for any decent bot software, this of course will be far more sophisticated, or at least, more sophisticated, and easier to use.

Any self respecting bot author of course will take the time to google threads like this one to see what people are currently doing, and simply add in spider/downloader protections to handle those ideas, which is why you don't use them. Bot authors are often serious and competent professionals, whose work focuses on successful bot activity, this is their work, so it's worth respecting them a bit to avoid aggravating them, which is like jay walking in front a policeman, it's just a pointless provocation.

For the downloading phase, you can again use the fact of the detection to simply serve them fake pages with no content, or anything else that will have very little load on the server. Whatever you do, you don't want to use a lot of resources to give them fake content.

There's a lot of fairly silly ideas as well floating around the internet, which generally show more the lack of experience of the people floating them, for example, spending time blocking IP addresses. Most bots are run from random IP addresses, might be on a bot net, might be datacenters, might be AWS temp sites/IPs set up then discarded, whatever it is, there are few less productive ways to block bots that by blocking IPs. This will become even more so when IPv6 becomes standard, with its ability to do random reassignments of public IP addresses based on timers or schedules or device configurations.

Analyzing IP data can be useful to let you know what you are dealing with, for example, maybe most come from romania, or china, and that tells you something, but they can, and do, switch from one to another, and from one datacenter/botnet to another, so IP analysis is more of a good backend tool to let you see what type of bot issue you're dealing with.

It's always a good strategy to respect the intelligence of the spidering tool (aka, bots) software authors, and to not assume they are stupid, or to antagonize the users of that software needlessly, just do your business, handle the bot and then let it move on to the next site without registering issues on yours.

Also, it's important to understand that there's no such thing as bots per se, what there are are usually single purpose bots. A 'bot' per se is a piece of software that in almost all cases automatically requests urls in its database, it may also as part of its function create that database of urls, or it may use commercially available databases. Some bots scan automatically the entire IP ranges of the internet, they are just looking for things, others are targeted, and, as with search bots, follow links. Note that if you have not blocked all admin type pages/directories in robots.txt, you can't tell which bot is ok and which isn't, so that's the first step you take. You don't need the whole file path, just the start of it that makes it unique as a path, like: /admin

search bots, legitimate
seo bots, usually pretending to be legitimate. Blocking useragents in robots.txt then analyzing log files for subsequent requests can show you which are real and which scum.
contact page autosend/fill bots
blog autopost bots
forum signup/post bots
blackhat bots searching for common easily exploitable application urls, for often insecure and/or non updated tools like: wordpress, phpmyadmin, drupal, etc. These can usually be detected in stats by seeing what requests certain urls that generally are not linked to.
site uptime bots, see #2 for how to check if real. Basically, if you didn't sign up for the service, it's gray or blackhat.
experimental bots, new search engines, web services, sometimes ok, sometimes badly written, usually not malicious but worth examining.
indexing bots, for example, could be selling indexes of sites to various other bot operators.

Then there are non bot type things like DDOS attackers, which use zombie net pc's usually, or hijacked servers (see #6, this is one reason they will spider - to find sites running software with known security issues, or zero days. You can actually discover usually when a security issue appears in the blackhat world by simply looking for requests for certain common software urls). Hijacked servers are a premium product in the underworld because they tend to have access to much more bandwidth and cpu/ram than a regular personal pc device.

This is not of course an exhaustive list, just a sample of various bots and the kind of thing they look for.

Note that some search bots may act like bad bots if configured badly, again, you can give directives in robots.txt and see if they obey them, if they don't, you can block them, if they don't respect the blocks, you have a variety of choices.

The reason there's such heavy bot activity out there is that most users are simply totally and utterly unequipped to be a webmaster for their site, they are basically sitting ducks, and use badly written software with bad security, or use bad programming that fails to protect against common attack vectors like sql injection, so there's a significant benefit/reward to the constant bot scanning activity. Obviously you can mitigate a bit by always updating all your software you use, or, better yet, eliminate unnecessary and very insecure tools like phpmyadmin in the first place to remove the entire attack surface. Or at least password protect the folder containing its files.

I had a client who had the terrible idea of running their own IIS webserver, which was at the time so radically insecure that, as in the OP case, the second I put it online, I saw instant bot probes for common IIS access points, even though there had never been a link to the IP. I told my client the next morning that he had to shut down the IIS instance and give up, since he was never going to be able to maintain secure local systems, luckily for me, he did.

I use grep, sed, awk for most of my analysis, though I'm sure there must be gui tools to do some of the stuff, no idea, never needed it.

You're not examining 10s of thousands of accesses in your logfiles, by the way, you're searching for the patterns, using tools that are made for that job (like awk/sed/grep), then seeing if you can detect patterns that can be solved with programming responses. There's also firewall tools that can be programmed to block requests that fit certain rules, but those are harder to configure correctly.

If I am using apache how do I configure it to monitor and send out the 404 codes when botnets come. — cybernard, Dec 07 '16 at 03:57
Also what tools can I use to analyze the 10's of thousands of IP,port,protocol, time/date to look for patterns. Sure I could manually write them but if pre-built tools exist why re-invent the wheel. — cybernard, Dec 07 '16 at 04:42
You have a significant misconception about how to handle this stuff. Botnets don't announce themselves with polite greetings saying, hello, I'm a request coming from a botnet. You have to discover that event yourself and then handle it programmatically. Discovering patterns isn't that hard usually, it just requires analyzing log files. For example, say only a few IP addresses are allowed to access admin.php, then you would look for all requests to that and see if you can find patterns. But there are many types of bots, so no one approach covers all the possible scenarios. — Lizardx, Dec 07 '16 at 19:35
I updated my answer to include some more things that might clarify what bots are and various ways you can approach analyzing traffic/log data. I suggest you learn to respect these guys, part of that respect means you admit that they are better at this game than you are, and that beating them isn't going to be as easy as using a drop in module or program, at least not in most cases. Your main advantage comes because most users of bot software are not in fact skilled at all, they are just people who bought it and use it, like you would. — Lizardx, Dec 07 '16 at 19:53
You state that if they only request the HEAD they are probably a bot, I should respond with a 404 response. If you use apache, which many many people do, how can I configure mine to do the same. — cybernard, Dec 07 '16 at 23:22
No, unless you want to block all search engines from your site. And various other legitimate uses for standard request methods. " More likely your browser or search engines are probably using HEAD requests to see if their cached versions of your pages are still up to date. " http://security.stackexchange.com/questions/62811/should-i-disable-http-head-requests - again, you are suffering under the misconception that there is some easy way to block bots, this is simply untrue, if it was easy, everyone would do it, and there would be no unwanted bots. Various devices/software use these things. — Lizardx, Dec 08 '16 at 18:49
Anything building an index and / or checking to see if a page exists, including your browser with a cached version of the page checking to see if a new version exists will send a HEAD request. Just because bots exist doesn't mean you have to panic and try to block them, most are quite innocuous or just drifting around the web looking for their targets. You seem set on believing that you can solve this issue without learning anything substantial about how to run a web server, or how requests work, etc, so I'll give up this, and leave my response as information that may help understanding. — Lizardx, Dec 08 '16 at 18:53

score 1 · Answer 6 · edited Mar 24 '22 at 09:35

I use a gzipped 5 gigabyte HTML file consisting of numerous various recursive open tags with no corresponding closing tags. The gzipped file is only 20 megabytes. Then I use the following PHP code to feed it to the client using HTTP compression so it inflates to its original volume at the client's end:

set_time_limit( 0 );
$payload = '5G-html.gzip';
header( "HTTP/1.1 200 GO AWAY", TRUE, 200 );

if( !function_exists( "exit_handler" ) )
    {
        function exit_handler()
        {
            global $payload;
            error_log( "[spamcheck] Finished sending $payload to malicious client" );
        }
    }

register_shutdown_function( 'exit_handler' );
error_log( "[spamcheck] Begin sending $payload to malicious client" );
header( 'Content-Type: text/html' );
header( 'Content-Encoding: gzip' );
header( 'Content-Length: ' . filesize( $payload ) );

if (ob_get_level()) { ob_end_clean(); }
   readfile( $payload );

The 20 megabyte file:

rw-r--r-- 1 www-data www-data  20M Sep 25 11:32 5G-html.gzip

Here's the Apache log of a guestbook spammer hitting my script. Check out the request size, 3.7 of the 20 megabytes were sent (inflating to ~925 megabytes received by the client):

(REDACTED) - - [23/Mar/2022:15:36:48 -0500] "GET /contact/ HTTP/1.0" 200 3730993 "(REDACTED)" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"

It definitely has an effect on the spammy bots. Since I implemented this system, the volume of guestbook-spamming bot hits has decreased from hundreds per day down to maybe five or ten per day. I like to think that it crashes their scripts with out-of-memory errors.

score 0 · Answer 7 · edited Mar 24 '22 at 09:36

You're hurting yourself if you try and antagonise the attacker. They have infinite resources whilst yours are finite. If you ship enough data to make all of their botnet exhaust the ISP limits then I suppose you have won. But this is not possible whilst your resources are finite.

What you can possibly do is tarpit their collectors until they run out of usable sockets, but if you were to do this from your web server it is likely that you will run out of memory as Apache HTTPd/nginx/varnish are reasonably heavy memory-wise, especially Apache with a few modules loaded.

What you could possibly do is use a firewall to direct the traffic to something like a thin proxy that tarpits certain URLs but all others go to the real server. This will take a lot of work for minimal gain.

score 0 · Answer 8 · edited Mar 24 '22 at 09:37

0

Project Honeypot does this already. Perhaps you should join Project Honeypot. If you want to "give them a hard time", the 404 does that already.

As already stated countless times -- they have more resources available than you do, so -- it is a war of attrition that you will lose, in my opinion.

edited Mar 24 '22 at 09:37

schroeder

123,438
55
284
319

answered Dec 05 '16 at 16:59

Michael

1

Giving malicious crawlers and scripts a hard time

8 Answers8