This is a topic near and dear to my heart, I've battled bots etc for ages. My conclusion is that your best strategy always, though not very gratifying, is to simply absorb the traffic, that is, detect that it is a bot, and then neutralize any further action on its part, without giving any programmatically obvious indication that you have done so (obvious would be: send a large file in response to an html page request; serve a page telling them they are bad and not allowed - a mistake I made for ages, not realizing that it's software, not a person). Programmatically non obvious means that to automated bot spidering software, the response looks normal, and does not raise red flags internally.
To understand what Programmatically non obvious pages look like, take some time examining for example an seo constructed spam website that you get to in a google search. Those pages are designed to work around google bot filters, that is, they do not trigger internal google spidering red flags, which is why they were offered as a SERP result to you in your search. Bad Bots are not very complicated (speed/efficiency is more critical for bot activities) in terms of programming sophistication, that is, their ability to 'read' the page or response tends to be quite basic, and it's not very hard to give them something that looks like a normal response.
Some of the responses here point to a problem that until you've personally experienced it, is easy to not understand the severity of, and that's underestimating the resources, and the possible vindictiveness, of bot masters. If you engage in the type of tricks contemplated here, only a few things will happen:
you use server resources for no reason and no gain
you aggravate the bot operators, who can then add you to lists that are distributed globally that basically are databases of sites to target heavily, usually because they are good targets, but they can just as easily add your site to this list. This has happened to me before, and it took me months to program around the attacks because while they resemble DDOS, they aren't actually that, they are simply various bot software operators using those master lists.
I've at some points tried the send large file to violators, but it's not just a waste of time, it's generally counterproductive.
There's also some fundamental misperceptions of how bot software works, it's often quite simple, and designed for rapid action, so the odds are quite good that as soon as a size limit is hit, it just stops.
To clear up some common misconceptions about bots and how they work, let's construct a simple one:
wget --spider -t 1 --read-timeout=5 -U "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0" [target url]
this is the initial url database builder bot component. It's only requesting the HEAD, that is, checking to see if the file exists. This will generate a server response code. As some people here indicated, correctly, one excellent strategy is to simply use this step to send them a 404, once their bot has been detected. The bot is software, not a person, so it will merely shrug its virtual shoulders and go, fine, page doesn't exist, next url. This is a very good strategy, and the one least likely to raise red flag alerts. It's also telling the server that it's firefox on windows 10, or it could be anything else they chose to enter in there, but fake lists of useragents that are switched or rotate are common features of decently designed bad bots.
However, serving the 404 still takes some server resources, particularly if your 404 page is constructed by a cms. If you just use the default server 404, that is light, but ugly for real users who might have also hit a 404.
A 301 is just going to send the request elsewhere, whatever you do, do not send it to a known anti spam/bot website, they know those as well or better than you. Just send it somewhere that doesn't exist.
However, bots do not in general respect 301s with any consistency, sometimes they do, sometimes they don't, depends on their programming.
So there's some decisions on how to handle a bot at the spidering stage. A 404 is a quite good solution, though make sure to actually test the real server response codes so that you know the response sent is a real 404, not just a redirect to a 404 page, which most bots won't realize was a failed test.
Now we can grab the file:
wget -t 1 -Nc --read-timeout=5 -U "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0" [target url]
Note again, we're timing it out again, which terminates the big file idea, and makes that a total waste of your server resources, and sending a fake useragent.
Obviously, bots will generally be written to be significantly more efficient than sending wget requests, or curl, but the ease of working around most ideas here should be obvious since I'm writing my bot using wget and built in options, for any decent bot software, this of course will be far more sophisticated, or at least, more sophisticated, and easier to use.
Any self respecting bot author of course will take the time to google threads like this one to see what people are currently doing, and simply add in spider/downloader protections to handle those ideas, which is why you don't use them. Bot authors are often serious and competent professionals, whose work focuses on successful bot activity, this is their work, so it's worth respecting them a bit to avoid aggravating them, which is like jay walking in front a policeman, it's just a pointless provocation.
For the downloading phase, you can again use the fact of the detection to simply serve them fake pages with no content, or anything else that will have very little load on the server. Whatever you do, you don't want to use a lot of resources to give them fake content.
There's a lot of fairly silly ideas as well floating around the internet, which generally show more the lack of experience of the people floating them, for example, spending time blocking IP addresses. Most bots are run from random IP addresses, might be on a bot net, might be datacenters, might be AWS temp sites/IPs set up then discarded, whatever it is, there are few less productive ways to block bots that by blocking IPs. This will become even more so when IPv6 becomes standard, with its ability to do random reassignments of public IP addresses based on timers or schedules or device configurations.
Analyzing IP data can be useful to let you know what you are dealing with, for example, maybe most come from romania, or china, and that tells you something, but they can, and do, switch from one to another, and from one datacenter/botnet to another, so IP analysis is more of a good backend tool to let you see what type of bot issue you're dealing with.
It's always a good strategy to respect the intelligence of the spidering tool (aka, bots) software authors, and to not assume they are stupid, or to antagonize the users of that software needlessly, just do your business, handle the bot and then let it move on to the next site without registering issues on yours.
Also, it's important to understand that there's no such thing as bots per se, what there are are usually single purpose bots. A 'bot' per se is a piece of software that in almost all cases automatically requests urls in its database, it may also as part of its function create that database of urls, or it may use commercially available databases. Some bots scan automatically the entire IP ranges of the internet, they are just looking for things, others are targeted, and, as with search bots, follow links. Note that if you have not blocked all admin type pages/directories in robots.txt, you can't tell which bot is ok and which isn't, so that's the first step you take. You don't need the whole file path, just the start of it that makes it unique as a path, like: /admin
- search bots, legitimate
- seo bots, usually pretending to be legitimate. Blocking useragents in robots.txt then analyzing log files for subsequent requests can show you which are real and which scum.
- contact page autosend/fill bots
- blog autopost bots
- forum signup/post bots
- blackhat bots searching for common easily exploitable application urls, for often insecure and/or non updated tools like: wordpress, phpmyadmin, drupal, etc. These can usually be detected in stats by seeing what requests certain urls that generally are not linked to.
- site uptime bots, see #2 for how to check if real. Basically, if you didn't sign up for the service, it's gray or blackhat.
- experimental bots, new search engines, web services, sometimes ok, sometimes badly written, usually not malicious but worth examining.
- indexing bots, for example, could be selling indexes of sites to various other bot operators.
Then there are non bot type things like DDOS attackers, which use zombie net pc's usually, or hijacked servers (see #6, this is one reason they will spider - to find sites running software with known security issues, or zero days. You can actually discover usually when a security issue appears in the blackhat world by simply looking for requests for certain common software urls). Hijacked servers are a premium product in the underworld because they tend to have access to much more bandwidth and cpu/ram than a regular personal pc device.
This is not of course an exhaustive list, just a sample of various bots and the kind of thing they look for.
Note that some search bots may act like bad bots if configured badly, again, you can give directives in robots.txt and see if they obey them, if they don't, you can block them, if they don't respect the blocks, you have a variety of choices.
The reason there's such heavy bot activity out there is that most users are simply totally and utterly unequipped to be a webmaster for their site, they are basically sitting ducks, and use badly written software with bad security, or use bad programming that fails to protect against common attack vectors like sql injection, so there's a significant benefit/reward to the constant bot scanning activity. Obviously you can mitigate a bit by always updating all your software you use, or, better yet, eliminate unnecessary and very insecure tools like phpmyadmin in the first place to remove the entire attack surface. Or at least password protect the folder containing its files.
I had a client who had the terrible idea of running their own IIS webserver, which was at the time so radically insecure that, as in the OP case, the second I put it online, I saw instant bot probes for common IIS access points, even though there had never been a link to the IP. I told my client the next morning that he had to shut down the IIS instance and give up, since he was never going to be able to maintain secure local systems, luckily for me, he did.
I use grep, sed, awk for most of my analysis, though I'm sure there must be gui tools to do some of the stuff, no idea, never needed it.
You're not examining 10s of thousands of accesses in your logfiles, by the way, you're searching for the patterns, using tools that are made for that job (like awk/sed/grep), then seeing if you can detect patterns that can be solved with programming responses. There's also firewall tools that can be programmed to block requests that fit certain rules, but those are harder to configure correctly.