There are a few elements to pursue with this.
user-agent string is one value, but it can be trivially spoofed.
I've found a reasonably useful heuristic is to do a bit of pre-processing, then look at traffic:
Parse out your access logs adding host, ASN, CIDR, and ASN name information. Subset URLs to the nonvariant part (stripping everything past '?' generally, though YMMV). If you've got specific search or utility pages, focus on these (typically I've seen problems either with bots using some sort of user verification service, or search).
Look for single IPs with high volumes of traffic.
Look for single CIDR blocks or ASNs with high volumes of traffic.
Rule out legitimate search traffic (Google, Bing, Yahoo, Baidu, Facebook, and similar bots / network space). This is probably going to be one of your larger areas of ongoing maintenance, this stuff changes all the time.
Rule out legitimate user traffic. Especially for high-volume users of your site.
Identify what normal patterns of usage are, for both end-users and search bots. If a typical user visits 1-3 pages per minute, with a typical session of 5-10 minutes, and Googlebot limits itself to, say, 10 searches per minute, and you suddenly see a single IP or CIDR block lighting up with hundreds or thousands of searches per minute, you may have found your problem.
Investigate the origins of high-volume / high-impact (in a negative sense) traffic. Frequently a WHOIS query will reveal that this is some sort of hosting space -- not typically where you'll see a lot of legitimate user traffic. Patterns may appear in user-agent strings, request URLs, referrer strings, etc., that tip you off to additional patterns.
A caching whois client can be a big assistance if you end up doing a lot of WHOIS lookups, both the speed the process, and to avoid rate-limiting/throttling by registrars (for some reason, they don't take kindly to entities conducting thousands of repeat/automated lookups). You may be able to go to registrars directly for more information, though I haven't pursued this.
Checks against various reputation databases (spam lookups, SenderBase, there's now some Google stuff along these lines) may also corroborate poorly-policed network space.
I'd love to say I've got something to sell you along these lines, but what I'm working with is mostly some awk and other tools to pull this together. It'll parse a million lines of log a minute or so (plus a bit of preparatory overhead to prepare hashes for IPs and ASN/CIDR information) on a modest workstation. Not fully automated, but it'll give me a decent picture of an issue with a few minutes of work.