1

I'm having one site that when hit with a spider just goes off the handles. Normally everything seems fine. We have a nagios montior to report back when CPU is over 80%.

When we get the warnings, I begin watching logs via sudo tail -f access_log. Most times, it's a spider.

It seems to get caught in one URL that the spider has packed with an infinite number of query string values.

What I've tried:

I've since put Disallow: *?* in robots.txt.

Current top reads:

enter image description here

enter image description here

Question:

Are there other methods that I could use to tell spiders to calm down on our site? On the high memory use httpd processes, can I tell which pages these are calling in order to isolate the troubled spots on this site?

That is, how do I find and isolate the trouble maker?

Errata: We're running Apache 2.2.15 on RHEL 6.8 with memcache.

# apachectl -V
Server version: Apache/2.2.15 (Unix)
Server built:   Feb  4 2016 02:44:09
Server loaded:  APR 1.3.9, APR-Util 1.3.9
Compiled using: APR 1.3.9, APR-Util 1.3.9
Architecture:   64-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Rick
  • 299
  • 1
  • 4
  • 12

1 Answers1

2

You can try using lsof to read the files open by the apache process:

lsof -p PID

Checking the apache logs for errors that correspond to the timestamps of the spider crawl in your access logs is also a good idea.

I also like using goaccess to help parse the log data and extrapolate useful information:

http://www.hackersgarage.com/goaccess-on-rhelcentos-6-linux-real-time-apache-log-analyzer.html

strace and ltrace are also excellent utilities you may want to consider using to help troubleshoot.

wilbo
  • 84
  • 2