4

We have a high traffic website, at peak it has 1000 concurrent users, and in minimum it has 100 users at the same time. In average it has 40,000 to 100,000 visit a day. The problem is sometimes it load very slow(we named this time as disaster time :) ), In in that time when we try to load website with Firefox, it shows waiting...(I tried it with many providers around the world)

We monitor the server at disaster times , CPU load , Memory Usage are normal. Also slow query log of MySQL doesn't any query up to 1 sec. Apache hasn't any errors. iotop doesn't show anything that causes this disaster.

It is very interesting that disaster time and peak times don't have any relations. Sometimes disaster happen at 300 concurrent user and another time different. I can't find any relation between them.

How can I trace the packets at disaster time? I want to know this disaster is our Data Center's fault (such as upstream or firewall) or our server fault(such as Apache configuration, web application or anything else that I don't know).

For additional data just add a comment, then I edit my question to provide the data that you need to answer.

superuser
  • 271
  • 1
  • 10
  • 1
    To rule out the external network/firewall, set up a health check process (doing simple HTTP requests every few seconds) on the same box or on the same local network (or both) and see if it slows down or fails during the disaster times. If it does then the problem is local and you can probably fix it yourself. If it doesn't then the problem is outside of your control and you will have to ask your hosting provider for help. – Ladadadada Jul 31 '13 at 08:08
  • Make sure that it's not basic tuning parameters. On some systems the defaults for number of processes that can spawned are set very low. Check for instance the value of `ServerLimit` – Jensd Jul 31 '13 at 12:57

2 Answers2

2

The number of concurrent users / visits has nothing to do with the capacity/performance of the system - it's all about concurrent connections and what those requests are doing.

Adding request response times to your server log would be a start - if these don't reflect the problem then the problem is likely on the network. I notice you make no reference to your webserver logs in your question - did you check them?

You consider that you have high traffic volumes, and your question implies you only have a single server. Why? (multiple servers would add complications to this specific such as load distribution, but would also also simplify much of the diagnostics, however it's a no-rainer for performance and availabiltiy).

Tracking the number of connections and their state also provides essential data in diagnosing the problem.

How can I trace the packets at disaster time?

With a packet capture program - this can be running anywhere from the client to the server. I use wireshark (available on Linux, MSWindows and others)

It would hae been useful if you'd mentioned what version/MPM your server is using and what OS it is running on.

symcbean
  • 19,931
  • 1
  • 29
  • 49
2

If you're using Linux, you could use tcpdump, e.g.:

$ tcpdump dst port 80

But I don't think that would help much. I would try to eliminate as many variables as possible. My first thought is that it may be a network issue.

Try creating an Apache log with response times, like so:

LogFormat "\"%{%Y-%m-%d %H:%M:%S}t\" %V %m \"%U\" \"%q\" %{Content-Type}o %s %B %O %D" responsetime
CustomLog "/var/log/apache2/responsetime.log" responsetime

Then, try hitting the web server from a machine/server on the same switch.

If that seems normal, try using something like time wget http://localhost/index.html -q --output-document=/dev/null to do it on the same box.

Belmin Fernandez
  • 10,629
  • 26
  • 84
  • 145