0

I'm a project manager working with complicated web application which is placed on 3 different clusters(servers) in different parts of the  world. On every cluster the code is the same.

But Google page load time is different from cluster to cluster and deviation is very high - it varies from 2.5 seconds on cluster A (which is ok) to 6 seconds on cluster B (which is far beyond company's SLA)

We've implemented NAGIOS http checks to see what it will show and the numbers are pretty much similar to google page load time.

Our admins troubleshoot this issue with regular tcptraceroutes and show the numbers from 0.5 sec to 1.8 sec, after what verdict that there are no issues with the network or servers.

The question are:

1) Is tcptraceroute check relevant for troubleshooting such kind of an issue?

2) Is there any other way to troubleshoot page load time from admins side?

3) My main argument for admins to keep investigating this issue is that on one cluster Page load time is 2.5 sec and on another it is 6 sec. (In both GA and NAGIOS checks) Isn't that enough for admins to keep investigating? 

Thanks and sorry if I touched smbdy's feelings with such a vague questions.

Pasha
  • 3
  • 2
  • It's not clear if you are having the long page load on just one cluster or on all of the clusters ? – user9517 Nov 22 '16 at 08:45
  • On one cluster A it's 2.5 seconds (which is ok) on another cluster B it's 6 seconds (which is not ok) – Pasha Nov 22 '16 at 08:48

1 Answers1

0

Is tcptraceroute check relevant for troubleshooting such kind of an issue?

It may be useful for checking the network path between your monitoring point and the remote server.

Is there any other way to troubleshoot page load time from admins side?

Scientific method is your friend I've written about it here.

I would start this investigation by first confirming that both clusters are exactly alike. If they are not then I would seek to understand how they are different, what effects those difference have and if they will impact the application in the manner observed.

I would check the logs to see if the application/cluster/OS etc is logging anything interesting.

I would set up resource monitoring on the cluster servers and see if there is anything that is untoward etc.

General basic system administration and diagnostics really.

I would also get the developers involved in this. Perhaps they need to instrument the application so that it can provide better diagnostic and logging information. For example, if the application is talking to a database maybe they need to be able to log how long a query took.

user9517
  • 114,104
  • 20
  • 206
  • 289