We've got a pretty large MSMQ environment setup which today decided to grind to a halt.
(Everything is a VM under vSphere 4.0 Update 1)
There are 8 Web Servers which receive data from clients on the net. These machines all have MSMQ installed and simply send the MSMQ message to the main MSMQ server. Messages are currently piled up in the outbound queue. These machines are Windows 2008 Web Edition with 2 Gigs of RAM and 2 vCPUs.
We have a Clustered MSMQ server (Windows Cluster Server) which gets the messages from the 8 web servers. There is no limit on the amount of data that can be in the queues. The hard drive is 50 Gigs, and there is 46 Gigs of free space. These machines are Windows 2008 Enterprise Edition with 8 Gigs of RAM and 4 vCPUs. The cluster used to have 2 vCPUs but the CPU load was hitting 100%, so I increased both nodes of the Windows cluster to 4 vCPUs.
There are 4 app servers which read the messages from the queues and process them.
Normally this all works perfectly, but not today.
This morning everything is running very slowly. The 8 web servers are currently showing up to 300k messages sitting in the outbound queues. The clustered server currently shows over a million messages in the queues (some are as low as 200k).
If I look at perfmon at the 8 web servers it shows that I'm averaging 2 messages sent per second. If I look at perfmon on the cluster it shows ~7 messages per second are coming into the cluster.
The machines which are doing the reading aren't getting many messages each. The fastest services are getting 10-12 messages per second, the slowest are showing 0 or 1.
The only changes recently is that we changed the number of front end web servers from 4 to 8. We did this about 2 weeks ago without issue. On Tuesday we powered them down to see how the remaining 4 could handle the load. On Wednesday we turned the four newer machines back on.
The disk on the cluster shows very low IO and no queueing.
To be safe I've updated PowerPath to the newest version but that hasn't helped any.
The 8 web servers are on one vLAN, and the Cluster'd servers and the app servers are on a second vLAN. There are no firewalls between the vLANs.
And there is nothing useful in the application or system logs on any of the machines.