8

We've got a pretty large MSMQ environment setup which today decided to grind to a halt.

(Everything is a VM under vSphere 4.0 Update 1)

There are 8 Web Servers which receive data from clients on the net. These machines all have MSMQ installed and simply send the MSMQ message to the main MSMQ server. Messages are currently piled up in the outbound queue. These machines are Windows 2008 Web Edition with 2 Gigs of RAM and 2 vCPUs.

We have a Clustered MSMQ server (Windows Cluster Server) which gets the messages from the 8 web servers. There is no limit on the amount of data that can be in the queues. The hard drive is 50 Gigs, and there is 46 Gigs of free space. These machines are Windows 2008 Enterprise Edition with 8 Gigs of RAM and 4 vCPUs. The cluster used to have 2 vCPUs but the CPU load was hitting 100%, so I increased both nodes of the Windows cluster to 4 vCPUs.

There are 4 app servers which read the messages from the queues and process them.

Normally this all works perfectly, but not today.

This morning everything is running very slowly. The 8 web servers are currently showing up to 300k messages sitting in the outbound queues. The clustered server currently shows over a million messages in the queues (some are as low as 200k).

If I look at perfmon at the 8 web servers it shows that I'm averaging 2 messages sent per second. If I look at perfmon on the cluster it shows ~7 messages per second are coming into the cluster.

The machines which are doing the reading aren't getting many messages each. The fastest services are getting 10-12 messages per second, the slowest are showing 0 or 1.

The only changes recently is that we changed the number of front end web servers from 4 to 8. We did this about 2 weeks ago without issue. On Tuesday we powered them down to see how the remaining 4 could handle the load. On Wednesday we turned the four newer machines back on.

The disk on the cluster shows very low IO and no queueing.

To be safe I've updated PowerPath to the newest version but that hasn't helped any.

The 8 web servers are on one vLAN, and the Cluster'd servers and the app servers are on a second vLAN. There are no firewalls between the vLANs.

And there is nothing useful in the application or system logs on any of the machines.

mrdenny
  • 27,074
  • 4
  • 40
  • 68
  • 2
    It turns out that the cause of the slow MSMQ reading was actually an application problem. The services which read from the queue then go to stuff on a file share. The file share started taking longer and longer, which caused the services to run slower, which caused the queues to back up, and now we have a mess. Apparently our user base has grown much faster than planned and we are maxing out one of the RAID groups on the SAN which hosts the file shares. Monday we'll be putting in a rush order for more SAN space with our vendor. – mrdenny Feb 07 '10 at 03:39
  • 2
    We didn't see this queue growth ahead of time because our monitoring server is a Windows 2003 server, and Windows 2003 machine's can't monitor Clustered Windows 2008 MSMQ Queues remotely. Monitoring server is already scheduled for an upgrade in March. – mrdenny Feb 07 '10 at 03:42

3 Answers3

4

Whenever someone says they have over a million messages the alarm klaxons go off! Messages require kernel (paged pool) memory to be managed. If you have such a vast number of messages, you may be exhausting what is available on the clustered server. An optimal number for number of messages in a queue is zero - basically make sure you can normally process messages faster than they can arrive.

I would recommend shutting down the web servers and completely processing the backlog of messages before bring them back online again.

Reference Item 4 of this blog post: http://blogs.msdn.com/johnbreakwell/archive/2006/09/18/insufficient-resources-run-away-run-away.aspx

Cheers John Breakwell (MSFT)

  • I've got a call into PSS at this point, and I'm waiting for them to call me back now. I've stopped the messages from flowing into the queue on the web servers. The outbound queues on the web servers are all full at this point with 1 Gig of info each. The Clustered queues have a total of about 4.5 million messages each. Normally we keep a very low number of messages in the queues as we get the data processed very quickly. Something happened (not sure what) and it all went to hell. – mrdenny Feb 06 '10 at 19:02
  • John, thanks for taking a peek for me. Based on the output from tmq I'm guessing that's my problem. Pools limitations (calculated approximately, in KB) Paged : limit 307,200 used for 397 % Nonpaged : limit 262,144 used for 49 % I've got the queues slowing draining while I wait for PSS to call me back. If you are in Redmond during the MVP Summit let me know, beers on me. – mrdenny Feb 06 '10 at 19:42
  • @user34024 we found the initial problem, which I've put in a comment above. Thanks for the help. – mrdenny Feb 07 '10 at 03:39
1

I asked one of our sysadmins and he said our magic point was 4 web servers max hitting MSMQ box on virtual machines, then they moved to hardware box to solve. Also try packet capture to see what is going on. Is there much in authentication going to AD also? With how chatty MSMQ is, you need to limit network paths and possibly authentication path.

HTH, Chuck.

SQLGuyChuck
  • 114
  • 5
  • Were they able to nail down what exactly caused the slowdown when you have more than 4 web servers talking to a single MSMQ server? The storage is direct SAN storage over iSCSI so it shouldn't be a storage problem per say. I'll try powering down 4 of the 8 web servers and see what I come up with. If I have to tell my boss to buy new hardware, going to need a damn good reason. – mrdenny Feb 06 '10 at 03:20
  • Just the chattiness of the messages. They also found some authentication miss configurations. – SQLGuyChuck Feb 06 '10 at 03:39
  • I guess I'll download wireshark and put it on the MSMQ server and see what it shows. Can't put it on the Web servers, it crashes after about 30 seconds because of the network traffic load. – mrdenny Feb 06 '10 at 03:44
  • So I've fired up WireShark on the machine, and I'm seeing about 3 seconds between messages from the one web server that's I'm monitoring. Needless to say, that doesn't look good. – mrdenny Feb 06 '10 at 04:50
  • we found the initial problem, which I've put in a comment above. Thanks for the help. – mrdenny Feb 07 '10 at 03:40
1

Referencing your comment about lack of remote administration, yes, it's not a great story with MSMQ and perf counters. For anyone following the thread and wanting to know what combinations of OSes work then have a look at the Motley Queue blog:

MSMQ 4.0 Performance Counters and the NetNameForPerfCounters Registry Key http://blogs.msdn.com/motleyqueue/archive/2007/12/14/msmq-4-0-performance-counters-and-the-netnameforperfcounters-registry-key.aspx

Cheers John Breakwell (MSFT)