2

So, we had an issue today where less than a dozen users have been getting timeout errors in OWA. "! Server Busy The server is busy and respond to your request. Please try again later."

They're all external, thus coming through our TMG and hitting only one of our two CAS servers for the site where their mailboxes live. There's around 5000 mailboxes total at this site, but most users are hitting the CAS array internally and thus are evenly split between the two.

Upon inspection, the IIS logs show >200 instances of "overbudget". Example below.

Looks like the problem is "Max Effective Time In CAS" , and Perfmon does show this often creeping above 100%, during the time I was running Perfmon. We collect performance data via Solarwinds, but this isn't one of the counters so I don't have a history. Our last 7 days of IIS logging show that we generally see single-digits-per-day of "overbudget", however.

Thing is, I see essentially nothing on what to do about reducing "effective time spent in CAS" - other than removing/changing my throttling policy. Nothing else really jumps out at me for performance on this server. CPU and RAM are fine - it's a dual-proc VM, averaging ~27% CPU on each proc. 12 GB RAM, 3 GB cached, 3 GB available, 50 MB free. Other than getting proper load-balancing, so as to spread out the load, what can I actually do to diagnose and fix this issue that is stemming from this counter?

2014-05-09 15:41:54 10.70.39.170 GET /owa/ &ex=E303&OverBudget(Normal/CAS),Owner:Sid~domain\username~OWA~false [Conn:2,HangingConn:0,AD:18000/18000/0%,CAS:90000/-2602/155%,AB:18000/18000/0%,RPC:90000/89768/1%,FC:1000/0,Policy:DefaultThrottlingPolicy_aaadc777-4ff8-4a5e-97cc-2ec1e383cb7b,Norm] &v=14.3.174.1&mbx=SERVER01.company.local&sessionId=e842baf430514576aabf3ef6f372494c&prfltncy=1&prfrpccnt=0&prfrpcltncy=0&prfldpcnt=0&prfldpltncy=0&prfavlcnt=0&prfavlltncy=0&End+Budget>> Conn:2,HangingConn:0,AD:18000/18000/0%,CAS:90000/-2602/155%,AB:18000/18000/0%,RPC:90000/89768/1%,FC:1000/0,Policy:DefaultThrottlingPolicy_aaadc777-4ff8-4a5e-97cc-2ec1e383cb7b, Norm 443 company.local\ Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_9_2)+AppleWebKit/537.75.14+(KHTML,+like+Gecko)+Version/7.0.3+Safari/537.75.14 200 0 0 202

mfinni
  • 35,711
  • 3
  • 50
  • 86

1 Answers1

0

I'm adding my answer, because it's what worked. If someone can fill in the "Why" this worked, which would really explain what broke, I'll be happy to accept that as a better answer.

My fix : a reboot.

mfinni
  • 35,711
  • 3
  • 50
  • 86
  • 1
    LOL, very nice. With OWA, I'm curious if tweaking the IIS settings for the app pool OWA is in would help this issue. I've never seen that particular error myself. – TheCleaner May 12 '14 at 15:47
  • 1
    A friend of mine works on the Exchange team at MS : he thought this symptom was extremely weird. It's usually AD lookups or RPC access to the mailbox that drive up the CAS time, as it's a superset of those things plus others that aren't measured. But those are both rock-bottom. – mfinni May 12 '14 at 16:29