Two days ago our production server suffered a massive slowdown where the primary symptom was that an extraordinarily high number of requests were suffering SQLTimeouts. I will quickly describe our setup, what I investigated, our workaround, and will then follow with my question.
Our Setup
A pair of servers host this branch of our SAS application. One being an app server running multiple applications on IIS, and the other, the one who suffered the slowdown, is a Windows Server 2008 box running SQL server 2005. SQL is hosting somewhere between 100 and 200 databases.
The problem / investigation
Service grinds pretty much to a halt. Some requests pass through, but most suffer SQL timeouts. SQL machine CPU and RAM look fine, averaging around 25% CPU workload and 85% RAM. I didn't think to check disk activity at the time, as I went straight to 'EXEC sp_who2'
The result showed hundreds of tasks blocked by ID 123, which was itself and with a hundred others blocked by ID 456. Normal execution usually has no blocking tasks at all. When I re-ran sp_who2 after 15-20 secs, different blocking IDs popped up, but the amount of blocked/blocking tasks seemed to stay the same. (didn't count the groups because of emergency mode)
Most tasks were blocking with statements such as "SELECT INTO" or "CREATE INDEX on temptable".
The workaround
Kill the SQL process and restart it to restore service. The slowdown didn't reappear, but we know we are at risk.
My question
What can I do to fix this problem, preferably before it re-occurs?
Sub-questions:
- Is there another path I can investigate during normal activity?
- If/when the problem reoccurs, what information should I gather? (Needs to be quick to obtain, as that means we will be experiencing a service outage again)
What I did so far
From the symptoms, we suspected the problem to have been a contention of some kind on tempdb. (Another symptom was that right-clicking on tempdb to see properties during the problem generated an error after a short while)
No logs indicated that an auto-grow event occurred on tempdb, though as far as I know, auto-grow successes are not logged, only failures.
I have read a lot of different sources of info since then on tempdb contention, not limited to but including:
http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/ http://www.sqlservercentral.com/blogs/robert_davis/2010/03/05/Breaking-Down-TempDB-Contention/
From what I gather, it is best practice to have set-initial-size tempdb files and to have one per core, up to 8 files. It is our plan to put that in place (8 cores, so 8 files) as soon as possible since it is best practice. They would all be on the same hard disk (for now), but we believe that the worst case is no amelioration, and best case is we gain the difference between logical contention bottleneck and disk I/O bottleneck.
However, we can't be sure of the correlation with the problem we had. From what I understand, splitting to multiple temp files would help "PAGELATCH_XX" type of waits, but running Paul S. Randal's query (see 1st posted link) during normal activity, that type of wait is absent. Top 3 I see during normal activity are:
CXPACKET    68.63%
LATCH_EX    18.46%
PAGEIOLATCH_SH  4.35%
I have no way to know what type of blocking was occurring during the slowdown though, since we did not have all this information then.
