3

Two days ago our production server suffered a massive slowdown where the primary symptom was that an extraordinarily high number of requests were suffering SQLTimeouts. I will quickly describe our setup, what I investigated, our workaround, and will then follow with my question.

Our Setup

A pair of servers host this branch of our SAS application. One being an app server running multiple applications on IIS, and the other, the one who suffered the slowdown, is a Windows Server 2008 box running SQL server 2005. SQL is hosting somewhere between 100 and 200 databases.

The problem / investigation

Service grinds pretty much to a halt. Some requests pass through, but most suffer SQL timeouts. SQL machine CPU and RAM look fine, averaging around 25% CPU workload and 85% RAM. I didn't think to check disk activity at the time, as I went straight to 'EXEC sp_who2'

The result showed hundreds of tasks blocked by ID 123, which was itself and with a hundred others blocked by ID 456. Normal execution usually has no blocking tasks at all. When I re-ran sp_who2 after 15-20 secs, different blocking IDs popped up, but the amount of blocked/blocking tasks seemed to stay the same. (didn't count the groups because of emergency mode)

Most tasks were blocking with statements such as "SELECT INTO" or "CREATE INDEX on temptable".

The workaround

Kill the SQL process and restart it to restore service. The slowdown didn't reappear, but we know we are at risk.

My question

What can I do to fix this problem, preferably before it re-occurs?

Sub-questions:

  • Is there another path I can investigate during normal activity?
  • If/when the problem reoccurs, what information should I gather? (Needs to be quick to obtain, as that means we will be experiencing a service outage again)

What I did so far

From the symptoms, we suspected the problem to have been a contention of some kind on tempdb. (Another symptom was that right-clicking on tempdb to see properties during the problem generated an error after a short while)

No logs indicated that an auto-grow event occurred on tempdb, though as far as I know, auto-grow successes are not logged, only failures.

I have read a lot of different sources of info since then on tempdb contention, not limited to but including:

http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/ http://www.sqlservercentral.com/blogs/robert_davis/2010/03/05/Breaking-Down-TempDB-Contention/

From what I gather, it is best practice to have set-initial-size tempdb files and to have one per core, up to 8 files. It is our plan to put that in place (8 cores, so 8 files) as soon as possible since it is best practice. They would all be on the same hard disk (for now), but we believe that the worst case is no amelioration, and best case is we gain the difference between logical contention bottleneck and disk I/O bottleneck.

However, we can't be sure of the correlation with the problem we had. From what I understand, splitting to multiple temp files would help "PAGELATCH_XX" type of waits, but running Paul S. Randal's query (see 1st posted link) during normal activity, that type of wait is absent. Top 3 I see during normal activity are:

CXPACKET 68.63%
LATCH_EX 18.46%
PAGEIOLATCH_SH 4.35%

I have no way to know what type of blocking was occurring during the slowdown though, since we did not have all this information then.

CWilliams
  • 31
  • 2

1 Answers1

0

The problem eventually recurred the day after I posted this question.

Running Paul S. Randal's query, I quickly found out a number of PAGELATCH_XX blocking waits going on, so with sp_who2 I was able to find the culprit databases and only restart the relevant client application pools from the web server as a much less harsh workaround to restore service.

We were also able to follow the trail to the actual operations that do much more tempdb work that they did before, and will look to fix that in a different angle approach to this problem.

The solution

We have gone forward with splitting the tempdb file into multiple files as best practice suggests, since it seems it was the right type of contention that was occurring for this solution to fix my problem.

CWilliams
  • 31
  • 2
  • On top, may I suggest dumping the ridiculously slow HD and put in an SSD? It can be argues for most things, but the SERIOUS amount of IOPS that an SSD has is terrific for tempdb. – TomTom Sep 08 '14 at 13:51