0

My normal running SQL Server 2005 SP3 servers started experiencing random blocking issues over the past month. Each time it happens I run my blocking script to see whats blocking what (see code block below). I see SPIDs that are causing blocks, but they generally tend to be TEMPDB activities and each time I execute my blocking script the objects it finds causing blocks changes to something else. This makes it really hard to find out what is really causing the hold up. If I run some queries on the exec_requests DMV I see wait types of PAGELATCH for the suspended processes. I've taken all the best practices steps to give tempdb a performance boost; tempdb data and log files are own their own volumes, tempdb is recovery model simple, it has 8 data files, and I even turned on the trace flag TF1118. I have a script that I use to create new databases and I can use that script to reproduce the blocking on the server (most of the time). But I've used my create database script for over a year on this server with no issues till now... Please any advice on what to look for or how I can find out what is truly causing this?

    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; SET NOCOUNT ON

    DECLARE @Processes TABLE (SPID INT, Blocked INT, DBID INT, ProgramName VARCHAR(100), HostName VARCHAR(50), CMD VARCHAR(50), 
                                CPU INT, PhysicalIO INT, Status VARCHAR(50), ECID INT)
    INSERT @Processes (SPID, Blocked, DBID, ProgramName, HostName, CMD, CPU, PhysicalIO, Status, ECID)
    SELECT spid, blocked, dbid, [program_name], hostname, cmd, cpu, physical_io, status, ecid
    FROM sys.sysprocesses (NOLOCK)
    WHERE spid <> blocked

    DECLARE @BlockingIDs TABLE (ID INT)
    INSERT @BlockingIDs (ID)
    SELECT Blocked FROM @Processes WHERE Blocked IS NOT NULL AND Blocked <> 0

   -- If there are blocked processes...
    IF (SELECT COUNT(ID) FROM @BlockingIDs) > 0
        BEGIN
            DECLARE @BlockerData TABLE (RowID INT IDENTITY(1,1), BlockingSPID INT, SqlText NVARCHAR(4000), ObjectID INT, ObjectName VARCHAR(400), 
                                        DatabaseName VARCHAR(100), ProgramName VARCHAR(100), HostName VARCHAR(50), CMD VARCHAR(50))

            CREATE TABLE #ON (Name VARCHAR(400))
            INSERT @BlockerData (BlockingSPID, SqlText, ObjectID, ObjectName, DatabaseName, ProgramName, HostName, CMD)
            SELECT DISTINCT spid, master.dbo.DBA_GetSQLTextForSPID(spid), 
            master.dbo.DBA_GetSQLObjectIDForSPID(spid), '', DB_NAME([dbid]), ProgramName, HostName, CMD
            FROM @Processes
            WHERE SPID IN (SELECT ID FROM @BlockingIDs)
            ORDER BY SPID

            DECLARE @RowIndex INT,
                    @RowCount INT,
                    @ObjectID INT,
                    @ObjectName VARCHAR(200),
                    @DB VARCHAR(50),
                    @Sql NVARCHAR(300)

            SELECT @RowCount = COUNT(RowID) FROM @BlockerData
            SET @RowIndex = 1

            WHILE @RowIndex <= @RowCount
                BEGIN
                    SELECT @ObjectID = ObjectID, @DB = DatabaseName FROM @BlockerData WHERE RowID = @RowIndex
                    SET @Sql = 'SELECT Name FROM ' + @DB + '..sysObjects WHERE ID = ' + CONVERT(VARCHAR(50), @ObjectID)
                    DELETE #ON
                    INSERT #ON (Name) EXEC sp_ExecuteSql @Sql
                    UPDATE @BlockerData SET ObjectName = (SELECT Name FROM #ON) WHERE RowID = @RowIndex
                    SET @RowIndex = @RowIndex + 1
                END
            DROP TABLE #ON

            SELECT BlockingSPID, SqlText, ObjectID, ObjectName, DatabaseName, ProgramName, HostName, CMD FROM @BlockerData

            -- Identify the spids being blocked.
            SELECT t2.spid AS 'Blocked spid', t2.blocked AS 'Blocked By', 
            master.dbo.DBA_GetSQLTextForSPID(t2.spid) AS 'SQL Text', 
            t2.CPU, t2.PhysicalIO, DatabaseName = DB_NAME(t2.[dbid]), t2.ProgramName, t2.HostName, t2.Status, t2.CMD, t2.ECID
            FROM @Processes t1, @Processes t2 
            WHERE t1.spid = t2.blocked
            AND t1.ecid = t2.ecid
            AND t2.Blocked IN (SELECT ID FROM @BlockingIDs)
            ORDER BY t2.blocked, t2.spid, t2.ecid
        END
    ELSE -- No blocked processes.
        BEGIN
            SELECT 'No processes blocked.' 
        END
David George
  • 888
  • 1
  • 9
  • 21

1 Answers1

0

Go grab a copy of sp_whoisactive and use that. That should provide you with some good into. Also look at the wait_stats DMV to see what the cause of the waiting is on the blocker. If you are seeing PAGELATCH_IO then you've probably got some sort of storage issue going on. Use perfmon to look for slowly responding IO.

mrdenny
  • 27,074
  • 4
  • 40
  • 68
  • Thanks for the reply Denny! I do have Adam's procedure on that box, I'll have to try and run that next time this occurs. However I have been running my own scripts to check activity and waits using exec_requests and last wait types. The majority of the time all exec_requests get a status of 'suspended' and the wait is usually PAGELATCH_SH or PAGELATCH_EX; with shared being more common...any thoughts? – David George Jul 11 '11 at 12:17
  • I just had this issue today, the worst one so far.. I ran sp_whoisactive and I would get different results each time. Nothing really helpful, using perfmon I could see write/B/sec was the highest for tempdb's log file, but it wasn't as high as my other server which was operating just fine...of all the tasks that were showing up as suspended in exec_request I'd say about 80% were PAGEIOLATCH_EX, and 20% PAGEIOLATCH_SH... I have no idea what caused it this morning or why I couldn't fix it. This was the first time I actually had to restart the SQL server process. – David George Jul 11 '11 at 15:57
  • Also, I see blocking constantly throughout these nightmares, but everytime I run a check to see the SPID is something new causing the blocks, often some object executing against tempdb... – David George Jul 11 '11 at 16:28
  • OK, you are having IO problems. First thing to look at is the writes per second and reads per second. Also look at the seconds per read and seconds per write. If the seconds per read/write numbers are 0.050 or higher then the IO is responding to slow. Either you've got a problem at the IO layer (failing disk, controller, cable, etc) or you are simply pushing to much data to the disk. How many reads per sec and writes per sec do you see? What's the disk config behind the drive in question? – mrdenny Jul 11 '11 at 20:09
  • Thank you for the response @MrDenny A local RAID10 (8 - 146GB 15K drives) array make up volumes: C: (windows) H: (tempdb) I: (templog) K: (programs) All the rest of SQL is stored on separate LUNs (Data, Audit, Logs, and Indexes) H: (Read/sec = Average 7) (Writes/sec = Average 3.7) I: (Read/sec = Average 0) (Writes/sec = Average 15) This is an HP server and the array reports healthy via the ACU. However, this configuration has been deployed on this server for 1.5 years with no issues till recently though. – David George Jul 12 '11 at 18:22
  • How about the seconds per read and seconds per write. – mrdenny Jul 12 '11 at 19:34
  • TempDev H: (Sec/Read = Max 0.005) TempDev H: (Sec/Write = Max 0.003) TempLog I: (Sec/Read = Max 0.001) TempLog I: (Sec/Write = Max 0.007) – David George Jul 13 '11 at 17:58
  • All those numbers look ok. Are these the kinds of numbers that you get when you see the waits on pageIO? – mrdenny Jul 13 '11 at 18:32
  • no right now the servers appear fine and are under equivalent load to when I see the issue pop up...The issue can appear from nowhere – David George Jul 13 '11 at 18:43
  • Try setting up perfmon to gather the stats and save them to disk. Then when the problem comes up you can stop the perfmon collector and see what the data changes to. – mrdenny Jul 13 '11 at 19:03
  • I setup the data collectors to save those values to a SQL table for me (separate server we use for data warehousing). Also I forgot to mention earlier that when I see these waits its often the same resource thats waiting (dm_os_waiting_tasks) 2:1:103, and if I run that into DBCC PAGE (2,1,103) it says the page contains: sysmultiobjrefs... no idea what that is? :( Also I've always used Red-Gate's SQL MultiScript to do updates to these servers (~250 DB's/server) w/setting of 25 concurrent DBs. I execute an ALTER script from MultiScript and that will cause blocks it seems... – David George Jul 14 '11 at 13:44
  • That is a system table which tracks cross table references. How many physical files does your tempdb database have? – mrdenny Jul 18 '11 at 20:22
  • The server is 4 6-core CPU's (24 logical CPUs) with 128GB of RAM. We have had it at 8 tempdb files since the servers were put into operation last year, but I recently added 2 (10 total files) to see if that stopped the issue, but it still happens. – David George Jul 20 '11 at 12:05
  • Does your application have a lot of temp tables being used? With your contention being on that single page, adding more physical files won't help as all your system tables are contained within the first physical file of the database. While the page can't be moved, you can't spread it across multiple files. – mrdenny Jul 24 '11 at 03:29
  • Each server runs about 250~ databases, and the majority our "client" databases so they are each identical in terms of schema, procedures, etc., and differ only in terms of data. We do use temp tables very often in our software. (Beyond that I know some modules have cursors, table variables, and other temp objects that are used in the software). This issue really seems to have came from nowhere, but all the disk hardware appears okay. June-August is our busiest time of the year so I know that is a contributing factor, but other than that there wasn't any sudden change to start these issues. – David George Jul 25 '11 at 14:32
  • Have been watching active temp tables perfmon counter today just to see what our numbers were and it averages around 4,700 – David George Jul 25 '11 at 18:43
  • Try enabling trace flag 1118. While the system is running "DBCC TRACEON (1118, -1)" The flag won't be active the next time the SQL Server is restarted so you'll want to add the trace flag as a startup parameter by adding "-T1118" to the list of startup parameters. – mrdenny Jul 26 '11 at 06:43
  • I've already done so as noted in my original question post. I haven't noticed any difference with TF1118 enabled. These servers have 128GB of RAM and sp_configure 'max server memory' shows the run_value of 120832. After a monthly maintenance window these servers are rebooted (usually a weekend/Saturday night) yet come Monday they have consume 125GB of RAM... This seems abnormal? The problems seem to progress as the month does. TEMPDB contention gets worse after the server has been up for 3-4 weeks and queries that run in 1-2 hours the first week take several hours by 3-4 weeks. – David George Jul 26 '11 at 18:03
  • It's normal for the SQL Server to use up all the RAM pretty quickly. That just means that the SQL Server has loaded a lot of data into memory. You might want to lower the amount of memory allocated to SQL Server. With 125 Gigs in use that's only leaving 2-3 gigs for Windows. The max server memory only effects the buffer pool and procedure cache. The memory that the actually binaries, CLR, etc. use aren't included in that number. The only suggestion I can made would be to try a newer build, and/or reduce tempdb usage. It might be worth opening a PSS case if that doesn't help. – mrdenny Jul 27 '11 at 01:45
  • Sorry I wasn't able to find the solution. If PSS does please post it. – mrdenny Jul 27 '11 at 20:42