1

We’ve got a performance problem in production.

QA and DEV environments are 2 instances on the same physical server: Windows 2003 Enterprise SP2, 32 GB RAM, 1 Quad 3.5 GHz Intel Xeon X5270 (4 cores x64), SQL 2005 SP3 (9.0.4262), SAN Drives

Prod: Windows 2003 Datacenter SP2, 64 GB RAM, 4 Dual Core 1.6 GHz Intel Family 80000002, Model 6 Itanium (8 cores IA64), SQL 2005 SP3 (9.0.4262), SAN Drives, Veritas Cluster

I am seeing excessive Signal Wait Percentages (> 250%) and Page Reads /s (>50) and Page Writes /s (>25) are both high occasionally.

I did test this query on both QA and PROD and it has the same execution plan and even the same stats:

SELECT 
                top 40000000 * 
INTO 
                dbo.tmp_tbl
FROM
                dbo.tbl
GO

Scan count 1, logical reads 429564, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see it’s just logical reads, however: QA: 0:48 Prod: 2:18

So It seems like a processor related issue, however I’m not sure where to go next, any ideas?

Thanks,

Aaron

Vendoran
  • 53
  • 1
  • 5

4 Answers4

2

This was caused by two issues - indexes different between prod and QA along with improperly configured maxdop.

Vendoran
  • 53
  • 1
  • 5
  • This might be a bit silly to put in a comment so late after the fact, you can blame [some recent changes](http://meta.serverfault.com/questions/8393/what-does-the-community-think-about-the-recent-review-queue-change) on that, but if you could, please accept the answer. – Reaces Sep 30 '15 at 07:34
0

I have a couple of suggestions for things that you might investigate on the SAN:

  • Are you seeing significant page I/O latch waits on the production SAN?

  • Are the DB logs on a busy shared volume?

In the former case, there could be a SAN configuration or other performance issue relating to the SAN controllers. I have seen this happen on IBM shark hardware; moving to a DS8000 substantially mitigated the problem.

In the latter case you could be getting problems with random seeking disrupting the log writing activity. Log writes are a mostly sequential process with large numbers of small sequential writes. On quiet disks this is fast as the disk access patterns are mostly sequential. On busy disks the other disk traffic turns the sequential log writes into random writes, which are much slower. This can turn log drives into a significant performance bottleneck.

Note that SQL Server requires log writes to be flushed to disk before a transaction will commit, and most SAN vendors have enrolled in a certification program where they guarantee that the controllers will honour this standard. This means that no amount of cache memory will mitigate this issue if your logs reside on a busy shared volume.

0

"So It seems like a processor related issue, however I’m not sure where to go next, any ideas?"

It seems not unreasonable to me that a 3.5GHz Core 2 Architecture CPU is perhaps 100% faster than a 1.6GHz Itanium (isn't Intel Family 80000002 the original Itanium 2 family, Fanwood or Madison ?). If you need more speed, look at a CPU upgrade, perhaps to a x5600 series Xeon.

xenny
  • 780
  • 4
  • 8
0

Was there anything else happening on the Prod server? Looks like the QA server had only this query to run, while the Prod system had to content for CPU with other queries running at the same time. How do the elapsed_time and worker_time compare in QA and Prod?

Also, make sure the plans are exactly identical, including DOP.

Remus Rusanu
  • 8,253
  • 1
  • 19
  • 22
  • Both servers had other things running at the same time, however Prod is overall significantly slower. This was just an example Query I ran for comparison sake, the plans are the same, including DOP and even estimated rows, estimated io, etc. With the high signal wait % I would agree that there's CPU pressure however I'm not sure how to start troubleshooting it other than to say that 29% of waits are CXPacket parralelism – Vendoran Dec 09 '09 at 19:13
  • DOP was the same on both set at 2, however I did change prod to 0 to test, however no real effect that I saw – Vendoran Dec 09 '09 at 19:14
  • With 8 cores, set the maxdop to 4. Then test your query again. The cost threshold is a different animal - try 5, 25, 50, 75 and compare results. Another thing you could look at is the SAN. Can you do benchmarks to rule that out? –  Dec 09 '09 at 19:26
  • 1
    There are many places you can look for. Your test is a write (SELECT ... INTO), so it could be your HBA/Fiber/SAN connectivity is overloaded on Prod. Could be NUMA locality (maybe you need to affinitize clients). Vendors have tunning papers, like http://docs.hp.com/en/8875/WIE-SQLTuning-1006-00.pdf. – Remus Rusanu Dec 09 '09 at 19:34
  • I did some minimal SAN benchmarks and see both (QA and Prod) operating sufficiently, I did change the maxdop to 4 and retested with no improvement. the cost threshold is set to 5 on both servers, will investigate and test. – Vendoran Dec 09 '09 at 19:35
  • 1
    I would check the I/O subsytem first. A bandwith problem with the SAN could result in apparent CPU pressure, because requests are staying longer in the system, thus creating scheduler pressure (many workers are occupied). Compare the 'Avg. sec/Transfer' on the Physical Disks on the two systems. – Remus Rusanu Dec 09 '09 at 19:36
  • I gave +1 on both comments. Glad to see we're on the same wavelength re SAN. –  Dec 09 '09 at 19:53
  • haven't seen any SAN bandwidth problems, and looking at the Object: Physical Disk Counter: Average Disk sec/Transfer The values were showing 0 I have changed the maxdop to 4 and the cost of threshold to 50 and cleared the waitstats and will continue monitoring. – Vendoran Dec 09 '09 at 20:31