We had an overnight air conditioning failure. We discovered that the temperature in the server room had reached about 110-115°F (43-46°C). We powered off everything that hadn't already and had the A/C fixed.
Now that it's fixed, I'm concerned of the damage done by the extended exposure to the high temperature. I'd like to run a series of tests on all of our machines to ensure that machines aren't damaged before we return to relying on them. My plan is as follows:
- Run memtest86 to check if any DIMMs were damaged (have already done this and essentially found no issues)
- Run Prime95 to check if any CPUs are damanged (presumably this will come in the form of unexpected interrupts or hardware faults)
- Run
smartctl -a
andbadblocks
on all disks and check output for any anomalies
This list feels a little thin, and I'm not confident these will all properly exercise the hardware to ensure we won't run into any heat-induced issues in the future.
Is this battery of tests sufficient? Are there any others I should consider?