The situation
- Recent upgrade from 2.2 to 3.1(1e).
- Since the upgrade, at 6:51am (UTC+1) every day I experience failures on between zero and three (out of ~60) of the B200-series blades in my installation.
- It's always the same three blades, all in different chassis.
- The failures manifest themselves as a hard hang with 'CPU predictive failure' and 'CATERR_N' messages in the SEL.
- Power-cycling the blade restores it to service (at least until the next failure).
- There are no one-time or recurring schedules in the UCSM that are anywhere near this time of day.
- Cisco TAC is investigating but isn't shedding any light as to why the failures happen at the same time every day.
My research and suspicions
- I have a working theory that these are real hardware problems which have somehow been exposed by the firmware upgrade.
- There's a brief mention of something called the 'sensor scanning manager' in the troubleshooting guide, but I can't find any detail as to what it does or how to monitor it.
- I've all but ruled out an environmental cause. Our power and temperature monitors show nothing unusual at that time. We are not in an earthquake zone :-)
The question
Why are the failures happening at precisely the same time every day?