Cisco UCS CPU faults at the same time every day

Question

The situation

Recent upgrade from 2.2 to 3.1(1e).
Since the upgrade, at 6:51am (UTC+1) every day I experience failures on between zero and three (out of ~60) of the B200-series blades in my installation.
It's always the same three blades, all in different chassis.
The failures manifest themselves as a hard hang with 'CPU predictive failure' and 'CATERR_N' messages in the SEL.
Power-cycling the blade restores it to service (at least until the next failure).
There are no one-time or recurring schedules in the UCSM that are anywhere near this time of day.
Cisco TAC is investigating but isn't shedding any light as to why the failures happen at the same time every day.

My research and suspicions

I have a working theory that these are real hardware problems which have somehow been exposed by the firmware upgrade.
There's a brief mention of something called the 'sensor scanning manager' in the troubleshooting guide, but I can't find any detail as to what it does or how to monitor it.
I've all but ruled out an environmental cause. Our power and temperature monitors show nothing unusual at that time. We are not in an earthquake zone :-)

The question

Why are the failures happening at precisely the same time every day?

I see this is a relatively old question so I hope by now the Cisco TAC has solved this for you. If so, please consider answering your own question so that others might benefit. If not: you say `There are no one-time or recurring schedules in the UCSM that are anywhere near this time of day.` but you did not mention if there are any recurring schedules in the hypervisor/OS/applications running on those blades? — hertitu, Oct 06 '16 at 15:10
Thank you! One last request, if you could also mark it as resolved please, that prevents it from popping up in the unresolved questions :) — hertitu, Oct 06 '16 at 15:21

score 2 · Accepted Answer · answered Oct 06 '16 at 15:18

This turned out to be a bug in firmware version 3.1(1e) (Cisco account required for that link). It's described as a 'rare event' involving the VIC 1340 and a debug interrupt.

The reason this was happening at the same time every day is that it was being triggered by—

heavy memory usage, followed by
running lspci,

and this is exactly was Puppet was doing each morning (we only run it once per day).

It's unclear why only certain blades were affected by this bug, but upgrading to version 3.1(1h) solved the problem.

Cisco UCS CPU faults at the same time every day

1 Answers1