4

My client's HP ProCurve 5412zl chassis switch reboots on occasion, despite being powered through four redundant power supplies and being under UPS protection.

These reboots usually happen during a real power outage or during a brown-out or low-voltage event. All of the equipment attached to the UPS stays up except for the switch.

The UPS for the rack is an APC SmartUPS SUA3000XL 208V with step-down transformer. This switch provides PoE for phones and access points throughout the facility. The battery cells are healthy, replaced recently and have a full charge.

These blips have the effect of rebooting all of the phones in the facility and disconnecting users from their sessions. It's disruptive.

In the switch logs:

 Keys:   W=Warning   I=Information
         M=Major     D=Debug E=Error
----  Event Log listing: Events Since Boot  ----
I 02/17/16 22:26:31 03802 chassis: System Self test started on  Master
I 02/17/16 22:26:31 03803 chassis: System Self test completed on  Master
I 02/17/16 22:26:35 00061 system: -----------------------------------------
I 02/17/16 22:26:35 00062 system: Mgmt Module 1 went down without saving crash
            information
M 02/17/16 22:26:35 03001 system: System reboot due to Power Failure

And version information:

valley-core# sh version
Image stamp:    /ws/swbuildm/rel_orlando_qaoff/code/build/btm(swbuildm_rel_orlando_qaoff_rel_orlando)
                Nov 19 2014 15:17:26
                K.15.16.0005
                335
Boot Image:     Secondary

For years, I didn't realize that you have to modify the power supply settings on this switch model, but this unit is configured properly to take advantage of the multiple PSUs.

valley-core# sh power-over-ethernet

 Status and Counters - System Power Status

  System Power Status    : Full redundancy
  PoE Power Status       : Full redundancy

 Chassis power-over-ethernet:

  Total Available Power  :  600 W
  Total Failover Power   :  600 W
  Total Redundancy Power :  600 W
  Total Used Power       :  359 W +/- 6W
  Total Remaining Power  :  241 W

 Internal Power

        Main Power
  PS    (Watts)       Status
  ----- ------------- ---------------------
  1     300           POE+ Connected
  2     300           POE+ Connected
  3     300           POE+ Connected
  4     300           POE+ Connected

 External Power
        EPS1   /Not Connected.
        EPS2   /Not Connected.

Additional PSU information:

valley-core# sh system power-consumption

 Slot Power Usage:
 Slot  Module Description                        Current Power
 ----- ----------------------------------------- ---------------
 A     HP J9534A 24p Gig-T PoE+ v2 zl Module     18 W
 B     HP J9536A 20p GT PoE+/2p SFP+ v2 zl Mod   23 W
 C     HP J9534A 24p Gig-T PoE+ v2 zl Module     18 W
 D     HP J9534A 24p Gig-T PoE+ v2 zl Module     19 W
 E     HP J9534A 24p Gig-T PoE+ v2 zl Module     17 W
 F     HP J9534A 24p Gig-T PoE+ v2 zl Module     18 W
 G     HP J9534A 24p Gig-T PoE+ v2 zl Module     18 W
 H     HP J9534A 24p Gig-T PoE+ v2 zl Module     18 W
 K     HP J9534A 24p Gig-T PoE+ v2 zl Module     18 W
 L     HP J9534A 24p Gig-T PoE+ v2 zl Module     19 W

valley-core# sh system power-supply

Power Supply Status:

 PS#    Model       State        AC/DC  + V      Wattage
 ---- --------- ------------- ----------------- ---------
   1   Unknwn    Powered         AC 120V           875
   2   Unknwn    Powered         AC 120V           875
   3   Unknwn    Powered         AC 120V           875
   4   Unknwn    Powered         AC 120V           875

   4 /  4 supply bays delivering power.
   Total power: 3500 W

What's unique is that the switch is the only device losing power. None of the connected servers have power issues, despite being on the same battery or PDU.

I can admit that the power in this location is poor and suffers from voltage dips and the occasional spike. But the UPS didn't even log a fault during this recent warm-boot.

I have another 5412zl at an unrelated customer that has done the same thing multiple times in the past.

Any thoughts on what I can do about this? Should I try to move two of the PSUs to utility power instead of all being on the UPS?


Edit:

Boot history shows:

valley-core# sh boot-history

Mgmt Module 1 -- Saved Crash Information (most recent first):
=============================================================
ID: 29008d6a
Active system went down: 02/01/16 09:23:54 K.15.16.0005 335
Switch rebooting due to temporary loss of power or low voltage

ID: 994a405a
Active system went down: 12/14/15 11:31:15 K.15.16.0005 335
switch rebooting due to temporary loss of power or low voltage

An HP change note on a previous firmware revision says:

Power (CR_0000112424) - When the switch is exposed to AC power fluctuations and the voltage drops too low, the switch reboots and generates an incorrect error message saying the switch crashed. With this fix, the error message is changed to "Switch rebooting due to temporary loss of power or low voltage".

This is consistent with this tech note.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I'm wondering why the step down transformer. Won't the power supplies work at 208v? – user3528438 Feb 18 '16 at 17:05
  • I have other equipment in the rack, including some that won't run at 208V. For the switch, I use 120V because it needs special notched C13/C15 (_not C14_) cables, and I didn't have any :) - But yes, 208V is an option. – ewwhite Feb 18 '16 at 17:07
  • Well, if it's just the switch is capable of 100v-240v, then skipping the transformer may be a solution, because then it has a much larger cushion on the bottom (it's much harder to make the UPS drop from 208 to 100 than from 110 to 100). Also take a look of the power factors (not efficiency ) of all the devices and see if they are equipped with pfc. – user3528438 Feb 18 '16 at 17:16

4 Answers4

4

According to this page, your UPS series is of the "line interactive" type. This designation means that it isn't constantly converting the utility power to DC and back to mains level again. Rather, it's just sitting there monitoring the power and keeping its batteries charged. Input power is passed straight through, although it may be passed though a few chokes and a surge protection device along the way for extra safety.

When the utility power goes down or has a voltage dip, the UPS needs to switch its inverter into the circuit to start supplying battery power to the connected equipment. Regardless of how this switching is done (it's going to be either a physical or a solid-state relay), you will always see a "gap" of a few milliseconds. Also, the UPS's inverter probably won't be in phase with the utility power, so the AC waveform jumps to the new phase.

Most equipment doesn't really care if the incoming power is lost for a few milliseconds. The capacitors in the power supply are often large enough to ride over small gaps without a problem. I've seen many servers and network equipment take a couple of complete missed cycles without so much as a glitch.

My suspicion would be that this particular switch's PSUs are a bit more critical than most. I'd think your problem could be solved by getting another UPS (which is continuously in the loop converting AC-DC-AC) to run the switch off of. This type of UPS is often referred to as "online", although you should check with your vendor to confirm you're getting the right type.

Mels
  • 685
  • 3
  • 6
  • I'd love to buy an Online UPS... (SmartUPS RT). That's not an option for this customer at this location yet, but I do understand why they are a better option. – ewwhite Feb 18 '16 at 17:34
  • 1
    If cost is the major concern, you could consider adding a small second online UPS just for this switch. Just choose a relatively low-powered model (with small batteries) and connect only half the switch's PSUs to that new UPS. That way, you get the continuity from the online UPS and the long backup time from the existing big offline UPS. – Mels Feb 18 '16 at 20:20
  • @Meis Good idea. Should have mentioned that myself. Been there, done that. It is a cost-effective workaround. With this 4 PSU switch running just one of the PSUs of a small buffer UPS in stead of the big UPS could be enough. – Tonny Feb 18 '16 at 21:14
  • Looking at his usage figures, he needs the power of at least two PSUs to carry the normal load. – Mels Feb 19 '16 at 07:53
3

My initial and immediate thoughts are along the lines of what you're contemplating. If these blips are occurring independent of any self-test schedules you have set up on the UPS (if the blips happen some percentage of the time while ON a self-test, then you have either a UPS/transformer/load problem), I'd do exactly what you're suggesting. Move a couple of the PSUs to a different feed, and see if blips recur. If they do - and I'm not suggesting this lightly - open a case with HP. It may be a painful, tedious process. However, they can likely help provide guidance to get real debugging info out of the switch. I'd also take a moment to check the release notes/buglists for the current rev of firmware on the switch, too.

vigilem
  • 559
  • 2
  • 7
  • The reboots occur during low-voltage situations and actual utility power outages. Everything else attached to the UPS stays up. – ewwhite Feb 18 '16 at 13:13
  • Okay, that helps clarify the situation. I'd wager the draw from the 5412 is much higher than the draw from the other devices that stay up. Move two of the power supplies to a different UPS or utility feed. If you lose power on the two supplies connected to utility and the switch stays up, then it's about the sensitivity/draw from those four power supplies. Do you have a rough number on the current load on the UPS? – vigilem Feb 18 '16 at 15:19
  • UPS load is 67%. The sensitivity is probably the issue, but I'm trying to find out if this is a toxic combination. I have 1200W of PSU power, 600W redundant and drawing 359W. – ewwhite Feb 18 '16 at 15:20
  • Are these the j9306A power supplies? – vigilem Feb 18 '16 at 15:52
  • Yes, I have 4 x J9306A installed. – ewwhite Feb 18 '16 at 15:53
  • 1
    Okay - so worst-case maximum on those is 1500w/13a @ 100-127 VAC. Obviously we'll never be close to that in normal operations. 600w of each of those PS are for chassis power, with 300w for PoE/PoE+. Your UPS is rated for 2700 watts. You've got four of these PS connected to the UPS, in addition to some other devices which load the UPS at 67% during standard ops - but that's still just a calculation the UPS software is making based on what's being pulled. A spike in draw when the battery is flying solo could lead to what you're describing, since the chassis itself restarts. – vigilem Feb 18 '16 at 16:06
  • 1
    So you think my UPS is possibly overloaded? Maybe distribute to another UPS or do the utility power for 1/2 of the PSUs? – ewwhite Feb 18 '16 at 16:08
  • @vigilem I agree too. See my own answer, 1st part. – Tonny Feb 18 '16 at 16:20
  • @ewwhite - Yes, I do. I would definitely separate two of the PS to another power source - preferably another UPS. – vigilem Feb 18 '16 at 16:28
2

With the info you just added in the edit it is pretty clear.

2 possible causes come to mind:

1) The UPS when it is actually needing to do the work slightly drops it's output voltage and the rate of change is steep enough to make the switch think it has a low power condition.
I have seen that happen with UPS units before.
The only remedy is to take some load of the UPS or get a bigger UPS.
In some cases: If the UPS has multiple outgoing circuits, re-distributing the load on those may help. Ideally each circuit should more or less have the same load to it. This minimizes voltage-drop on the outputs.

2) Another possibility, though quite rare, also applies to UPS units with multiple outputs. It could be the outputs are not exactly in sync considering the phase of the AC they provide.
If the PSUs of you switch hook up to several circuits with a phase difference the power-board inside the switch that combines the power of its PSUs may have trouble synchronizing and cause the same problem. In that case the solution is exactly opposite: Put everything on the same circuit.

Tonny
  • 6,252
  • 1
  • 17
  • 31
  • Also interesting. So I may be slightly undersized on the UPS side, and there's a potential that I've balanced the PSU distribution incorrectly. The fluctuations in voltage aren't too bad, but seeing [this HP technote](http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0104699) makes me wonder. – ewwhite Feb 18 '16 at 16:22
  • @ewwhite HP and powersupplies... Don't get me started... I've been bitten too many times... – Tonny Feb 18 '16 at 17:14
  • Point 2 is not likely: in all multi-PSU devices I've seen, the power distribution circuit is on the DC side of things and hence doesn't care about phase sync between PSUs. – Mels Feb 18 '16 at 20:38
  • @Meis It ought to be, but... About 10 years ago I saw a major example. Dual PSU HP Proliant server: Out of phase incoming AC on the PSUs (and the PSU not entirely getting rid of ripple in its DC output) would cause an oscillation in the DC power. This oscillation would eventually lead to an out of control self-amplifying feed-back loop. As a result the server crashed due to unstable DC power. It was reproducible and not just on my 10 servers: Friend of mine had 1500 of them in a data-center. Some 20% rebooted at least twice a week. Took HP 2 revisions of that board to get it fixed. – Tonny Feb 18 '16 at 21:09
1

The switch says there's a power outage. The overhead lights say there's a power outage. I'm guessing there's no power, even if just briefly. That has nothing to do with the switch and everything to do with the UPS.

I'd double check the power cabling between the switch and the UPS, make sure it's really plugged in where you think it is, perhaps put the switch on a different UPS for a while just to see. It may be this switch is just a little more sensitive to the battery-cutover than your other devices, especially considering it's supplying power to all your phones; that can add up quick.

Joel Coel
  • 12,910
  • 13
  • 61
  • 99
  • Yes, the switch is plugged into the right places. It is losing power. But the other devices connected to the same UPS and/or PDU are not losing power. The default for this model of switch is _zero_ PSU redundancy, but I believe I've configured the setting properly. – ewwhite Feb 18 '16 at 15:13
  • If you admit it really is losing power, you need to look at the UPS more than you do the switch, especially given that the switch is supplying power to all your phones... the UPS just might not be able to get enough power to the switch fast enough to keep the phones going. – Joel Coel Feb 18 '16 at 15:19
  • I don't agree. The UPS is either up or down. It functions properly and is powering the rest of the equipment in the rack without incident. I've lowered the UPS sensitivity and transfer points to accommodate the poor power in the region. – ewwhite Feb 18 '16 at 15:48
  • But the rest of the equipment doesn't have near the power draw of the switch. All those phone PoE ports add up quick. – Joel Coel Feb 18 '16 at 15:53
  • @ewwhite If an outage is suffienctly short then some PSUs may be able to ride-over it on their internal capacitance. Others may not. IF the switch has a universal power supply and you are currently running it off 120V you might want to consider feeding it with 208V, this should increase the charge on the primary capacitors which may help with hold-over time.. – Peter Green Feb 18 '16 at 16:51