PDU management interface has low availability - product flaw or isolated issue

Question

Our colocation provider has supplied us with APC AP7932 switched 0U PDUs as part of several cabinets they provide us. We have had a lot of trouble with the network management aspect of these PDUs, which I'll describe below. We are moving to cage space in the same datacenter, and will purchase our own PDUs for the cage. I'd like to determine which enterprise-grade PDUs have been reliable performers from a remote management perspective so that we don't end up buying something that looks good on paper but is a nightmare to use.

Our colo-provided PDUs are configured to support management via an SSL web UI and via telnet. We updated the firmware on all of them to the current version as of NOV2011. They respond to pings reliably, and we have no reason to suspect a network layer issue. However, we experience frequent hangs, timeouts, disconnects, and general unavailability from the embedded management host in all of the PDUs. We occasionally have to restart the microcontroller on the PDU to recover from what appears to be an occasional hard fault. The outlets stay powered (thankfully), but the management aspect is so unreliable that it has become an ops liability - we can't be confident that we could get into the PDU to power cycle a host if we needed to. We have 3 PDUs that all exhibit identical behavior.

There are many manufacturers of enterprise-grade 0U switched PDUs, all with comparable features. If I looked at the datasheet for our current PDUs, they would appear to be a good fit -- only with the benefit of suffering through using them do we know to avoid them. I'd like to avoid picking a PDU that looks fine on paper, but has similar reliability issues.

What has been others' experience with switched PDUs? Is this level of flakiness normal?

[Product recommendations are off topic on all Stack Exchange sites.](http://blog.stackoverflow.com/2010/11/qa-is-hard-lets-go-shopping/) — HopelessN00b, Aug 28 '12 at 20:49
@HopelessN00b: I was trying to keep this away from requesting a specific product recommendation. I've had ops issues with a particular component and am trying to leverage the expertise in the community to avoid a repeat scenario. I've edited the title to better reflect my intent. — HikeOnPast, Aug 28 '12 at 21:36
Seems on topic now. I can't cancel a close vote, so just ignore it, I suppose. — HopelessN00b, Aug 28 '12 at 21:46
Are the devices under warranty? Are they yours or just installed in your cabinet (owned by the data center)? — Andrew, Aug 29 '12 at 05:23
@Andrew: They're owned by our colo provider. With our move to cage space, we'll be purchasing our own with the express goal of avoiding buying lemons. — HikeOnPast, Aug 29 '12 at 16:40

score 2 · Accepted Answer · answered Aug 29 '12 at 17:17

2

What you describe is not normal, sorta. However, how are you determining availability? Do you have an monitoring solution constantly pinging/probing the device?

In the past, I had OpenNMS set to collect from my APC UPS and PDU devices. Some of the checks, specifically the http, ftp and telnet probes, caused the management interface to timeout, creating 30-60 second outages. Maybe that's what you're seeing.

I've never had issues with SNMP collection, however. So if this is the case, try to reduce the hits against the management interface and only focus on collecting what you need.

An excerpt from my OpenNMS availability chart on an APC interface enter image description here

answered Aug 29 '12 at 17:17

ewwhite

194,921
91
434
799

I'm using availability in the loosest sense - basically my subjective of "can we remotely manage the PDU when we want to". We do have a monitoring solution (LogicMonitor) looking at several SNMP datapoints every 5 min and ping every 1 min. Strangely, the PDUs appear generally healthy from the monitoring system's perspective, yet when we want to log in and actually *do* something in them, they rarely work the first few attempts, and occasionally lock up to the point of requiring a hard restart (after which, monitoring confirms the unhappiness). – HikeOnPast Aug 29 '12 at 17:57
We also had http and https monitoring configured. I've disabled it and will retest soon. – HikeOnPast Aug 29 '12 at 18:15
Disabling http and https monitoring did not improve availability using the web UI (https), though several tests via telnet were 100% reliable (small sample size), which is an improvement. The only remaining monitoring is via ping and snmp. Perhaps it's time to disable ping at this point. – HikeOnPast Aug 29 '12 at 19:10
Kill the ping... rely on SNMP (if you're pulling power data). – ewwhite Aug 29 '12 at 19:13
Killed the ping...no such luck fixing access via https. The PDUs are sometimes unavailable, and sometimes allow me to log in, then hang halfway through loading the home screen. I don't mind doing everything via telnet, but it just doesn't seem like things should be this unreliable. @ewwhite: Thanks for sharing your experience. I was really hopeful that reducing monitoring traffic would improve things. – HikeOnPast Aug 29 '12 at 20:09

PDU management interface has low availability - product flaw or isolated issue

1 Answers1