4

We are currently working across our environment and disabling all ways that an HP server can automatically reboot. This is in response to a massive outage which caused our servers to begin flapping, causing a service outage for several million customers. The request from "on high" is to have the servers shut down, but not reboot until a human can manually guide them back online when the "coast is clear" (we have several geographically redundant sites).

So far, I have identified the following possible causes:

  1. HP ASR automatically reboots a host. This can be disabled by switching off the ASR timer.
  2. Disable automatic-power-on in iLO. I believe this is only triggered when power is removed, and then re-applied to the host.

However, I assume there is yet another configuration that is applied when one of the server sensors passes a critical threshold, for example if the ambient temperature sensor exceeds 40 degrees C. That should absolutely shut down a host, but I'm unsure where the configuration lies to disable the automatic reboot after the ambient temperature drops. Or is this also controlled by HP ASR?

I just want to ensure that there aren't any scenarios that I have forgotten that could bite us in the butt in production.

Any help would be appreciated.

Matthew
  • 2,666
  • 8
  • 32
  • 50
  • This seems like you're trying to minimize or mask the symptoms, which might be valuable if you're also addressing the root cause of the outage. Are you doing that as well? – joeqwerty Oct 14 '16 at 15:26
  • The root cause is already addressed (bad HVAC controllers). However, the opinion from both our engineering and operations teams is that we can no longer trust the environmental control team to know what they are doing. So, we have to engineer our servers to be resilient. We don't really gain any functionality by having automatic recovery in place, so it makes more sense to let our geographic redundancy take over. – Matthew Oct 14 '16 at 15:32
  • 4
    If you're as big an HP customer as you sound like you are, you should have no problem getting an HP resource to clue you in on what options are available on their platform to make an these event shutdown, rather than reboot. Is there a reason you haven't done that? That would have been the first thing I'd have done. "Hey, vendor, we just had an outage for millions of customers. Tell me how to configure your platform to prevent that from happening again." – HopelessN00b Oct 14 '16 at 16:02
  • We've gone that route, but haven't gotten satisfactory answers. From a corporate standpoint, HP has guaranteed the business (every year they are put through a reverse auction with several other vendors, HP wins) so outside of good response when servers fail, we have had issues getting direct feedback when we need general questions answered. – Matthew Oct 14 '16 at 16:11
  • But there are only so many technical fixes available to you. I'd be curious if you expect that another vendor would have more acceptable options. But seriously, the HP server is doing everything it can to protect your data with three different options available to tweak the behavior. Would Dell or Supermicro have that? _Maybe_, but the fact that it's not clear may indicate that this is a facility and environment issue. – ewwhite Oct 14 '16 at 16:13
  • 6
    Take the decision away from the server. Monitor the environment settings externally and get the external system to issue a standard shutdown command for your OS – Drifter104 Oct 14 '16 at 16:22

1 Answers1

1

The cleanest approach to this is to control your environment.

The ambient temperature thresholds for these server platforms are well documented.
Focus on keeping your facility and environment within those thresholds. (repeating myself?)

If you have the number of customers described, this task falls on your facilities and/or datacenter team, right?

On the local server level, your only other parameter is the BIOS Thermal Shutdown option.

If you're experiencing this type of issue, it's rarely sudden and unexpected.. at least to the point where you have time to automate power-off of the affected systems via ILO.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 1
    Controlling the environment is great, but sometimes, the environment goes crazy. Like if a bad HVAC controller gets installed and goes nuts during a brown-out condition and closes all vents and shuts down all fans in the building (yes, this actually happened), sending the entire DC over 140 degrees F over the course of about an hour. The environmental conditions of the datacenter are owned by that team, however the servers in the DC are owned by our team. So, we have to attack from both angles. Our gear needs to assume that the environment isn't stable now. – Matthew Oct 14 '16 at 15:28
  • 2
    You're asking too much. I've outlined the options available to your hardware. How you deal with the failures of your environment are a matter for your team. – ewwhite Oct 14 '16 at 15:39