6

We have been experiencing a very strange problem in our new office's server room across all the power outlets.

Specifically, when all the equipment is up and running (i.e. the air conditioning system, 2x rack mounted servers, 5x 48-port PoE switches and also the door access system - which has its backup batteries and main control circuits based inside the server room) we occasionally see the servers spontaneously reboot, the door access system reboots and the PoE switches simultaneously lurch into a non-functional state for 20 minutes or more at a time. When this happens, all three systems reboot simultaneously. All three systems are on the same circuit.

The servers and switches are running on a UPS device and the card access system also has a backup battery of its own - so a simple momentary loss of power would not explain this as everything should just continue to run from the UPS without interruption. We've disconnected the UPS from the wall and have seen the servers continue to run, as expected - so the UPS seems to be working properly as far as power outages are concerned.

None of the circuit breakers have ever tripped or needed to be reset.

The air conditioning system is apparently on a separate circuit to the servers and network equipment; however, its power cables share a conduit with the power cables which run to wall outlets used by the servers etc. Could there be a risk of a voltage being induced from one circuit to the other when the AC switches on or off as they are parallel to each other for quite a few metres?

I talked to one of the electricians who was trying to work out what was happening and he said that, although the air conditioning unit is on a separate circuit to the servers and other systems, the two circuits actually share a common neutral - something he thought could potentially causing problems. Is this a normal configuration or would it be considered bad practice to have something like an AC unit share a neutral with sensitive equipment in a server room?

Currently, the problem has subsided of its own accord. The servers have stopped spontaneously rebooting and switches are back online but no real changes have been made, so the underlying problem is still there and likely to resurface sooner or later.

Given we are seeing multiple systems with separate battery backup units rebooting during these episodes, what possible explanations could there be besides power surges or spikes?

Austin ''Danger'' Powers
  • 1,160
  • 6
  • 20
  • 50
  • 1
    Do the UPS units have log files that you can inspect to see if they logged any power related events that can be correlated to these spontaneous reboots and shutdowns? – joeqwerty Dec 29 '14 at 06:29
  • Does this problem occur year round or is it only a recent thing? It could be possible that, being in the winter months, the heaters are kicking on and consuming more power than usual. If that is the case, the recent decrease in issues could be explained by the warmer weather we've been having recently - at least for me on the East coast of the US. – cutrightjm Dec 29 '14 at 06:45
  • @joeqwerty I'm contacting the access control system company to see if the system logs power events and to see if they've ever seen a power issue hard reboot the system before. I'll check the UPS logs. ekaj - we've only just moved into this building and the first day the users came into the office everything in the server room started rebooting without warning. This could have been because the air conditioning was not turned on until the same day but also there was more load on the PoE switches. Hard to say now as the problem has stopped happening and the HVAC is still running. – Austin ''Danger'' Powers Dec 29 '14 at 06:48
  • Can you check the outages over a span on a few days and see if there is any pattern? – cutrightjm Dec 29 '14 at 06:59
  • 1
    @ekaj sure. We had a day of around 50% downtime due to this last week but since then everything has worked perfectly. I'll wait a while and report back with any new info or clues I find in the UPS logs. – Austin ''Danger'' Powers Dec 29 '14 at 07:06
  • 4
    I'd suggest hiring someone to measure the power quality, they can probably bring equipment able to diagnose and detect the problem even when it isn't enough to reboot the room. – derobert Dec 29 '14 at 17:57
  • @derobert that's one of the things we're looking at trying now. There must be something happening that we can detect, even when it's not causing things to malfunction. We checked the UPS logs and it wasn't seeing any surges or spikes. This is strange because whatever happened went straight through the UPS and hit the servers the other side of it, so you'd expect it to have a record of *something*. – Austin ''Danger'' Powers Dec 30 '14 at 06:17
  • 1
    a [voltage logger](http://amazon.com/Supco-LCV-Current-Voltage-Logger/dp/B003CRKDPA) may be a great tool for diagnostics. I'd stick it on the supply side of the UPS. – tedder42 Dec 31 '14 at 23:01
  • You say that it stopped spontaneously ...is the AC unit always on? I have a hunch that that is causing the problem as HVAC systems are *very* electrically noisy as they consume a lot of electricity. In practice, those systems should be isolated in all ways. – Nathan C Jan 05 '15 at 13:59
  • What is the make and model of the UPS? – pauska Jan 05 '15 at 14:14

1 Answers1

4

While not the direct "here's you issue" answer you were hoping for, here's my suggestion.

It appears that while noble, your quest to find out what is wrong isn't going to be solved by you quickly.

You can do like others have suggested and try and log anything you can and hope for a pattern to emerge.

I like derobert's suggestion of hiring someone to measure the power quality...

HOWEVER, here's my actual suggestion which you've somewhat already done. Leave it to the electricians.

Seriously. A qualified electrician (even if you have to outsource it) should be able to give you the root cause IF it's electricial in nature or not. They can test each circuit to make sure they aren't overloaded (especially on spikes/startups), they can make sure the wiring is adequate and the circuits are sized properly for what you are attaching to them. etc. etc.

Most of the time, IT won't have their own qualified electrician and we often like to just "plug stuff in" and don't realize whether we are using the right circuits, balancing circuits, etc.

If your UPS supports log gathering, I would do it, if nothing else than to help prove the issue. While your UPS might not be high end enough to compensate for the spikes/valleys properly (quickly) enough, it doesn't mean it is the root cause. It sounds like an electrical issue to me. If you are running a nice on-line UPS and it seems to be compensating the input voltage properly (based on its logs) then it would be weird that all of the IT equipment plugged into it and the card reader system reboot at the same time.

Talk to your boss and explain the issue in terms of needing an expert electrician to diagnose. It's not fair to expect an electrician to setup BGP routing, and conversely don't expect a sysadmin to be a qualified electrician.

TheCleaner
  • 32,352
  • 26
  • 126
  • 188
  • And to add, I've had "relevant similar experiences". Your electrician should be asking you for information such as power requirements of your equipment and then determining load on each circuit, checking for voltage/amperage spikes/valleys, etc. You can overwhelm them and your bosses with logs/data if you want to cya, but it sounds like your electricians are likely in-house facilities guys that aren't delving too far into it beyond the obvious. – TheCleaner Jan 05 '15 at 14:00