467

The other day, we notice a terrible burning smell coming out of the server room. Long story short, it ended up being one of the battery modules that was burning up in the UPS unit, but it took a good couple of hours before we were able to figure it out. The main reason we were able to figure it out is that the UPS display finally showed that the module needed to be replaced.

Here was the problem: the whole room was filled with the smell. Doing a sniff test was very difficult because the smell had infiltrated everything (not to mention it made us light headed). We almost mistakenly took our production database server down because it's where the smell was the strongest. The vitals appeared to be ok (CPU temps showed 60 degrees C, and fan speeds ok), but we weren't sure. It just so happened that the battery module that burnt up was about the same height as the server on the rack and only 3 ft away. Had this been a real emergency, we would have failed miserably.

Realistically, the chances that actual server hardware is burning up is a fairly rare occurrence and most of the time we'll be looking at the UPS the culprit. But with several racks with several pieces of equipment, it can quickly become a guessing game. How does one quickly and accurately determine what piece of equipment is actually burning up? I realize this question is highly dependent on the environment variables such as room size, ventilation, location, etc, but any input would be appreciated.

yoozer8
  • 322
  • 2
  • 12
Chad Harrison
  • 6,960
  • 10
  • 28
  • 41
  • 34
    @DeerHunter Well thank you goodness it was the end of the day and there were very few people in the builing. Thank you for your constructive criticism, and I'll be sure to let my supervisor know what lives she risked in deciding to keep the system up. – Chad Harrison Apr 04 '13 at 19:04
  • 12
    @hydroparadise - somebody has to have the guts to say "**STOP** We are not doing this thing right". If your supervisor doesn't understand safety rules, there's not really much that can be done, except growing some spine and not bowing to the urge to cut corners. – Deer Hunter Apr 04 '13 at 19:23
  • 114
    @DeerHunter: What would be the appropriate response when you smell something burning? There's no visible smoke, just a burnt smell. Do you turn off the entire datacenter, vent it out for a few hours, then turn on servers one by one until the smell returns? A small 25 rack datacenter could have 1,000 servers to check on, that's a lot of downtime for a "smell" -- the OP didn't report visible smoke or fire. – Johnny Apr 04 '13 at 19:43
  • 24
    @Johnny - Quoting the OP: "the whole room was filled with the smell. Doing a sniff test was very difficult because the smell had infiltrated everything (not to mention it made us light headed)" Answering your question - yes, you have to vent the room, and troubleshoot **systematically**. Anything else is irresponsible. – Deer Hunter Apr 04 '13 at 19:51
  • I am guessing you have at least one additional, redundant, server room. So hit the kill switch, cycle the ear in the room, check the sensor logs, remove and replace the defective equipment, and restart. – ctrl-alt-delor Apr 08 '13 at 11:15
  • 1
    Was it really this bad, or are you exaggerating a bit? Just asking because I have seen people overreact to simple problems like a blown capacitor. – Stefan Lasiewski Apr 10 '13 at 20:24
  • 15
    So, are those critical of the OP's handling of the smell suggesting that there is no difference in urgency between a smell and a fire/smoke? If you smell something burning in your house but see no smoke and hear no alarm, do you rush you and your family out of the house and call 911? – trpt4him Apr 11 '13 at 00:49
  • 8
    Servers don't explode. Throwing my hat in with the people who investigate first, overreact later. – DeeDee Apr 30 '13 at 14:04
  • 1
    For everyone recommending calling the fire department: keep in mind two hours after the EPO switch is hit, this poster would likely still have been searching for the smell. A subtle melted whatever can indeed take a while to locate. – Bryce May 02 '13 at 23:31
  • 8
    @trpt4him `If you smell something burning in your house but see no smoke and hear no alarm, do you rush you and your family out of the house and call 911?` Yes. And they haven't asked me to cook, since. As far as I'm concerned, this policy of mine is working wonderfully. Although it does mean I'm no longer allowed near the toaster. – Parthian Shot Jul 17 '14 at 17:29

7 Answers7

390

The general consensus seems to be that the answer to your question comes in two parts:

How do we find the source of the funny burning smell?

You've got the "How" pretty well nailed down:

  • The "Sniff Test"
  • Look for visible smoke/haze
  • Walk the room with a thermal (IR) camera to find hot spots
  • Check monitoring and device panels for alerts

You can improve your chances of finding the problem quickly in a number of ways - improved monitoring is often the easiest. Some questions to ask:

  • Do you get temperature and other health alerts from your equipment?
  • Are your UPS systems reporting faults to your monitoring system?
  • Do you get current-draw alarms from your power distribution equipment?
  • Are the room smoke detectors reporting to the monitoring system? (and can they?)

When should we troubleshoot versus hitting the Big Red Switch?

This is a more interesting question.
Hitting the big red switch can cost your company a huge amount of money in a hurry: Clean agent releases can be into the tens of thousands of dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping the room") can be devastating.
You do not want to drop a datacenter because a capacitor in a power supply popped and made the room smell.

Conversely, a fire in a server room can cost your company its data/equipment, and more importantly your staff's lives.
Troubleshooting "that funny burning smell" should never take precedence over safety, so it's important to have some clear rules about troubleshooting "pre-fire" conditions.

The guidelines that follow are my personal limitations that I apply in absence of (or in addition to) any other clearly defined procedure/rules - they've served me well and they may help you, but they could just as easily get me killed or fired tomorrow, so apply them at your own risk.

  1. If you see smoke or fire, drop the room
    This should go without saying but let's say it anyway: If there is an active fire (or smoke indicating that there soon will be) you evacuate the room, cut the power, and discharge the fire suppression system.
    Exceptions may exist (exercise some common sense), but this is almost always the correct action.

  2. If you're proceeding to troubleshoot, always have at least one other person involved
    This is for two reasons. First, you do not want to be wandering around in a datacenter and all of a sudden have a rack go up in the row you're walking down and nobody knows you're there. Second, the other person is your sanity check on troubleshooting versus dropping the room, and should you make the call to hit the Big Red Switch you have the benefit of having a second person concur with the decision (helps to avoid the career-limiting aspects of such a decision if someone questions it later).

  3. Exercise prudent safety measures while troubleshooting
    Make sure you always have an escape path (an open end of a row and a clear path to an exit).
    Keep someone stationed at the EPO / fire suppression release.
    Carry a fire extinguisher with you (Halon or other clean-agent, please).
    Remember rule #1 above.
    When in doubt, leave the room. Take care about your breathing: use a respirator or an oxygen mask. This might save your health in case of chemical fire.

  4. Set a limit and stick to it
    More accurately, set two limits:

    • Condition ("How much worse will I let this get?"), and
    • Time ("How long will I keep trying to find the problem before its too risky?").

    The limits you set can also be used to let your team begin an orderly shutdown of the affected area, so when you DO pull power you're not crashing a bunch of active machines, and your recovery time will be much shorter, but remember that if the orderly shutdown is taking too long you may have to let a few systems crash in the name of safety.

  5. Trust your gut
    If you are concerned about safety at any time, call the troubleshooting off and clear the room.
    You may or may not drop the room based on a gut feeling, but regrouping outside the room in (relative) safety is prudent.

If there isn't imminent danger you may elect bring in the local fire department before taking any drastic actions like an EPO or clean-agent release. (They may tell you to do so anyway: Their mandate is to protect people, then property, but they're obviously the experts in dealing with fires so you should do what they say!)

We've addressed this in comments, but it may as well get summarized in an answer too -- @DeerHunter, @Chris, @Sirex, and many others contributed to the discussion

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • 30
    University I went to installed a new data center. They implemented a highly sophisticated EPO/Fire Suppression system. The equipment it was protecting was in the millions of dollars and it was also being used for millions of dollars of research for the medical part of the school. Obviously if it was needed the red button would be hit but, that being said if the red button *was* hit, just resetting it was close to $200,000 US dollars. ***Tax Payer Dollars*** you can sure as hell bet that if the switch was hit when it wasn't needed the guy who hit it would no longer have a job. – Ryan Apr 04 '13 at 23:14
  • 28
    +1 for the buddy system. I think it's a little nuts that there are DCs out there that use the EPO to also dump fire suppression as well. There are plenty of situations where you'd want to EPO without wanting to dump halotron all over the guy getting electrocuted. An EPO is a serious deal but isn't a "destroy everything in the DC kinda deal" or at least shouldn't be. The guys in the DC should hopefully understand the big red button and the fire suppression system well enough to weigh the effect of hitting the button. An EPO may actually *stop* a fire and save the DC, for instance. – chris Apr 05 '13 at 03:00
  • 13
    An important note I haven't seen mentioned is that the majority of the time when something fails so as to give off a burning smell, whatever is burning will *extinguish itself before the odor is detected* and without burning anything outside the failed equipment. Sometimes a piece of equipment will continue to smoulder as long as it has power, but if one sees smoke it should be possible to identify the equipment, cut power just to it, and see whether the smoke soon clears or continually gets worse. – supercat Apr 05 '13 at 16:21
  • 1
    @ryan: If hitting the big red button costs so many tax payer dollars, the responsible person has hopefully worked out a plan to resolve minor incidents with the local fire department that doesn't involve endangering employees. – Christoph Apr 06 '13 at 08:59
  • 3
    @ryan That reminds me of a tv report about CERN that I saw recently: The camera team and reporter were taken really to the guts of the system and and one moment one of the camera guys *almost* rammed a red emergency off button with his backpack - giving near heart attacks to the staff guy thinking about the reboot costs ... – Hagen von Eitzen Apr 03 '16 at 10:45
  • We practice fire suppression tests at our Video hub offices on a regular basis. Usually these tests are performed by the suppression system vendor. Anyway, we had an incident where, once the fire suppression systems test was complete, the contractor performing the work hit the WRONG "red button", and killed power for the entire floor. Keep in mind that the "Fire Test End" switch was out in the open, and the "Main Power" switch was under a spring-loaded cover you had to hold with one hand whilst pushing the scram switch with the other. Oops. – George Erhard Mar 03 '17 at 22:56
186

A Thermal Imaging Camera could do the work, and let you identify where the overheating is. A device like this would let you identify also the origin of a fire or burning in a smoke filled room.

ddalcero
  • 1,164
  • 1
  • 7
  • 4
  • 31
    Thermal cameras go for under a grand nowadays, and if you are running a big server room they are a tool well worth to have. – rackandboneman Apr 04 '13 at 15:18
  • 17
    A T.I.C. is not so expensive and is very useful in a datacenter or big server room. Not only in case of problems like overheated cables or equipment, but also as a preventive or early detection of issue, refrigeration optimization, air flow, etc. – ddalcero Apr 04 '13 at 15:19
  • 42
    A laser temperature gun, like [this one](http://www.amazon.com/gp/product/B002YE3FS4/ref=as_li_qf_sp_asin_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=B002YE3FS4&linkCode=as2&tag=byte56-20), is a cheap alternative – House Apr 04 '13 at 16:40
  • 4
    @mfinni Electricians also often have thermal cameras. (A thermal imaging check of our power distribution panels every year, or after any major wiring work, was standard when I worked at a hosting company). – voretaq7 Apr 04 '13 at 19:58
  • 3
    A thermal camera has very large limitations: 1. The field of view may prevent the usage 2. Your environment may be very dense. [Big fires will be spotted but not small ones] 3. Averaging of temperatures will be needed to determine a threshold – monksy Apr 05 '13 at 15:43
  • 2
    While I agree about the usefulness of a thermal imaging camera in other areas, while using it to find the source of the "funny smell", you could be inhaling toxic fumes. To be safe, one would need to have breathing apparatus and people who are trained to use it on site. The question is why not just call the fire department anyway and have them charge you for the call? – Christoph Apr 06 '13 at 08:54
  • Infrared laser thermometer - less than $100. – Mary Dec 06 '14 at 09:40
  • @Christoph The fire department can't do much. You still have to find out which device is burning. – Navin Apr 04 '16 at 09:08
142

You do none of these things that have been said. You leave the hazardous environment because whatever is being pumped through the entire room is dangerous to your health and may really mess up your lungs. If there is an acrid smell of something burning in the room that you can't find, call (911|112|999|whatever emergency number fits your jurisdiction) and let the fire (company|department|brigade) sort it out while they're on bottled air.

Computer parts contain all sorts of interesting chemicals including mercury, cadmium, lead, and lots of plastics in casings. Notice that all the links I made explain how low level exposures can cause lasting damage or even quick death. This is an environment that can be immediately dangerous to life and health.

... so really, if something is burning, don't spend hours sniffing the fumes. If you can't identify it and immediately act to contain it, get out.

Jeff Ferland
  • 20,239
  • 2
  • 61
  • 85
  • 18
    It should be added that if this happened in a "real" datacenter with smoke detectors integrated with the air conditioning and an extinguishing system installed, the fire alarms would have went off and the room would be sealed and flooded with Argon or CO2 automatically, so there could not even be a thought about running around and sniffing equipment. – the-wabbit Apr 05 '13 at 07:22
  • 8
    @syneticon-dj This depends on the *type* of detectors installed. Ionization detectors might have tripped the fire suppression, but I've worked in (and currently host equipment at) places that have optical smoke detectors - Those require visible smoke (or at least a good haze) before they trip. – voretaq7 Apr 05 '13 at 15:57
  • A hundred times yes @JeffFerland The number of toxic chemicals in computer parts that, even at mild exposure levels, can cause serious long term damage is not to be taken lightly!!! – NULLZ Apr 06 '13 at 02:12
  • 3
    I wish I could upvote this more. at the risk of being controversial, 'get a professional' firefighter is the only way forward. – user9517 Apr 06 '13 at 17:14
  • 22
    Yeah, as a former firefighter, I wouldn't stay there without my gear. Even when a fire is out, we are trained to stay packed up because of the poisonous gasses. If I would call the pros, you should too! – Jeff Ferland Apr 06 '13 at 17:50
  • A colleague of mine was a civilian part of a firefighter exercise. When they entered a burning container for just a few seconds, he asked if he could do it without gear (and just hold his breath). They told him he would be unconscious within seconds. So I would not want to mess with it. – mafu Sep 20 '15 at 21:25
  • 1
    Even if you call in the firemen, you still have to find out which device is burning. This doesn't answer the question. – Navin Apr 04 '16 at 09:09
  • 1
    @the-wabbit: In one of the companies I worked for we had a huge server farm, over 1000 racks, 25000 servers. They generated a lot of heat; naturally, we had a lot of cooled air pumped through. At the routine fire inspection the inspector told us that we are perfectly up to code, but with that amount of cooled air pumped through none of the sensors would respond to fire. Sure enough, when later one of the racks caught fire with visible smoke, heat and smoke disspated so fast that none of the ceiling smoke and heat sensors was triggered. – Michael Oct 12 '16 at 20:41
  • 2
    @Michael the designs I've seen did not rely on ceiling smoke detectors but had photoelectric detectors in the return air flow. The only time I have seen it trigger was during a testing routine where the argonite system has been detached and a smoke source has been placed in one of the closets. It worked as I would expect it to work. Thankfully, I never had to deal with real fires. – the-wabbit Oct 13 '16 at 07:37
76

If you had proper monitoring on the UPS (usually via SNMP), the unit itself should have rung the bells on your monitoring system. If it didn't, talk to your vendor about that. It either malfunctioned or your monitoring system isn't properly configured.

If something active is actually burning, it should be complaining about it in some way, or simply be off the network, which should also cause an alarm.

If it's something like an actual power rail burning through insulation, and it's not on a smart PDU, then we're back to your original question, which is "how do I find a burning thing?" And I think the proper answer is "Hit the EPO and figure it out. Your production servers are probably not important enough to go risking lives."

mfinni
  • 35,711
  • 3
  • 50
  • 86
  • 1
    I think it's safe to assume that there are any number of possible "halt and catch fire" failure modes that could occur but be outside the "visibility" of a device's built-in monitoring system. I'm wondering what ideas are out there for detecting those kinds of failures. – Evan Anderson Apr 04 '13 at 14:29
  • 13
    What does EPO mean? – Midhat Apr 04 '13 at 15:10
  • 39
    Emergency Power Off...the big red button that cuts all power to the room. Mostly for when its on fire. – Grant Apr 04 '13 at 15:15
  • 11
    An emphatic +1, would have voted +1,000. Hit the button, evacuate, wait, sort out things later. Doing business as usual with fire and smoke present (and trying to troubleshoot anything) is one of the worst mistakes an engineer can make. – Deer Hunter Apr 04 '13 at 18:20
  • 37
    @chris I have to respectfully disagree on "EPO, Leave, Wait" -- Activating the EPO and/or clean agent release for a room full of production gear can very often be what we like to call a *Career Limiting Move*. If there is not an *active, visible* fire or trail of smoke coming from some equipment performing some initial investigation is usually the Right Thing. Of course you should absolutely be prepared to bolt from the room while hitting the appropriate red buttons at any point in your investigation. – voretaq7 Apr 04 '13 at 19:52
  • 13
    It's likely even a perfect monitoring system would not have caught this until the same moment the UPS panel said "Replace Module" -- that being said you certainly want your monitoring system to bring such things to your attention. Next time a module may fail at 19:30 on a Friday when nobody's around, and the monitoring alert will get you to come back in and deal with the problem before it develops into a fully-fledged emergency. If you can tie monitoring into your FACP your smoke and/or heat sensors may even warn you about insulation burning off power rails and the like. – voretaq7 Apr 04 '13 at 19:55
  • 1
    @EarthEngine Usually the EPO cuts power to the entire bus it services (for most operations that means "the whole room") - Most configurations I'm familiar with command a UPS "load disconnect" or equivalent, and there's a separate process to disconnect the UPS from utility/generator power. [APC has a whitepaper (#22 - "Understanding Emergency Power Off") that's worth a read to understand the basics](http://oksolar.com/pdfiles/power_apc_emergency_power.pdf), and you can pick up the rest from talking to a good DC manager & site electrician. – voretaq7 Apr 05 '13 at 16:19
  • 3
    I see both sides to this argument. On the one hand, human life is not to be toyed with; not your own, not your co-workers'. On the other hand, I work for an company whose datacenter is the nerve center for our "command center" of humans responding to alarm panel signals. We don't have a BRB, because for us, a BRB is itself a threat to human life safety (namely those of our customers, who expect us to be there when they hit *their* panic switch). The UL, which must OK everything we do WRT power/networking/etc, would laugh all the way out the door at an EPO on our DC. – KeithS Apr 05 '13 at 17:53
  • 7
    If we were to have a BRB that would actually be worth something in the case of an electrical fire, we'd have to have a complete carbon copy of our already triple-redundant DC in an offsite location to provide failover. That DC would have to be independently certified by the UL for redundancy as if it were our only one, it would have to be able to handle peak traffic by itself (meaning we couldn't load-balance with it), and we would have to be able to get to it with our primary DC dark. Now compare that to the cost of a mask and air bottle and a 15-lb Class C fire extinguisher. – KeithS Apr 05 '13 at 18:26
  • 3
    @KeithS: This was a very needed response. +1 – sjas Apr 07 '13 at 08:58
  • And another +1 for "if something active is actually burning, it should be complaining about it in some way" – Karma Fusebox Apr 09 '13 at 20:42
44

This is one of those situations where

XKCD Die Hard sysadmin

doesn't apply, you should call a professional

Firefighter in protective gear

Anything else is just plain stupid.

Deer Hunter
  • 1,070
  • 7
  • 17
  • 25
user9517
  • 114,104
  • 20
  • 206
  • 289
  • @Navin No _you_ don't the guys in the fire department do that. – user9517 Apr 04 '16 at 09:32
  • Some comment were deleted there, but like it's told, the answer is direct, but if something is burning, as you can get intoxiced by the fume, or any other accidents. Dont try to go in to find what is burning as it's dangerous, especialy in isolated server room. – yagmoth555 Jan 06 '20 at 17:52
40

As someone whose former career was as an electronic tech, I have experience with "burning smells" that were not fires. This isn't uncommon.

I wouldn't shut down a data center for a smell. Smoke is another matter, something is really burning (usually, but a pea-sized tantalum capacitor can fill a room with smoke too). It's amazing how much smell a fried component in a power supply can make.

A TIC or IR thermometer (a useful tool and a lot cheaper than a TIC) would not necessarily show it as the component doesn't generate much heat at all and it's inside a case. But check for devices not working, use you monitoring tools. For a smell like that then 95% of the time it'll be a power supply affecting the performance of the whole device.

Malcolm
  • 409
  • 3
  • 2
  • 3
    +1, blown power supplies are common. In most datacenters with high airflow rates the smoke is blown away quickly and it is difficult to locate the source of the smell. In a small room however, the smell can be pretty bad, and can quickly spread throughout the entire room. – Stefan Lasiewski Apr 10 '13 at 20:15
18

I like the IR imaging or thermometer answers but maybe what would also help is a real "odor detector". After all what triggered your caution was the smell. Smoke, heat, IR etc. are all surrogates.

Something like this one: from Shinyei . I've personally never used them or even seen them used in a datacenter. But at least theoritically it should be a neat tool. If you have the money to spend on this gizmo that is.

http://www.sca-shinyei.com/odormeter or http://www.intopsys.com/products/cyranose.html?gclid=CNXXzOrLs7YCFUws6wodViYApQ

It gives you an odor strength as well as classification. So homing in onto the odor should be possible. Devil's in the details of course. How sensitive it is, masking out spurious background odor etc.

One advantage over purely temperature based measurements is that often odor occurs at a far earlier point or threshold. Or if the overheated component is hidden by a body / concealed wiring etc. it is easier to detect molecules escaping than a line-of-sight hot spot.

Another situation is a non-heat related smell. We've had a cooling circuit leak before and the coolant smells were peculiar too. I won't even go into the now ancient case of a rodent dead in the ducts. :)

I was surprised how sensitive these sensors are. Apparantly H2S / mercaptans etc. (usual culprits) are detectable at sub ppm levels.

enter image description here

curious_cat
  • 359
  • 2
  • 10