8

In line with the question about Best sysadmin accident, what's the worst accident you've been involved in? Unlike the previous question, I mean "worst" in the sense of most system damage or actual harm to people.

I'll start with mine:

We have two remote wiring closets that are at the end of a 100-foot corridor which has a metal grate for the floor. After we had Cat6 cable installed, the contractors cleaned up all the debris that dropped through the grating to the concrete 3 feet below. A co-worker and I entered the corridor to check on the progress one day but were distracted and didn't notice that a piece of grating had been moved aside. My buddy stepped into air and his chest slammed into the steel crossbar. He was winded and sore enough to take a couple days off, but luckily the steel beam had rounded edges and the size of the opening was such that he didn't smack his head into it or the floor below.

Obviously we learned that areas where the floor is partially removed need to be flagged.

Ward - Reinstate Monica
  • 12,788
  • 28
  • 44
  • 59

10 Answers10

25

When I worked for Cisco, I used to get customers who had bought $30 wireless cards and who were spitting chips when their driver wouldn't install, or people with the cheapest most basic router Cisco had who would rant and rave over support issues.

This was all put in context one day, when I received a call from one of the world's largest card providers (think Amex, Mastercard, Visa, Diners... in fact it was one of those brands, I don't know if they would appreciate me mentioning it). I was front-line support, my only job was to assess the scenario, rate it, and put it through to the appropriate support division. This case was the only Priority One case I ever put through.

A man from the card company called up and stated that their link between their east-and-west-coast US mainframes was down. If an account was created on one mainframe, the transaction was always processed on that mainframe. Which was fine if your closest link was always near to that mainframe. But on this particular day, if you had an account on the east-coast server, but you were in the west coast, the transaction would be denied because the link was down.

Standard question when assessing damage was "How much is this costing your business?" The reply, calm and collected, was "About a million dollars every 30 seconds".

Really puts it into context next time you feel tempted to rant and rave to customer support over you $30 wireless card.

(it should be noted that Cisco had his link up and running within 5 minutes of being transferred)

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
10

It's very common to alias commands like rm or mv to add the '-i' option to avoid mistakes. But this happend in my company a while ago. Someone put this line in root's .bashrc in one of the servers.

alias rm='rm -i'

Then it copied the line and substitute rm for mv... or so he thought:

alias rm='rm -i'
alias mv='rm -i'

The rest is history :)

Well, the thing is that when mv'ing the 'are you sure' question said 'remove' instead of 'move' but yet...

chmeee
  • 7,270
  • 3
  • 29
  • 43
  • lmao so sorry man... the history command would not even help you find the massive poison you put out for yourself. – ojblass Jun 22 '09 at 08:41
4

We were installing a massive Point of Sale system at a large retailer (over 1000 branches). The central polling server was all custom HP-Unix code, and the test to production migration was handled by a single guy - the IT Director's son.

This guy spent 7.95 hours of his day reading Fantasy novels, and the other few minutes running his batch job to migrate nightly builds to production. The system was 3 days from going live at 150 of the branches (our first "real" rollout). Everything was set, and my team had just finished testing the final pieces of code. We commited our changes and moved our images from development to test to be picked up by the IT Director's son the next morning.

I get there at 8:00am and everything is in chaos. Turns out that the son had been instructed that after copying files to production, he was supposed to go into the ./changed folder and type "rm -rf *". Yes, someone actually told him this! Of course, he accidentally did this on the production root drive, which also housed our transactional polling database (which happened to be offline for backups at the time, just our luck).

Result: Our 16 pilot stores had to serve customers out of cigar boxes (in some cases, literally) for 2 days. The CIO's son was demoted to Server Watcher (he sat in the freezing cold server room and was supposed to watch for red lights ... but he wasn't allowed to touch anything ... they didn't even give him a computer and revoked all his logins/email). Our development team pulled an all-nighter rebuilding lost data from backups and retesting/resubmitting code.

We luckily made the 150 branch rollout, but it was the worst rollout experience EVER.

Beep beep
  • 1,843
  • 2
  • 18
  • 33
2

I learned to finish every command sentence before hitting the Enter Key.

A slightly similar situation that I face is when I'm not sure about a command, I press Home and type some junk characters so that the command is not a recognised one.

me@mypc:~$ sdkjfhdsudo mv --too-many --switches-to-be --comfortable --working-with --while-running --an-important-command /here/this /there/that

bash: sdkjfhdsudo: command not found

And then I check the options again, slowly if need be. Does anyone else do such a thing. Of course, you have to ensure that you type sufficient junk chars (5+), to prevent it from becoming another valid command and doing more unpredictable damage.

(Is there a basic flaw in this that I have not figured out or a situation where, given 5+ junk characters, typically in the "asdfghjkl" keys, it does something unpredictable?)

2

In re-installing a laptop's operating system for a manager, someone made a copy of all it's data over the network to a linux station in /tmp. There were some problems and it took more than one day.

... the linux station was shutdown at the end of the day ...

The following day, when they went to look for the manager's data...

chmeee
  • 7,270
  • 3
  • 29
  • 43
1

My colo facility had some downtime a while back.

They took down their primary network link to the internet to perform some software maintenance on the router, fair enough.

However, at the same time, the upstream provider of the secondary link switched it off to perform some testing (apparently they had been told, but it had been mislabelled in the datacentre)

So far so bad... however, customers had some difficulty getting through to the facility to bring the downtime to the provider's attention.. the provider only had VoIP phones, which were connected through... well, you can guess.

I imagine you wouldn't believe me, but its true, and a matter of record on the blogosphere :)

gbjbaanb
  • 3,852
  • 1
  • 22
  • 27
1

I've been working as a SysAdmin for about 7 months, one of my first tasks was getting a Squid proxy server running and i actually did get it working, like 2 weeks after that i was using BackTrack and messing with a lot of tools "Playing the Hacker" i actually hacked the server which was kinda good but after i got in for some odd reason i did a rm -rf from / and well erased part of the OS (Debian linux).

I learned to finish every command sentence before hitting the Enter Key.

Cheers.

1

Imagine if you will living in South Florida during hurricane Andrew (slightly before the 24X7 craze). All of your servers are securely locked up in a building that requires you badge into it and a more secure area requiring an additional scan of your badge. Imagine a nitwit that did not account for needing actual handles on the doors. Imagine a four million dollar contract requiring a delivery, the closest electricity being 230 miles north, gas being in short supply, dangerous roads, and a generator that was designed to provide 48 hours of electricity. Laugh if you will at a collection of servers being in the back of a truck, stuck on the Mickey Mouse turnpike, stalled for want of gas. Laugh if you will at the total lack of an excuse at how bad it all went from a logistical, sysadmin, and operational standpoint. The best part was listening to the hundreds of UPS units crying simultaneously for life giving electricity.

ojblass
  • 636
  • 1
  • 9
  • 17
  • 17
    Uuuh please don't take this the wrong way, but I've no idea what actually happened in the story, because of all the "Laugh Ifs"... – Mark Henderson Jun 21 '09 at 05:09
  • 1
    Thats funny, I like the 48 hour generator part. One place I checked out once had 48 hours fuel on site and another 14 days at the utility yard and they owned a fuel truck to refill the generator, so they didn't have to count on anyone else. They were also a hydro company. – SpaceManSpiff Jun 21 '09 at 05:56
  • While not being a narrative... the whole story is above. – ojblass Jun 21 '09 at 08:13
  • Fuel truck is a smart idea. Last year I toured a Seattle datacenter that had only a few days of diesel fuel on site. I was not impressed: only once in ~40 years has the Seattle bus system ever shut down for a day, and that was primarily due to fuel trucks not showing up at the bases to deliver diesel fuel during a major snow event. I can't imagine that a major earthquake, flood, or other regional disaster would cause fuel to be any *more* available than a in a snowstorm. – Skyhawk Nov 07 '11 at 04:28
1

One of our customers hit a pretty uncommon XFS filesystem bug on december 24 2005... Well at the time I didn't know it was an linux kernel bug of course, I thought it to be just some of the usual suspects ( 13TB RAID with 8KB free, spurious drive failure in the array, etc).

Finally as the filesystem was unmountable, I asked the operator on the line to enter xfs_repair -n /dev/whatever. Hmm, it wants to clear the log (obviously, as the FS isn't mountable), but no too ominous message. So go for it : xfs_repair /dev/whatever.

15 minutes later, she calls back :

why can't I see most files?

Hu oh... Turns out that to add insult to injury, the xfsprogs were of some version that would do severe harm in this exact case... Ouch. 8TB of data were gone for real.

wazoox
  • 6,782
  • 4
  • 30
  • 62
1

I'm not sure that this could be an interesting answer, but I'm also a coder. I coded my last website completely on a production evoirement, with no backups at all on my pc. A bad day after 16 hours of continuos work, I had to empthy a partition, and the fastest way to do it was to format it. I runned fdisk -l to check what was the name of the partition I had to format, and unfortunately I readed the wrong line, and formatted it.

I lost like 6 months of work.

Fortunately, the second time that you do the same thing you do it better and faster, since that you already know how to do it. Now the website is live. And I have backups :=)

cedivad
  • 680
  • 3
  • 13
  • 25