I'm looking for amusing stories of system administrator accidents you have had. Deleting the CEO's email, formatting the wrong hard drive, etc.
I'll add my own story as an answer.
I'm looking for amusing stories of system administrator accidents you have had. Deleting the CEO's email, formatting the wrong hard drive, etc.
I'll add my own story as an answer.
I had fun discovering the difference between the linux "killall" command (kills all processes matching the specified name, useful for stopping zombies) and the solaris "killall" command (kills all processes and halts the system, useful for stopping the production server in the middle of peak hours and getting all your co-workers to laugh at you for a week).
I was in charge of our corporate web proxy which at the time was Netscape's product. While playing around in the admin forms (it was a web based interface) there was a big (and I swear it was red) button that said Delete User Database. No problem, I thought. Let's see what the options it gives me are when I hit that. Surely there will be a confirmation prompt if there are no options.
Yeah, no confirmation. No options. No more users.
So, went over to Mr. Solaris Sysadmin and said that I was in desperate need of a restore from tape to which he replied, "I don't back that box up."
"Uh, come again," I retorted.
"I don't back that box up. It's on my list of things to add to the backup rotation but I haven't gotten around to it yet."
"This server's been in production for nearly 8 months!" I screamed.
shrug, he replied. "Sorry."
Many years ago the company I worked for had a client which ran a nightly backup of their NT 4.0 Server to a Jaz drive (like a high capacity zip disk).
We set up a batch file, which ran as a scheduled job overnight. Every morning they'd collect last nights disk from the drive, and before they left in the evening they'd insert the next disk in the sequence.
Anyway, the batch file looked something like this (the Jaz drive was drive F:)...
@echo off
F:
deltree /y *.*
xcopy <important files> F:
Anyway, one night they forgot to put the disk in. The change to drive F: failed (no disk in drive), and the batch file continued to run. The default working directory for the batch file? C:. First time I've ever seen a backup routine destroy the server it was backing up.
I learned a little something about sysadminning (and exception handling) that day.
Jim.
PS: The fix? "deltree /y F:\*.*".
root@dbhost# find / -name core -exec rm -f {} \;
Me: "You can't get in? OK. What's the DB name?"
Cu: "Core."
Me: "Oh."
I love the way everyone qualifies their story with "when I was young/green" as if they would never do it again. Accidents can happen to even the most seasoned pros.
My own worst moment is so bad I still get palpitations thinking about it...
We had a SAN with production data on it. Critical to the company. My "mentor" decided to extend a partition to free up some disk space. Can you see where this is heading? He said that the SAN software could do this live, in production hours and no-one would notice. Alarm bells should have started ringing, but were conspicuously silent. He said he'd done it "loads of times before" with no problems. But here's the thing - he got ME to click the button that said "are you sure?"! As I was new to the company I assumed this guy knew what he was talking about. Big mistake. The good news was that the LUN got extended. The bad news was...well I knew there was bad news when I started seeing disk write errors on the Windows box.
I'm glad I was wearing brown pants.
We had to explain why 1TB of data had disappeared at lunchtime. That was a really, really bad day.
It's a good principle actually - before you do something that you have doubts about, imagine having to explain to management if something goes wrong. If you can't think of a good answer to explain your actions then don't do it.
Nagios pinged us one morning when business hours started to say that it couldn't connect to a non-critical server. Ok, hike to the server room. It's an old server, a Dell 1650 purchased in '02, and we knew that the 1650s have been having hardware problems. The PFY stabs the power button. Nothing. Hit it again, and hold it for five seconds to 'force power on' ... which overrides the BMC's error protection, since without a DRAC there's no way to examine the BMC logs without having the power on to the chassis.
The machine starts POST, and then dies again. I'm standing above it and go, "I smell smoke." We pull the server out on it's rails, and one of the power supplies feels warm, so the PFY pulls it and is about to close the box back up. I say, "No, that's not power supply smoke, that's motherboard smoke."
We open the case again and look for the source of the burning smell. Turns out an inductor coil and a capacitor something blew off the voltage regulator on the motherboard, and sprayed molten copper and capacitor goop across everything, shorting a bunch of stuff and basically making a big mess.
The worst part for me was recognizing that I'd smoked enough hardware to recognize the difference between the smell of a burnt motherboard and a burnt power supply.
Three days ago (seriously) I was remotely logged in to a school server, installing Service Pack 2 on a Windows Server 2008 file server.
I decided to schedule the needed reboot for late at night, when teachers wouldn't be logged on finishing their end-of-year report cards. I typed something like:
at 23:59 "shutdown -r -t 0"
...which might have worked fine.
But then I second guessed myself. Was my 'shutdown' syntax correct? I tried to view the usage help by typing
shutdown /h
...and instantly lost my RDP connection. Panicking, I hit up Google for the syntax. A quick search revealed that the Server 2008 version of shutdown includes a /h switch, which (as you may have guessed) hibernates the machine.
Teachers started calling me within minutes to report that they could no longer open or save the report cards they had been working on. Since I was offsite and the server room was locked, I had to call the school principal directly and walk her through the process of powering the machine back on.
Today I brought homemade cookies to everyone as a form of apology.
In a previous job, we had a great homegrown system that logged and archived every single piece of mail that entered, left or stayed within the company.
Blew away your entire mailbox? No problem! Looking for a piece of mail that somebody sent you a week/month/year ago but you can't remember who sent it or what the subject was? No problem! We'll just redeliver everything from February for you to a special folder.
At some point, the need came for the CEO of the company to monitor mail going between a competitor and an internal salesperson under suspicion. So we setup a script than ran every night and delivered relevant mail from the previous day to the CEO. No problem!
Around a month later word of a double-plus urgent problem came down from on high. Seems that as the CEO was reading through the list of mails sent to $OTHERCOMPANY, he came across this one:
To: somebody@$OTHERCOMPANY
From: CEO
Subject: CEO has read your message (subject line here)
Naturally, the CEO being an important person and all, he was too busy to click on all those "Send Read Receipt" dialogs in Outlook and had configured his client to just send them all. One of the messages caught by the monitoring filter had a read-receipt request set. Guess what Outlook did? Certainly buggered up the 'clandestine' monitoring.
Our next task: adding rules to the mail filter to block outgoing read receipts from the CEO to that company. Yes, it was the easiest way. :)
Ahhh,mine was about 10 years ago, when I was still getting my feet wet. I had the joy of installing battery backups on all the programmers computers. They also wanted the software loaded to warn of power outage and shut down properly.
So I set it up on my computer to test everything first of course and make sure it all worked. So I disconnect the power cord and the message comes up on my screen. "external power lost, beginning system shutdown".
So I thought, Hey cool, it worked. But for some weird reason, I don't even remember, it sent that message out as a network message so all 200+ computers in the company got that message, where 100+ users where programmers.
Yeah, talk about mass freak out!!
I kept my head low in that place for awhile!
I would often use the "sys-unconfig" command on Solaris machines to reset the machine Name service, I.P. address, and root password. I was on a users system and I logged into the building install server and looked something up (as root), then forgetting that i had logged into another machine (non descriptive "#" prompt) I ran the "sys-unconfig" command.
# sys-unconfig
WARNING
This program will unconfigure your system. It will cause it
to revert to a "blank" system - it will not have a name or know
about other systems or networks.
This program will also halt the system.
Do you want to continue (y/n) ? y
Connection closed
#
That "connection closed" message slowly turned to panic... what machine was I logged into when I ran that command.
The worst part of this was not the hard time my co-workers gave me, it was that I did the same thing a month later.
I've got a pretty good one. Admittedly, it was prior to my time as a sysadmin, but still tech-related so I figured I'd add it.
Back in the day, I was working as a satcom/wideband tech for the USAF. Having recently graduated technical school, I found myself stationed in South Korea. Shortly after arriving on-station an opportunity arose to travel down south with the "big guys" who'd been there for a while and actually work on some real-world,(i.e. `production') equipment.
I went down with the crew and as an eager, young tech, was chomping at the bit, quite excited at the prospect of getting my hands on an actual piece of equipment that was passing LIVE military voice and data traffic.
To start me off slowly, they handed me a manual, turned to the preventative maintenance section and pointed me in the direction of four racks filled with several large digital multiplexers. The equipment was easy enough, we'd covered the same equipment in tech school.
First page of the manual read; "Apply power to the ditigal multiplexer. Turn both rear switches to the ON position and wait for the equipment to power-up, then begin tests." I looked up, and there was already power APPLIED!
I was in a quandary for sure. Not knowing how to proceed, I shot my best, `Ummmm.. Kinda lost here' look at the senior.
He looked over at me and laughed, "No, no, it's ok. You can ignore that part of the checklist." Then, as he noticed the look on my face, (since we were taught in school to NEVER, EVER ignore any part of a checklist, and it was certain death and destruction if one was to do so) he put a serious look on his face and said, "Ignore ONLY that part! Follow the rest of it, to the letter!"
Dutifully, I ran through the multi-step PM instructions, happy as a clam and proud that they were letting such a low-ranking, (albeit smart) tech do this important work.
Somewhere between the fifth and sixth preventative maintenance checklist on these huge multiplexers I started noticing an increased level of activity around me. Phones were ringing, people were moving quickly. Quizzical looks were being exchanged.
Finally, a group of folks ran up to me, headed by one of the senior techs who had brought me down.
"Hey! We're seeing HUGE outages in data traffic, and we've isolated/traced the path back to the racks that you're working on! Are you seeing any weird.."
(At that point he was cut off by another one of the troubleshooters who'd made her way around to the first group of multiplexers that I had been performing the PMs on.)
"HOLY NUTS! THEY'RE TURNED OFF! HE'S BEEN TURNING THEM OFF!!!!"
In short order, I watched as they hurriedly ran through the first step in the manual, "Turn both rear switches to the ON position..." When the senior tech was done, he came over to me and incredulously asked what I was thinking of, by turning the critical pieces of equipment off.
Scared out of my wits, I handed him the checklist that I'd been following, swearing that I hadn't deviated at ALL. That I had followed it, `to the letter' as he'd instructed.
After a while he laughed and pointed out where the problem lay.
In the manual, the FINAL step in the preventative maintenance checklist was:
"Record final probe reading, wipe down front panel, removing all dust and particulate, then turn both rear power switches to the OFF position."
:)
I was reloading a system for someone, and during the manual backup process I asked him the question "Do you have any other programs you use?" and "Is there anything else important you do on the computer?"
He said "no" SEVERAL times.
I was convinced and formatted the drive.
About 30 minutes later he said "oh my god" and put both hands on his head.
Turns out he had been working on a book script for over 10 YEARS in a specialized program. This was back when programs used to save user data in its program files directory and I missed it.
Whhhhooooops.
He wasn't mad at me, but it was a sobering feeling.
It's kind of a sysadmin accident.. in so far as sysadmins occasionally have to physically haul large numbers of machines from point A to point B (where A and B are seemingly always separated by several flights of stairs in a building with no lift). On the n'th trip of the day, I stopped for a breather three flights up from the basement loading level to chat with someone coming down, propped the full-size tower w/station I was schlepping on the inside handrail of the open stairwell and... well, you guessed... slightly lost my grip on it. It plunged unerringly straight down the well and when it reached the bottom, er... not so much with the functionality for that one! Total salvageable parts: two sticks of RAM, one floppy drive and one ISDN card (God bless the Hermstedt engineering folks!). Everything else either cracked, rattling or smashed into tiny pieces.
By the grace of God, nobody was walking underneath, which, thankfully for me, was my boss' first though, so I got to keep my job. Felt very sick for an hour or so though.
Moral: gravity always wins!
My personal favourite isn't actually mine, and I'm VERY glad of it. Take a look here.
This didn't happen to me, but…
I was working at a company that made software that ran on Linux machines provided by the client. We would essentially 'take over' the machines, completely configure them to our specs, and do all of the management and monitoring. Essentially, we were a team of 10-15 sysadmins, managing thousands of servers for hundreds of customers. Mistakes were bound to happen.
One of our team found some issues on a server (a backup, I believe), and decided that he should run fsck on it. He stopped all relevant services, made sure that the system had had backups taken recently, and then ran the fsck, but it complained that the filesystem was mounted. Since we were remote and had no remote access (DRAC, ILO, etc.), he couldn't do the fsck, but he was pretty sure that it was safe to do it with the filesystem mounted, if you were careful.
He decided to try it himself by running fsck on his root partition, with predictable results – he corrupted his root partition and couldn't boot anymore.
Confused, he went over and talked to our team lead. The lead said he was pretty sure that you couldn't do that, and the team member said 'Sure you can!', took the lead's keyboard, and showed him that you could – by running fsck on the lead's root partition. Which completely corrupted HIS root partition.
End result? No customer data lost, thanks to the team member's testing. Two days of employee productivity were lost, but that was worth far, far less than the data on the customer's machine. And for the record? You can run fsck on a mounted drive, but only to verify data. Not to repair it. That was the team member's mistake.
--
To add my own story, I was working at the same company, and was trying to reset a user password. Our system refused to let me set it to the password he needed, because it tracked old password hashes and refused to let you duplicate the password. The mechanism was simple: it validated your password against the most recent hash in the database.
(And for the record, it needed to be the old password because it was a shared account, and making sure everyone knew the new password was impractical)
I decided to just go into the users database and delete the new records so that it would use the older one. It's all just SQL (running an ancient version of Sybase), so it's easy. First, I had to find the records:
SELECT * FROM users_passwords WHERE username='someuser';
I found the old record he wanted to keep; there were two more in front of it. I decided to be clever and just delete anything newer than the old record. Looking at the result set, I saw that the old password was ID #28 in the database, and the new ones were ID #several thousand (very busy system). That's simple, all the old rows were > 28, so:
DELETE FROM users_passwords WHERE id > 28;
There's nothing worse than doing some simple row pruning and seeing '212,500 rows affected'. Fortunately, we had two master database servers (with the user ID), but Sybase (at least, our version) didn't support automatic replication, so it didn't automatically wipe out the old records. It was a trivial matter to get a dump of the users_passwords table and re-import it. Still, a pretty big 'oh f**k!' moment.
DELETE statement without a WHERE clause, on the customers' live patron database.
Typed kill 1
as root. init
and all of her children died. And all of their children. etc, etc. Oops.
What I meant to type was kill %1
After I realised what I did I ran to the control panel of a BIG wool bale sorting machine and hit the emergency stop button. This stopped the machine ripping itself to bits, as I had just killed the software which controlled it.
Another of my favorites:
When setting up a computer and a local laser printer on a system, I had the bright idea to plug them both into the computer's UPS. Ever try to print to a local laser printer when it's plugged into a desktop UPS? Well, if you don't know, it tends to pull all the amps... Which restarts the computer... And the print job never finishes...!
Ever get the call: 'Whenever I print, it restarts my computer and doesn't print!!!'?
Ooops!
JFV
After a long day or performance tracing and tuning a huge mainframe (you know the beasts that take a couple of hours before all standby backup-sites have agreed that it is indeed booted up again and fully synced) I stretched my fingers, typed satisfied shutdown -p now in my laptop prompt, closed the lid, yanked the serial cable out of the mainframe, with the anticipation of a nice cold glass of lager.
Suddenly I hear the deafening sound of spinning down mainframe while my laptop was still happily displaying X.
While waiting for the machine to come fully online again I decided that I got time to get my ACPI working on my laptop so I never ever are tempted to cli shutdown my laptop.
We were in the middle of a power outage and saw that the UPS was running at 112% of it's configured load. This wasn't much of an issue as we were running on the generator at the time.
So we went around pulling backup power cables to reduce the power usage on that UPS (we had two, one much larger than the other). We got to the network switch which ran the server room (this was the server room with all the internal servers for the company, with the customer facing servers in another server room). The switch was a large enterprise class switch with three power supplies in it. The supplies were N+1 so we only needed two in order to run the switch.
We picked a cable and pulled it out. Unfortunately for us the other two were plugged into a single power strip, which promptly blew as the load went up on the two power supplies which were plugged into it. The sysadmin then panicked and plugged the third cable in. The switch tried to fire up, putting the entire load of the switch unto the single power supply. Instead of the power supply shutting down, it exploded in a shower of sparks not 12 inches from my face sending me jumping back into the rack of servers.
Out of instinct I tried to jump to the side, but unfortunately on my left was a wall, and two my right was a very large 6'4" facilities guy. I some how managed to jump over him, or possibly through him bouncing off of the Compaq racks (the ones with the thin mesh fronts) without putting a whole in the rack, and without touching the facilities guy.
At some point in my career a legal investigation at the company I was working for placed a requirement on us that all email be kept from "this day" forward, until told otherwise. After about a year of storing daily full backups of our exchange environment (1TB nightly) we started to run out of space.
The exchange admins suggested that we only keep every 8th copy of the email. To do this, we had them restore a days worth of the exchange databases, extract the email they needed (specific people flagged for investigation) and re-archive it. They did this for every 8th day of email for all of our backups. The 8th day was chose because exchange had a parameter set where "deleted items" are kept in the database for 8 days.
After they would finish each archive, I would go back through and delete any backups which were older than what they had archived.
TSM does not have an easy way to do this, so you have to manually delete objects from the backup database.
I wrote a script which would delete all backups older than some date, by way of a date calculation using the difference between today, and the date in question. On some day I had to delete about a months worth of backups, except when I made the date calculation I made a typo and entered the date as 7/10/2007 instead of 6/10/2007, and ran the script. I deleted an entire extra month worth of data, accidentally which was part of a very important lawsuit.
After that, I added some steps to the script to confirm that you wanted to delete the data, and show you what it was going to delete...
Luckily, they never even used any of the data we worked so hard to preserve, and I still have my job.
I deleted someone's account by mistake, got the names mixed up with the one I was suspose to delete. Opps
The cool part is they never knew what happened. Got the call they couldn't log in, the penny dropped about the account I deleted.
While on the phone with them, I quickly re-created their account, re-attached their old mailbox to it (thankfully Exchange doesn't delete mailboxes right away) and pointed it back to their old user files.
Then I blamed them for forgetting their password which I had just reset for them :)
This accident didn't happen... but it's worth mentioning:
I was sent to a heavily-used data center to conduct bandwidth tests on a new circuit. I got to the demarc room/IDF, found a spot on one of the racks for my test router, made my connections, and started the tests. Unfortunately, I completely failed to notice the in-production border router not only being exactly on the next rack (almost at the same level), but that it was also the same make and model as my testing router.
When the test was done, I began pressing the power switch to the off position (...imagine it in slow motion...) and, I swear, just as I was applying pressure it dawned on me that the router I was about to turn off was the one in production. My heart stopped and I almost... well, use your imagination.
I left the data center's MDF looking spooked and pale, but at the same time glad I still had a job!
Accidentally installed a tar.gz file on my Gentoo Linux box in the wrong place and it left files all over the place. This must've been around 1999, 19 at the time (thanks for the comments below)
Being the geek that I am, I decided to try to script myself out of the work of going manually through each file.
So I tried:
tar --list evilevilpackage.tar.gz | xargs rm -rf
It didn't take me very long to notice that tar also listed all the directories the program was using, those included were ''/usr, /var, /etc'' and a few others that I didn't really want gone.
CTRL-C! CTRL-C! CTRL-C! Too late! Everything gone, reinstall time. Fortunately the box didn't contain anything important.
As a smallish part of my former life I administered the company's file server, a netware 4:11 box. It hardly EVER needed any input at all, but if it did, you opened up a remote console window.
Used to using DOS all the time, when I was finished, I naturally would type "Exit". For Netware, "exit" is the command to shut down the OS. Luckily, it won't let you shut down unless you first "Down" the server.(Make it unavailable to the network/clients) So when you type "Exit" in the console, it helpfully says, "You must first type "Down" before you can exit"
Ask me how many times I 1: typed "exit" in the console session and 2: Obediently typed "Down" and then "Exit" so I could "finish what I was trying to do"
And then the phone starts ringing.....
LOL
The last place I worked, my co-worker had his kids with him in the server room (why? I have NO IDEA!).
He made sure that they were far away from the servers and explained to his 5-year-old that he shouldn't touch ANY of the servers and ESPECIALLY none of the power switches.
In fact, he had them right near the door... (can you see where this is going...?)
The boy didn't touch any of the server power buttons... No, that would be entirely too easy to explain. Instead he hit the BIG RED BUTTON that was near the door... The button that shuts down power to the ENTIRE SERVER ROOM!!!
Phone lines immediately started to light up wondering why Exchange, File Servers, etc. weren't available... Imagine trying to explain THAT to the CEO!
-JFV
Another story that didn't happen (phew):
We were doing incremental backups religiously every day to a tape drive.
We happened to write a tape containing data to ship to someone else. They said 'we can't read your tape'. In fact, neither could we. Or any tape in fact.
We bought another tape drive and held our breath until we installed it.
Moral of the story. Always make sure you test your backups.
I once had a fight with the APC UPS monitoring software. Being a small company, we had a couple of small-ish UPSes and various servers were setup to monitor them. Most of the servers were Linux, but a few were running Windows and so they were the ones used because the APC software is Windows only.
However, the APC software at the time was hard-coded to assume the UPS it is talking to is also powering the PC its running on! This was not the case for this server, but I discovered that too late to tell it to halt. Also unfortunately, the lead programmer was demonstrating the company product to a partner - it was a web-based app, running on the same server I didn't want the APC software to shut down...
Tripping over a tower server that was wedged behind a rack and hitting my head on the back of the main Cisco router on my way down. Thus revealing how loosely the power cords were actually seated in the power supplies on the front of the Catalyst 6500.
Yeah. We've got a hardhat on a hook in the server room now. With my name on it.
I was giving a new sysadmin a tour of a Service Manager app. I said "if you ever needed to stop this service you would click this button, but you should never do it during the day." You would never believe how sensitive her mouse button was!
Two minutes later the service had started up again, and no-one seemed to notice.
I work for a wireless provider in North America, and had done some training for a person in my group to run through work orders. I had stayed up the first couple of nights (we do everything during the maintenance window), but he was doing fine and said he's got to learn it on his own, so I let him and left my cell phone and pager on. I logged in and checked the configuration when I got up at 8 a.m. the following morning.
The change was that we were adding a new pool of IP addresses for BlackBerrys, the pool we were adding was about 10000 addresses. To do this, we add routes on the router that point to the processor address on a blade that does all the call processing (essentially it works like a proxy). Also, we log into the processor and configure the IP pool, and link the IP pool to be used for our wireless users. However for testing, we normally configure this on one processor (actually boot up a phone and test all the features), and then just move the configuration to the actual processor we want it on.
Fast forward two weeks, and I get a call from our control center that there's been a lot of call in's about some intermittent BlackBerry problems, and the few BlackBerrys they've looked at seem to be cycling through a common pool, but weren't really sure what was happening. It only took me about 5 minutes to realize that this was the new pool my colleage had just added two weeks before. It also didn't take long to see that the router had two routes in it, one going to the test processor, and one going to the proper call processor. This being what it was, he forgot to delete the route to the test processor, and it superceded the proper route.
Essentially a BlackBerry would connect to the network, connect to the proxy to get its IP address, the proxy would give it an address from the pool with the incorrect route, and the BlackBerry would try and talk to the RIM relay, and the response would be routed to the test proxy and never make it back to the user, essentially meaning no connectivity.
We got lucky though since BlackBerrys have a behaviour that if they can't contact the relay, they will disconnect / reconnect to the network, but nonetheless some RIM devices were without service for up to several hours until they were able to cycle onto a working pool. I thought back, and when I double checked the work, I had only check the proxy configuration which was new to this guy, I never checked the routing configuration since this guy was previously with the backbone team and routing was his thing. Oops!
I fixed it and called him up that afternoon, his day was going well, but I started with I'm sorry, but I'm about to ruin you're entire week. A year later the story still comes up around beers.
When I was first hired as sysadmin by the lead admin...within the first week we received a brand new Dell server...Windows Server 2003...it was his little baby until I was secretly called to the server room at midnight one Saturday night to clean numerous instances of malware from it because he was SURFING THE WEB with it before deployment WITHOUT ANTIVIRUS!!!
Malware cleaning is something that I have had much experience with, but since this was a server I did a format and reinstall to be extra safe.
I never said a word to him about it. He knew he had messed up royally.
More of a personal scripting thing than a system administration thing, but...
I was writing a Perl script to act like a macro that would retrieve now playing information from Banshee and enter it character by character as keyboard events using the program "xte". This way, I could have it work within programs without any special interaction, it would be just like I typed it.
Well, I coded the thing almost perfectly. I decided to test it out in some random game. The keypress to bring up the chat was shift + enter. Now in order to do this I needed to have it hold down shift, press enter, then release shift. Unfortunately in my haste I forgot "release shift". I ran the script and this led to the somewhat hilarious side effect of my shift key being locked down. I thought "no problem, I'll just go to the terminal and manually type in the line to release shift". Unfortunately, as everyone knows, Linux is case sensitive. It would not accept the command in all caps as I had to enter it. I couldn't "counter-shift" or anything like that.
This led to a five minute scavenger hunt of me visiting websites and using the mouse to copy+paste individual lowercase letters into the terminal to form the command I needed to turn it off.
Not a huge problem, but certainly an 'egg on my face' morning about 10 years ago. I had been going through the old hardware inventory and re-imaging the disks ready for the hardware to be offloaded. Trying to find the most efficient way possible to do this, I had built a CDRom with a copy of Norton Ghost and the image to apply. You powered on the machine, and while it was POSTing, put the CD in the drive. The machine would boot off the CD and re-image itself automatically. Worked well.
The problem came when I had been making copies of the CD so I could get more machines going in parallel. I finished burning the last CD, switched off my desktop computer and went home for the day. Well you can guess what happened the next morning. I came in, switched on my PC and went and made a coffee...
When I came back for some reason my machine was off the domain and not accepting my password...
I had just worked out what had happened and started swearing when the other guys arrived for the day. Yep, they didn't let me live that one down for a while.
Back in the day, when I was very green, I needed to install AV software on my users PC's, as no-one seemed to have it. So I spent a bit of time figuring out how to do a remote install, rather than poking around 40 or 50 desktops. The remote installation ran perfectly and everything seemed fine, until various managers dropped by my office to complain that they couldn't log in.
It turned out that a few individuals had Symantec AV installed on their machines, and this did not coexist at all well with the McAfee software I was using and would lock up the machines after a login attempt.
Fortunately, it was possible to remotely disable the service if you got to the machine before they tried to log in, so I managed to get points for fixing it instead of having to rebuild all of senior managements PC's...
My aunt asked me to fix their computer. They said it wouldn't boot up and its been like that for 2 weeks. I suspected it was either the BIOS or the OS.
I sat down in front of their computer. I crouched down to push the power button. I look up.
The BIOS passed. That's good.
The OS booted. That's good.
I moved the mouse around thinking maybe there's a problem with the input devices. There was no problem with the input devices.
I opened up her word processor. It ran.
I print test the printer. It printed.
By this point, I stood up and told my aunt (who was watching me) that there is nothing wrong with the computer. She claimed that it wasn't like that before I sat down.
I can now claim to my family that I am so good, that I can fix any computer just by sitting in front of it.
Longer ago than I'd like to think, I was the company's technical person and worked with some consultants installing their application. The hardware was a DEC VAX and used an HSC50 storage server. The consultants took much of the day with their install, and after they left, I decided to back up the system disk to an empty disk using the HSC50's bit-for-bit copy utility. After the copy was done and I tried to reboot, I discovered that I had reversed the names of the source and target disk, and so had backed up the blank disc bit-for-bit to the system disk.
I was able to rebuild VMS on the system disk, and reinstall much of the application, but I think it never worked as well. Since then, if I was doing a copy/backup/etc., I would write-protect the source disk before continuing. (Now that write-protect switches are no more, I look at the command before I hit Return.)
Done by one of my employee's... Perfect example of why you clearly label your servers:
Sent my employee out to the colo to rebuild the secondary MSSQL database server (which had no current data on it). Primary one was actively in use. You can probably predict the rest of this story... Once there, he rebooted the server, started the install and reformatted the drives, only to have me call him and ask him why the primary database server was no longer responding. (doh)
Mine happened just 6 months ago. We had just switched to a new server for a PHP/MySQL web application. Since I got to choose the OS, I chose the one I'm most familiar/comfortable with: Ubuntu.
We had a number of backup scripts that would be run by cron hourly, daily, etc. The transition went perfectly. There were only about 2 minutes of down time while I transferred the MySQL DB from the old server to the new one and switched IPs.
A few weeks later however, I was working in MySQL at the command line and was deleting some old test records that were no longer needed. Since I'm a programmer first, sysadmin second, I've gotten into the habit of typing my semi-colon (;) first and then typing in the command. Well, as I was about to add the WHERE clause to my DELETE query, I accidentally hit the enter key. ...oops.
Query OK, 649 rows affected (0.00 sec)
"No big deal," I thought. "The hourly backup just finished 4 minutes ago. There might be 3 records lost in all. I quickly went to the backup directory and restored. Problem solved.
...Then I noticed the timestamp on the backup. It was 17 days old. There were no other backups. I had just wiped out everything in the system entered less than 17 days before.
It turns out there's a bug in Ubuntu's cron daemon that causes it not to run a script file with a dot (.) anywhere in the name. It doesn't raise an error, so there's no evidence of a problem. It just refuses to run it. All of our backup scripts had dots in their names. They worked perfectly before, but not now.
Lessons I've learned:
I got called to investigate an alert coming from a Windows machine that was indicating that the monitoring system had no license file. I opened up the command prompt and started to investigate the problem and found that the basic windows commands were not even there.
A sysadmin who had run a script remotely had written a script which used the del command to delete a folder specified by a root and subfolder with the folders specified in Environment Variables. If the Environment Variables were not set, it silently deleted the whole partition.
When told, the sysadmin was so surprised that they confirmed the action by running the said script on their own notebook, thus trashing it too.
The amazing thing was that Windows was running fine, until we rebooted the server. Only the stingy monitoring software complained.
It was the secondary Active Directory server for a political party. Oops.
Adding a bypass rule to a firewall in order to speed up some BitTorrent downloads. It turns out the system that the bypass rule used wasn't too stable, and it took down the firewall. This was a border firewall for every school's Internet connection in the city. To make matters worse, the reboot was just enough to cause the firewall's hard drive to die. Amusing? Not so much. Spectacular failure? Definitely.
Ok. To get &
on a US keyboard, press Shift-7. To get it on a Swedish keyboard, press Shift-6. So, what do you get when you press Shift-7 on a Swedish keyboard? You get /
.
Years ago Swedish layouts were not that common. My personal preference was to use the US layout. One day I wanted to delete a bunch of files and subdirs in a directory.
I hit:
rm -fr *
But is was too slow, so I quickly hit:
Ctrl-C rm -fr * &
Or did I? Well I did not. It took me a few seconds to realize I was on a Swedish keyboard. See above to decode what happened. And that disaster was a fact.
That was the day when I learned the command:
dd
I managed to get basically eventually from the disk to tape, only that it took all night. Next day I learned that the system was about to be reinstalled anyway.
I was lucky, but I learned a few things.
VNC'd into a Win 2k Server 200 miles away, went to add an IP address, so... right click on the network icon in the system tray, clicked 'Disable' not 'Properties' - DOH!.... Solution.... Get in car. Not happy! If only they had an 'are you sure' on that menu option!
Mike
Mine was a tag team effort.
I was instructed by management to log one of our DBA's into a server so he could do some sort of cleanup. He ran his query and immediately both our pagers went off, which prompted expletives from both of us.
As it turns out, the cleanup was actually a drop of the database, and was supposed to be done on one of the development servers. However, the instructions that I received led me to believe this was a minor cleanup task that was supposed to occur in production.
Fortunately, we were able to restore from backup with minimal data loss.
Lesson learned: Make sure you ALWAYS know EXACTLY what you're supposed to be doing when messing with production servers. If there's uncertainty, it's best you get clairification.
When most of the server fleet was still Windows NT, the primary remote method in use was pcAnywhere. We had a "well-known" bug, that sometimes the servers would suddenly restart when using pcAnywhere, and end-users were told about this well-known bug.
The bug was that pcAnywhere (at least whichever version we were using) had a "reboot host" button next to the "disconnect from host" button. So every now and then... :D
Summer 2002.
I inadvertently deployed IE 6.0 with a forced reboot to 16,000 users in the middle of the day.
In truth I caught my mistake and typed the fastest ever odadmin shutdown all (Tivoli command to stop all deployment servers).
On Linux and FreeBSD hostname -s
will "Display the short host name. This is the host name cut at the first dot".
On Solaris 9, hostname -s
will SET the hostname to be '-s'.
So, my fellow admin ran a script to audit all of our 120 systems, including 10 Mission Critical Oracle Database servers running on Solaris 9.
for HOST in `cat all-hosts`; do
ssh $HOST "hostname -s"
done
All of our Oracle servers failed instantly. The speed of this failure was really quite amazing, It took about 20 seconds for us to recover from this mistake, but it was already too late. Everything was down.
The irony is that our datacenter suffered from a major power failure just a few days earlier, and we were updating our "power down/power up" spreadsheet to ensure faster recovery for any future power failures.
I'm a programmer, so all of my mistakes belong on Stack Overflow. However, below are some of the system administrator errors I have witnessed.
Revoke logon permissions from ALL users on a Windows NT domain. (Other than the builtin administrator on the PDC, sadly only the contractor that set the domain up knew the password, and they were long since gone) I don't actually know how this was achieved. I do know that I got to sit and chat with my fellow developers for a few hours.
Accidentally delete the Member Servers OU. That was another few hours chatting while a restore from tape was done.
Our admin intended to give all domain admins permission to use CD & floppy drive access. (We used SecureNT to control access to removable media at the time.) Sadly he got the group membership backwards and instead gave all users of removable media full domain administrator rights as well. I found this because some tables turned up in a production SQL database that had been created by a user that shouldn't have been able to. When I told the administrator in question I enjoyed watching his face change from, no, that's the right way round, down to, oh ****. Thankfully there was no serious harm done.
Not me, but someone I work with. They created a policy on the AV server that contained a *
in the process field. In layman's terms: do not allow read, write, execute to any process that contains the name *
.
This policy then was replicated to 1,500 servers, which in turn shutdown RDP and any other process. To fix it meant to mount every server hard drive one by one and remove the policy. 48 hours with a team of 15.
I had an employee complain that his laptop was slow, so I checked the hard drive fragmentation and it was (and is to this day) the worst I had ever seen. Attempts to defragment the drive were fruitless because there was not enough free space. I tried cleaning up temporary files (not sure why I didn't just move stuff to the server temporarily) and stupidly deleted his entire outlook.pst thinking that it was a backup of his e-mail and not his actual e-mail. He forgave me, but never let me forget it.
(This happened many years ago shortly after I graduated university. I'm much more competent now.)
Former employer story that's great. Some of the details are changed to protect the innocent. I had a problem employe, call him Fred, who had been having alot of productivity issues, but seemed to have redeemed himself and had earned back some privileges. Only problem was, when his privileges were restored, a bug in a provisioning script gave him some extra privileges.
I was in the middle of a big project, so I asked Fred to package up a Windows hotfix that was needed for an application. (This was in the pre-blaster days when people didn't patch as religiously as they do today). So Fred runs a test on out in our lab and everything works fine.
Fred then asks a couple of questions:
"Who should I push it to?" (Mind you, this is a patch for some custom VB app)
"Everyone", I respond
"Ok, what time should it start?"
"How about 2AM?", I answer. (Figuring I'd have time to look over everything before I left for the day!)
So what happens next? He setups up a job with our software distribution app to push to everyone, and is even kind enough to check the boxes for every platform that the product supports. Then, sets the start time for 2AM, as in the 2AM which took place about 12 hours in the past.
The result? Everything reboots and trys to install some VB5 runtime patch. At about 2:45 PM on a Friday afternoon. Everything.
Everything? Like 40,000 PCs? Yes. 3,000 Windows servers? Yes. 300 HP, Sun and IBM Unix boxes? Yes. An AS/400 cluster? Yes.
The only thing that didn't reboot were the Windows DCs, because the AD guys disabled our application for some reason. Holy nightmare. After a week of mopping up, I couldn't believe that I was still employed.
The punchline? Fred got a huge promotion into a job where he couldn't hurt anything anymore.
I had one not so long ago. During some Oracle ODBC bridge deployment, I had to modify the path on about 500 user posts.
It's a quite simple operation, really. Too bad I forgot about those quotes. People started ringing after they had some strange garbled messages (the ODBC install failing), and seemed to think rebooting the machine would be just wat it needed.
Of course, some other previous installation PREPENDED (!!!) some program files path in the system variable (with spaces and all, without quotes), so the new path stopped just there, at c:\Program (of course, the existence of %ProgramFiles% remained completely ignored). No system, no system32, no shell. So no logon scripts either.
People who rebooted didn't have any network access anymore, and no automated script could repair the damage. Of course, as soon as I went to some complaining user, looked around and checked the path, I got that.. sinkin' feeling.
In about 30 minutes, I had another script, with the most standard path values, ready to be mailed to everyone (e-mail still worked). Users even phoned back to be sure the patch was real, as they are not used being send cryptic exe's with strange reasons to apply them, and most of them weren't even aware of what was happening.
The first version was messy (a new semicolon at each execution), but it logged every possible path value available, so I quickly had data with possible paths, so I just had to create something smart to check them all, end get the path nicely in place.
All in all, it lasted only about 45 minutes, and I was luckely the one who put everything back allright. But still, when a corrupted path pops up now, I'm still ready to take the blame ;)
Was adding RAM to an email server in a cabinet with about 8 machines. On the top row was a power strip with a "lightswitch" style on-off switch. As we reached in to pull out the machine, my arm flipped up the light switch, cutting power to all of the machines ( There was no UPS on the boxes ). All of the machines came back up except 1. Spent the next 4 hours ( 2am - 6am ) fixing that box.
Lesson --
1) UPS is good
2) Only use power strips and UPS's that protect their on-off switches from accidental tripping.
My best one came at a time when our backup server was in administrative limbo - my boss was "debating" whether or not it should remain in the office, off-site from our server room (and not doing backups for some reason) or whether it should be installed in the server room to save massive amounts of bandwidth. I seem to recall that this limbo state existed for several months.
Our web server had a RAID 5 array for storage of websites. It seems that it had been running in degraded mode (without informing me for reasons unknown or which I can't remember) for some time before the second of three drives failed. I got to pull an all-nighter putting the server back together. Our customers were Not Happy that their websites had disappeared and they needed to restore from their own backups. Especially the ones who didn't have their own backups.
Questions my boss asked me were "How could a RAID array fail like that? I thought they weren't supposed to!" and "Why didn't we have backups of our webserver?"
However, the lesson had not gone unheeded. My boss was cooperative when I suggested that the upgrades to our mail server should include a RAID 1 array with a hot spare (instead of arguing with me over the extra cost, which he would normally have done). And of course, the backup server was doing its job properly in short order.
How about learning the difference between Exchange Server 2007 "Remove Mailbox" and "Disable Mailbox" feature? Especially when I'm removing everyone's old mailbox to deal with a corrupt database?
...
Restore on an exchange server... not fun... Having to restore an exchange server AND active directory... double not fun.
Doing it at 11:00 AM Friday morning... Priceless.
I was new to RAID 5 and was still learning about how it worked. At the time I was the only IT guy in a very small company. All the files everyone accessed were stored on only one server. The server was getting low on space and had only 3 drives in the RAID array, so I thought adding in a 4th would increase space and responsiveness. I did this during business hours. I hadn't learned the concept of after-hours maintenance.
The array started rebuilding, and it said it would be done in 36 hours. I thought that was way too long. I found a slider that controlled rebuild priority, and it was set to the lowest setting. I set it to medium. The time went down to 8 hours. The hard drive lights were blinking a bit faster, but I still thought that was still way too long for only 80GB of data. So I set the priority to high. The hard drive lights went solid, and I thought "that's more like it!" Then the GUI I was using stopped responding. It connected to box remotely. I tried to bring it back up, but it couldn't find the server.
I started to hear people down the hall complaining that they couldn't get on the server. I went to the server to log in to see what was going on. It took 5 minutes for the blank screen to change to the background. It was another 5 minutes before the login prompt came up. Each key press took 5 minutes to register. I had set the priority so high that the server wouldn't respond to anything. It took 2 hours for the array to rebuild. Luckily it was an hour before lunch, so no one really cared that much. My manager at the time was a really cool lady and said it wasn't a big deal. The head design engineer did give me a mean look though. I was sweating bullets for 2 hours. Lesson learned.
I was trying to free up some space on the primary partition of the site's RedHat 5 web server. I was relatively new to Linux but had been using DOS for ages.
I managed to move the entire /bin folder to another partition, taking out the production website, and leaving myself without any accessible system commands. I freaked out, I couldn't rename, copy, move, anything because I'd moved all those helpful executables.
Thankfully I was able to use a boot disk and undo my handiwork.
Ha, My first really big accident was when i was writing a small SVN Admin panel on our development server, completely insecure software that was only to be used for updating the internal "Development" website.
Sometimes the SVN repo would get corrupted so i had written a button that would call a PHP file, which would clean out the entire SVN directory requested, and looked something like this..
<?php
$directory=$_GET['dir'];
$result = shell_exec("sudo rm -Rvf /".$direcory);
echo $result;
?>
For those who don't see it -- the i misspelled "$directory" in the shell_exec, causing the system to run "sudo rm -Rvf /" .... At first i thought the web page was just taking its time deleting all the files in the repo. After about 10-15 minutes i had discovered i had destroyed over 1/2 the file system.
Oops.
Maybe more of a late night brain fart than anything else.
One of the developers was having trouble with running a Java profiler on a Solaris box. The profiler was complaining that there were two copies of Libc; one in /lib
and one in /usr/lib
. So after a few ld
s we moved the one from /lib
as everything was pointing to /usr/lib
, or so they said.
But suddenly nothing worked. No ls
, no cd
, no cp
or mv
. After about 20 minutes of 'oh crap, oh crap' we figured out that one of the developers had a currently running copy of Emacs on that box and we were able to open the backed up /lib
copy of Libc and write it back out with the original name. And voila! Everything worked. Lesson learned; leave Libc where it wants to be and don't make changes on developer requests at 2 A.M.!
Very stupid mistake.
I was writing a script on my Linux workstation that processed a number of files, but it didn't matter what kind of files it were, as long as it were a lot of files. So I decided it was a good idea to copy /etc
to a directory I was doing my tests in. When things went wrong, I deleted the copy and copied /etc
to my test directory again. That went well, for a some time, and then I typed
rm -rf /etc
instead of
rm -rf etc/
OK, nothing to worry about, I could still do things on my workstation and thought I could revive it by copying it from another workstation, or something. Or, reinstall at the end of the day. First, get something to drink, and because of the corporate policy, I locked my screen. Damn, I need my password to unlock and that's in /etc/.....
Stupid mistakes:
/etc
instead of etc/
/etc
for testing purposesWe built turnkey IVR systems for clients on Unix boxes. One time the developers had all their code in /devel. They asked me to remove the development directories and box and take the servers to the airport on a Sunday afternoon (my day off!). In my hurry, I deleted /dev/*. Instantly saw my mistake, sat and pondered for a minute. Not sure if the system would die if the kernel had no hooks to system devices, so I looked at the /dev directory on an identical machine and in order did mknod [c|b] major minor to restore keyboard, tty, scsi drives, fd0 and null then made a floppy on the other machine /dev and mounted and copied it locally to get the rest.
Still no idea what would have happened if I left things alone, but I'm pretty sure it would have been unhappy on reboot :)
Lesson learned - development directory doesn't get to be called /devel.
This happened when I had just started my first support job out of uni, I was connected in to a customer's 2003 server trying to get on to one of the user's machines after they had complained about connectivity problems.
Talked her through some basic troubleshooting and noticed she had a static IP so started talking her through setting this to DHCP. I opened up the properties on the LAN connection on the server to use while I talked her through what to do. After getting her to try and set it back to DHCP it still had a static IP so asked her to disable the connection and re-enable it.
Now by this point I was doing everything I was telling her to on the server without actually changing any settings, right up until the point I asked her to right click on the LAN connection and hit disable which I then proceeded to do too.
Took me maybe half a second to realise what I'd just done.
Took maybe 10 minutes for the other engineers to stop laughing at me before one of them had to go drive for an hour to re-enable the NIC at the customers site.
I used to look after a bunch of database servers, each with a well defined development and testing cycle. Our role was to roll the changes the developers supplied, using their documentation from their test environment into the customer's test environment for customer testing before going live. As part of that the customer test environment was built from the most recent backup of the live environment.
This was all neatly documented, along with the process for rolling the change into the live environment after the customer had signed off on the change.
We had a new start in our team and after he'd been with us for a couple of months we let him sit in on a number of change cycles until one fateful night we let him do it himself. The customer testing went smoothly and the customer happily signed off on the change.
The new start then did exactly what he'd done every time he'd rolled the change into the test environment, confident he didn't need to follow the documentation the rest of us did. Step (1), rebuild from previous backup...
The next morning the customer noticed that the previous day's work was missing and it didn't take us long to find out what had happened. Fortunately the databases had change logging enabled so we were able to recover all the activity. The new start did at least learn to value the documentation and follow it in future.
I had a good new one happen to me last week.
I had one of my guys build a temporary DNS server for a test platform we're building, I asked our DNS guys to update a particular test domain to point at this new temp DNS server but the guy updated the live record not the test one.
Suddenly this one server (fortunately a new box so a reasonable spec) serving just about every DNS request for nearly 5m users - 400 million requests on the first day! - fortunately the TTL was only 24 hours so it's mostly drained away now.
End of week, everyone almost out of the building, I go into the server room to load new tapes into the autochanger, for the weekend-long full backup. The AC is too cold I think, and turn it off (the server room was just a room with a wall mounted AC - no funds for anything serious). So I load up the tapes, make sure the TBU read the barcodes OK, and head out.
The next day, I wake up in the morning, with a hangover (hey, it's weekend!), look at my phone and see a bunch of SMS messages "$server going down". Then another one "main UPS going down".
I grab the keys, drive to the offices, and open the server room, to find it's around 60c in there, and all the equipment is off.
Ended up dragging a few fans to drive the hot air out, before I even could start the AC working, not to mention the UPS and the 40+ servers and comms equipment. And spending the weekend in the office of course. And thanking all deities for smart UPS units that can pull everything down nicely if the ambient temp is too high. I always keep a hoodie around since, and never turn the AC off
Totally different dimension, but it is still a system administrator accident.
Sorry: You need to understand some Italian slang to get this. It can't be translated. You need to know it by heart
I was asked to fix something on a Solaris server in Napoli, Italy. I needed the root password, and I didn't speak much Italian at the time. The guys did seem reluctant to tell me what it was. Finally one of them half-whispered:
- sticazzi
I said: Aha, 'sticazzi'. How do you spell that?, and gave him a piece of paper + pen.
A year later I met M.*o B.*
again (Hi! - if you read this). At the
time my Italian was far better. I told him I now know some more italian.
That was a hard laugh.
The moral of the story: If need to ask for the root password in a language you do not know, once it's given to you better laugh, blush and look insulted at the same time.
There was the time I accidentally deleted the "bin" user on a Unix box. Of course, deleting a user causes its home directory to be removed, as well.
Can you guess what bin's home directory is?
/bin
This didn't happen to me, but I guess it's a really nice story.
These guys were working with one of those old Solaris full-tower servers which, as I am aware, were holding databases for several Informix database this company had. This was a basic-utility company so you can imagine how much data that means.
There was a point where several configurations through servers were copied on a floppy disk and then passed on from server to server. After working with a server, they would just eject the floppy disk and move on to the next one.
Accompanied by another person in the sysadmin group, this guy was working on these configurations as the they talked about random stuff. He finished his step so he pushed the button to eject the floppy.
-"WAIT! Don't release the button!"
When he looks again, he had hit the reset button on error and not the eject button. At the moment he released that button, the whole database system for the company would immediately power down. (I thought these buttons were instantaneous... but this is how the story goes.)
So, every sysadmin stops what he's doing to call department managers and "tell everyone to log off the system. Now." while this guy looks everything happening attached to a server by his finger.
While setting up a static IP address in /etc/network/interfaces
on a Debian box, somebody accidently switched the IP addresses on the IP address line and the gateway line.
Guess what happens when you "steal" the IP of the core switch?
Ten, plus years ago I was working on a project that required a SOCKS proxy. I had been using a program called WinGate that in addition to SOCKS proxy, provided a nice little Internet gateway functionality with NAT, DHCP and a few other niceties. This was before Windows had Internet Connection sharing, so WinGate let you share your dial-up modem with your Ethernet network.
I installed the software and started work on the SOCKS client functionality. Later that day, we lost internet connectivity. All of a sudden, it just stopped and nobody could access outside the company. We called our ISP and everything looked fine on the connection. The router was working fine. We just couldn't figure out what went wrong. I pitched in at one point as I had some knowledge of TCP/IP, but I didn't make any headway.
The next day our IT guy figured out that the DHCP server had given the address of the router out to someone's machine, and everyone was using it for the default gateway which didn't go anywhere. Later that day our IT guy came into my office and I asked, "So did you figure out who gave out the wrong IP address?" He said, "Yeah, it's you!"
WinGate had defaulted to running a DHCP server and had given out the router address to the first client whose previous address had expired. I was pretty red-faced for a while.
A couple of companies ago we had a Windows NT 4 box as the main server running everything, as a backup it had a mirrored hard drive.
I accidentally deleted a few important files, no problem just restart the box, select disk 2 from the SCSI menu and we are back up and running on the copy in under a minute.
Then I started the command to rebuild the mirror drive. It turns out that although Windows now had new C: and D: drives the clever mirroring software wasn't going to be fooled by that. It used the SCSI ID numbers for the source and target, and happily copied 1->2.
Thank you Adaptec!
Everyone 'rm -rf /'s at some point accidentally. Mine was trying to delete some of the extra files in my home directory 2 days before my last data structures assignment was due.
Professionally I've been capable enough to not have any catastrophic screw ups so far.
Early on when I was a young one, I was trying to be 'helpful' and tried to copy 250 MB of data over a 128 kbit/s line to 86 different sites at the same time... during business hours. While I was doing this, I overheard people asking why everything was taking so long.
Needless to say, I killed the transfers, and (luckily) no one knew it was me!
On my first installation task (many years ago, in DOS age) I accidentally delete almost all system files and half application files on computer which belongs to director of public institution. But it wasn't my fault. I try to delete non important files in C:/TEMP folder to free some space. Delete begins...after a few moments I see some familiar names from root and DOS folder scrolling up on screen...Hitting hard Ctrl+Break...but too late...
That was the harder way to learn what cross linked files problem on FAT file system is.
During maintenance at a co-location I pulled our primary DNS power cable. I was replacing the secondary at the time and must have yanked the cable before I closed the rack. All of our sites started dropping fast and I had to go back to the co-location to plug the stupid thing back in.
We have a cold-testing facility for our engineers in northern Minnesota. About 10 years ago the T1 we had up there went dead. We had moved servers down from that facility to our main datacenter because we had installed the faster line so just about everything was useless up there. Come to find out that some farmer in central Minnesota had run through the fiber with some piece of farm equipment. We were none too happy that the fiber was even accessible to that piece of equipment and not buried much deeper...
Oh, one day I deleted a PostgreSQL database inadvertently and recovered it from log files ;)
Thankfully I was able to easily recover from what I am about to share with you. So you have heard of the infamous
rm -rf /under linux, right? Did you know that Windows has the same command? From the C:\ prompt, this is it:
deltree /y /s/b \.
My problem was that I typed this in and knew it was wrong, so I went to hit the backspace key, but fat fingered it and hit the enter key instead! It took me literally only 2 seconds to realize what I had done so I furiously started pressing ctrl-c repeatedly to abort the operation. By the time I had stopped it, half of the file system was gone.
Backups to the rescue, my friends! Other than a reboot, there was no other down time. In once sense, I was really lucky that day because I had great backups in place.
We had a bit of a mess up a few years back. Mid-morning, the users started reporting loads of errors about locking when accessing our SQL Server-hosted app. The app grinds completely to a halt - nobody can do anything. Rather than take the time to find out what's causing it, we do an emergency reboot and everything starts working again. Then I start nosing through the various logs to see what might have triggered it, and just before everything went belly-up I find an open named transaction against the main table without a corresponding COMMIT.
Turned out my colleague had written some SQL in Query Analyzer to correct some erroneous data in the main table, and he'd placed it inside a transaction. But, instead of just hitting F5 to run it, he'd highlighted the whole thing and then hit F5. Except he hadn't quite highlighted everything...he'd missed out the end where it actually COMMITTED the transaction...leaving the table locked.
In the early days of the Internet I ran everything on SGI Challenge S servers. At one point, without my knowledge, the "art department" ordered a demo rendering print server from IKON. Walked in one morning, Challenge acting funny, admin calls into the server room, we go through routine diagnostics, etc finally I say it HAS TO BE the power supply. Of course we have no spare. I walk back into the main office - see the loaner machine and realize - it's also an SGI - open it, unscrew power supply, reboot server - bingo! We order a spare overnight, rep shows up in the AM to ask how we like the demo, we have to hummada hummada for 30 mins til FedEx shows up and we re-swap power supplies and roll the demo box out the door. All in a days work.
I'm a bit of a novice/hobbiest sysadmin with only 30-40 sites hosted on my server so this wasn't too bad. I was removing execute permissions on all files in the directory /bin/xxx and they all started with .
So taking the obvious action, I ran
chmod -R a-x .*
Wow. When you remove execute permissions on your bin directory, it's quite a pain to cleanup. The data centre techs had to boot into a live CD to fix. The best part was I had to walk them through how to fix it. The worst part is they still knew enough to laugh at me :P
Picture a cup of coffee. It's a full cup, with sugar. Picture it seriously misplaced on a rack's retractable keyboard tray. A rack full of servers. The tray gets somehow pushed into the rack. The cup enters the rack and then topples.
That was my fault, and I was a seasoned admin by then, so I have no excuses. There was a bathroom nearby and I was able to mop up most of the mess with paper towels. Luckily not enough coffee got inside the servers, so I shut them down and cleaned them good. Only 400 users affected. Phew!
Then there was another accident, let's call it so, that happened to a friend of mine. He has dedicated the past 10 years building his own company. He has ~15 employees, and all the company's data was in this one server. This included all past and present projects, lots of costumer data, information he had been contracted to keep safe, all contact information, etc. All nicely encrypted with LUKS. I had been pestering him for a long time to make him start doing backups, but he never did. Too busy, short of funds, you get the idea. He was confident his RAID1 would save him. His last backup was 8 months old. That was his server uptime too. He had changed his LUKS password right before the last reboot, 8 months before this. Now he rebooted his server and then realized he had not written the new password down, and he didn't remember it. All he could remember was that it was very long, and it had several words approximately arranged in some way with some sort of capitalization and possibly symbols thrown in.
You can imagine the degree of demoralization among his employees and the rage of costumers who had to resend their information for processing, thereby learning their data was "temporarily" unavailable. To make a long story short, it took me about 40 hours of work, 14 days of runtime and a specialized program to generate and test more than a million passwords to finally find his LUKS password.
Several years back, our iSeries administrator at the time was doing some cleanup in the area where our IBM iSeries servers where sitting in the computer room. This was around 8:30 in the morning. Just as I started to get going with whatever I was working on at the time. The screen went blank a few seconds later the phone calls started coming in.
Come to find out, when he moved a table the power cord was wrapped around the leg just enough that it came out when he moved the table.
About two hours later after the system recovered itself from the power down people were able to work again.
In my early time of system administration I invented some new method of doing inventory process (stock taking) for our retail shops. I took a lot of laptops and connected barcode scanners to them and made the process ten times faster than usual as when we did it by writing all the articles with pen on pappier. I also bought some Symbol PDT DOS handheld terminals. To extend the lifetime of the batteries for Symbol terminals I made my own battery packs and connected wires manually. That night and the next morning I was so proud of myself and I was proud as a peacock walking around the office saying how smart I was.
The nightmare started when I was sending data up to the server to make a calculation and comparison of stock and lists. One of the Symbol devices with an extra battery pack had been flashed because one of wires had lapsed and the device left without energy for a long time.
Now all the work of around 100 employers fell into the water. What is the purpose of 13 or 15 devices and their list if I did not have all of them? How could I know what of inventory was missing.
To closer describe my disaster, we had only a few days off in the year. It is when we close our shops and make stock taking, and that event costs our company a lot of money and effort.
Lucky for me our director and chef of that retrial has been reasonable and accepted inventory lists as they were at computer for that year.
After that I always make two copies of data while work is still in progress and just after we finish inventory process and of course I do not brag anymore.
A long time ago, I decided to change the mount point of my data partition. So I created a new directory, changed the mount point in /etc/fstab, and deleted the directory it was previously mounted on.
The thing is I only realized that the partitions was still mounted on the old directory when nautilus showed me a progress bar (for what should be a 4Kb deletion). Thankfully I was able to cancel it before a great damage was done, but I did lose some files.