4

I saw this comment:

"Been there, done that. We aliased the killall-command on the solaris-boxes after that: alias killall='echo ORLLY?' =) – Commander Keen May 28 at 12:03"

In response to an answer.

It made me wonder, what do sysadmins do to prevent stupid mistakes, either for themselves or others?

In a previous organization where every change had to go through change management (even adding a host to /etc/hosts), we did copy/paste instructions for all change records. If it required additional commands/procedures that weren't in the record, a new ticket was opened.

jtimberman
  • 7,511
  • 2
  • 33
  • 42

25 Answers25

9

Surprised no-one has added... go home when you're tired! Brain does not do its best work when you're 10 or 12 hours into the day, go home, grab a beer, get some shut-eye and hit the ground running in the morning!

I also find "peer review" useful... "hey bob, I'm just going to bargle the frargle - you see anything up with that?" just saying it out loud can solidify what you are doing in your mind.

Now we return you to "technical solutions for a single, tired brain" ;)

Tom Newton
  • 4,021
  • 2
  • 23
  • 28
  • I agree with not doing something critical when you tired... better do it the next day. – Hapkido Jun 06 '09 at 01:44
  • 1
    We have a policy at our office, that no major changes get made after 3pm, because we're all starting to look forward to hometime and start making mistakes. – Mark Henderson Jun 07 '09 at 07:07
  • 2
    I like that - an official "3pm" policy - we can add that to "never make a big change on a friday" (As it costs more to rectify mistakes on saturday!) – Tom Newton Jun 07 '09 at 12:43
8

I don't do anything to prevent them, but rather, plan around the expectation that I'm bound to make horrific ones.

hark
  • 560
  • 2
  • 6
5

Just don't make a mistake that was obviously noted or explained in the documentation... which is a good advice: READ THE DOCUMENTATION FIRST

l0c0b0x
  • 11,697
  • 6
  • 46
  • 76
5

I take a maxim from my carpenter freinds...

Measure twice cut once.

Before doing things that may result in me standing in a unemployment queue...

Think twice run once.

5

I'm strongly against protective aliases like rm="rm -i".

Once you retrain your brain to expect rm to be safe, you become very dangerous on any machine without those protections. I'd much rather train my fingers to type "rm -i" or just use mv instead of rm, since those aren't likely to get me into trouble in a new environment.

Nick Russo
  • 51
  • 1
4

Among others, these might prove valuable:

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
alias mysql='mysql --safe-updates' (or add to your .my.cnf)
set -o noclobber

Also, if you often do a lot of browsing your database, but don't often have to make a lot of changes, create a separate user that only has SELECT privileges on tables.

Dan Udey
  • 1,460
  • 12
  • 17
4

Many commands have an option that just shows the output as if the command were run, but doesn't actually do it. (E.g. rsync --dry-run) Look for them, and use them.

Randy Orrison
  • 490
  • 6
  • 11
4

Automate whatever you can. Whenever you rely on yourself doing something manually, you allow the possibility for mistakes.

Use various techniques to write robust shell scripts.

When preparing a batch job (for loop, clusterssh job, etc.), prepend the commands that do stuff with echo to make sure that they look sane.

Josh Kelley
  • 963
  • 1
  • 7
  • 17
4

Checklists and scripting

For every complex task, there is a checklist or a script that will save your butt.

If it's good enough for surgeons and airline pilots, it's good enough for us.

Matt Simmons
  • 20,218
  • 10
  • 67
  • 114
3

When it really matters, I sit down a week before, and write the entire thing down in a Wiki page. The intent is to cut and paste the entire action without a single live edit. Basically, write a script, but with a human able to abort and restart any action.

The next day, I read it and fix it.

The next day, I read it again and fix it.

The next day, I read it again and fix it.

2-3 days before the real execution, I run it once on a machine that I can mess up. Scratch that, a machine that I will mess up. Then I fix it the wiki page.

The next day, I read it again and fix it.

On the actually execution date, I run it on the first production system. Then I fix the wiki page.

The 2nd production system usually works without a problem.

Example use: Migrating from an old SAN to a new SAN, with no downtime. Including "hot" Fiber Channel cable migrations.

It sucked. But what a rush when I pulled it off!

Craig Lewis
  • 141
  • 1
2

If you have no idea what you're doing, hire someone else to do it instead of trying to figure it out yourself.

Sasha Chedygov
  • 353
  • 1
  • 5
  • 13
2

We have a policy to only edit system configuration with a script that backs up the configuration file first, before letting you edit it. It's basically a wrapper around vi, but it does the job pretty well: it's very easy to roll back even the most complex changes.

wzzrd
  • 10,269
  • 2
  • 32
  • 47
  • 2
    The other solution here is to check your config files (e.g. all of /etc) into version control (e.g. git), and then have a cronjob commit/push new changes every – Dan Udey Jun 05 '09 at 19:26
  • 1
    There is a tool for this job called etckeeper – cstamas Jun 05 '09 at 20:07
2

I'm careful to be specific about what I'm doing before I do anything. Writing a script that deletes all files in the current working directory for example can work in my test, but do something bad later on.

MathewC
  • 6,877
  • 9
  • 38
  • 53
  • I worked on a team where the primary tools / script author had a space in the wrong place: "rm -r / *" - ran the script and wiped out an entire production cluster. – jtimberman Jun 05 '09 at 19:36
2

I think it's hard to protect yourself from yourself.. if I knew I was doing it wrong in the first place, then I wouldn't have done it. That said, I are a couple of thinks I try to remember:

  1. Read through instructions before attempting the task. This is sometimes hard because who really likes instructions?
  2. Read ALL prompts. If there is a prompt, it was designed for a purpose.. reading these and not rushing through clicking has definitely saved me a couple of Homer DOH! moments.
  3. Document difficult tasks. Most of the time when I complete something new and challenging and has not been previously documented, I'll take the time to write up some notes on the task.
  4. Backups
2

We colored the bash prompts differently on dev / stage / production systems. "Oh shit, I was on production?!?!?!"

Trey
  • 186
  • 1
  • 6
1

Some tips for linux machines:

alias rm="rm -i"
alias mv="mv -i"
  • disable ctrl-alt-delete
  • install molly-guard : protects remote machines from accidental shutdowns/reboots
  • install metche : configuration monitor to ease collective administration
rkthkr
  • 8,503
  • 26
  • 38
1

I don't get out of bed.

Failing that I read twice and click once.

Shawn Anderson
  • 542
  • 7
  • 14
1

Document everything you do, you can use this later on as a script when you have to redo the task. Peer review. Double check and use a stage machine to test the stuff you want to do/change.Automate and keep everything configuration related under some version control system.

Most important "don't be afraid of making mistakes - you will do them". Most often this will make it easier for you to work. Mistakes will happen just be prepared to be able to clean up the mistakes nicely.

f.ederi.co
  • 69
  • 1
  • 3
1

Since I'm too noob to comment on [What do you do to prevent stupid mistakes?, I have to post another answer.

This is how I have colored various command prompts: $ cat ~/.bashrc

export FGGRAY=37
export BGRED=41
export BGYELLOW=43
export BGGREEN=42
export HIGHLIGHT=01
export NORMAL=00

export PS1="[\u@\[\e[${FGGRAY};${BGRED};${HIGHLIGHT}m\]\h\[\e[${NORMAL}m\] \W]\\$ "

$ cat ~/.cshrc

setenv FGGRAY 37
setenv BGRED 41
setenv BGYELLOW 43
setenv BGGREEN 42
setenv HIGHLIGHT 01
setenv NORMAL 00
setenv ESC "^["

set prompt = "[%n@%{${ESC}[${FGGRAY};${BGRED};${HIGHLIGHT}m%}%m%{${ESC}[${NORMAL}m%} %~]%# "

It took me surprisingly long to get those prompts working and somewhat readable. Naming the colors made it easy to change a system from production to staging and back (because our staging machine became "production" during beta testing cycles, which was part of the problem).

The astute reader will note that I'm using ANSI escape sequences which don't work everywhere. They worked fine on RedHat, but I haven't tested other OSes.

[1]: Trey's answer about colored prompts above

Craig Lewis
  • 141
  • 1
1

Don't work on sensitive things when you feel tired!

voyager
  • 698
  • 1
  • 6
  • 13
0

By far the most widespread practice in this vein is to set alias rm="rm -i" and alias mv="mv -i".

chaos
  • 7,463
  • 4
  • 33
  • 49
0

automated versioning on the most important configuration files and logon scripts, so everything remains tracable.

Berzemus
  • 1,162
  • 3
  • 11
  • 19
0

I suppose it really depends on your business. In my previous post as a Jr. Linux Sysadmin, anything going bad was VERY bad. We had clients who depended on things, programmers who didn't do a great job of securing/saving their code, and people in other departments messing with things they had no right to touch.

In my current position, mistakes aren't to terribly bad. The other day, my boss accidentally rm -rf *ed the wrong directory. was it a pain to rewrite the scripts? you bet. did we lose much money? nope.

All I can say is follow the mantra previous mentioned: think twice, do once. And, because we all know that doesn't always work out, have some kind of recovery plan. Personally I'm a fan of an Rsync'd directory that saves all important files nightly, but that's because it works for me. other people may need backup solutions that are far more frequent.

Tedd Johnson
  • 71
  • 2
  • 10
0

In addition to many things listed, I use Zsh as my shell.

/var/lib/mysql% rm ib_ *
zsh: sure you want to delete all the files in /var/lib/mysql [yn]? n
Juliano
  • 5,402
  • 27
  • 28
0

Some sensible and dangerous tasks are performed in pairs, not alone. GNU screen is used when possible, so the same terminal is shared by two admins working together.

For example, once I had a RAID disk failing when I was 300+ Km away from the server, and the on-site admin wasn't too secure of the procedure. He correctly identified and replaced the failing disk, but was afraid of dealing with the beast that is the RAID command-line management interface (called afacli). It was a tight situation for him: the array was degraded meaning that if another disk failed serious data loss would ensue.

So, we joined a shared screen session, and I watched him issuing the commands for setting the new disk as a fallback, then watching the RAID to rebuild itself in the new disk.

Juliano
  • 5,402
  • 27
  • 28