What do you do to prevent stupid mistakes?

Question

I saw this comment:

"Been there, done that. We aliased the killall-command on the solaris-boxes after that: alias killall='echo ORLLY?' =) – Commander Keen May 28 at 12:03"

In response to an answer.

It made me wonder, what do sysadmins do to prevent stupid mistakes, either for themselves or others?

In a previous organization where every change had to go through change management (even adding a host to /etc/hosts), we did copy/paste instructions for all change records. If it required additional commands/procedures that weren't in the record, a new ticket was opened.

Perhaps this should be community wiki ... – tomjedrz Jun 05 '09 at 19:20 — tomjedrz, Jun 05 '09 at 19:20

score 9 · Answer 1 · answered Jun 05 '09 at 19:54

9

Surprised no-one has added... go home when you're tired! Brain does not do its best work when you're 10 or 12 hours into the day, go home, grab a beer, get some shut-eye and hit the ground running in the morning!

I also find "peer review" useful... "hey bob, I'm just going to bargle the frargle - you see anything up with that?" just saying it out loud can solidify what you are doing in your mind.

Now we return you to "technical solutions for a single, tired brain" ;)

answered Jun 05 '09 at 19:54

Tom Newton

4,021
2
23
28

I agree with not doing something critical when you tired... better do it the next day. – Hapkido Jun 06 '09 at 01:44
1

We have a policy at our office, that no major changes get made after 3pm, because we're all starting to look forward to hometime and start making mistakes. – Mark Henderson Jun 07 '09 at 07:07
2

I like that - an official "3pm" policy - we can add that to "never make a big change on a friday" (As it costs more to rectify mistakes on saturday!) – Tom Newton Jun 07 '09 at 12:43

score 8 · Answer 2 · answered Jun 05 '09 at 19:18

8

I don't do anything to prevent them, but rather, plan around the expectation that I'm bound to make horrific ones.

answered Jun 05 '09 at 19:18

hark

560
2
6

score 5 · Answer 3 · answered Jun 05 '09 at 19:03

5

Just don't make a mistake that was obviously noted or explained in the documentation... which is a good advice: READ THE DOCUMENTATION FIRST

answered Jun 05 '09 at 19:03

l0c0b0x

11,697
6
46
76

1

Amen! Read and make sure there's no word or phrase that you don't understand. That can always throw you off. – Hondalex Jun 05 '09 at 20:50
1

And read the entire log message, end to end, when you see an error occur. – jtimberman Jul 07 '09 at 01:46

score 5 · Answer 4 · answered Jun 05 '09 at 19:31

5

I take a maxim from my carpenter freinds...

Measure twice cut once.

Before doing things that may result in me standing in a unemployment queue...

Think twice run once.

answered Jun 05 '09 at 19:31

Peer review is good for this! Adding comment so I remember to +1 when I get votes back. – jtimberman Jun 05 '09 at 19:38
No matter how many times I cut this board its still too short. Kinda like no matter how many times I reboot this computer it won't start. – SpaceManSpiff Jun 06 '09 at 03:40

score 5 · Answer 5 · answered Jun 05 '09 at 21:13

5

I'm strongly against protective aliases like rm="rm -i".

Once you retrain your brain to expect rm to be safe, you become very dangerous on any machine without those protections. I'd much rather train my fingers to type "rm -i" or just use mv instead of rm, since those aren't likely to get me into trouble in a new environment.

answered Jun 05 '09 at 21:13

Nick Russo

51
1

So what do you do proactively to prevent stupid mistakes of your own, or of other admins? – jtimberman Jun 06 '09 at 07:05
Get in the habit of not pressing enter quickly when doing destructive commands. – Matt Simmons Jun 06 '09 at 12:14

score 4 · Answer 6 · answered Jun 05 '09 at 19:25

4

Among others, these might prove valuable:

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
alias mysql='mysql --safe-updates' (or add to your .my.cnf)
set -o noclobber

Also, if you often do a lot of browsing your database, but don't often have to make a lot of changes, create a separate user that only has SELECT privileges on tables.

answered Jun 05 '09 at 19:25

Dan Udey

1,460
12
17

commenting so I remember to +1 when I get votes back tomorrow. – jtimberman Jun 05 '09 at 19:37

score 4 · Answer 7 · answered Jun 05 '09 at 19:56

4

Many commands have an option that just shows the output as if the command were run, but doesn't actually do it. (E.g. rsync --dry-run) Look for them, and use them.

answered Jun 05 '09 at 19:56

Randy Orrison

490
6
11

score 4 · Answer 8 · answered Jun 05 '09 at 20:05

Automate whatever you can. Whenever you rely on yourself doing something manually, you allow the possibility for mistakes.

Use various techniques to write robust shell scripts.

When preparing a batch job (for loop, clusterssh job, etc.), prepend the commands that do stuff with echo to make sure that they look sane.

score 4 · Answer 9 · answered Jun 05 '09 at 20:48

4

Checklists and scripting

For every complex task, there is a checklist or a script that will save your butt.

If it's good enough for surgeons and airline pilots, it's good enough for us.

answered Jun 05 '09 at 20:48

Matt Simmons

20,218
10
67
114

score 3 · Answer 10 · answered Jun 08 '09 at 21:27

When it really matters, I sit down a week before, and write the entire thing down in a Wiki page. The intent is to cut and paste the entire action without a single live edit. Basically, write a script, but with a human able to abort and restart any action.

The next day, I read it and fix it.

The next day, I read it again and fix it.

2-3 days before the real execution, I run it once on a machine that I can mess up. Scratch that, a machine that I will mess up. Then I fix it the wiki page.

The next day, I read it again and fix it.

On the actually execution date, I run it on the first production system. Then I fix the wiki page.

The 2nd production system usually works without a problem.

Example use: Migrating from an old SAN to a new SAN, with no downtime. Including "hot" Fiber Channel cable migrations.

It sucked. But what a rush when I pulled it off!

score 2 · Answer 11 · answered Jun 05 '09 at 19:05

2

If you have no idea what you're doing, hire someone else to do it instead of trying to figure it out yourself.

answered Jun 05 '09 at 19:05

Sasha Chedygov

353
1
5
13

1

Please! If I did that, I'd just sit at my desk reading ServerFault all day! – Matt Simmons Jun 05 '09 at 22:06

score 2 · Answer 12 · answered Jun 05 '09 at 19:22

2

We have a policy to only edit system configuration with a script that backs up the configuration file first, before letting you edit it. It's basically a wrapper around vi, but it does the job pretty well: it's very easy to roll back even the most complex changes.

answered Jun 05 '09 at 19:22

wzzrd

10,269
2
32
47

2

The other solution here is to check your config files (e.g. all of /etc) into version control (e.g. git), and then have a cronjob commit/push new changes every – Dan Udey Jun 05 '09 at 19:26
1

There is a tool for this job called etckeeper – cstamas Jun 05 '09 at 20:07

score 2 · Answer 13 · answered Jun 05 '09 at 19:22

2

I'm careful to be specific about what I'm doing before I do anything. Writing a script that deletes all files in the current working directory for example can work in my test, but do something bad later on.

answered Jun 05 '09 at 19:22

MathewC

6,877
9
38
53

I worked on a team where the primary tools / script author had a space in the wrong place: "rm -r / *" - ran the script and wiped out an entire production cluster. – jtimberman Jun 05 '09 at 19:36

score 2 · Answer 14 · answered Jun 05 '09 at 20:06

I think it's hard to protect yourself from yourself.. if I knew I was doing it wrong in the first place, then I wouldn't have done it. That said, I are a couple of thinks I try to remember:

Read through instructions before attempting the task. This is sometimes hard because who really likes instructions?
Read ALL prompts. If there is a prompt, it was designed for a purpose.. reading these and not rushing through clicking has definitely saved me a couple of Homer DOH! moments.
Document difficult tasks. Most of the time when I complete something new and challenging and has not been previously documented, I'll take the time to write up some notes on the task.
Backups

score 2 · Answer 15 · answered Jun 05 '09 at 21:00

2

We colored the bash prompts differently on dev / stage / production systems. "Oh shit, I was on production?!?!?!"

answered Jun 05 '09 at 21:00

Trey

186
1
6

score 1 · Answer 16 · answered Jun 05 '09 at 19:26

1

Some tips for linux machines:

alias rm="rm -i"
alias mv="mv -i"

disable ctrl-alt-delete
install molly-guard : protects remote machines from accidental shutdowns/reboots
install metche : configuration monitor to ease collective administration

answered Jun 05 '09 at 19:26

rkthkr

8,503
26
38

disabling ctl-alt-del is a good 'security' practice anyway. +1 when I have votes! – jtimberman Jun 05 '09 at 19:38

score 1 · Answer 17 · answered Jun 05 '09 at 19:55

1

I don't get out of bed.

Failing that I read twice and click once.

answered Jun 05 '09 at 19:55

Shawn Anderson

542
7
14

score 1 · Answer 18 · answered Jun 05 '09 at 22:02

Document everything you do, you can use this later on as a script when you have to redo the task. Peer review. Double check and use a stage machine to test the stuff you want to do/change.Automate and keep everything configuration related under some version control system.

Most important "don't be afraid of making mistakes - you will do them". Most often this will make it easier for you to work. Mistakes will happen just be prepared to be able to clean up the mistakes nicely.

score 1 · Answer 19 · edited Apr 13 '17 at 12:14

Since I'm too noob to comment on [What do you do to prevent stupid mistakes?, I have to post another answer.

This is how I have colored various command prompts: $ cat ~/.bashrc

export FGGRAY=37
export BGRED=41
export BGYELLOW=43
export BGGREEN=42
export HIGHLIGHT=01
export NORMAL=00

export PS1="[\u@\[\e[${FGGRAY};${BGRED};${HIGHLIGHT}m\]\h\[\e[${NORMAL}m\] \W]\\$ "

$ cat ~/.cshrc

setenv FGGRAY 37
setenv BGRED 41
setenv BGYELLOW 43
setenv BGGREEN 42
setenv HIGHLIGHT 01
setenv NORMAL 00
setenv ESC "^["

set prompt = "[%n@%{${ESC}[${FGGRAY};${BGRED};${HIGHLIGHT}m%}%m%{${ESC}[${NORMAL}m%} %~]%# "

It took me surprisingly long to get those prompts working and somewhat readable. Naming the colors made it easy to change a system from production to staging and back (because our staging machine became "production" during beta testing cycles, which was part of the problem).

The astute reader will note that I'm using ANSI escape sequences which don't work everywhere. They worked fine on RedHat, but I haven't tested other OSes.

[1]: Trey's answer about colored prompts above

score 1 · Answer 20 · answered Jun 09 '09 at 00:42

1

Don't work on sensitive things when you feel tired!

answered Jun 09 '09 at 00:42

voyager

698
1
6
13

score 0 · Answer 21 · answered Jun 05 '09 at 19:22

0

By far the most widespread practice in this vein is to set alias rm="rm -i" and alias mv="mv -i".

answered Jun 05 '09 at 19:22

chaos

7,463
4
33
49

score 0 · Answer 22 · answered Jun 05 '09 at 20:57

0

automated versioning on the most important configuration files and logon scripts, so everything remains tracable.

answered Jun 05 '09 at 20:57

Berzemus

1,162
3
11
19

score 0 · Answer 23 · answered Jun 05 '09 at 22:22

I suppose it really depends on your business. In my previous post as a Jr. Linux Sysadmin, anything going bad was VERY bad. We had clients who depended on things, programmers who didn't do a great job of securing/saving their code, and people in other departments messing with things they had no right to touch.

In my current position, mistakes aren't to terribly bad. The other day, my boss accidentally rm -rf *ed the wrong directory. was it a pain to rewrite the scripts? you bet. did we lose much money? nope.

All I can say is follow the mantra previous mentioned: think twice, do once. And, because we all know that doesn't always work out, have some kind of recovery plan. Personally I'm a fan of an Rsync'd directory that saves all important files nightly, but that's because it works for me. other people may need backup solutions that are far more frequent.

score 0 · Answer 24 · answered Jun 09 '09 at 01:01

0

In addition to many things listed, I use Zsh as my shell.

/var/lib/mysql% rm ib_ *
zsh: sure you want to delete all the files in /var/lib/mysql [yn]? n

answered Jun 09 '09 at 01:01

Juliano

5,402
27
28

score 0 · Answer 25 · answered Jun 09 '09 at 01:12

Some sensible and dangerous tasks are performed in pairs, not alone. GNU screen is used when possible, so the same terminal is shared by two admins working together.

For example, once I had a RAID disk failing when I was 300+ Km away from the server, and the on-site admin wasn't too secure of the procedure. He correctly identified and replaced the failing disk, but was afraid of dealing with the beast that is the RAID command-line management interface (called afacli). It was a tight situation for him: the array was degraded meaning that if another disk failed serious data loss would ensue.

So, we joined a shared screen session, and I watched him issuing the commands for setting the new disk as a fallback, then watching the RAID to rebuild itself in the new disk.

What do you do to prevent stupid mistakes?

25 Answers25