This isn't a question about how to cope with or limit downtime or data-loss, I know all about that. I'm putting together a 'stories' section for my PASS post-con on disaster recovery and I'd like to be able to share some more recent and impressive tales than the ones I have from my days at Microsoft, although if you've heard me present my corruption deck any time over the last 3 years, you'll remember them all being doozies.

So, think of this is kind of a confessional (although I can't provide absolution :-) and of course, all stories told here happened to a friend or colleague, or at a previous company, unless you're brave and want to 'fess up. I won't pass judgement or ridicule any answers, and will only provide insight if asked to.

Really, the idea is for everyone to learn from mistakes and mis-steps. As an example of a story I heard, see A sad tale of mis-steps and corruption.

Not sure if this will work or not on this forum, but it's worth a try.


PS If you haven't seen my corruption session and heard the stories, it was the #2 session at TechEd IT Pro last year and they video-taped it: see TechEd: 80 minute video of Corruption Survival Techniques presentation. The blog post links to a bunch of corrupt databases and demo-scripts you can download and play with too (no advertizing or anything like that on our site, just info).

  • 53,385
  • 32
  • 133
  • 208
Paul Randal
  • 7,184
  • 1
  • 35
  • 45
  • too bad we can't post as 'anonymous coward' – Nick Kavadias Jun 01 '09 at 08:50
  • I'm already the "star" of your sad tale of mis-steps. We're still trying to get everything back up over here so I'll come back and update this thread when I finally get back up and running. – SQLChicken Jun 04 '09 at 17:43

4 Answers4


Other that the classic "I forgot to include the WHERE clause and I wasn't inside a transaction" update/delete statement?

Kept getting the databases on one server going offline in our lab environment. The drive the MDB files lived on would just disappear, SQL would hiccup, and I'd need to manually bring the databases back online when the drive re-appeared (usually a few minutes later) Spent the better part of a week with the ops guys to try and determine why the drive was going away. It was a LUN on the SAN, with redundant paths to the switch.

Long story short, turned out the fiber cables weren't fully clicked into their ports on the switch, and the cables had shifted during some recent maintenance. They now rested in the cavity between the rack door and the recess it closes into. When the door closed, it pulled on the cables just enough to cause the plugs to ride out and break connection. The door wasn't locked, just swinging freely, and when the door to the lab was opened/shut, the air movement caused the rack door to swing back and forth.

Dave Dustin
  • 161
  • 5

At a small company I was at we had just roled out a basic Sharepoint Services site. We were small but our emplyees were around the world so the web access & MS Office intergration for Sharepoint was amazing (everything else sucked but thats another story) Since we didn't have much money and we were small we kept it simple, one SQL server with RAID and one web server also with RAID. About 1 week and 5 gigs of project data into it the power supply failed in the SQL box. We had a day of downtime waiting for delivery of the new one. We could have rolled the backups onto another server, but since we were still new to sharepoint the DR plan was still in development and we figured it would take as long to figure it all out as it would to just wait for the power supply to arrive, and since we know as soon as we had a new power supply we'd be online and not have to failback we just opted to wait and not risk messing up sharepoint.

  • 2,547
  • 18
  • 19

Human error resulted in a two terabyte MS-SQL database having all of its indexes dropped. They noticed fairly quickly and decided to rebuild the indexes. Unfortunately that process took over 48 hrs. In hindsight it would have been easier (and caused much less downtime) to restore from tape.

Shawn Anderson
  • 542
  • 7
  • 14
  • And generated a lot less log too. Was the downtime because they took the database offline or because performance sucked without the indexes, or because they were offline index rebuilds? Thanks! – Paul Randal Jun 04 '09 at 16:54
  • They went offline for the index. Trying to stay online would have effectively been the same as offline since performance would have been so poor. – Shawn Anderson Jun 07 '09 at 06:50

Few years ago while working for an auto finance company, I brought down one db server during a deployment. That is one of the major screw-ups I am involved in my professional life, although I came out squeaky clean from that issue.

We had one-way Transactional replication from SQL 2K (SP3) to SQL 2K (SP3) and during deployments, replication should be teared down and rebuilt as a company policy if it involves table(s) in replication. At some point, a decision was made to upgrade to SP4 and changes were rolled to all prod servers but replication wasn't rebuilt after the upgrade.

Couple of weeks later, my project (I was a database developer and a contractor) was due for deployment and I am at the data center supporting the deployment( usually deployments are done at midnight). Replication was brought down, project deployment was successful and while rebuilding replication failed after 2 hours. The SCM person, re-started it without reading the complete error message at 3 am and it failed again after 2 hours and we are almost nearing the SLA. I knew I had to call my manager at 5 am and lot of calls were made to escalate the issue to all levels/groups.

DBA group took over the issue at 6 am and I was kept in dark from the troubleshooting steps and my manager asked me 3 times in 2 hours to check if I my scripts are responsible for the screw-up. My head was on the line. 4 Prod DBA's and 2 managers were hot on this issue & a Ticket was raised with MSFT, and even after 3 pm the issue was NOT resolved until I figured out what really happened. In one article (table), we had a unique index on a column but the data quality wasn't good. We had '' and null value and the remaining millions of records were legitimate values although some legacy data was questionable. After the SP4 upgrade, SQL Server was trying to transform '' and null values to null on the subscriber side and it failed because the unique key/index violation. The bad data was removed after getting high level permissions from business group and I got to keep my job for another year.

Lesson learned: Test, Test & Test each and every program you have before moving with a upgrade.

Sankar Reddy
  • 1,374
  • 8
  • 8