One of my client's sites received a direct lightning hit last week (coincidentally on Friday the 13th!).
I was remote to the site, but working with someone onsite, I discovered a strange pattern of damage. Both internet links were down, most servers were inaccessible. Much of the damage occurred in the MDF, but one fiber-connected IDF also lost 90% of the ports on a switch stack member. Enough spare switch ports were available to redistribute cabling elsewhere and reprogram, but there was downtime while we chased down affected devices..
This was a new building/warehousing facility and a lot of planning went into the design of the server room. The main server room is run off of an APC SmartUPS RT 8000VA double-conversion online UPS, backed by a generator. There was proper power distribution to all connected equipment. Offsite data replication and systems backups were in place.
In all, the damage (that I'm aware of) was:
- Failed 48-port line card on a Cisco 4507R-E chassis switch.
Failed Cisco 2960 switch in a 4-member stack.(oops... loose stacking cable)- Several flaky ports on a Cisco 2960 switch.
- HP ProLiant DL360 G7 motherboard and power supply.
- Elfiq WAN link balancer.
- One Multitech fax modem.
- WiMax/Fixed-wireless internet antenna and power-injector.
- Numerous PoE connected devices (VoIP phones, Cisco Aironet access points, IP security cameras)
Most of the issues were tied to losing an entire switch blade in the Cisco 4507R-E. This contained some of the VMware NFS networking and the uplink to the site's firewall. A VMWare host failed, but HA took care of the VM's once storage networking connectivity was restored. I was forced to reboot/power cycle a number of devices to clear funky power states. So the time to recovery was short, but I'm curious about what lessons should be learned...
- What additional protections should be implemented to protect equipment in the future?
- How should I approach warranty and replacement? Cisco and HP are replacing items under contract. The expensive Elfiq WAN link balancer has a blurb on their website that basically said "too bad, use a network surge protector". (seems like they expect this type of failure)
- I've been in IT long enough to have encountered electrical storm damage in the past, but with very limited impact; e.g. a cheap PC's network interface or the destruction of mini switches.
- Is there anything else I can do to detect potentially flaky equipment, or do I simply have to wait for odd behavior to surface?
- Was this all just bad luck, or something that should be really be accounted for in disaster recovery?
With enough $$$, it's possible to build all sorts of redundancies into an environment, but what's a reasonable balance of preventative/thoughtful design and effective use of resources here?