0

We've recently changed our mirror setup to include a witness so we support automatic failovers, yesterday in our datacenter they had some planned maintenance to some of the network supplies (i believe they changed some Routers and things).

Sadly, this caused some instability in our network which is nothing i can change anything about now (except write an angry email!). What really troubles me is the following scenario.

We have about 10 databases running on a Principal we call the 90, then the 91 is our mirror and the 92 runs as our witness.

Yesterday @ 09:35 the Witness and Mirror claimed to have lost connection to the principal and promoted the Mirror to the Principal. The Principal (on the 90) however, never claims to have lost connect to the witness and does it's thing (remaining up). Then at around 09:54 the 90 claims to have lost connection to the Mirror (which makes sense, since @ 09:25 this became the principal... it just realises this 30 minutes to late?!).

At this point the we have two quarums.. The 90 can see the 92 and remain principal, and the 91 which has been promoted can also see the 92...

Sadly at this point the principal (90) started throwing some horrible deadlocks and refused to respond to any commands, The clients which have a connectionstring setup with a failover partner how ever can still ping/connect to the 90 meaning non of them failover to the 91. In the end we restarted the SQL Server isntance on the 90 which made all the databases failover correctly.

I personally have no idea how this situation occurred, if there is anything wrong with our setup the fact we somehow had two principals really bothers me, especially since the original principal was up and running for around 30 minutes leaving us with a gap of 30 minutes when we managed to turn it into a mirror.

Any information on this would be much appreciated since at the moment we are pretty shellshocked by all this.

1 Answers1

0

At the time (09:35) did you check mirroring monitor to see what the status was? , do you receive notifications of these events?

It is possible that the connection between the 90 and the 91/92 was dropped because of the network maintenance so 91 became principal also, for the clients nothing happens because the connection with 90 was still up.. thus creating you situation.

What i probably had done (easy to say now:) if i was aware of the 2 principal situation..

on 90 which was running fine at the time, serving clients etc, remove the mirroring config from the databases, this way the databases stay consistent and everything stays online.

Afterwards you can reconfigure the mirroring again without any downtime.

Peter
  • 1
  • yes, that's actually something we tried. We're not sure why but we had some horrible deadlock on the database in question. We couldnt remove the mirroring at all. In the end maybe i should have closed all running proceses on the DB. – F.B. ten Kate Aug 31 '11 at 11:48
  • I'm wondering though, if the 90 and 91/92 lose connection, the 91 is promoted, and then connection is reestablished, shouldn't the 90 the a "update" making it go mirror? Or turn the 91 back into a mirror or something? – F.B. ten Kate Aug 31 '11 at 11:49
  • seems logical since those two form the quorum i think a reboot of 90 would do the trick but i think it will still cause inconsistency because clients could have made updates on the 90 – Peter Aug 31 '11 at 12:47
  • Ye, i gues we might have gotten "unlucky" with the horrible lock that happenend on the principal making us virtually unable to recover. Luckily it's only a 20 minute gap orso. Just setup some alerts on state changes atleast.... Aint having this stuff suprise me like it did again! – F.B. ten Kate Aug 31 '11 at 12:51