3

Forgive me if I'm not able to be totally clear here. It is not intentional, I'm a senior level developer in a very small company having to act like a manager at the moment.

Anyway, the story is that we have 2 older dell servers with SQL Server 2008 Standard in a "cluster". I put that in quotes because I'm still not 100% clear what that means. We have 2 brand new blade servers and want to move the existing databases to the new hardware.

Ok, so here is the gotcha. We need to do this with little or no down time. I'm being told that we can evict the passive node, then pull in one of the new servers. But I'm also being told that this is a dangerous step because something could go wrong that would cause the cluster to fail and then we would be left with nothing because the active server would not be able to come back up.

Does anyone have any thoughts on how to handle this? I'm being told that the only way to ensure success is to have at least a day of down time where we bring up a new cluster on the new hardware and then migrate the databases 1 by 1.

[Edit] Since it is still related to this question I'd like to add another question. Is it possible for us to remove a machine from the cluster. Then create a new cluster with the removed node as the active machine and then bring a new server into that? Effectively preserving the old cluster while the new machines get swapped in and out in case something goes wrong?

John
  • 177
  • 1
  • 7

4 Answers4

1

little or no down time

While it's of little help now you should be running enterprise is you need high availability, the most obvious feature you would be using in this situation is the ability to have up to 16 nodes in a cluster, so in your case you just would have added 2 more nodes then removed the ones you no longer wanted. I would consider upgrading the version while you are upgrading the hardware

... But I'm also being told that this is a dangerous step because something could go wrong that would cause the cluster to fail and then we would be left with nothing because the active server would not be able to come back up.

Anything is possible. While I've never seen a server 208 sql 2008 failover cluster simply drop dead, it's theoreticaly possible. Note that the active node is not "down" during the node upgrade so there is nothing to take down. The cluster is simply running on 1 node without possibility of failover. The reasonable worst case scenario is that the old node is somehow dead and the replacement won't add, in which case you would be running without failover capability until the issue that is causing the server not to add is resolved.

I'm being told that the only way to ensure success is to have at least a day of down time where we bring up a new cluster on the new hardware and then migrate the databases 1 by 1.

That's probably the only way to ensure the success of the guy doing the work. I'd ask the innocent question of "if it takes a day of downtime to move a cluster why would I cluster in the first place? I could buy 2 machines and leave 1 off and ready to go for that kind of availability". In short you need to find someone that's actually works with clusters befiore and understands the technology involved. Presuming there are no unique issues (EG your company wrote some almost cluster aware software that runs on the cluster) I'd think most professional microsoft admins would be embarrassed to say it would take a day of downtime to replace/add hardware to an existing, working cluster

Jim B
  • 23,938
  • 4
  • 35
  • 58
  • Your last paragraph pretty much sums up exactly what I was thinking which is why I posted this. Thanks so much. – John Jun 10 '10 at 11:41
0

First off, the recommended strategy at the end of your question is the way I would recommend to do it as well but seeing as that is not an option this is how I would handle it. You seem confused about a cluster, basically both servers have SQL installed and cluster services, with a command through cluster services you can "roll" SQL from one server to another. If I were in your shoes I would do it as you have suggested, roll all services to one node, remove the second node from the cluster, add one of your new servers as a cluster node, roll all services to the new cluster node, add the second new node, remove the second old node from the cluster.

**Please note, if you are unfamiliar with cluster services and/or clustered SQL installations and you attempt this on your live system this could end very, very badly for you. As in far worse than the one day of planned downtime. I would either hire a consultant with experience with clusters, or if that was not an option setup a test environment htat it could test the process inside and out.

Hereis a link to the steps for adding a node to your cluster.

Charles
  • 879
  • 5
  • 9
0

You don't need to break the old cluster at all unless you want to use the hardware again. I would recommend the following:

  • Create a new cluster with the new blades
  • Isntall SQL on the new cluster, keep the drive letters, paths, port and instance name (if applicable) the same
  • After installation restore/replace the master and msdb databases from the old SQL server to the new one to get the logins and jobs, or else script the jobs out on the old one and use sp_help_revlogins
  • Log ship or mirror the database from the old server to the new one to get the data up to date

This will get your new instance in the same state as the old one, along wiht a fresh install of the OS and SQL. In order to cutover to the new cluster you can do the following, assuming that the name of your old instance is INSTA and the new one is INSTB:

  • Take the old SQL instance offline
  • Recover the databases on the new server
  • Delete the INSTA DNS record from Active Directory's DNS
  • Create a new CNAME (alias) DNS record in Active Directory DNS that points to INSTB

Once this is done the applications should be connecting the the old name of the SQL instance but that will take them to the new server. You may need to run "ipconfig /flushdns" on all the application servers in order to make the DNS change work faster, make sure to ping the old name to see when it points back. We use this method for cutover because it allows us to keep the old cluster around in case we need to roll back. You will not be able to bring the old SQL instance up until you change the SQL Server Network Name parameter to something else, but once that is done you would just point the DNS alias back to the old one if you want to roll back.

Jason Cumberland
  • 1,559
  • 10
  • 13
  • What type of downtime would this involve at the point where they recover the databases to the new cluster? The thing we are trying to avoid is anything more than a few minutes tops. – John Jun 09 '10 at 19:59
0

Without knowing the specifics of the hardware to know if this would work, my suggestion would be to image the old passive node over to the new server. Using something like Acronis that would allow for the image to be put on new hardware should allow you to basically move the passive node to the new hardware. Once there, you can power it up and verify that it is functioning properly (as much as you can), and then try to fail it over to the new hardware. Although there are many things that could go wrong, as Jim B said, there is a good chance it will either fail over properly to the new hardware, or not work and just have to go back to the old hardware. If it works, then you can repeat the process on the other node. If it doesn't, you can just power the old passive node back on (which you wouldn't have to destroy), and try something else.

Paul Kroon
  • 2,220
  • 16
  • 20
  • Thanks, I suggested the imaging solution and was told it wasn't possible. Specifically I was told that as soon as the image is transferred booting up the new blade would result in a "blue screen". Are you suggesting that Acronis will be able to allow for the differences in drivers? – John Jun 10 '10 at 11:40
  • Yes, Acronis has a feature called Universal Restore (a little extra $, I believe) that abstracts the Windows HAL. As long as you run the restore on the hardware you're putting the image onto, it will allow you to restore and install the new drivers in the same way you would a fresh install. The one thing to watch out for is HDD drivers, which you might need to provide to Acronis during the restore if Windows needs them to boot. That would cause a blue screen, but is easily fixed by providing that driver. – Paul Kroon Jun 10 '10 at 16:28
  • Of course imaging sql server (aside from being unspported) is a great way to cause all sorts of problems for SQL. I doubt Acronis makes a sql server universal restore module. – Jim B Jun 10 '10 at 16:36
  • http://us2.download.acronis.com/pdf/TrueImageEnterpriseServerEcho_ug.en.pdf - section 6.3.4. It does this through VSS, but that's only necessary if it is done live. You must be referring to the possible issue here: http://support.microsoft.com/kb/899159. I've done dozens of re-images of SQL servers this way, and haven't had a problem yet, but YMMV depending on the environment. Without knowing the details of what uses the database, I couldn't comment on whether this name issue could cause problems for you. – Paul Kroon Jun 10 '10 at 16:51
  • There are also issues if you add/remove features at a later time. section 6.3.4 appears to refer to restoring the server as a backup - not making an image to restore to a diffferent server. I've worked deploying an app that uses sql server on the back end and seen many issues as customers deploy sql server from images. – Jim B Jun 10 '10 at 17:27