Fault tolerant server structure for the smallest of businesses

Question

I'm trying to figure out what to do for a small business that has been plagued by ridiculous hardware problems. Right now, this business runs on five or six desktop machines; no server infrastructure is in place. On top of that, and I'm not embellishing this, they have seen four hardware failures this year to date, and it's got them bordering on madness.

I've already discussed with them the notion of putting a Small Business Server in place (they're a microsoft shop), and they're receptive to the idea. I also plan on getting my feet wet with System Center Essentials to keep on eye on things. The focus then becomes ensuring that this server remains available.

Also, I've just read through this other high availability thread. Much like the guy in that thread, I'm very new to IT, coming from a programming background instead.

Some ideas come to mind:

Simple raid-5 with hot-swap edit: and hot-spare
Get two cheaper server machines, configure to run one virtualized server with hot-migration (I've done some reading but sadly I can't tell if SBS Standard and SCE will support this)
Failover clustering? I got this term from the other thread but haven't been exposed to it in the past.

Is there a best practice when it comes to this? The business owner is willing to dig into his pockets a little for this because he's becoming terrified of downtime, but I've got no experience with these to lead me in one direction over the other.

I'd appreciate your wisdom!

edit: To provide some addtional detail on the problems they've experienced, it's been a weird mix of inexplicable failures.

switch on chassis fails to power on the system: motherboard had onboard switch, which provided a stop-gap solution, however switching out the case didn't fix the problem. Later, switching out the motherboard didn't fix the problem either.
Two identical machines have both suffered drive failures in their raid-1 arrays, and both machines were assembled no more than 5 months ago.
Boot failure issues: one system in raid-1 fails to boot at all. Unfortunately I didn't write down the original error message, but in my notes I have that "Failed to save startup options" in Windows Repair & Recovery led me to this thread which supported my suspicions that it was a hardware-related issue.

edit: Also, the machines are running in a collection of home offices, so residential-grade electrical is at play. I guess this may be more of a contributing factor than I'd given it credit for. However, the machines are all run on desks (literally desktops!) and not on the floor; I don't believe dustiness is involved.

So these were home-built machines and not from a vendor (i.e.: dell, hp)? — GregD, Aug 17 '10 at 16:35
Yes, the machines were all custom-made, and from very nice (i.e. way overkill) parts, at that. The owner has strong predispositions against HP, Dell, and IBM, despite my urging towards Dell (who I've personally had great experience with). — bwerks, Aug 17 '10 at 16:40
Is there really a business need for the types of setups you're inquiring about (i.e.: SBS virtualized in hyper-v, HA, etc.)?? I mean are these folks working on life/death software here? I'm just curious because you're talking about 5-6 users. While the owner may be frustrated by the downtime he's seen, all of the HA that you're talking about setting up, seems like a knee-jerk reaction to that frustration. The most stable environments I've set up and managed have been the simplest ones — GregD, Aug 17 '10 at 16:42
I'm really not trying to sound curmudgenly (sp?). When you start entering the realm of HA costs and complexity can skyrocket. Part of my job as a sysadmin is to translate business needs into technology needs. While the owner may be willing to throw money at his current issue (downtime), there are much simpler ways of accomplishing that without needlessly complicating the management of the domain. — GregD, Aug 17 '10 at 16:48
I would do my best to get him to ditch "custom made" hardware and go with a vendor like Dell. I've had the pleasure of supporting Dell's in an enterprise environment for the past 12 years. Their servers and workstations are rock-solid. The only exception I've seen to this is a small form-factor Optiplex (280?) they made about 6 years ago. Those had some cooling issues and power supplies were known to give out early. Their tech support is outstanding, especially if you're lucky enough to be Dell certified.. — GregD, Aug 17 '10 at 16:51
Your curmudgeoning is well-received; I know where you're coming from, but from the owner's perspective every day they lose due to machine downtime could pay for an entire machine. It's by no means mission-critical or anything, but they've got a lot of clients and a lot of simultaneously running projects, so the delays make them look bad to all of them simultaneously. The rationale is pretty much just as you describe it--"we've got the money to throw at it; just make the problems go away." — bwerks, Aug 17 '10 at 16:53
I agree 100% with support for Dell--I'm actively trying to steer in this direction, however at this point they've got two ailing systems assembled in the last half year, both of which were very expensive; somewhat contrary to the previously given outlook, they don't want to just deep-six these custom builds in order to pick up a bunch of new optiplexes. Touchy situation, to be sure. — bwerks, Aug 17 '10 at 16:56
Ah, the old, "I really want you to fix this problem, but I really want to tell you how to fix it." Explain it that the domain is only was solid as it's weakest link. The most expensive HA systems aren't worth squat, if you've got nothing to access it on. Fix those systems and offer to sell them on eBay to offset the cost of the new Dell's.... — GregD, Aug 17 '10 at 17:01
Haha. Yeah, I know what you mean; I've actually suggested that they get a "DR" desktop just to sit around unplugged in case all the workstations should be unavailable at once. The case for the server was just that if all the business data were in one place (also they're interested in/curious about sharepoint) then they'd still be operational in the case that one of these myriad hardware problems occur. I like the ebay idea though for cost recooping; I'll bring that to them. — bwerks, Aug 17 '10 at 17:09

GregD · Accepted Answer · 2010-08-17T16:32:17.510

First of all, SCE is overkill for 5-6 desktop machines. WSUS is probably a better option and is free.

You haven't said much about what exactly has failed. Was it a part in the machine? Is this a dusty environment? My primary support environment is approximately 40 users with approximately 10 servers (not including virtualized). We buy Dell machines (Optiplex's) and we have had maybe 4 hardware failures in the last 5 years on ALL of that stuff. So what you're seeing on the workstations, isn't normal.

Do they have a proper server room/location for the server (with cooling and not a lot of dust, at least?)

Raid-5 with hot swap is an inexpensive way to go on this server and provides some protection against hard drive failure. I would also add in redundant power supplies (inexpensive) and a UPS.

Server class hardware
Raid on hard drives (edited to add)Having a hot spare available is probably overkill, since most drives under warranty can be overnighted. With 3 drives in a raid-5 for instance, you can lose one drive and be okay until the new one arrives. Lose >1 drive however, you're screwed no matter how you look at it.
redundant power supplies
Proper warranty (with Dell for instance we get next-business-day and keep your hard drive because we can live with a day of downtime on any of our servers.)
Backup solution

Failover clustering? You are starting to enter a realm that is both costly and complex for such a small environment. Remember that in such a small environment, while uptime is important, it's also important to remember that you'll want to keep things as simple as possible.

As for the workstations, address the problem (which you haven't been extremely clear about). Perhaps you could purchase an "extra" workstation that has your base image on it, that just sits there taking all of your updates from WSUS that you could use as a swap out machine if one of their workstations dies (which is what we do). We also have a shitload of parts that we can swap to replace the most common parts that die (power supplies, ram, hard drives) until the warranty part arrives.

Backups. No amount of redundancy is a substitute for good backups. You have numerous options here. With such a small environment you could look at many (Mozy, Carbonite come to mind) over-the-wire solutions which take care of offsite and automated at the same time for a reasonable cost. You could also put in a tape solution and use a service like Iron Mountain to vault the tapes off-site. Whatever you do, do not take tapes home with you! especially if they have valuable information on them (SS#, etc.)

To be clear, I was more interested in the SCOM/SCVMM aspects of SCE more than I was SCCM. Given my limited knowledge of WSUS and SCCM both, I'm under the impression that SCCM is a beefed up wrapper/replacement for WSUS, right? It's mainly the health monitoring I'm after, and from what I was quoted by MS it would be about a $200 deployment--which didn't seem terribly overkill. — bwerks, Aug 17 '10 at 16:31
It's not overkill in terms of pricing. It's overkill in terms of setup, configuration and options for your small environment. — GregD, Aug 17 '10 at 16:33

Posipiet · Answer 2 · 2010-08-17T14:37:11.603

From my experience, SBS has its own set of problems. Especially if you set it up clustered etc. The effort in maintenance is way too big for such a small shop.

Set up a proper little server, 4 disks, raid (5 | 10 | 6), pci-e raid controller, a basic fileserver, ups (thanks tomtom).

Mail for just a few people is probably best handled by an external provider.

Stay away from SCE and similar overkill situations, since you would have to have VPN, Active Directory, and similar. Setting all this up is a major effort, and perhaps not in the best interest of your customer.

By guiding your small customer to a simple, yet efficient and reliable solution, you will make them and yourself happy.

Teach them to look into eventlogs, maybe give them a simple script that checks for disk warnings. Visit them regularly, if they want that, and check the logs for them. Deal with the problems one at a time.

I'd say that you'd haev problems with clustering Windows SBS since it doesn't support any type of out-of-the-box Microsoft clustering solutions. — Evan Anderson, Aug 19 '10 at 00:30

score 1 · Answer 3 · answered Aug 17 '10 at 14:27

1

This is not a hardware issue primarily. Get a USV - NOW. One that is ON LINE (I.e. filters the electicity).

On top of that, and I'm not embellishing this, they have seen four hardware failures this year to date

This is eihther comical - VERY rare - or based on for example fluctuating power or something th eserver did not handle that good. This is NOT normal, and the chance of that happening "just" is EXTREMELY low. Like lottery winning low. I have seen similar behavior - but based on either CRAP power supplies or... on unstable power supplies with spikes, partially home inducted (seen servers die when you turn on the lights thanks to a very bad switch where you could see sparks).

SCE is not needed. WSUS is enough.
SBS does not really support what you need in uptime - but you could try running it on a virtualization platform. It DOES run in Hyper-V... I know people doing that for demo purposes.

answered Aug 17 '10 at 14:27

TomTom

50,857
7
52
134

Yeah, I agree. These failure rates seem impossible to me. And yeah; I meant that I was thinking of running a virtualized SBS in hyper-v, but I'm not sure what the underlying infrastructure for that would cost; mostly because I'm not entirely sure what would be required of the underlying infrastructure itself, heh. – bwerks Aug 17 '10 at 16:34
Well, my home server has 16gb RAM, hardware raid, 8 hard discs with hardware raid and is a micro atx board vased on AMD running hyper-v ;) That said, i am not sure about the price.... the raid controller itself was freaking expensive, as was the sas backplane for the 8 discs. – TomTom Aug 17 '10 at 16:37
Agreed that that many hardware failures seems a bit 3excessive - OP should look into line-balancing UPSs (I had similar problems at an apt. of mine that would get frequent 'brownouts' until I got such a UPS). – pjz Aug 17 '10 at 17:34
I know the machines that failed are indeed on UPS, but I'm waiting on confirmation that they're line-balancing; I don't know for sure that they're not purely for outages right now. – bwerks Aug 17 '10 at 18:36
Cheap UPS are often not in-lline or on-line... thy basically take over within x ms when a power failure occors. Spiks in the line go "through" them. The most expensive ones basically ALWAYS us the battery, which they constantly load, so the power quality is always perfect. In the middle you have those that filter the power. – TomTom Aug 18 '10 at 06:40

score 1 · Answer 4 · answered Aug 31 '10 at 03:54

Just some additional insights:

Use RAID-6 instead of RAID-5+hot-spare. With RAID-6 the parity is doubled across the disks, so you can have 2 disks failing at the same time. Or just use RAID-5, and have working DR backups
First focus on having redundancy INSIDE the server box (disks, power supply, cooling)
Buy some premium support service for the server box, with a response time SLA for hardware failure (it's much cheaper than a cluster solution)
Buy some (good) on-line UPS
Implement some availability solution based on replication, like DoubleTake Availability. There's a version of DoubleTake Availability tailored for Windows SBS wich is very inexpensive. You will need 2 servers to do that, but your downtime in case of hardware failure will decrease to less than 10 minutes

score 0 · Answer 5 · answered Aug 18 '10 at 06:29

I don't understand what problem the server is supposed to be solving.

If all four machines came from the same vendor, and there's nothing unusual about your location (very high humidity/dust, static electricity, lightning, or very unreliable power) you need a new hardware vendor. Whatever Dell, HP, and IBM did to get on the owner's bad side, the supplier for these machines is worse, at least from a hardware point of view. You'd get better reliability buying the cheapest machines you can find at Wal-Mart.

It may be that it's not wholly the vendor's fault - maybe someone specified particular hardware and/or insisted on some very low-spec gear - but they still should have refused to build machines that badly configured, or else done something heroic to replace the bad machines.

I suggest you buy some middle-of-the-road PC's from Dell/HP/Lenovo (or kick the butt of the current supplier to support what they sold), sign up for some paid Dropbox accounts (or box.net, or NetDocuments) to share files, and have your ISP or Google handle the mail and web serving.

[* Yes, "cloud" services are theoretically less secure than owning your own server - but if this is running in a bunch of home offices, the data is at risk if any of those homes are burglarized, or if someone's family member uses the work machine to run random malicious software from the internet when the employee's not home or on vacation. The biggest danger of downtime will come from consumer-grade net connections, not the cloud provider's downtime.]

It sounds like you need less hardware and simpler hardware if you want reliability, not more complicated and more expensive hardware/software.

Fault tolerant server structure for the smallest of businesses

5 Answers5