Rackmount Server Reliability by Brand Name and Form Factor

Question

Our company runs e-commerce web sites (thousands) on two clusters in two separate data centers.

Basically, all we require to operate is rack mountable server nodes. Each node needs:

1.) 4 or 8 cores 2.) 32 gb ram 3.) 1 250 gb sata disk 4.) 2 Port, Gigabit, Ethernet Adapters 5.) Ability to boot Windows XP Pro

That it. We run about 40 such nodes in a fully redundant, always up (hopefully!) cluster (we wrote the clustering part ourselves)

Previously, we bought our systems whiteboxed (basically had a small shop custom build our servers (supermicro) to our specs).

This scheme was working well up to our last round of node purchases. Out of the last round of node purchases have had a super, super high failure rate (30% failed in 6 mo.) No one reason, bad PSU, bad memory, mobo fried, etc.

My questions are these:

Will we have more consistent reliability if we purchase from a name brand vendor (IBM/DELL/HP) or are we basically in the same crap shoot of reliability we were in before? Remember, these are low end servers. We are not going to transition to a mainframe or anything exotic.

Will our reliability vary with the form factor of the servers? That is to say, will 2u servers be any more reliable than high density, 2 nodes in a 1 u box server?

Anybody out there transitioned from white box servers to name brand servers (or changed form factors) and have a tale to tell?

Are you using XP Pro as the server OS? – Luis Ventura Dec 29 '09 at 01:35 — Luis Ventura, Dec 29 '09 at 01:35

score 3 · Accepted Answer · answered Dec 28 '09 at 22:57

The brand names, in general, tend to be more reliable than whiteboxes (although supermicro don't count as "white box" in my world), however you will still have the occasional run of bad luck with hardware from the name brands. What you do tend to get, though, if you've got a large purchasing volume and history with one of the bigger kids is a quick turnaround on fixing those sorts of problems. If you get a dud batch of motherboards from a whitebox vendor, there's limited chance that they'll have a pile of spares sitting around to replace them with, whereas a big name will have spares out their ears -- and long-term, loyal customers (ie "cash cows") will get that stock first.

Ultimately, though, it's computer hardware, and this sort of thing is why we run extensive burn-in tests on all hardware received. This stuff happens with alarming regularity once you get into large-scale management, and having it fail on the test rack is a far better option than having it fail in production (even if you do have massively redundant systems).

Also, "runs XP Pro" -- are you serious?

Regarding the XP Pro: We only needed an OS that could run a JVM, end of story. XP Pro is an OS our developers are all familiar with, its cheap (we pay about $99/node). We don't need any services typically associated with a server OS. Basically, all we needed was a OS that runs a good JVM, has a TCP/IP stack and boots. We considered running FreeBSD as our node OS but rejected it simply because our personnel are more used to windows. — SvrGuy, Dec 29 '09 at 17:04

score 2 · Answer 2 · answered Dec 28 '09 at 22:40

2

change the builder but keep the brand.

Really, Supermicro hardware is really good. if you're getting such high failure rates, I'd first suspect that the build guys are messing it up.

answered Dec 28 '09 at 22:40

Javier

9,078
2
23
24

1

I can't make a call re: Supermicro being "really good" or not-- I have no experience. The symptoms, though, nake me wonder if the builder had a failure in their electrostatic discharge protection system or procedures (a new employee who didn't observe protocol, failure of a ground connection on ESD mats, etc). – Evan Anderson Dec 29 '09 at 00:12
I have had some experience with Supermicro, their servers are good enough. Such failure rate must be caused by something else, e.g. overheating or more probably the build guys. – Taras Chuhay Dec 29 '09 at 10:33

score 0 · Answer 3 · answered Dec 28 '09 at 22:54

Supermicro is a very reliable brand, from the motherboards to their full solutions.

A good builder should stand behind their work, and should help you out however possible. Going with a major brand like Dell and HP will get you the same thing.

As for the configuration type. The more heat you have in one spot, the higher the failure rate could be. So 2 nodes in a 1u is going to put off more heat then 1 in a 2u. If you have enough cooling in your rack, this shouldn't be a factor though at all.

score 0 · Answer 4 · answered Dec 29 '09 at 01:15

One nice thing about Dell is that they do build your servers to spec and they do this in a very clean and nice environment - this adds to longlivety of their servers. In my experience never ever opening a server adds to longlivety. Id say that if the server works after the first year its likely to keep working for a long time. Further you want to keep your servers in a good datacentre that provides a good environment both electrically aswell as physically. Steady temperatures matter - varying temperatures kills hardware much faster.

As for formfactor any decent supplier like the well known brand names do conscruct their systems in such a manner as to negate the majority of effects due to formfactor. Personally Id say it doesnt matter, althought that isnt entirely true. Dell, HP and IBM are well known for flaming eachothers bladecenter designs. :-) But I dare say they are all pretty darn good anyway so at the end of the day its their hardware replacement plans that matter and TCO, aslong as its a serious corp.

We stick with Dell because thyere cheaper than IBM and HP, have in my experience very low failrates because of they way they distribute their stuff (build to spec and ship). THis also saves me a bunch of time. Last time I shopped HP I bought some 30 blades with assorted disks, storage etc.. IT was delivered as some 316 boxes.. Dell would ship it as more like 10. :-) I dont like spending three hours unboxing hardware, then have to drag it into the datacenter and get it in racks (because thats the only safe place to leave hardware anyway).

As temperature goes, Id look into the 55xx series xeon cpus, especially the L variants. They are highly energy efficient usually running at 60watts or thereabout.

And, hehe, yes, whats that with XP? Are you running your webservers on XP pro? :-)

Regarding the XP Pro 64: We only needed an OS that could run a JVM, end of story. XP Pro is an OS our developers are all familiar with and its cheap. We don't need any services typically associated with a server OS. I have never heard a real, concrete reason to switch to another server os. Lots of "FreeBSD is rock solid" etc. but no one has ever followed that up with a solid because XYZ or provided real failure data etc. We did performance test comparing Linux/FreeBSD and Windows XP Pro and XP PRO actually came out on top for our workload on identical hardware. — SvrGuy, Dec 29 '09 at 17:10
Fair enough :-) Just never used XP Pro in a production environment at all like this. But theres one advantage atleast with most Linux distros (FreeBSD is imho high maintenance) and thats rarely having to reboot them on a regular basis due to updates, except for kernel updates ofcourse. Furthermore most distros tend do be quick on their feet rolling out updates for security issues, while MS sometimes do take some time to do so. Also low cost depending on the distro you prefer, further it really depends, in such a large setup as you have I can see several server features Id probably use. :) — Rune Nilssen, Dec 29 '09 at 18:12

score 0 · Answer 5 · answered Dec 29 '09 at 03:28

The selling point for me when buying hardware from large OEM's is the fact that, as opposed to smaller vendors, large OEM's build thousands of machines everyday and have their manufacturing\assembly process fine tuned to a science. They have parts manufacturers and engineers at their beck and call and have parts depots and service technicians in every major metro area. Not only is the equipment "road tested" before it's delivered to you, it comes with thousands of man hours of experience and engineering behind it. IMHO this translates into reliability, stability, and consistency.

score 0 · Answer 6 · answered Dec 29 '09 at 05:02

One thing I don't like about lower end hardware is ventilation. With high-density 1 or 2U servers, fans and lots of them are critical, and so are thermal zones. The IBM/HP/Dell servers have this down to a science, and they also have numerous temperature/fan speed sensors, and management software that will alert you if something is out of whack.

If you already have all of this covered, I wouldn't focus on switching hardware brands.

Most good servers are rated up to about 95 degrees F inlet temp, but it can quickly get much hotter than that in a rack or enclosure with poor ventilation.

Rackmount Server Reliability by Brand Name and Form Factor

6 Answers6