22

When specifying servers, like (I would assume) many engineers who aren't experts in storage, I'll generally play it safe (and perhaps be a slave to marketing) by standardising on a minimum of 10k SAS drives (and therefore are "enterprise"-grade with a 24x7 duty cycle, etc) for "system" data (usually OS and sometimes apps), and reserve the use of 7.2k mid/nearline drives for storage of non-system data where performance isn't a significant factor. This is all assuming 2.5" (SFF) disks, as 3.5" (LFF) disks are only really relevant for high-capacity, low IOPs requirements.

In situations where there isn't a massive amount of non-system data, I'll generally place it on the same disks/array as the system data, meaning the server only has 10k SAS drives (generally a "One Big RAID10" type of setup these days). Only if the size of the non-system data is significant do I usually consider putting it on a separate array of 7.2k mid/nearline disks to keep the cost/GB down.

This has lead me to wonder: in some situations, could those 10k disks in the RAID10 array have been replaced with 7.2k disks without any significant negative consequences? In other words, am I sometimes over-spec'ing (and keeping the hardware vendors happy) by sticking to a minimum of 10k "enterprise" grade disks, or is there a good reason to always stick to that as a minimum?

For example, take a server that acts as a hypervisor with a couple of VMs for a typical small company (say 50 users). The company has average I/O patterns with no special requirements. Typical 9-5, Mon-Fri office, with backups running for a couple of hours a night. The VMs could perhaps be a DC and a file/print/app server. The server has a RAID10 array with 6 disks to store all the data (system and non-system data). To my non-expert eye, it looks as though mid/nearline disks may do just fine. Taking HP disks as an example:

  • Workload: Midline disks are rated for <40% workload. With the office only open for 9 hours a day and average I/O during that period unlikely to be anywhere near maximum, it seems unlikely workload would go over 40%. Even with a couple of hours of intense I/O at night for backups, my guess is it would still be below 40%
  • Speed: Although the disks are only 7.2k, performance is improved by spreading it across six disks

So, my question: is it sensible to stick a minimum of 10k SAS drives, or are 7.2k midline/nearline disks actually more than adequate in many situations? If so, how do I gauge where the line is and avoid being a slave to ignorance by playing it safe?

My experience is mostly with HP servers, so the above may have a bit an HP slant to it, but I would assume the principles are fairly vendor independent.

poolie
  • 1,155
  • 1
  • 9
  • 17
dbr
  • 1,812
  • 3
  • 22
  • 37
  • Sure, I can certainly add some more specifics, although may take me 24hrs or so. I tried to ask a question that was specific enough to be able to answer in a meaningful way, but also general enough to try and determine some basic principles. I should have mentioned, I'm assuming SFF rather than LFF disks. – dbr Jan 18 '16 at 00:36
  • 3
    SFF 7.2k midline disks make no sense because of capacity and duty limitations. If you're talking about HP equipment _(my specialty)_, 900GB and 1.2TB 10k SAS drives will be the best option if you're not using SSDs. If you are in the US, 900GB SAS should be ~$300-400 if you have a good vendor. – ewwhite Jan 18 '16 at 00:44
  • From a vendor perspective, these are things that are a function of what makes sense with manufacturer options. Very few 3.5"-equipped servers exist from the major manufacturers (Dell/HP). However, companies that are more DIY or use Supermicro, for instance, will have more chassis flexibility. – ewwhite Jan 18 '16 at 01:10
  • 1
    Minor grammatical complaint: if you say "substitute X for Y", that implies you had Y to start with and are replacing it with X. – pjc50 Jan 18 '16 at 12:07
  • @pjc50, I agree, it's pretty confusing whether dbr wants to replace 10k drives with 7.2k drives or vice versa. I queued an edit to fix it. – poolie Jan 18 '16 at 15:58
  • 2
    Sure you live in 2015? Because since some years my OS drive is a small SSD (saves power etc.) and I would not touch any HD for high performance either. – TomTom Jan 18 '16 at 17:48
  • 1
    @TomTom No, I'm in 2016 :) In all seriousness, I've not really considered it. As I said in my post, I'll generally go for a "one big RAID 10" approach these days, so the OS will go on there. Separating out the OS onto a separate SSD seems wasteful if it's not really necessary. I'd be interested to hear your thoughts. Would you use a single SSD or a mirrored pair? Perhaps this would make a good SF question by itself... – dbr Jan 18 '16 at 19:37
  • 1
    Mirrored pair for OS. HP even sell OS/boot-specific SSDs. – ewwhite Jan 18 '16 at 19:47
  • It's probably my inexperience again, but I'm struggling to see the justification for the extra cost of having a separate pair of SSDs for the OS if it would already be on fairly fast HDDs (in my case, typically a RAID10 array with 4, 6 or more disks). I could perhaps understand it if the server only has the host OS locally and looks to shared storage for everything else. – dbr Jan 18 '16 at 19:58
  • Coming from a cost sensitive job, I had a budget of $X, and a requirement to store Y TB. Another requirement is "Do I need more physical spindles (are more+smaller drives better for my workload) or can I get away with larger storage per spindle and fewer of them?" – Criggie Jan 18 '16 at 20:41
  • Cost does not matter - it is some USD for a machine worth thousands. We talk of a small sSSD or a pair, 120gb generally. But flexibility does. One RAID 10 is not suitable for everything. I assume you never have seen a machine with 20+ SSD split into various workload areas for a high performance database server? – TomTom Jan 19 '16 at 07:06
  • tbh 3.5" 7.2k nearline are still way ahead for size/power and size/cost, combined with SSD cache/tiers make a good combination. At LFF reliability and warranties are the same. But as @ewwhite says the numbers simply don't add up for the 2.5" drives. – JamesRyan Jan 19 '16 at 11:27
  • @TomTom I totally appreciate that SSDs make complete sense for some high performance workloads like a database server, but we're talking about the boot volume on a typical server – dbr Jan 22 '16 at 19:31
  • @dbr And let me guess - you never bothered to even look up the price difference between a small "boot grade" SSD and a hard disc? Because they are pretty non existing. But the advantages are hugh - during patch day for example. – TomTom Jan 24 '16 at 20:48
  • @TomTom Yikes, that was a bit strong! I did take a look at the RRPs using iQuote, and the prices were higher, but perhaps that's not the best place to look. I've just done some Googling and it looks as though iQuote isn't showing all the disks that Quickspecs suggest should be available. Also, as I mentioned before, I'll tend to go for a "one big RAID 10" type of approach rather than boot + data. Without doing the sums, I would have thought switching to using dedicated boot SSDs would increase the cost. – dbr Jan 24 '16 at 20:59
  • @dbr The problem with the "one big RAID 10" is that it is not flexible. And it would be quite hugh. For me "A big raid" has a LOT of discs - my file servers normally are built on a 2+24 layout to start (2 boot discs, 24 slots for storage discs). Smaller machines have no discs except boot and store stuff on the network. I seriously do not want the OS being impacted when the storage network overloads with IO - and I do want patch days to come and go as non-events. The price for a SSD or two is trivial for anything but trivial servers. – TomTom Jan 24 '16 at 21:10
  • @TomTom It's certainly true that I'm generally dealing with fairly simple requirements compared to you and many other on here. I appreciate that more complex requirements mean you can't get away with a simplistic approach. In that situation, a few extra £££ in some areas is much more easily justified by the benefits and reassurance it can bring. Sounds like it's just a case of apples and oranges. – dbr Jan 24 '16 at 21:23

3 Answers3

25

There's an interesting intersection of server design, disk technology and economics here:

Also see: Why are Large Form Factor (LFF) disks still fairly prevalent?

  • The move toward dense rackmount and small form-factor servers. E.g. you don't see many tower offerings anymore from the major manufacturers, whereas the denser product lines enjoy more frequent revisions and have more options/availability.
  • Stagnation in 3.5" enterprise (15k) disk development - 600GB 15k 3.5" is about as large as you can go.
  • Slow advancement in 2.5" near line (7.2k) disk capacities - 2TB is the largest you'll find there.
  • Increased availability and lower pricing of high capacity SSDs.
  • Storage consolidation onto shared storage. Single-server workloads that require high capacity can sometimes be serviced via SAN.
  • The maturation of all-flash and hybrid storage arrays, plus the influx of storage startups.

The above are why you generally find manufacturers focusing on 1U/2U servers with 8-24 2.5" disk drive bays.

3.5" disks are for low-IOPs high-capacity use cases (2TB+). They're best for external storage enclosures or SAN storage fronted by some form of caching. In enterprise 15k RPM speeds, they are only available up to 600GB.

2.5" 10k RPM spinning disks are for higher IOPS needs and are generally available up to 1.8TB capacity.

2.5" 7.2k RPM spinning disks are a bad call because they offer neither capacity, performance, longevity nor price advantages. E.g. The cost of a 900GB SAS 10k drive is very close to that of a 1TB 7.2k RPM SAS. Given the small price difference, the 900GB drive is the better buy. In the example of 1.8TB 10k SAS versus 2.0TB 7.2k SAS, the prices are also very close. The warranties are 3-year and 1-year, respectively.

So for servers and 2.5" internal storage, use SSD or 10k. If you need capacity needs and have 3.5" drive bays available internally or externally, use 7.2k RPM.

For the use cases you've described, you're not over-configuring the servers. If they have 2.5" drive bays, you should really just be using 10k SAS or SSD. The midline disks are a lose on performance, capacity, have a significantly shorter warranty and won't save much on cost.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thanks for taking the time to put this together. I'll have a chance to give it some proper thought tomorrow. Just having a quick look at prices, it looks like about a 30% jump between the 1TB 7.2k and 900GB 10k, which isn't massive (I'm in the UK if it matters). Could possibly be a factor if you're on a tight budget though where you're trying to make reasonable savings in several places and disk selection is just one of them. I'd be interested to hear what you think about the question from a purely technical perspective too. – dbr Jan 18 '16 at 01:16
  • From a technical perspective, there's no advantage to a 7200 RPM 2.5" disk. If the costs seem too far off, keep shopping. There's little difference in this market. If this is for boot disk purposes, SSD is a good alternative. But I can't think of any reason I'd use an HP 7200 2.5" disk in a server today. Also, read your HP quickspecs closely. Midline drive have shorter warranties. – ewwhite Jan 18 '16 at 01:27
  • 1
    In general this answer is great. But like with anything else, "it depends." In the example of a 900GB 10k vs 1TB 7200 disk, the 1TB disk will run cooler and therefore perhaps last longer, and will be less expensive. If you don't need the additional performance, then it's a waste of money, both the original capital cost and operations. For one server, it doesn't matter much. For 10, it starts to add up. – Dan Pritts Jan 18 '16 at 04:33
  • 2
    Really the disk that run slower will last longer? Any article I am missing? – vasin1987 Jan 18 '16 at 10:29
  • @vasin1987 HP warranties are shorter for 7200RPM disks than 10k RPM disks - So yes, that reflects duty cycle, longevity and design decisions. – ewwhite Jan 18 '16 at 12:35
  • @DanPritts The pricing is very close in the example you gave, and we're talking about servers that have advanced comprehensive thermal monitoring and cooling features. Heat is not going to be an issue. A bunch of 7200RPM internal drives is a riskier choice than 10k (or SSD). – ewwhite Jan 18 '16 at 12:42
  • I did a quick price check on current seagate 2.5" drives. the 1TB drive is ~$200, the 900GB drive ~$300. The 900GB uses about 2.5w more "average operating power" also. ... I didn't mean to suggest that the server or drive would overheat - clearly unlikely. But all things being equal, there is less stress on a 7200rpm drive than on a 10k drive; that may lead to longer life. Regardless, I'm willing to concede your point - the faster drive is slightly less risk (likely to perform, not much more likely to fail). Whether that is worth the additional expense is for the reader to decide. – Dan Pritts Jan 18 '16 at 17:22
  • Seagate offers 5 year warranties on both their 7200rpm and 10k sas drives. Warranty length is more likely to be a pricing decision than an expected lifetime decision. I would expect either drive to outlast the warranty. – Dan Pritts Jan 18 '16 at 17:26
  • @ewwhite I appreciate that there's no technical advantage gained from using a 7.2k midline disk over a 10k. My question was more - would 7.2k drives be technically adequate in a situation like the one I describe in my question, or would they never be a good idea? If they are, then if you're on a tight budget, then I can see there being a case for them there. As others have said, lots of small savings add up. – dbr Jan 18 '16 at 19:43
  • @dbr There's no reason to use them. They are technically capable, but not a good purchase. – ewwhite Jan 18 '16 at 19:44
  • @ewwhite because there are options (e.g 10k SAS) that would much more "comfortably" fit the requirements and are only marginally more expensive? – dbr Jan 18 '16 at 19:49
  • 2
    From a vendor/manufacturer's perspective, yes. They are *steering* you to 10k and SSD for 2.5". If you were white-boxing, go 7200 RPM. In fact, my ZFS storage vendor, [PogoStorage](http://www.pogolinux.com/products/osnexus-linux-zfs-storage-servers.php), use 7200 RPM 2.5" for their ZFS arrays because the caching and SSD tiering eliminate the need to spec faster disks. – ewwhite Jan 18 '16 at 19:53
5

There are at least a few things that could cause problems with SOME drive types:

  • Drives that are not meant to deal with the vibration load of a chassis having many drives (unlikely problem with any drive specified as RAID/NAS-capable)

  • Firmware that does not allow TLER, or needs time-consuming manual reconfiguration of the drive to enable it (ditto)

  • Drives that have never been tested with the RAID controller used, and might have unrecognized bugs that surface in such a setup

  • Internal drive write caches that behave in a way (physical writing is out of order or very delayed) that causes a lot of confusion in case of a hard shutdown (RAID controller should be configured to force these OFF. Potential problem if firmware should ever ignore that. See untested drives :)

  • Drive might do internal maintenance routines occasionally that could make the drive behave slowly, or respond with enough delay, to make the RAID controller think it failed (related to TLER)

  • SATA in general, as it is usually implemented, has less safeguards compared to SAS against a drive with completely shot or hung electronics hanging everything on the controller (not a theoretical risk, certain disk+controller brand combinations love that failure mode).

randers
  • 65
  • 1
  • 7
rackandboneman
  • 2,487
  • 10
  • 8
  • 1
    These seem like reasons to use drives qualified with the server hardware and application stack, but not specifically about 10k vs 7k2 rpm. – poolie Jan 18 '16 at 16:22
  • 1
    The question can easily be (mis?)understood for "can a non-enterprise 7.2k disk, or one designated for single-drive enterprise use, be used in the application?". And "safely" would usually imply addressing risks of data loss or failure related downtime. – rackandboneman Jan 18 '16 at 16:31
4

HUGE issue:

(May be a teeny bit off-topic - but I'ts imporant!)

When you are dealing with SSDs - (as is often the case, or may be either the case or temptation) - a lot of SSDs have a nasty problem where they cannot always recover from spontaneous power outages!

This is a tiny problem with HDDs. HDDs usually have enough capacitance to power their logic and enough angular momentum to carry the platters through finishing off writing a 512-byte block - in the event that power is lost mid-write. Once in a rare while, this will not work, resulting in something called a "torn write" - where a single block may be partially written. The partial write (albiet rare) will cause a checksum failure on the block - i.e. that individual block will be bad. This can usually be detected as bad by the disk circuitry itself, and corrected by the upstream RAID controller.

SSDs are a different animal. The usually implement something called "wear leveling" - where they don't just write "block X" to a physical location for "block X" like a HDD does. Instead, they try to write to difference places on the flash media - and they try to aggregate or combined writes (using a bit of buffering). Writing to the different places involves keeping a "map" of where things are written, which is also buffered and written out in a manner meant to reduce wear leveling. Part of the wear leveling even can involve moving data that's already on the device and hasn't even been recently written.

This problem is that when the SSD loses power - it has a lot of data in memory (unflushed), it has some data that has been written out to different/changed locations - and it has these maps in it's own memory which need to be flushed out to make any sense of the strucuture of all the data on the device.

MANY SSDs do not have the logic or circuitry to be able to keep their controllers up and alive long enough on spontaneous-power-out to safely flush all this data to flash before it dies. This doesn't just mean that that one block you wrote could now be in jeprody - but other blocks - even all the blocks on the device could be in trouble. Many devices also have problems where they not only lose all the data on the device, but the device itself becomes bricked, and unusable.

This is all true theory - but (working in the storage industry) - I/we have seen this happen way too many times on way too many devices - including in some of our own, personal laptops!

Many vendors have discussed making "enterprise grade SSDs" where the specifically add devices ("super-caps") and other circuitry to allow a clean "flush" - but it's very very hard to find any device which specifically states as a part of it's datasheet that it has sufficient, explicit, tested protection from such events and will protect against such events.

Obviously if you buy a "high end storage array" from a top-tier vendor which utilized flash technology, either their drives - or their system on-whole has been designed with all this in account. Make sure it has!

The problem with respect to your question is: If you have a RAID array - and several of the disks are the "bad" SSDs without this protection - if you get a "spontaneous power outage" - you could lose ALL the data on MULTIPLE disks rendering RAID reconstruction impossible.

"But I use a UPS"

It is also generally important to note that "spontaneous power outage" can include situations like BSOD and kernel locks/crashes/panics - where you have no choice of recover by to pull the plug on the system.

Brad
  • 477
  • 2
  • 5
  • 13
  • 2
    It is rare that someone will pull the plug on a hung system (unless it is trashing the disk) quickly enough to not allow disks of any type to flush their caches. And in that case, conventional HDDs with enabled caches can produce the same mess, albeit with less chance of bricking but still with a significant chance of data corruption - Reiserfs, early NTFS, they tended to end up shot from that because they handled journal data being written for an activity that didn't actually happen (or vice versa, both likely with out of order cache flushing) VERY badly. – rackandboneman Jan 19 '16 at 00:44
  • 2
    A properly designed SSD won't corrupt or lose data in the event that data hasn't been fully flushed. As the physical location of each logical sector can change on every write, the previous version of the data in each logical sector should still exist in the event that the update has not been flushed yet. You can still lose data if the firmware suffers from design flaws or implementation bugs. – kasperd Jan 19 '16 at 08:29
  • 1
    @kasperd consumer SSDs are sold on speed basis, they make compromises to do that. While it should be possible to maintain integrity the way you suggest, the fact is that most manufacturers drives (at least at consumer level) simply don't. Also when they reach EoL most don't fail gracefully. – JamesRyan Jan 19 '16 at 11:19
  • @JamesRyan Stories about manufacturers cheating with the flushing of data to persistent storage in order to come out better in some performance metric are not new. We have heard about that happening also in the days of hard disks. It is not because this is what consumers want. It is because consumers only see some of the metrics and don't know how the manufacturer has been cheating in other areas to achieve it. Sometimes manufacturers get away with cheating, sometimes they don't. (I'm sure somebody could come up with a car analogy inspired by recent news.) – kasperd Jan 19 '16 at 12:06
  • 2
    SSDs are a different animal. They have map tables that tell WHERE the data is. They are moving and relocating data and adjusting these maps. They NEED to coalesce their writes (i.e defer, bunch them up & write later) to avoid write amplification. The maps themselves can't be written to aggressively and need to follow these same rules. We can about "proper designs" and flaws - but SSDs aren't a "simple" as journled filesystems (which aren't simple). I'm speaking from a LOT of experience, testing, specifications and I may or may not have spoken to a manufacturer - or two - or three in my job. – Brad Jan 19 '16 at 14:31
  • If you really care for your data, you use ZFS. :-) – Martin Schröder Jan 20 '16 at 18:32
  • True. Only recently the key SSD manufacturers started to address this kind of problem in non-enterprise SSDs. Two examples: [recent Sandisk's SSDs flush the map table each seconds](http://www.sandisk.com/Assets/docs/Unexpected_Power_Loss_Protection_Final.pdf) and Micron (from M500/C500) use small capacitors to protect data-at-rest. While they are not enterprise grade and are not 100% protected by data loss, these precautions should prevent (or make very rare) NAND metadata corruption and total disk brick. – shodanshok Jan 23 '16 at 20:31
  • See also [here](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf) – shodanshok Jan 23 '16 at 20:40