15

The Problem

We have an issue with performance on an existing platform, so I'm turning to the hive mind for a second opinion on this. The performance issue so far relates to IOPS rather than throughput.

The Scenario

A blade centre of 16 hosts, each with 64GB of RAM. (It's a Dell M1000e w/ M610s, but that's probably not relevant) 500 VMs, all web servers (or associated web technologies such as MySQL, load balancers, etc), around 90% are Linux and the rest Windows. Hypervisor is VMWare vSphere. We need to provide host HA, so local storage is out. As such the hosts just have an SD card to boot.

A bit of background thinking

At the moment we are up to 6 hosts (the blade centre will be at full capacity in a years time at current growth) and we are running iSCSI to a Dell MD3220i w/ MD1220 for expansion.

Possible options we have considered, and immediate thoughts along with them:

  • Spreading the VMs across NFS datastores, and running NFS storage that meets performance requirement for up to a given number of VMs. NFS seems cheaper to scale, as well as been abstracted a bit more than block level storage so we can move it around as needed.
  • Adding more MD3220i controllers/targets. We are concerned though that doing this could have a negative effect somehow in how VMWare handles having lots of targets.
  • Swapping all disks from Nearline SAS to SSD. This ought to entirely solve the IOPS issue, but has the obvious side effect of slashing our storage capacity. Also it's still very expensive.
  • vSphere 5 has a storage appliance. We haven't researched this much, but it must work well?

The Question

What sort of storage would you run underneath all of that? It wouldn't need to scale to another blade centre, it would just need to provide relatively good performance for all of those VMs.

I'm not looking for "Buy SAN x because it's the best" answers. I'm looking for thoughts on the various SAN technologies (iSCSI, FC, FCoE, InfiniBand, NFS, etc), different types of storage (SATA, SAS, SSD), and methodologies for handling storage for 100s of VMs (Consolidation, Separation, Sharding, etc).

Absolutely any thoughts, links, guides, pointers etc are welcome on this. I'd also love to hear thoughts on the above options we'd already considered.

Many thanks in advance for any input!

Update 5th March '12

Some fantastic responses so far, thank you very much everyone!

Going by the responses to this question so far, I'm beginning to think the following route is the way:

  • Tier the available storage to the VMWare cluster and place VM disks on suitable storage for their workloads.
  • Potentially make use of a SAN that is able to manage the placement of data on to suitable storage automagically.
  • Infiniband looks to be the most cost effective to get the required bandwidth with the hosts at full capacity.

It definitely sounds like it would be worth making use of the pre-sales services of a major SAN vendor to get their take on the scenario.

I'm going to continue to consider this problem for a while. In the mean time any more advise gratefully received!

SimonJGreen
  • 3,195
  • 5
  • 30
  • 55
  • also Mellanox has a 40gbe switch\nic deal thats quite extrodinary, coming very close to infiniband in terms of $/performance. At that point I'd consider a nexenta with a couple 40gbe cards as a viable option. – tony roth Mar 08 '12 at 14:54

5 Answers5

13

The key to a good VMWare storage platform is understanding what kind of load VMWare generates.

  • First, since you host a lot of servers, the workload is typically random. There are many IO streams going at the same time, and not many of them can be successfully pre-cached.
  • Second, it's variable. During normal operations, you may see 70% random reads, however the instant you decide to move a VM to a new datastore or something, you'll see a massive 60GB sequential write. If you're not careful about architecture, this can cripple your storage's ability to handle normal IO.
  • Third, a small portion of your environment will usually generate a large portion of the storage workload.

The best way to approach building storage for a VMWare platform is to start with the fundamentals.

  • You need the ability to service a large random read workload, which means smaller faster drives, as well as possibly SSD. Most modern storage systems allow you to move data around automatically depending on how it's accessed. If you are going to use SSD, you want to ensure this is how you use it. It should be there as a way of gradually reducing hot-spots. Whether you use SSD or not, it's beneficial to be able to put all the work across all the drives, so something with a type of storage pooling would be beneficial.
  • You need the ability to service intermittent large writes, which doesn't care as much about the spindle speed of the underlying drives, but does care about the controller stack's efficiency and the size of the cache. If you have mirrored caching (which is not optional unless you're willing to go back to backups whenever you have a controller failure), the bandwidth between the two caches used for mirroring will be your bottleneck for large sequential writes, usually. Ensure that whatever you get has a high speed controller (or cluster) interconnect for write caching. Do your best to get a high speed front end network with as many ports as you can get while remaining realistic on price. The key to good front end performance is to put your storage load across as many front end resources as possible.
  • You can seriously reduce costs by having a tier for low priority storage, as well as thin provisioning. If your system isn't automatically migrating individual blocks to cheap large/slow drives (like nearline SAS or SATA with 7200 RPM and 2TB+ sizes), try to do it manually. Large slow drives are excellent targets for archives, backups, some file systems, and even servers with low usage.
  • Insist that the storage is VAAI integrated so that VMWare can de-allocate unused parts of the VMs as well as the datastores.
Basil
  • 8,811
  • 3
  • 37
  • 73
10

My big VMWare deployments are NFS and iSCSI over 10GbE. That means dual-port 10GbE HBA's in the servers, as well as the storage head. I'm a fan of ZFS-based storage for this. In my case it's wrapped around commercial NexentaStor, but some choose to roll their own.

The key features of ZFS-based storage in this context would be the ARC/L2ARC caching functionality, allowing you to tier storage. The most active data would find its way in RAM and SSD storage as a second tier. Running your main storage pool off of 10k or 15k SAS drives would also be beneficial.

This is another case of profiling and understanding your workload. Work with someone who can analyze your storage patterns and help you plan. On the ZFS/NexentaStor side, I like PogoStorage. Without that type of insight, the transport method (FC, FCoE, iSCSI, NFS) may not matter. Do you have any monitoring of your existing infrastructure? What does I/O activity look like now?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • How big are these deployments out of curiosity? And what sort of workload? – SimonJGreen Mar 05 '12 at 23:35
  • Multiple hosts. Largest has 90 mixed-use VMs, including Linux, Windows infra (File/AD/Exchange), VDI and database systems. The RAM on the storage units is high (96GB+) and I have 1.2TB of L2ARC read cache on enterprise SSDs. – ewwhite Mar 05 '12 at 23:50
  • You'll have to forgive my ignorance here, and to be clear I don't doubt you're doing the right thing. Why do you have that much RAM in the storage units? Is it used for buffers? – SimonJGreen Mar 05 '12 at 23:56
  • 2
    Ah I've just read about ZFS and ARC/L2ARC. That is awesome sauce :) – SimonJGreen Mar 05 '12 at 23:56
8

The key question is: "where's the bottleneck?" You mention IOPS, but does that mean that you're positively identified the disks themselves as being the bottleneck, or merely that the SAN ports aren't running at capacity, or that the VMs are in far more iowait than you'd like?

If you've definitely identified that the disks are the limiting factor, then switching to NFS or infiniband or whatever isn't going to do squat for your performance -- you need SSDs (or at least tiered storage with SSDs in the mix) or a whole bundle more spindles (a solution which has itself gotten a whole lot more expensive recently since the world's stepper motor production got washed into the ocean).

If you're not 100% sure where the bottleneck actually is, though, you need to find that first -- swapping out parts of your storage infrastructure more-or-less at random based on other people's guesses here isn't going to be very effective (especially given how expensive any changes are going to be to implement).

womble
  • 95,029
  • 29
  • 173
  • 228
  • absolutely correct, I always assume that the person posting the question has done their homework. But with that said after doing quite a few performance consultations I mostly just give up and say add more or faster drives and more then 98% the problem is resolved. The other 2% is overcommitted beyond belief. – tony roth Mar 04 '12 at 18:53
  • 4
    "I always assume that the person posting the question has done their homework" -- baaaaaad assumption... – womble Mar 04 '12 at 19:46
  • This answer is perfect. On many occasions I have set out to tackle a problem like this one and I had some preconceived notion of what the problem was. Nine times out of ten it ends in tears when I learn that I simply did not know enough about the problem. Carefully profile, determine what the bottleneck is and then proceed. You can ask the "hive mind" for help, or you can turn to a SAN vendor to assistance. Also, if you are having trouble profiling, NetApp and/or EMC will be glad to help you figure your stats out and then size a solution for you. Both have good software for doing this. – SvrGuy Mar 04 '12 at 21:54
  • I was basing this diagnosis on the combined output of `esxtop` on all hosts (showing disk utilisation), taking the total CMD/s and comparing that to benchmarks on the SAN we use. The total CMD/s is consistently high when taking the the benchmark results as a headline. SSDs definitely seem to be a good option from a tech perspective, they're just horrendously expensive still GB/£. Might be a solution though with tiered storage. On a side note/FYI, according to a recent press release I received WD are back up to production levels on disks. – SimonJGreen Mar 05 '12 at 23:39
  • How was the benchmark on the SAN done? The limiting factor *could* still be network, as opposed to the disks themselves. At least you've got a benchmark to start from, though, if you want to start playing with different things to make things run faster, which is crucially important. – womble Mar 06 '12 at 07:59
4

If you want iscsi or nfs then minimally you'll want a few 10/40gb ports or infiniband which is the cheapest option by far but native storage solutions for infiniband seem to be limited. The issue will be the module for the bladecenter what are its options, usually 8gb fc or 10\1gbe and maybe infiniband. Note that infiniband can be used with nfs and nothing comes closed to it in terms of performance\price. if the blade center supports qdr infiniband i'd do that with a linux host of some kind with an qdr infiniband tca via nfs. Here's a good link describing this http://www.zfsbuild.com/2010/04/15/why-we-chose-infiniband-instead-of-10gige

but if the bladecenter can support qdr infiniband and you can afford native infiniband then thats the solution you should pick.

Currently you can get 40gbe switchs far cheaper (thats a strange thought) then 10gbe switches but I doubt you're blade center will support that.

tony roth
  • 3,844
  • 17
  • 14
  • These are the connectivity options from the blade centre: http://www.dell.com/us/enterprise/p/poweredge-m1000e/pd Infiniband does look good, and at this quantity of guest VMs the cost is justifiable. What would you do SAN side? – SimonJGreen Mar 03 '12 at 19:17
  • what ever dell has that supports infiniband should be your san solution. – tony roth Mar 03 '12 at 20:41
  • doesn't look like dell has any IB based storage, so I'd think that option might be a strech in this case. Both Sun and SGI have IB based SAN's not sure what there costs are. – tony roth Mar 03 '12 at 21:06
  • They don't offer IB storage, but they do offer IB connectivity. I have no qualms with using another storage vendor, we have no love for Dell in that regard. – SimonJGreen Mar 05 '12 at 23:35
  • 1
    then either sun or sgi will have a solution for you, not sure what the current model #'s are. – tony roth Mar 06 '12 at 13:27
-3

Local storage is out? I am quite happy with the write throughput on my local RAID 5s - mirrored with DRBD8 to the cluster-partner of my XEN-machine... (but this is "not supported", of course).

Aside from that I am quite sure that mySQL is your performance problem (I never saw a worse DB). Try to tune it away and/or try to put the whole DB into the filesystem cache (for read access)...

Nils
  • 7,657
  • 3
  • 31
  • 71
  • The OP has an existing VMWare solution and is running with diskless hosts. Local storage does not make sense. – ewwhite Mar 04 '12 at 21:54
  • Local storage might include using local storage blades as well. But VMWare won`t support that, I suppose. – Nils Mar 04 '12 at 21:58
  • I don't believe that Dell offers local storage blades - and I'm not sure I've seen those with anyone else. I've seen drive blades that attach to a single blade, not offer storage to anyone in the blade. You'd need an interconnect for that, it would essentially be a chassis-local SAN, right? – mfinni Mar 05 '12 at 15:51
  • Sorry @Nils, I'm pretty sure you didn't read the question properly. – SimonJGreen Mar 06 '12 at 00:01
  • Nils - looking at the D2200sb : "The enclosure backplane provides a PCI Express connection to the adjacent c-Class server blade and enables high performance storage access without any additional cables. ... Use the HP P4000 Virtual SAN Appliance Software (VSA) to turn the D2200sb into an iSCSI SAN for use by all severs in the enclosure and any server on the network." – mfinni Mar 06 '12 at 01:10
  • So, that is a drive blade that attaches to a single server, and you can run VSA software on that blade to make an iSCSI SAN. So, I think I'm still correct. It's using the built-in ethernet as the interconnect and a server running VSA as the iSCSI block device. Otherwise, it's just local storage for the server that it's next to, via the PCIe connector between the two slots. – mfinni Mar 06 '12 at 01:14
  • It is a SAS-array that connects to a SAS-fabric. That SAS-fabric forwards the disk-traffic to individual blades. That is comparable to a small local SAN - comparable to "low priority storage" as outlined in Basil`s answer. – Nils Mar 15 '12 at 20:24