23

I'm still new to ZFS. I've been using Nexenta but I'm thinking of switching to OpenIndiana or Solaris 11 Express. Right now, I'm at a point of considering virtualizing the ZFS server as a guest within either ESXi, Hyper-V or XenServer (I haven't decided which one yet - I'm leaning towards ESXi for VMDirectPath and FreeBSD support).

The primary reason being that it seems like I have enough resources to go around that I could easily have 1-3 other VMs running concurrently. Mostly Windows Server. Maybe a Linux/BSD VM as well. I'd like the virtualized ZFS server to host all the data for the other VMs so their data could be kept on a physically separate disks from the ZFS disks (mount as iscsi or nfs).

The server currently has an AMD Phenom II with 6 total cores (2 unlocked), 16GB RAM (maxed out) and an LSI SAS 1068E HBA with (7) 1TB SATA II disks attached (planning on RAIDZ2 with hot spare). I also have (4) 32GB SATA II SSDs attached to the motherboard. I'm hoping to mirror two of the SSDs to a boot mirror (for the virtual host), and leave the other two SSDs for ZIL and L2ARC (for the ZFS VM guest). I'm willing to add two more disks to store the VM guests and allocate all seven of the current disks as ZFS storage. Note: The motherboard does not have IOMMU support as the 880G doesn't support it, but I do have an 890FX board which does have IOMMU if it makes a huge difference.

My questions are:

1) Is it wise to do this? I don't see any obviously downside (which makes me wonder why no one else has mentioned it). I feel like I could be making a huge oversight and I'd hate to commit to this, move over all my data only to go fubar from some minute detail I missed.

2) ZFS virtual guest performance? I'm willing to take a small performance hit but I'd think if the VM guest has full disk access to the disks that at the very least, disk I/O performance will be negligible (in comparison to running ZFS non-virtualized). Can anyone speak to this from experience hosting a ZFS server as a VM guest?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
osij2is
  • 3,875
  • 2
  • 23
  • 31
  • You say you want to host data for all the other VMs. Do you forsee yourself wanting deduplication at some point? If so, this should really be on it's own machine, as deduplication is very memory intensive. Why not take a look at something like SmartOS for your ZFS needs? That way you get a hypervisor too. – devicenull Jun 14 '12 at 01:57
  • I've thought about dedupe but for the immediate time being, no, I'd rather not use it. I'll investigate SmartOS. I haven't heard of it so I'll check that out. – osij2is Jun 14 '12 at 15:06

1 Answers1

38

I've built a number of these "all-in-one" ZFS storage setups. Initially inspired by the excellent posts at Ubiquitous Talk, my solution takes a slightly different approach to the hardware design, but yields the result of encapsulated virtualized ZFS storage.

To answer your questions:

  • Determining whether this is a wise approach really depends on your goals. What are you trying to accomplish? If you have a technology (ZFS) and are searching for an application for it, then this is a bad idea. You're better off using a proper hardware RAID controller and running your VMs on a local VMFS partition. It's the path of least resistance. However, if you have a specific reason for wanting to use ZFS (replication, compression, data security, portability, etc.), then this is definitely possible if you're willing to put in the effort.

  • Performance depends heavily on your design regardless of whether you're running on bare-metal or virtual. Using PCI-passthrough (or AMD IOMMU in your case) is essential, as you would be providing your ZFS VM direct access to a SAS storage controller and disks. As long as your VM is allocated an appropriate amount of RAM and CPU resources, the performance is near-native. Of course, your pool design matters. Please consider mirrors versus RAID Z2. ZFS scales across vdevs and not the number of disks.


My platform is VMWare ESXi 5 and my preferred ZFS-capable operating system is NexentaStor Community Edition.

This is my home server. It is an HP ProLiant DL370 G6 running ESXi fron an internal SD card. The two mirrored 72GB disks in the center are linked to the internal Smart Array P410 RAID controller and form a VMFS volume. That volume holds a NexentaStor VM. Remember that the ZFS virtual machine needs to live somewhere on stable storage.

There is an LSI 9211-8i SAS controller connected to the drive cage housing six 1TB SATA disks on the right. It is passed-through to the NexentaStor virtual machine, allowing Nexenta to see the disks as a RAID 1+0 setup. The disks are el-cheapo Western Digital Green WD10EARS drives aligned properly with a modified zpool binary.

I am not using a ZIL device or any L2ARC cache in this installation.

enter image description here

The VM has 6GB of RAM and 2 vCPU's allocated. In ESXi, if you use PCI-passthrough, a memory reservation for the full amount of the VM's assigned RAM will be created.

I give the NexentaStor VM two network interfaces. One is for management traffic. The other is part of a separate vSwitch and has a vmkernel interface (without an external uplink). This allows the VM to provide NFS storage mountable by ESXi through a private network. You can easily add an uplink interface to provide access to outside hosts.

Install your new VMs on the ZFS-exported datastore. Be sure to set the "Virtual Machine Startup/Shutdown" parameters in ESXi. You want the storage VM to boot before the guest systems and shut down last.


enter image description here

Here are the bonnie++ and iozone results of a run directly on the NexentaStor VM. ZFS compression is off for the test to show more relatable numbers, but in practice, ZFS default compression (not gzip) should always be enabled.

# bonnie++ -u root -n 64:100000:16:64

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
saint           12G   156  98 206597  26 135609  24   410  97 367498  21  1478  17
Latency               280ms    3177ms    1019ms     163ms     180ms     225ms
Version  1.96       ------Sequential Create------ --------Random Create--------
saint               -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
    64:100000:16/64  6585  60 58754 100 32272  79  9827  58 38709 100 27189  80
Latency              1032ms     469us    1080us     101ms     375us   16108us

# iozone -t1 -i0 -i1 -i2 -r1m -s12g

    Iozone: Performance Test of File I/O

    Run began: Wed Jun 13 22:36:14 2012

    Record Size 1024 KB
    File size set to 12582912 KB
    Command line used: iozone -t1 -i0 -i1 -i2 -r1m -s12g
    Output is in Kbytes/sec
    Time Resolution = 0.000001 seconds.
    Throughput test with 1 process
    Each process writes a 12582912 Kbyte file in 1024 Kbyte records

    Children see throughput for  1 initial writers  =  234459.41 KB/sec
    Children see throughput for  1 rewriters        =  235029.34 KB/sec
    Children see throughput for  1 readers          =  359297.38 KB/sec
    Children see throughput for 1 re-readers        =  359821.19 KB/sec
    Children see throughput for 1 random readers    =   57756.71 KB/sec
    Children see throughput for 1 random writers    =  232716.19 KB/sec

This is a NexentaStor DTrace graph showing the storage VM's IOPS and transfer rates during the test run. 4000 IOPS and 400+ Megabytes/second is pretty reasonable for such low-end disks. (big block size, though) enter image description here

Other notes.

  • You'll want to test your SSDs to see if they can be presented directly to a VM or if the DirectPath chooses the entire motherboard controller.
  • You don't have much CPU power, so limit the storage unit to 2 vCPU's.
  • Don't use RAIDZ1/Z2/Z3 unless you really need the disk space.
  • Don't use deduplication. Compression is free and very useful for VMs. Deduplication would require much more RAM + L2ARC in order to be effective.
  • Start without the SSDs and add them if necessary. Certain workloads don't hit the ZIL or L2ARC.
  • NexentaStor is a complete package. There's a benefit to having a solid management GUI, however, I've heard of success with Napp-It as well.
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • +1. Thanks for all the information! To answer your question, I'm doing this for a few reasons. I'm largely doing this to leverage the other CPU cores to make one or two other VMs (not doing ZFS) and to provide an iSCSI target to my Opteron virtual server. My reasons for ZFS are (in no particular order) compression and data security & replication. Dedupe looks very cool, but in terms of resources and my data, I'm not sure if it's necessary. I'm using Nexenta right now but I was considering moving to Solaris Express or OpenIndiana if I continue to pile on the disks to exceed the 18TB limit. – osij2is Jun 14 '12 at 15:45
  • So, I understand your comment on whether or not to use the SSDs for L2ARC or ZIL and I'm willing to do just that. See the performance first, THEN determine whether to add ZIL and/or ARC. As for mirroring vs. RAIDZ, after reading your comments and reading this blog post (http://constantin.glez.de/blog/2010/01/home-server-raid-greed-and-why-mirroring-still-best) I guess mirroring holds a slight edge. I don't really need the disk space but if I can have some redundancy & fast read/write capabilities, I think I'll switch to that. Whatever storage space I could eek out really wouldn't be worth it. – osij2is Jun 14 '12 at 15:54
  • Plus, remember that the compression is useful. I do pay for commercial Nexenta for client systems and anything larger than 18TB. But the same tips apply to OpenIndiana. – ewwhite Jun 14 '12 at 16:13
  • Are you using an E1000 vnic or a VMXNet3 vnic for the NFS network? Because [I'm only getting 1gbps between Nexenta/Solaris and VMware](http://serverfault.com/questions/537532/why-am-i-getting-only-1gbps-between-solaris-and-vmware) using a similar setup and can't figure out how to get more speed. What version of NexentaStor? I suspect the version they currently have available is broken... – Josh Sep 10 '13 at 16:56