25

I'm looking toward building a largish ZFS Pool (150TB+), and I'd like to hear people experiences about data loss scenarios due to failed hardware, in particular, distinguishing between instances where just some data is lost vs. the whole filesystem (of if there even is such a distinction in ZFS).

For example: let's say a vdev is lost due to a failure like an external drive enclosure losing power, or a controller card failing. From what I've read the pool should go into a faulted mode, but if the vdev is returned the pool should recover? or not? or if the vdev is partially damaged, does one lose the whole pool, some files, etc.?

What happens if a ZIL device fails? Or just one of several ZILs?

Truly any and all anecdotes or hypothetical scenarios backed by deep technical knowledge are appreciated!

Thanks!

Update:

We're doing this on the cheap since we are a small business (9 people or so) but we generate a fair amount of imaging data.

The data is mostly smallish files, by my count about 500k files per TB.

The data is important but not uber-critical. We are planning to use the ZFS pool to mirror 48TB "live" data array (in use for 3 years or so), and use the the rest of the storage for 'archived' data.

The pool will be shared using NFS.

The rack is supposedly on a building backup generator line, and we have two APC UPSes capable of powering the rack at full load for 5 mins or so.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Cyclone
  • 557
  • 4
  • 15
  • 2
    If you don't already know what you are doing, get a consultant and/or get some courses. I doubt all specifics you need can be covered in one simple answer. – Lucas Kauffman Jul 24 '12 at 13:01
  • 3
    So you're still planning on using cheapo consumer 7.2 SATAs then? *sigh* – Chopper3 Jul 24 '12 at 13:08
  • @Chopper3 Actually, I intentionally didn't say that... I am giving serious consideration to buying 2TB SAS drives instead of 3TB SATA drives. Though I've seen plenty of people say they've been using SATA drives just fine.... – Cyclone Jul 24 '12 at 19:47
  • 1
    SATA disks for ZFS are not really a good mix. You won't find many people recommending that setup nowadays. At the scale you're talking about (150TB), it's an expensive and unnecessary mistake. [Take a look at this, though](http://blog.solori.net/2010/09/17/quick-take-zfs-and-early-disk-failure/). – ewwhite Jul 24 '12 at 20:39

2 Answers2

21

Design the right way and you'll minimize the chances of data loss of ZFS. You haven't explained what you're storing on the pool, though. In my applications, it's mostly serving VMWare VMDK's and exporting zvols over iSCSI. 150TB isn't a trivial amount, so I would lean on a professional for scaling advice.

I've never lost data with ZFS.

I have experienced everything else:

But through all of that, there was never an appreciable loss of data. Just downtime. For the VMWare VMDK's sitting on top of this storage, a fsck or reboot was often necessary following an event, but no worse than any other server crash.

As for a ZIL device loss, that depends on design, what you're storing and your I/O and write patterns. The ZIL devices I use are relatively small (4GB-8GB) and function like a write cache. Some people mirror their ZIL devices. Using the high-end STEC SSD devices makes mirroring cost-prohibitive. I use single DDRDrive PCIe cards instead. Plan for battery/UPS protection and use SSD's or PCIe cards with a super-capacitor backup (similar to RAID controller BBWC and FBWC implementations).

Most of my experience has been on the Solaris/OpenSolaris and NexentaStor side of things. I know people use ZFS on FreeBSD, but I'm not sure how far behind zpool versions and other features are. For pure storage deployments, I'd recommend going the Nexentastor route (and talking to an experienced partner), as it's a purpose-built OS and there are more critical deployments running on Solaris derivatives than FreeBSD.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I updated my question with some more info, but I'm particularly interested in knowing more details regarding: 'never an appreciable loss of data', and what that means/involved. Also interested in knowing more about recovering those faulted zpools, handling the bad NICs, and even the problems with the SATA drives and switching over to SAS (though you'll be happy to know, I'll likely go with 2TB SAS over 3TB SATA, on your recommendation). – Cyclone Jul 24 '12 at 20:45
  • Non-appreciable-loss == a few seconds of transactional data, or a [crash-consistent state](http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/293-crash-no-goo.html). And the bad NICs were isolated to a single VMWare host and resulted in problems at the VM-level. Not the underlying ZFS storage. – ewwhite Jul 24 '12 at 21:15
  • The `design the right way` link is broken now. – Saurabh Nanda Jan 30 '19 at 18:31
11

I accidentally overwrote both ZILs on the last version of OpenSolaris, which caused the entire pool to be irrevocably lost. (Really bad mistake on my part! I didn't understand that losing the ZIL would mean losing the pool. Fortunately recovered from backup with downtime.)

Since version 151a though (don't know offhand how what ZPool version that means), this problem has been fixed. And, I can testify that it works.

Other than that, I've lost ZERO data on a 20tb server - including due to several further cases of user error, multiple power-failures, disk mis-management, mis-configurations, numerous failed disks, etc. Even though the management and configuration interfaces on Solaris change frequently and maddeningly from version to version and presents a significant ever-shifting skills target, it is still the best option for ZFS.

Not only have I not lost data on ZFS (after my terrible mistake), but it constantly protects me. I no longer experience data corruption - which has plagued me for the last 20 years on any number of servers and workstations, with what I do. Silent (or just "pretty quiet") data corruption has killed me numerous times, when the data rolls off the backup rotation, but has in fact become corrupt on-disk. Or other scenarios where the backups backed up the corrupt versions. This has been a far bigger problem than just losing data in a big way all at once, which is almost always backed up anyway. For this reason, I just love ZFS and can't comprehend why checksumming and automatic healing haven't been standard features in file systems for a decade. (Granted, truly life-or-death systems usually have other ways of insuring integrity, but still - enterprise data integrity is important too!)

Word to the wise, if you don't want to descend into ACL-hell, don't use the CIFS server built-in to ZFS. Use Samba. (You said you use NFS though.)

I disagree with the SAS vs. SATA argument, at least the suggestion that SAS is always preferred over SATA, for ZFS. I don't know if that comment[s] was referencing platter rotation speed, presumed reliability, interface speed, or some other attribute[s]. (Or maybe just "they cost more and are generally not used by consumers, therefore they are superior". A recently released industry survey (still in the news I'm sure), revealed that SATA actually outlives SAS on average, at least with the survey's significant sample size. (Shocked me that's for sure.) I can't recall if that was "enterprise" versions of SATA, or consumer, or what speeds - but in my considerable experience, enterprise and consumer models fail at the same statistically significant rates. (There is the problem of consumer drives taking too long to time-out on failure though, which is definitely important in the enterprise - but that hasn't bitten me, and I think it is more relevant to hardware controllers that could take the entire volume off-line in such cases. But that's not a SAS vs SATA issue, and ZFS has never failed me over it. As a result of that experience, I now use a mix of 1/3 enterprise and 2/3 consumer SATA drives.) Furthermore I've seen no significant performance hit with this mix of SATA, when configured properly (e.g. a stripe of three-way mirrors), but then again I have a low IOPS demand, so depending on how large your shop is and typical use-cases, YMMV. I've definitely noticed that per-disk built-in cache size matters more for latency issues than platter rotational speed, in my use-cases.

In other words, it's an envelope with multiple parameters: cost, throughput, IOPS, type of data, number of users, administrative bandwidth, and common use-cases. To say that SAS is always the right solution is to disregard a large universe of permutations of those factors.

But either way, ZFS absolutely rocks.

bubbles
  • 131
  • 3
  • 3
    Thanks for taking the time to respond. Your experience with ZFS is consistent with mine. My comments on drive selection were specifically about nearline SAS versus SATA disks. The main difference is the interface. They're mechanically equivalent. The best-practice in ZFS-land now is to avoid SATA because of the need for dual-ported interfaces, better error-correction and manageable timeouts offered by SAS. – ewwhite Aug 27 '12 at 21:17
  • I ended up going with 3TB SAS disks but.... before doing so I cobbled together 30 or so mixed disks (5 400GB SATA, 12 750GB SATS, 14 1TB SAS) that I put into the same SAS expanded enclosure. Really a worst case scenario according. These drives also had ~2-3years of runtime already. I then wrote a program that ran 8 threads randomly reading writing and deleting files to the pool. I ran that for over a week. Read and wrote >100 TB to the pool... no problems. AVG R/W 100-400MB/sec. I suspect the SATA over SAS warnings might be old news now. – Cyclone Aug 29 '12 at 04:27