4k hard drives freebsd gpart and zfs



I have 3 hdd, with the next camcotrol identify.

root@cirmos:/root # camcontrol identify ada1
pass2: <WDC WD10EZEX-00RKKA0 80.00A80> ATA-8 SATA 3.x device
pass2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-8 SATA 3.x
device model          WDC WD10EZEX-00RKKA0
firmware revision     80.00A80
serial number         WD-WMC1S4587539
WWN                   50014ee003930f6e
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6 

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes  yes
write cache                    yes  yes
flush cache                    yes  yes
overlap                        no
Tagged Command Queuing (TCQ)   no   no
Native Command Queuing (NCQ)   yes      32 tags
SMART                          yes  yes
microcode download             yes  yes
security                       yes  no
power management               yes  yes
advanced power management      no   no
automatic acoustic management  no   no
media status notification      no   no
power-up in Standby            yes  no
write-read-verify              no   no
unload                         no   no
free-fall                      no   no
data set management (TRIM)     no
root@cirmos:/root # 

as can see above, the sector size is detected as:

sector size           logical 512, physical 4096, offset 0

Here is already some topics on 4k drive tuning. I want create ZFS (raidz) from above 3 drives, and have the next questions:

  1. Are these drives 4k drives? (asking because the physical sector size is 4k but logical is reported as 512b)
  2. what is the recommended gpart for the above divers to get correct alignment (Want create one freebsd-zfs partition)
  3. Is here zpool tuning what i should consider? (the root, system and swap should be not in the above drives - these drives are only for "pure" file storage (and home directories).


Posted 2013-06-17T13:19:41.040

Reputation: 807



Starting with point 2; in all best practices, ZFS should be "fed" whole drives to manage. No special partitioning required.

As to the rest of it:

This link has a lot of useful hints, of which I'll repeat some.

Each vdev (like a mirror or raidz) has a single ashift. ashift=9 is 512 byte sectors, ashift=12 is 4k sectors. (calculated as 2^ashift=sector-size)

To help ensure future-forward compatibility, without having to destroy and recreate the pool later, it is generally recommended to use ashift=12 regardless of actual drive capabilities (since it can't be changed after vdev creation).

From the link:

# gnop create -S 4096 ada0
# zpool create tank raidz ada0.nop ada1 ada2
# zdb | grep ashift
     ashift: 12

The gnop command creates a forced 4k-alignment passthrough device for ada0 as ada0.nop Then, the pool is created. ZFS will then use ashift=12 for the whole vdev. With the pool/vdev created, it is recommended to get rid of the ada0.nop passthrough device.

# zpool export tank
# gnop destroy ada0.nop
# zpool import tank

Now the pool will import with devices ada0, ada1, and ada2. And it will still have the locked-in ashift=12 that it was created with.

That's it. With ZFS managing the whole drives, you're set and ready to go.


Posted 2013-06-17T13:19:41.040

Reputation: 1 886

The "best practice" of "feding ZFS the whole disk" is a Solaris-ism that does not apply to FreeBSD. Under Solaris ZFS would disable caching if "fed" a partition vs. a real disk. FreeBSD has no such issue. – Adam Strohl – 2015-04-17T05:27:50.567

@AdamStrohl Meh. (that's a technical term.) If you think that proposing to feed ZFS partitions instead of whole drives is a great idea, then propose it as an answer, with sources that say it is such a [great idea] (inside or outside FreeBSD). (hint, it isn't anything resembling a great (or even good) idea.) – killermist – 2015-04-23T16:07:12.743

@killermist I've submitted as an answer. I've run across a few places were "partitions are fine" has been clearly stated: http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/%E2%80%A6 and http://forums.freebsd.org/threads/zfs-and-disk-labeling-question.33896 among others.

Additionally we have dozens of clients with ZFS under partitions in production spanning many, many servers. What specifically "isn't a great idea" about this? Maybe I can do some testing?

– Adam Strohl – 2015-04-24T19:00:22.437

By giving ZFS the entire drive, you're also accepting the risk that if in the future you'll need to replace the drive that failed and the replacement you'll buy has even one sector fewer than the old disk, you won't be able to use it.

IMO, a better practice is to create a partition that's a slightly smaller than the entire disk. Such margin would allow you to use other disk models of the "same" capacity. – Marcin Kaminski – 2014-03-03T18:08:26.293

Meh. But to elaborate, how often do replacement drives (especially with time being an element of consideration) DECREASE in size? – killermist – 2014-04-17T00:03:22.773

1I've had this happen to me twice in the past. I'd rather set it up the way that takes out the potential for guesswork out of the process in the future. – Marcin Kaminski – 2014-04-22T13:04:50.767


Are those 4k drives? Yes, you can see that they report 4096 byte physical which is the indicator for this. The 512 byte logical reporting is a result of drive manufacturers' attempt at backwards compatibility (and thus confuses things).

gpart? In your situation I would use the following commands to gpart out the disk:

# -- Force ashift to be at least 12
sysctl vfs.zfs.min_auto_ashift=12;

# -- Create GPT tables
gpart create -s gpt ada0 &&
gpart create -s gpt ada1 &&
gpart create -s gpt ada2;

# -- Create paritions, align start/stop to 1 MiB boundaries
gpart add -a 1m -t freebsd-zfs -l disk0 ada0 && 
gpart add -a 1m -t freebsd-zfs -l disk1 ada1 && 
gpart add -a 1m -t freebsd-zfs -l disk2 ada2;

# -- Not needed under FreeBSD 10.1 but sometimes is on
#    older versions to get /dev/gpt to update.
#    Run if you don't see /dev/gpt/disk0 etc devices:
true > /dev/ada0; true > /dev/ada1; true > /dev/ada2;

# -- Create temporary GNOP 4k devices
gnop create -S 4k /dev/gpt/disk0 &&
gnop create -S 4k /dev/gpt/disk1 &&
gnop create -S 4k /dev/gpt/disk2;

# --  Create the zpool
zpool create -f -m /mnt zstorage raidz /dev/gpt/disk0.nop /dev/gpt/disk1.nop /dev/gpt/disk2.nop;

# -- Export
zpool export zroot;

# -- Remove temproary GNOP devices
gnop destroy /dev/gpt/disk0.nop &&
gnop destroy /dev/gpt/disk1.nop &&
gnop destroy /dev/gpt/disk2.nop;

# -- Bring back pool with "real" devices
zpool import -d /dev/gpt zstorage;

# -- Check status
zpool status;

# -- Verify ashift is 12
zdb | grep ashift

gpart-ing has no performance penalty or drawbacks that we're aware of or have seen. We have had this deployed in dozens of production locations for many, many years. It also confers the following advantages:

  • You can label (the -l) partitions (i.e. disk0, disk1) so you know which disks are which, even if their port numbers change (i.e. ada0 might not always be disk0). gpart show -l will show the GPT table with those labels.
  • While not applicable to you, it lets you boot off ZFS and also have swap partitions (i.e. using GMIRROR) on the same disks.
  • Due to 1 MiB alignment, you end up with a little bit of free space at the end of the disk because your partition is rounded to 1 MiB. This avoids a situation where you replace a drive with a different vendor and it ends up being ever-so-slightly smaller and thus unusable.

You'll also notice the first thing above is to do sysctl vfs.zfs.min_auto_ashift=12; and the last thing is check that value. Under ZFS ashift=9 is the default which is appropriate for 512 byte disks but for 4k disks you'd see write amplification and loss of performance, similar in effect but not in cause due to partition misalignment. We've seen where, for unknown reasons, ZFS does not pick ashift=12 automatically even with GNOP so this forces the issue. This page describes the whole thing nicely: http://savagedlight.me/2012/07/15/freebsd-zfs-advanced-format/

Tuning? Depends on your work load. We now enable LZ4 compression on all new deployments as it has proven to have negligible overhead at worst and at best increases performance drastically for compressible files.

# -- Set compresison on
zfs set compression=lz4 zstorage;

# -- See compression performance
zfs get used,compressratio,compression,logicalused zstorage;

The only "down side" is that this will affect benchmarking, bonnie++ will report some insane(ly awesome) numbers when this is turned on that likely don't reflect real-world performance. Same with dd if=/dev/zero of=test.dat style benchmarking.

Adam Strohl

Posted 2013-06-17T13:19:41.040

Reputation: 141