21

We are looking into using BtrFS on an array of SSD disks and I have been asked to verify that BtrFS does in fact perform TRIM operations upon deleting a file. So far I have been unable to verify that the TRIM command is sent to the disks.

I know BtrFS is not considered production ready, but we like the bleeding edge, therefore I'm testing it. The server is Ubuntu 11.04 server 64-bit release (mkfs.btrfs version 0.19). I have installed the Linux 3.0.0 kernel as the BtrFS changelog states that bulk TRIM is not available in the kernel shipped with Ubuntu 11.04 (2.6.38).

Here's my testing methodology (initially adopted from http://andyduffell.com/techblog/?p=852, with modifications to work with BtrFS):

  • Manually TRIM the disks before starting: for i in {0..10} ; do let A="$i * 65536" ; hdparm --trim-sector-ranges $A:65535 --please-destroy-my-drive /dev/sda ; done
  • Verify the drive was TRIM'd: ./sectors.pl |grep + | tee sectors-$(date +%s)
  • Partition the drive: fdisk /dev/sda
  • Make the file system: mkfs.btrfs /dev/sda1
  • Mount: sudo mount -t btrfs -o ssd /dev/sda1 /mnt
  • Create a file: dd if=/dev/urandom of=/mnt/testfile bs=1k count=50000 oflag=direct
  • Verify the file is on the disk: ./sectors.pl | tee sectors-$(date +%s)
  • Delete the test file: rm /mnt/testfile
  • See that the test file is TRIM'd from the disk: ./sectors.pl | tee sectors-$(date +%s)
  • Verify the TRIM'd blocks: diff the two most recent sectors-* files

At this point, the pre-delete and post delete verifications still show the same disk blocks in use. I should instead see a reduction in the number of in use blocks. Waiting an hour (in case it takes a while for the TRIM command to be issued) after the test file is deleted still shows the same blocks in use.

I have also tried mounting with the -o ssd,discard options, but that doesn't seem to help at all.

Partition that was created from fdisk above (I keep the partition small so the verification can go faster):

root@ubuntu:~# fdisk -l -u /dev/sda

Disk /dev/sda: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x6bb7542b

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1              63      546209      273073+  83  Linux

My sectors.pl script (I know this is inefficient, but it gets the job done):

#!/usr/bin/perl -w

use strict;

my $device = '/dev/sda';
my $start = 0;
my $limit = 655360;

foreach ($start..$limit) {
    printf "\n%6d ", $_ if !($_ % 50);
    my @sector = `/sbin/hdparm --read-sector $_ $device`;
    my $status = '.';
    foreach my $line (@sector) {
            chomp $line;
            next if $line eq '';
            next if $line =~ /$device/;
            next if $line =~ /^reading sector/;
            if ($line !~ /0000 0000 0000 0000 0000 0000 0000 0000/) {
                    $status = '+';
            }
    }
    print $status;
}
print "\n";

Is my testing methodology flawed? Am I missing something here?

Thanks for the help.

Shane Meyers
  • 1,008
  • 1
  • 7
  • 17
  • 1
    I wholly support testing bleeding edge things, but just so you know, as of right now, btrfs doesn't have an fsck that actually, you know, fixes things: https://btrfs.wiki.kernel.org/index.php/Main_Page - so just watch out for that. – Matt Simmons Sep 02 '11 at 01:10
  • @Matt - Good point about the missing fsck. My understanding is that the first version of an fsck should ship within the next few weeks, so we should be covered by the time we move this to production. Additionally, we'll have multiple copies of our data, so if we loose one copy, we have at least two more copies to restore from. But I fully agree that this is not the file system for people with irreplaceable data for now. – Shane Meyers Sep 02 '11 at 16:10
  • 1
    Probably won't change anything, but you might as well try running a `sync` after rmming the file. – zebediah49 Sep 02 '11 at 18:16
  • I want to say that I tried running a `sync` after removing the file and the results were still the same. I will double check that though when I'm back in the office after the weekend is over. – Shane Meyers Sep 03 '11 at 05:31
  • if you don't mind bleeding edge, have you considered http://zfsonlinux.org/ ? native (i.e. in kernel, not fuse) ZFS for linux. they're close to an official "release", and have RCs available (including a PPA for Ubuntu - easy enough to rebuild for debian too) – cas Sep 15 '11 at 13:13

6 Answers6

4

So after many days working on this, I was able to demonstrate that BtrFS does use TRIM. I was unable to successfully have TRIM work on the server that we will be deploying these SSDs to. However, when testing using the same drive plugged into a laptop, the tests succeed.

Hardware used for all of this testing:

  • Crucial m4 SSD 512GB
  • HP DL160se G6
  • LSI LSISAS9200-8e HBA
  • generic SAS enclosure
  • Dell XPS m1210 laptop

After many failed attempts at verifying BtrFS on the server, I decided to try this same test using an old laptop (remove the RAID card layer). The initial attempts of this test using both Ext4 and BtrFS on the laptop fail (data not TRIM'd).

I then upgraded the SSD drive firmware from version 0001 (as shipped out of the box) to version 0009. The tests were repeated with Ext4 and BtrFS and both filesystems successfully TRIM'd the data.

To ensure the TRIM command had time to run, I did a rm /mnt/testfile && sync && sleep 120 before performing validation.

One thing to note if you're attempting this same test: SSDs have erase blocks that they operate on (I don't know the size of the Crucial m4 erase blocks). When the file system sends the TRIM command to the drive, the drive will only erase a complete block; if the TRIM command is specified for a portion of a block, that block will not be TRIM'd due to the remaining valid data within the erase block.

So to demonstrate what I'm talking about (output of the sectors.pl script above). This is with the test file on the SSD. Periods are sectors that only contain zeros. Pluses have one or more non-zero bytes.

Test file on drive:

24600 .......................................+++++++++++
24650 ++++++++++++++++++++++++++++++++++++++++++++++++++
24700 ++++++++++++++++++++++++++++++++++++++++++++++++++
    -- cut --
34750 ++++++++++++++++++++++++++++++++++++++++++++++++++
34800 ++++++++++++++++++++++++++++++++++++++++++++++++++
34850 +++++++++++++++++++++++++++++.....................

Test file deleted from drive (after a sync && sleep 120):

24600 .......................................+..........
24650 ..................................................
24700 ..................................................
    -- cut --
34750 ..................................................
34800 ..................................................
34850 ......................+++++++.....................

It appears that the first and last sectors of the file are within a different erase blocks from the rest of the file. Therefore some sectors were left untouched.

A takeaway form this: some Ext4 TRIM testing instructions ask the user to only verify that the first sector was TRIM'd from the file. The tester should view a larger portion of the test file to really see if the TRIM was successful or not.

Now to figure out why manually issued TRIM commands sent to the SSD through the RAID card work but automatic TRIM commands to not...

Shane Meyers
  • 1,008
  • 1
  • 7
  • 17
  • I thought all HW RAID ate trim commands, nice to see that things are slowly changing. On the other hand, with good modern drives, TRIM matters less and less. – Ronald Pottol Nov 28 '11 at 22:40
4

Based on what I've read, there may be a flaw in your methodology.

You are assuming that TRIM will result in your SSD zeroing the blocks which have been deleted. However this is often not the case.

That is only if the SSD implements TRIM so that it zeroes the discarded blocks. You can check if the device at least knows enough to report discard_zeroes_data:

cat /sys/block/sda/queue/discard_zeroes_data

Also, even if the SSD does zero, it may take some time -- well after the discard has completed -- for the SSD to actually zero the blocks (this is true of some lesser quality SSDs).

http://www.redhat.com/archives/linux-lvm/2011-April/msg00048.html

BTW I was looking for a reliable way to verify TRIM and haven't found one yet. I'd love know to if anyone finds a way.

chrishiestand
  • 974
  • 12
  • 23
3

Here is testing methodology for 10.10 and EXT4. Maybe it'll help.

https://askubuntu.com/questions/18903/how-to-enable-trim

Oh and I think you do need the discard parameter on the fstab mount. Not sure if SSD param is needed as I think it should auto detect SSD.

Dave Veffer
  • 131
  • 3
  • 2
    I have attempted to follow Ext4 SSD verification instructions, but they don't work due to differences in how BtrFS works compared to other file systems. Hence the workflow I came up with. I used the `ssd` mount option to ensure that BtrFS knew to use its SSD-specific code even though it should auto detect. I also tried using `discard` (as noted above) and it didn't help. – Shane Meyers Sep 01 '11 at 23:36
  • Oh well. Worth a shot :) – Dave Veffer Sep 01 '11 at 23:57
1

For btrfs you need discard option to enable TRIM support.

A very simple but working test for functional TRIM is here: http://techgage.com/article/enabling_and_testing_ssd_trim_support_under_linux/2

Paweł Brodacki
  • 6,451
  • 19
  • 23
  • 1
    As I mentioned above, I tried my testing with both the `discard` option and the `ssd` option. The BtrFS docs mention the `ssd` option a lot, so I focused my testing there, but neither option resulted in the outcome I expected. Most webpages that show how to test TRIM are for Ext4 and the like. BtrFS can not be tested using those methodologies due to difference in design of the file system. – Shane Meyers Sep 02 '11 at 16:05
  • `hdparm --fibmap` is FS agnostic. A block at given LBA address is either zeroed out, or not, whether it's extN, btrfs, xfs, jfs... `ssd` option is irrelevant for trim, see e.g. this discussion on btrfs mailing list: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg10932.html. – Paweł Brodacki Sep 02 '11 at 18:05
  • I tried using `hdparm --fibmap` but it doesn't work on BtrFS. If you look at the wiper.sh README (distributed alongside hdparm), they explicitly state that "FIEMAP/FIBMAP ioctl() calls are completely unsafe when used on a btrfs filesystem." So hdparm is out, which is too bad as this would make testing go a lot easier. I didn't know that the `ssd` option had nothing to do with TRIM as the docs aren't very clear on the usefulness of the option. – Shane Meyers Sep 03 '11 at 05:27
  • Thank you for the extra information about ioctls, I didn't know it. I think the best place to ask for extra information could be btrfs mailing list. You'll get first-hand information from there. – Paweł Brodacki Sep 03 '11 at 05:48
1

Virtually all SSDs with a SATA interface run some sort of log structure filesystem that is completely hidden from you. The SATA 'trim' command tells the device that the block is no longer in use and that the underlying log structure filesystem can flash it /if/ the corresponding erase block (which might be substantially larger) /only/ contains blocks marked with trim.

I have not read the standard docs, which are here: http://t13.org/Documents/MinutesDefault.aspx?keyword=trim, but I'm not sure if there is any standard level guarantee that you'd be able to see the results of a trim command. If you can see something change, like the first few byte being zero'd out at the start of an erase block, I don't think there's any guarentee this is applicable across different devices or perhaps even firmware version.

If you think about the way the abstraction might be implemented, it should be possible to make the result of the trim command completely invisible to the one just reading/writing blocks. Furthermore it might be hard to tell which blocks are in the same erase block, since only the flash translation layer has to know that and might have reordered them logically.

Perhaps there is a SATA command (OEM command perhaps?) for fetching metadata related to the SSDs flash translation layer?

user134450
  • 141
  • 2
  • 3
Joshua Hoblitt
  • 665
  • 4
  • 11
1

Some things to think about (to help answer your "am i missing something?" question):

  • what exactly is /dev/sda? a single SSD? or a (hardware?) RAID array of SSDs?

  • if the latter then what kind of RAID controller?

  • and does your raid controller support TRIM?

and, finally,

  • does your testing method give you the results you expect if you format /dev/sda1 with something other than btrfs?
cas
  • 6,653
  • 31
  • 34