5

Is there a good way to prime a ZFS L2ARC cache on Solaris 11.3?

The L2ARC is designed to ignore blocks that have been read sequentially from a file. This makes sense for ongoing operation but makes it hard to prime the cache for initial warm-up or benchmarking.

In addition, highly-fragmented files may benefit greatly from sequential reads being cached in the L2ARC (because on-disk they are random reads), but with the current heuristics these files will never get cached even if the L2ARC is only 10% full.

In previous releases of Solaris 10 and 11, I had success in using dd twice in a row on each file. The first dd read the file into the ARC, and the second dd seemed to tickle the buffers so they became eligible for L2ARC caching. The same technique does not appear to work in Solaris 11.3.

I have confirmed that the files in question have an 8k recordsize, and I have tried setting zfs_prefetch_disable but this had no impact on the L2ARC behaviour UPDATE: zfs_prefetch_disable turns out to be important, see my answer below.

If there is no good way to do it, I would consider using a tool that generates random reads over 100% of a file. This might be worth the time given that the cache is persistent now in 11.3. Do any tools like this exist?

Tom Shaw
  • 3,702
  • 15
  • 23

2 Answers2

3

With a bit of experimentation I've found four possible solutions.

With each approach, you need to perform the steps and then continue to read more data to fill up the ZFS ARC cache and to trigger the feed from the ARC to the L2ARC. Note that if the data is already cached in memory, or if the compressed size on disk of each block is greater than 32kB, these methods won't generally do anything.

1. Set the documented kernel flag zfs_prefetch_disable

The L2ARC by default refuses to cache data that has been automatically prefetched. We can bypass this by disabling the ZFS prefetch feature. This flag is often a good idea for database workloads anyway.

echo "zfs_prefetch_disable/W0t1" | mdb -kw

..or to set it permananently, add the following to /etc/system:

set zfs:zfs_prefetch_disable = 1

Now when files are read using dd, they will still be eligible for the L2ARC.

Operationally, this change also improves the behaviour of reads in my testing. Normally, when ZFS detects a sequential read it balances the throughput among the data vdevs and cache vdevs instead of just reading from cache -- but this hurts performance if the cache devices are significantly lower-latency or higher-throughput than the data devices.

2. Re-write the data

As data is written to a ZFS filesystem it is cached in the ARC and (if it meets the block size criteria) is eligible to be fed into the L2ARC. It's not always easy to re-write data, but some applications and databases can do it live, e.g. through application-level file mirroring or moving of the data files.

Problems:

  • Not always possible depending on the application.
  • Consumes extra space if there are snapshots in use.
  • (But on the bright side, the resulting files are defragmented.)

3. Unset the undocumented kernel flag l2arc_noprefetch

This is based on reading the OpenSolaris source code and is no doubt completely unsupported. Use at your own risk.

  1. Disable the l2arc_noprefetch flag:

    echo "l2arc_noprefetch/W0" | mdb -kw
    

    Data read into the ARC while this flag is disabled will be eligible for the L2ARC even if it's a sequential read (as long the blocks are at most 32k on disk).

  2. Read the file from disk:

    dd if=filename.bin of=/dev/null bs=1024k
    
  3. Re-enable the l2arc_noprefetch flag:

    echo "l2arc_noprefetch/W1" | mdb -kw
    

4. Read the data randomly

I wrote a Perl script to read files in 8kB chunks pseudorandomly (based on the ordering of a Perl hash). It may also work with larger chunks but I haven't tested that yet.

#!/usr/bin/perl -W

my $BLOCK_SIZE = 8*2**10;
my $MAX_ERRS = 5;

foreach my $file (@ARGV) {
        print "Reading $file...\n";
        my $size;
        unless($size = (stat($file))[7]) {print STDERR "Unable to stat file $file.\n"; next; }
        unless(open(FILE, "<$file")) {print STDERR "Unable to open file $file.\n"; next; }
        my $buf;
        my %blocks;
        for(my $i=0;$i<$size/$BLOCK_SIZE;$i++) { $blocks{"$i"} = 0; }
        my $errs = 0;
        foreach my $block (keys %blocks) {
                unless(sysseek(FILE, $block*$BLOCK_SIZE, 0) && sysread(FILE, $buf, $BLOCK_SIZE)) {
                        print STDERR "Error reading $BLOCK_SIZE bytes from offset " . $block * $BLOCK_SIZE . "\n";
                        if(++$errs == $MAX_ERRS) { print STDERR "Giving up on this file.\n"; last; }
                        next;
                }
        }
        close(FILE);
}

Problems:

  • This takes a long time and puts a heavy workload on the disk.

Remaining issues

  • The above methods will get the data into main memory, eligible for feeding into the L2ARC, but they don't trigger the feed. The only way I know to trigger writing to the L2ARC is to continue reading data to put pressure on the ARC.
  • On Solaris 11.3 with SRU 1.3.9.4.0, only rarely does the L2ARC grow the full amount expected. The evict_l2_eligible kstat increases even when the SSD devices are under no pressure, indicating that data is being dropped. This remaining rump of uncached data has a disproportionate effect on performance.
Tom Shaw
  • 3,702
  • 15
  • 23
  • Would it be possible to manually reduce the size of the ARC after loading by reducing main memory artificially (running another program that continually reserves more and more memory without releasing it until almost all memory is claimed) to put on the pressure you spoke of? I have never tried it, it just crossed my mind reading your excellent post. – user121391 Jun 24 '16 at 07:49
  • I think that might work, or just making a file in /tmp would do the same. As another alternative, I tried slowly increasing the user_reserve_hint_pct tunable (as per MOS 1663862.1) and that did trigger the L2ARC feed. – Tom Shaw Jun 24 '16 at 12:08
0

I'd suggest using a real workload and monitoring the result with arcstat.

Something like:

arcstat.py -f "time,read,l2read,hit%,hits,miss%,miss,l2hit%,l2miss%,arcsz,c,l2size" 1

I don't think there's any need to "prime" the cache. If the workload you have doesn't naturally populate the cache, then it's not a representative benchmarking workload, right?

Maybe you have an exceptional use case (what's your dataset size, ARC size and working set size?), but in general, the focus on L2ARC is overemphasized.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 1
    In our case it actually does make a big difference to performance once the cache is primed, and we have seen that on previous releases of Solaris 10 and 11.1. The working set is about 600GB (lzjb compressed) on spinning, replicated SAN disk. We have 800GB worth of local SSD, so with compressed L2ARC we should be able to fit everything within the cache and never need to hit the SAN for reads. – Tom Shaw Jun 23 '16 at 12:42
  • Then what are you asking? – ewwhite Jun 23 '16 at 12:44
  • 1
    "Is there a good way to prime a ZFS L2ARC cache on Solaris 11.3?" – Tom Shaw Jun 23 '16 at 12:45