Tuning ZFS scrubbing, 141KB/s running for 15 days

Question

A pretty basic system running mirror+stripe on 7.2k rpm sas disks, not particularly loaded. No dedup, compression on all datasets. Scrub has been running for 15 days at the speed of a dead snail. Is there some optimization that needs be done, or can it be due to some faulty hw?

Dell R510 with MD1200 enclosure.
2x Xeon E5620
48GB
NexentaStor 3.1.3, Community edition

Some information:

scan: scrub in progress since Mon Apr  1 19:00:05 2013
171G scanned out of 747G at 141K/s, 1187h40m to go
0 repaired, 22.84% done
config:

    NAME                       STATE     READ WRITE CKSUM
    tank                       ONLINE       0     0     0
      mirror-0                 ONLINE       0     0     0
        c7t5000C500414FB2CFd0  ONLINE       0     0     0
        c7t5000C500414FCA57d0  ONLINE       0     0     0
      mirror-1                 ONLINE       0     0     0
        c7t5000C500415C3B1Bd0  ONLINE       0     0     0
        c7t5000C500415C5E4Fd0  ONLINE       0     0     0
      mirror-2                 ONLINE       0     0     0
        c7t5000C500415DC797d0  ONLINE       0     0     0
        c7t5000C500415DC933d0  ONLINE       0     0     0
    logs
      c7t5000A7203006D81Ed0    ONLINE       0     0     0
    cache
      c7t5000A72030068545d0    ONLINE       0     0     0


# iostat -en     
---- errors --- 
s/w h/w trn tot device
0 8887   0 8887 c2t0d0
0   0   0   0 c0t395301D6B0C8069Ad0
0   0   0   0 c7t5000C500415DC933d0
0   0   0   0 c7t5000A72030068545d0
0   0   0   0 c7t5000C500415DC797d0
0   0   0   0 c7t5000C500414FCA57d0
0   0   0   0 c7t5000C500415C3B1Bd0
0   0   0   0 c7t5000C500415C5E4Fd0
0   0   0   0 c7t5000C500414FB2CFd0
0   0   0   0 c7t5000A7203006D81Ed0

The spa_last_io is changed every time I run this

# echo "::walk spa | ::print spa_t spa_name spa_last_io spa_scrub_inflight" | mdb -k
spa_name = [ "syspool" ]
spa_last_io = 0x25661402
spa_scrub_inflight = 0
spa_name = [ "tank" ]
spa_last_io = 0x25661f84
spa_scrub_inflight = 0x21

Every 5 seconds, about 20-25 MB/s is written. Between those writes there's basically no reads or writes.

                          capacity     operations    bandwidth      latency
    pool                       alloc   free   read  write   read  write   read  write
    -------------------------  -----  -----  -----  -----  -----  -----  -----  -----
    syspool                     427G   501G      0      0      0      0   0.00   0.00
      c0t395301D6B0C8069Ad0s0   427G   501G      0      0      0      0   0.00   0.00
    -------------------------  -----  -----  -----  -----  -----  -----  -----  -----
    tank                        903G  1.84T    810  5.21K  1.50M  20.8M   9.42   4.71
      mirror                    301G   627G     22  1.00K  53.0K  3.96M   8.96   3.93
        c7t5000C500414FB2CFd0      -      -     20    244  50.1K  3.97M   6.70   1.14
        c7t5000C500414FCA57d0      -      -     19    242  48.2K  3.97M   7.60   1.12
      mirror                    301G   627G     25   1016  46.8K  4.10M  16.11   5.28
        c7t5000C500415C3B1Bd0      -      -     21    257  41.6K  4.11M   4.63   1.24
        c7t5000C500415C5E4Fd0      -      -     21    255  43.0K  4.11M  16.54   1.15
      mirror                    301G   627G     62    754   119K  3.03M  19.72   3.78
        c7t5000C500415DC797d0      -      -     57    219   114K  3.03M   9.99   1.15
        c7t5000C500415DC933d0      -      -     56    220   119K  3.03M  13.20   1.22
      c7t5000A7203006D81Ed0     260K  46.5G      0      0      0      0   0.00   0.00
    cache                          -      -      -      -      -      -
      c7t5000A72030068545d0    93.1G     8M      0      0      0      0   0.00   0.00
    -------------------------  -----  -----  -----  -----  -----  -----  -----  -----

Are the iostats telling me that I am spending more time waiting for disk then I should have to? Specifically the %b column

# iostat -xe
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b s/w h/w trn tot 
sd3       5.1   43.9   20.6  643.8  0.0  0.1    2.9   0   5   0   0   0   0 
sd4       9.4    1.8  141.1  169.6  0.0  0.0    0.5   0   0   0   0   0   0 
sd5       3.1   43.8   15.8  643.8  0.0  0.1    1.4   0   3   0   0   0   0 
sd6       5.2   38.1   14.3  494.4  0.0  0.1    3.0   0   7   0   0   0   0 
sd7       4.2   40.2   11.1  623.2  0.0  0.1    2.7   0   7   0   0   0   0 
sd8       3.6   44.3    9.7  623.2  0.0  0.1    1.5   0   4   0   0   0   0 
sd9       2.9   37.4    7.0  494.4  0.0  0.1    1.3   0   2   0   0   0   0 
sd10      0.7    0.4    3.4    0.0  0.0  0.0    0.0   0   0   0   0   0   0

The latency a tad on the high side?

# zpool iostat 10 10
               capacity     operations    bandwidth      latency
pool        alloc   free   read  write   read  write   read  write
tank         909G  1.83T     86  2.82K   208K  12.7M  22.68  13.63
----------  -----  -----  -----  -----  -----  -----  -----  -----
tank         909G  1.83T     29    857  42.4K  3.50M  17.86   4.47
----------  -----  -----  -----  -----  -----  -----  -----  -----
tank         909G  1.83T     30    947  46.1K  3.54M  15.55   5.67

Applied some tweaking which made little difference. zfs_top_maxinflight set to 127, zfs_scrub_delay to 0, and zfs_scan_idle to 0.

# echo zfs_top_maxinflight | mdb -k
zfs_top_maxinflight:
zfs_top_maxinflight:            127

# echo zfs_scrub_delay/D |mdb -k
zfs_scrub_delay:
zfs_scrub_delay:0

# echo zfs_scan_idle/D |mdb -k
zfs_scan_idle:
zfs_scan_idle:  0


 scan: scrub in progress since Wed Apr 17 20:47:23 2013
    1.85G scanned out of 918G at 1.14M/s, 229h36m to go
    0 repaired, 0.20% done

pre mdb tweak, notice the rather high b% column

$ iostat -nx -M 5

  r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c2t0d0
  0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t395301D6B0C8069Ad0
 35.2   44.2    0.3    0.7  0.0  0.4    0.0    5.3   0  32 c7t5000C500415DC933d0
 19.8    3.2    0.2    0.0  0.0  0.0    0.0    0.1   0   0 c7t5000A72030068545d0
 31.2   46.2    0.2    0.7  0.0  0.3    0.0    4.4   0  27 c7t5000C500415DC797d0
 30.6   46.8    0.2    0.8  0.0  0.4    0.0    4.6   0  28 c7t5000C500414FCA57d0
 37.6   53.0    0.3    0.8  0.0  0.4    0.0    4.7   0  33 c7t5000C500415C3B1Bd0
 37.6   53.6    0.3    0.8  0.0  0.5    0.0    5.6   0  39 c7t5000C500415C5E4Fd0
 33.2   46.8    0.3    0.8  0.0  0.5    0.0    6.1   0  33 c7t5000C500414FB2CFd0
  0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c7t5000A7203006D81Ed0

post mdb tweak, notice the b% column, 80-85% time in busy wait

$ iostat -nx -M 5 
  r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
  0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c2t0d0
  0.2   27.2    0.0    0.3  0.0  1.0    0.0   35.4   0  18 c0t395301D6B0C8069Ad0
129.6   20.2    0.9    0.4  0.0  2.9    0.0   19.5   0  85 c7t5000C500415DC933d0
 48.4    4.0    0.4    0.0  0.0  0.0    0.0    0.1   0   1 c7t5000A72030068545d0
130.4   19.8    0.9    0.4  0.0  3.0    0.0   20.2   0  84 c7t5000C500415DC797d0
125.8   25.8    0.9    0.5  0.0  2.9    0.0   19.2   0  80 c7t5000C500414FCA57d0
131.2   24.2    0.9    0.5  0.0  3.1    0.0   20.3   0  83 c7t5000C500415C3B1Bd0
130.6   25.8    0.9    0.5  0.0  3.5    0.0   22.5   0  88 c7t5000C500415C5E4Fd0
126.8   28.0    0.9    0.5  0.0  2.8    0.0   18.0   0  79 c7t5000C500414FB2CFd0
  0.2    0.0    0.0    0.0  0.0  0.0    0.0    0.1   0   0 c7t5000A7203006D81Ed0

What does `smartctl -A /dev/disk` say about each drive (might have to install `smartctl`, not sure if it comes with the base install). — Chris S, Apr 16 '13 at 13:45
What multiple occurence of iostat -XnE | grep Errors says ? do some error count increment ? — , Apr 16 '13 at 13:37
Nothing of interest, besides "Non-medium error count: 8071" on one disk. All disks sit in a JBOD (Dell MD1200) on the same (single) sas lane — 3molo, Apr 17 '13 at 17:09

Nex7 · Accepted Answer · 2013-05-27T23:38:25.283

ZFS scrub operations operate on some fairly brain-dead principles. Most notably, it only spends time scrubbing when there's nothing else going on. If you poke a pool with just a bit of data access on a fairly constant basis, scrub will effectively starve itself and do nearly nothing.

Tunables to explore, with my quick notes on what it does (I last looked into this awhile ago, though):

zfs_scan_idle - if user I/O occurrs within this many clock ticks, delay scrub I/O by zfs_scrub_delay clock ticks
zfs_scrub_delay - how many clock ticks to delay scrub operation if triggered by zfs_scan_idle
zfs_top_maxinflight - maximum number of scrub I/O per top-level vdev
zfs_scrub_limit - maximum number of scrub I/O per leaf vdev
zfs_scan_min_time_ms - minimum ms to spend per txg on scrub operations
zfs_no_scrub_io - no notes
zfs_no_scrub_prefetch - no notes, name seems to imply not causing prefetch on scrub ops

All of these are changeable on the fly using "echo [tunable]/W0t[number]" to change, and "echo [tunable]/D" to view current setting (which I recommend doing before changing).

So in theory, and in general practice, if you were to, say, change zfs_scan_idle down to 10 (or 1 - or 0, if it supports that, would need to check the code) and zfs_scrub_delay down to 1 (or 0, if it supports that), and if your txg_synctime_ms setting is 5000 or more maybe change zfs_scan_min_time_ms up a bit, it should become a lot more aggressive about actually doing scrub operations even with some level of user I/O occurring.

In your specific case, the %b and asvc_t reported imply some very, very random read workload going on (spinning disks should do better than that if it is truly sequential), and you've already done the "easy" stuff as explained above. So, first I'd turn on zfs_no_scrub_prefetch, to disable prefetch on scrub operations, just to see if that helped. If no joy, depending on the version of Nexenta you're on - you may be running 30/5, 5/1 or 10/5 (that's shorthand we use for the settings of zfs_txg_timeout & (zfs_txg_synctime_ms*1000)). Change zfs_txg_timeout to 10 and zfs_txg_synctime_ms to 5000, then try upping zfs_scan_min_time_ms to 3000 or 4000. This tells ZFS it can spend a lot longer on scrubs, as compared to the default settings on older NexentaStor installs that use 5/1 as the defaults - but careful, this may starve normal I/O if the delay settings have also been set to basically 0!

Hope this helps. Good luck!

I suppose I should note that you modify these settings in bash using "echo /W0t | mdb -kw". And you view current values with "echo /D | mdb -k". My notes say all of these can be changed in flight, none seem to require an /etc/system modification and reboot to take effect. — Nex7, May 09 '13 at 16:19
I should also read the entire question before responding - and stop browsing ServerFault while on conference calls. :) — Nex7, May 09 '13 at 16:22
The %b and asvc_t reported imply some very, very random read workload going on (spinning disks should do better than that if it is truly sequential). First I'd turn on zfs_no_scrub_prefetch, to disable prefetch on scrub operations, just to see if that helped. If no joy, depending on the version of Nexenta you're on - you may be running 30/5, 5/1 or 10/5 (zfs_txg_timeout & (zfs_txg_synctime_ms*1000). Change zfs_txg_timeout to 10 and zfs_txg_synctime_ms to 5000, then try upping zfs_scan_min_time_ms to 3000 or 4000. This tells ZFS it can spend a lot longer on scrubs, may starve normal I/O! — Nex7, May 09 '13 at 16:28
I think you provide very valuable input, but it would be much more helpful if you could add the comments into one good answer. — 3molo, May 10 '13 at 14:15
With zfs_max_topinflight 127, zfs_scan_idle=0 and zfs_scrub_delay=0 I effectively starved all other I/O, but got the scrub to reach 18MB/s so that it could finish in a day or so. Next time I'll be sure to pay close attention to your answer, which contains many parameters I haven't even heard of. — 3molo, May 29 '13 at 18:48
More tuning may have helped, but not necessarily. It is important to note that a ZFS scrub rolls through the data structure, NOT sector by sector on the disks. Which is to say, depending on how the zfs data structure looks on your disks, a scrub operation may look incredibly random -- your disks may be capable of > 100 MB/s of *sequential* read, but completely random read will be another story entirely. Average block size would also matter here. — Nex7, May 31 '13 at 06:00

score 3 · Answer 2 · answered Apr 16 '13 at 12:21

3

I suspect hardware...

Why would you let this run for 15 days? That's not normal. Stop the scrub - zpool scrub -s tank and check the system out.

Which controllers are you using?
Is this the first scrub you've ever run on this pool?
Was there a problem that prompted you to run the scrub in the first place?

answered Apr 16 '13 at 12:21

ewwhite

194,921
91
434
799

1

LSI SAS9200-8e (IT firmware). Not first scrub. No, no real issues (but I have been questioning the sequential read/write performance for a while). – 3molo Apr 16 '13 at 13:16
Updated with latency and wait times, starting to suspect there's always some time to service requests and that prioritizes scrub so low that it crawls to an halt. Any insight is very helpful! – 3molo Apr 16 '13 at 13:29
Scrubs are important to run periodically. Waiting until you have a problem to run a scrub is asking for that problem to blow up into data loss. Scrubs are there to catch silent data corruption (bitrot). A slow running scrub is not a sign of a system problem, just a pool that is kept busy enough not to let the scrub accelerate. – lschweiss Oct 23 '17 at 13:25

score 0 · Answer 3 · answered Dec 10 '17 at 10:33

My answer comes a little late, but if this kind of thing happens to anyone else, here is my take on it: simply try out "dmesg". In my case, I wasn't doing a scrub, but I was copying files to the disks, and I was clearly hearing the disks being active for a few seconds, then all stopping for a longer time, and again working and so on. This was due to the failure of one SATA controller and dmesg gave me all the errors. I thought it was a failing disk at first, but then I realized it was actually the controller.

score -3 · Answer 4 · answered Apr 16 '13 at 12:30

-3

Scrub uses available system downtime, even on an unloaded server, it's about availability. Ram and Processor are the keys to scrub utilization, not disc. The more of these available, the better your scrub performance will be. However, certainly, in this case, the better your discs are laid out, in terms of ZPools, the better your scrub performance will be as well.

So, if your performance has been slow, and that does appear to be the case, I would look to these as potential reasons.

answered Apr 16 '13 at 12:30

user169761

1

1

I don't see any indicator that any resources is scarce. – 3molo Apr 17 '13 at 17:10
1

This is pretty much completely false. CPU & RAM have effectively zero impact on scrub operations (assuming there's any free at all). Having lots of free RAM & CPU will not 'speed up' scrub operations. Scrub is limited by watching incoming I/O to the pool, not by checking for 'available system downtime', whatever that is. – Nex7 May 27 '13 at 23:34

Tuning ZFS scrubbing, 141KB/s running for 15 days

4 Answers4