zfs regular read pauses

Question

We have a zfs file system running on openindiana with an almost purely read-only workload. Recently we have found that the io is blocked or pauses every 1 minute for around 2-3 seconds. Does anyone know why this happens and how to stop it occurring?

pool: data

state: ONLINE

status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.

action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support feature
        flags.

scan: none requested

config:

        NAME                       STATE     READ WRITE CKSUM
        data01                     ONLINE       0     0     0
          raidz1-0                 ONLINE       0     0     0
            c6t5E83A97A6BAB722Fd0  ONLINE       0     0     0
            c6t5E83A97BBE7FE8F1d0  ONLINE       0     0     0
            c6t5E83A97CB2D1609Fd0  ONLINE       0     0     0
            c6t5E83A97E0D2BC6ACd0  ONLINE       0     0     0
            c6t5E83A97E1E2A6F66d0  ONLINE       0     0     0
            c6t5E83A97E4C627FC0d0  ONLINE       0     0     0
            c6t5E83A97E44EB7FC0d0  ONLINE       0     0     0
            c6t5E83A97E124F8168d0  ONLINE       0     0     0
            c6t5E83A97E86762E4Dd0  ONLINE       0     0     0
            c6t5E83A97E25716311d0  ONLINE       0     0     0
            c6t5E83A97EBEEB8B60d0  ONLINE       0     0     0
            c6t5E83A97FCF5551F5d0  ONLINE       0     0     0

errors: No known data errors

As requested by @jlliagre, here are the results from calling hotkernel during good and bad times.

good times

zfs`vdev_is_dead 1 0.0%
ip`conn_ip_output 1 0.0%
sd`sdintr 1 0.0%
mpt_sas`mptsas_start_cmd0 1 0.0%
genunix`new_mstate 1 0.0%
genunix`segvn_faulta 1 0.0%
unix`swtch 1 0.0%
unix`tsc_scalehrtime 1 0.0%
zfs`ddt_stat_add 1 0.0%
unix`lwp_segregs_save 1 0.0%
rootnex`rootnex_get_sgl 1 0.0%
genunix`ldi_strategy 1 0.0%
mpt_sas`mptsas_scsi_init_pkt 1 0.0%
genunix`ddi_fm_dma_err_get 1 0.0%
zfs`zio_create 1 0.0%
zfs`zio_destroy 1 0.0%
unix`strcmp 1 0.0%
unix`cpu_acpi_idle 1 0.0%
ip`ip_xmit_attach_llhdr 1 0.0%
genunix`mdi_pi_get_vhci_private 1 0.0%
genunix`lbolt_event_driven 1 0.0%
unix`disp 1 0.0%
genunix`cpu_update_pct 1 0.0%
unix`tsc_gethrtime 1 0.0%
ip`tcp_build_hdrs 1 0.0%
genunix`dnlc_lookup 1 0.0%
unix`page_create_va 1 0.0%
genunix`kmem_zalloc 1 0.0%
unix`mmapobj_map_elf 1 0.0%
bnx`bnx_intr_1lvl 1 0.0%
genunix`btopr 1 0.0%
genunix`fsflush 1 0.0%
genunix`gexec 1 0.0%
unix`mmu_tlbflush_entry 1 0.0%
unix`hment_alloc 1 0.0%
specfs`spec_lockcsp 1 0.0%
zfs`vdev_stat_update 1 0.0%
genunix`stk_getptr 1 0.0%
unix`resume 1 0.0%
unix`do_splx 1 0.0%
unix`page_clr_all_props 1 0.0%
bnx`bnx_xmit_ring_xmit_mblk 1 0.0%
genunix`list_create 1 0.0%
scsi`scsi_uscsi_handle_cmd 1 0.0%
unix`page_lookup_nowait 1 0.0%
genunix`syscall_mstate 1 0.0%
genunix`kmem_free 1 0.0%
genunix`as_map_locked 1 0.0%
unix`cpu_idle_adaptive 1 0.0%
unix`page_lookup_create 1 0.0%
sd`sd_start_cmds 1 0.0%
unix`0xfffffffffb85 1 0.0%
unix`atomic_cas_64 1 0.0%
genunix`restorectx 1 0.0%
genunix`vmem_is_populator 1 0.0%
unix`disp_anywork 1 0.0%
unix`prefetch_page_r 1 0.0%
sd`xbuf_iostart 1 0.0%
zfs`zfs_read 1 0.0%
unix`hwblkclr 1 0.0%
specfs`sfind 1 0.0%
genunix`swapfs_getvp 1 0.0%
zfs`vdev_cache_allocate 1 0.0%
zfs`buf_hash_find 1 0.0%
genunix`taskq_thread 2 0.0%
unix`tsc_read 2 0.0%
unix`tsc_gethrtimeunscaled 2 0.0%
genunix`fop_getpage 2 0.0%
unix`inflate_table 2 0.0%
unix`ddi_get32 2 0.0%
unix`mutex_exit 2 0.0%
unix`idle 3 0.0%
zfs`lzjb_compress 3 0.0%
unix`mutex_enter 6 0.1%
unix`page_nextn 9 0.1%
genunix`fsflush_do_pages 9 0.1%
unix`disp_getwork 12 0.2%
unix`acpi_cpu_cstate 441 5.7%
unix`i86_mwait 7118 92.7%

bad times

genunix`cpu_update_pct 1 0.0%
unix`splr 1 0.0%
bnx`bnx_timer 1 0.0%
unix`av_check_softint_pending 1 0.0%
genunix`devid_str_decode_id 1 0.0%
pcplusmp`apic_send_ipi 1 0.0%
genunix`callout_downheap 1 0.0%
unix`prefetch_page_r 1 0.0%
dtrace`dtrace_state_clean 1 0.0%
genunix`fsflush 1 0.0%
unix`link_ptp 1 0.0%
ip`ip_input_cksum_v4 1 0.0%
ip`tcp_timer_callback 1 0.0%
unix`page_pptonum 1 0.0%
unix`do_splx 1 0.0%
unix`page_nextn 5 0.1%
genunix`fsflush_do_pages 6 0.1%
unix`acpi_cpu_cstate 782 17.7%
unix`i86_mwait

The zpool contains 12 ssd's in raidz-1 configuration. The load is made up of a large number of small random reads. — PhilHoy, Jan 02 '14 at 13:45
And how much RAM do you have? Is there any L2ARC? Can you provide the output of `zpool status -v`? — ewwhite, Jan 02 '14 at 13:49
The server has 64 gb ram. The cache hit ratio is fairly poor so we have decided not to use an L2ARC. — PhilHoy, Jan 02 '14 at 14:07
You might want to use dtrace to identify what the OS is doing during the pauses. I would suggest starting with the hotkernel script. — jlliagre, Jan 02 '14 at 14:31
@PhilHoy Please use the @ + name prefix when replying for notifications to occur. You should run hotkernel twice, the first one during a normal period, the second one when the problem occurs then post the results. — jlliagre, Jan 02 '14 at 16:14
Have you run a scrub on your system since this problem started? — tegbains, Jan 03 '14 at 00:03
Status of C-states? And to pre-empt an answer of 'on', turn off C-states. They are a huge killer of performance (for instance, you'll never get full 10Gbe if C-states is on, fun huh?). You can't really (AFAIK) disable them fully in OS. Go into your BIOS, find anything about C-States and disable it. Then find where it talks about power, and set it to 'Performance' or the equivalent. If your CPU is capable of dropping below C1 (or maybe C1E), you're going to feel performance penalties, and at times you wouldn't even expect it. — Nex7, Feb 08 '14 at 08:05

ewwhite · Answer 1 · 2014-01-02T23:01:20.830

2

Well, there are a few things wrong here...

Your zpool version is old. zpool upgrade will fix that.
The pool design does not lend itself to high performance. You have the effective IOPS potential of one disk. (this is likely the biggest issue here)
You have too many drives in your RAIDZ1. You shouldn't go over 7...
There are likely other tunables that may help here. Can you show zfs list?

Take a read of: http://nex7.blogspot.com/2013/03/readme1st.html for more general guidelines.

edited Jan 02 '14 at 23:01

answered Jan 02 '14 at 14:10

ewwhite

194,921
91
434
799

I would like to stress the performance WAS very good until recently when it started pausing every minute. Inbetween the pauses the performance is great. zfs list NAME USED AVAIL REFER MOUNTPOINT data01 2.98T 1.71T 42.0K /data01 – PhilHoy Jan 02 '14 at 14:23

jlliagre · Accepted Answer · 2014-01-03T09:48:39.863

2

Perhaps is there an issue with power management on your machine.

Try disabling it by adding the following line to the /etc/power.conf file

cpupm disable

next run

pmconfig

and see if the I/O behavior changes.

Update:

As disabling power management doesn't change anything, the fact is the kernel is essentially idle during the "Bad times" with zero ZFS activity,.

I would then infer the application is not I/O bound and is simply doing something else during these periods of time.

edited Jan 03 '14 at 09:48

answered Jan 02 '14 at 21:11

jlliagre

8,691
16
36

Tried that and it did not help. Nice ideas though. – PhilHoy Jan 03 '14 at 09:06
Thanks for the feed back ! Don't forget to revert power management back to default by removing the cpupm line and running pmconfig again. – jlliagre Jan 03 '14 at 09:40
Answer updated. – jlliagre Jan 03 '14 at 09:48
I had already ruled out the application as a possible cause of the pauses however @jlliagre's wise words forced my to take another look. Embarrassingly i was barking up the wrong tree and the app was to blame. Thanks everyone one and sorry to take up your time. – PhilHoy Jan 03 '14 at 12:36

zfs regular read pauses

2 Answers2