3

I have a RHEL5 workstation that has recently started to "hiccup". About every thirty seconds, it apparently completely stops execution for about 4 seconds. Seemingly nothing runs during that period. Long term processes seem to catch up to their input, but new processes simply don't get started.

Concrete examples:

  • I have this loop running in a shell:

    while date; do
       sleep 0.2
    done
    

    Output merely skips over the missing seconds:

    Fri Aug 13 15:20:29 EDT 2010
    Fri Aug 13 15:20:29 EDT 2010
    Fri Aug 13 15:20:29 EDT 2010
    Fri Aug 13 15:20:30 EDT 2010
    Fri Aug 13 15:20:30 EDT 2010
    Fri Aug 13 15:20:30 EDT 2010
    Fri Aug 13 15:20:30 EDT 2010
    Fri Aug 13 15:20:34 EDT 2010
    Fri Aug 13 15:20:34 EDT 2010
    Fri Aug 13 15:20:35 EDT 2010
    Fri Aug 13 15:20:35 EDT 2010
    Fri Aug 13 15:20:35 EDT 2010
    
  • If typing in a terminal, either local console or remote via ssh or telnet, echoback pauses during the unresponsive time, but catches back up when it starts responding again, with apparently no loss of input, just lag.

  • pings go unresponded-to during the unresponsive time, but are responded to when it comes back:

    64 bytes from xxx: icmp_seq=1911 ttl=64 time=0.203 ms  
    64 bytes from xxx: icmp_seq=1912 ttl=64 time=0.199 ms  
    64 bytes from xxx: icmp_seq=1913 ttl=64 time=3202 ms  
    64 bytes from xxx: icmp_seq=1914 ttl=64 time=2196 ms  
    64 bytes from xxx: icmp_seq=1915 ttl=64 time=1197 ms  
    64 bytes from xxx: icmp_seq=1916 ttl=64 time=195 ms  
    64 bytes from xxx: icmp_seq=1917 ttl=64 time=0.201 ms  
    64 bytes from xxx: icmp_seq=1918 ttl=64 time=0.206 ms
    

    This would seem to imply that it is actually receiving input during the unresponsive period, as those ICMP packets are not being retransmitted.

  • vmstat 1 output also delays, but does not catch up. It's almost as if those few seconds didn't happen. It also shows an uptick in waiting processes, and a downtick in interrupts and context switches:

    procs -----------memory----------  ---swap-- -----io---- --system-- -----cpu------
     r  b   swpd   free   buff  cache    si   so    bi    bo    in   cs us sy  id wa st
     0  0    132 3111220 305540 588012    0    0     0     0  1035  151  1  1  99  0  0
     0  0    132 3111096 305540 588012    0    0     0     0  1019  125  0  0  99  0  0
     0  0    132 3111220 305540 588012    0    0     0    44  1034  154  0  1  99  0  0
     1  0    132 3111096 305540 588012    0    0     0     0  1016  131  0  0  99  0  0
     6  0    132 3111096 305540 588012    0    0     0     0   417   82  0  0 100  0  0
     0  0    132 3111220 305540 588012    0    0     0     0  1041  155  0  1  99  0  0
     0  0    132 3111096 305540 588012    0    0     0     0  1019  123  1  1  99  0  0
     0  0    132 3111220 305540 588012    0    0     0     0  1032  142  0  1  99  0  0
     0  0    132 3111096 305544 588008    0    0     0    44  1019  134  0  0  99  0  0
    

Rebooting makes the problem go away for a while. This most recent time it took six days to come back. I'm not sure if that's consistent or not.

I had initially suspected that the problem might be related to the nVidia video driver module, but I shut down X Windows and removed the module, without change in the symptoms.

There is nothing in dmesg or /var/log/messages that seems remotely relevant or in any way coincides with the hiccups. It does not appear to be an issue with a hard drive, as I would expect iowait to be prominent during the unresponsive period if that were the case, but it's not. It feels unlikely to be a hardware problem, as the hiccups are pretty regular. I've been unable to time them down to milliseconds, but it's a pretty consistent 30/4/30/4/30/4.

Any ideas?

wfaulk
  • 6,828
  • 7
  • 45
  • 75
  • I just wanted to comment that I never solved this problem and the system doesn't exist any more. – wfaulk May 05 '12 at 13:27

2 Answers2

2

My money still goes on a hard disk failure. I've had similar things occur in personal Windows desktops. And even an old Sun machine exhibited similar freeze issues. However, I won't claim I dug deep enough into the issue to notice the seconds dropping from a sleeping shell. Regardless, you might want to see if you can get any info out of your RAID controller, or otherwise rule out the harddisks.

wfaulk
  • 6,828
  • 7
  • 45
  • 75
Christopher Karel
  • 6,442
  • 1
  • 26
  • 34
  • 1
    Considering it happens every 30 seconds and that is the default for when to flush to disk (if there isn't any memory pressure) I agree with Christopher. Look at /proc/sys/vm/dirty_expire_centisecs If it is about 3000, change it to another number and see if the freeze times change as well. – Mark Wagner Aug 13 '10 at 22:08
1

My server has hiccups, too. I found this tool: http://www.latencytop.org/. Unfortunately my hiccups are not occurring regularly.

guettli
  • 3,113
  • 14
  • 59
  • 110