363

*NOTE: if your server still has issues due to confused kernels, and you can't reboot - the simplest solution proposed with gnu date installed on your system is: date -s now. This will reset the kernel's internal "time_was_set" variable and fix the CPU hogging futex loops in java and other userspace tools. I have straced this command on my own system an confirmed it's doing what it says on the tin *

POSTMORTEM

Anticlimax: only thing that died was my VPN (openvpn) link to the cluster, so there was an exciting few seconds while it re-established. Everything else was fine, and starting up ntp went cleanly after the leap second had passed.

I have written up my full experience of the day at http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/

If you look at Marco's blog at http://my.opera.com/marcomarongiu/blog/2012/06/01/an-humble-attempt-to-work-around-the-leap-second - he has a solution for phasing the time change over 24 hours using ntpd -x to avoid the 1 second skip. This is an alternative smearing method to running your own ntp infrastructure.


Just today, Sat June 30th, 2012 - starting soon after the start of the day GMT. We've had a handful of servers in different datacentres as managed by different teams all go dark - not responding to pings, screen blank.

They're all running Debian Squeeze - with everything from stock kernel to custom 3.2.21 builds. Most are Dell M610 blades, but I've also just lost a Dell R510 and other departments have lost machines from other vendors too. There was also an older IBM x3550 which crashed and which I thought might be unrelated, but now I'm wondering.

The one crash which I did get a screen dump from said:

[3161000.864001] BUG: spinlock lockup on CPU#1, ntpd/3358
[3161000.864001]  lock: ffff88083fc0d740, .magic: dead4ead, .owner: imapd/24737, .owner_cpu: 0

Unfortunately the blades all supposedly had kdump configured, but they died so hard that kdump didn't trigger - and they had console blanking turned on. I've disabled console blanking now, so fingers crossed I'll have more information after the next crash.

Just want to know if it's a common thread or "just us". It's really odd that they're different units in different datacentres bought at different times and run by different admins (I run the FastMail.FM ones)... and now even different vendor hardware. Most of the machines which crashed had been up for weeks/months and were running 3.1 or 3.2 series kernels.

The most recent crash was a machine which had only been up about 6 hours running 3.2.21.

THE WORKAROUND

Ok people, here's how I worked around it.

  1. disabled ntp: /etc/init.d/ntp stop
  2. created http://linux.brong.fastmail.fm/2012-06-30/fixtime.pl (code stolen from Marco, see blog posts in comments)
  3. ran fixtime.pl without an argument to see that there was a leap second set
  4. ran fixtime.pl with an argument to remove the leap second

NOTE: depends on adjtimex. I've put a copy of the squeeze adjtimex binary at http://linux.brong.fastmail.fm/2012-06-30/adjtimex — it will run without dependencies on a squeeze 64 bit system. If you put it in the same directory as fixtime.pl, it will be used if the system one isn't present. Obviously if you don't have squeeze 64-bit… find your own.

I'm going to start ntp again tomorrow.

As an anonymous user suggested - an alternative to running adjtimex is to just set the time yourself, which will presumably also clear the leapsecond counter.

Bron Gondwana
  • 1,738
  • 3
  • 12
  • 15
  • 58
    There's a leap second today, the 30th. I'm hesitant to imply that is your problem, but I will be watching my Debian machines closely. – jscott Jun 30 '12 at 17:09
  • 1
    yeah, I wondered about the leap second too, particularly with ntpd showing up - the thing is, it's a bit too convenient. I'm not convinced it's the cause yet. – Bron Gondwana Jun 30 '12 at 17:10
  • 1
    Related: [Does CentOS 5.4 properly handle leap seconds?](http://serverfault.com/questions/402087/does-centos-5-4-properly-handle-leap-seconds) – techie007 Jun 30 '12 at 18:45
  • 3
    lkml posting about this http://lkml.indiana.edu/hypermail/linux/kernel/1203.1/04598.html – Daniel S. Sterling Jun 30 '12 at 19:50
  • Interesting... so it could indeed be ntpd that's responsible. Maybe I'll just turn it off until afterwards... – Bron Gondwana Jun 30 '12 at 20:03
  • 2
    Thanks for reporting this! I'm now staring at my servers very, very closely. – Janne Pikkarainen Jun 30 '12 at 20:36
  • I've run `adjtimex -p` on all of my servers; they all report status 17 which I interpret as meaning they are going to add a leap second at midnight UTC (status bit 16). As far as I understand this means the kernel has already received `adjtimex()` and if it was going to livelock it already would have. Anyone know if that's accurate? – grifferz Jun 30 '12 at 21:38
  • 2
    since morning We've lost at least 9 different debian squeeze boxes from various vendors all running stock squeeze 2.6.32 kernel. we haven't been able to get a crash dump due to console blanking as well... – kargig Jun 30 '12 at 18:40
  • nice to get confirmation that it's not just us! I'll update as soon as I get a console dump. You know what they say about watched pots! – Bron Gondwana Jun 30 '12 at 18:42
  • have you had any server crash more than once ? we only had 1 such occasion... – kargig Jun 30 '12 at 19:22
  • not the same server - but I did have one which I restarted because two other machines with the same role next to it had just crashed within hours of each other and I suspected at the time it may be uptime related. They had all been restarted close together before. – Bron Gondwana Jun 30 '12 at 19:36
  • For what it's worth, my company has 17 physical machines running a combined 52 virtual machines, all Debian 6 with a custom kernel to support OpenVZ (still 2.6.32 though). We run one server with ntpd, the rest use ntpdate and are configured to talk to the main server that runs ntpd to update their time once per hour. No crashes yet but people are saying it won't happen til midnight - where are you servers located Bron? We're west coast US so it's only 5pm here. Getting a little nervous, may disable ntpd til tomorrow :) – Sean Jul 01 '12 at 00:10
  • @Sean I'm going to assume your servers are really running UTC and just adjusting local time via `/etc/timezone`... All my Debian boxes survived, `date -u` is reporting `Sun Jul 1 00:14:53 UTC 2012` so I think we're good. – jscott Jul 01 '12 at 00:15
  • I've just noticed a lot of Debian Squeeze machines, running Java (OpenJDK) with Tomcat, with java processes burning CPU at 100%. The OS itself is fine, syslog shows it inserted the leap second correctly. – Ricardo Pardini Jul 01 '12 at 01:20
  • @jscott ah, right... short a few brain cells today :) well, glad to be a bearer of good news - as I said I have 17 physical debian 6 servers and they're all just fine and dandy today. – Sean Jul 01 '12 at 02:19
  • @paradroid If not that, then probably the AWS outage on the East Coast today. – jscott Jul 01 '12 at 02:39
  • FYI, blogpost about Google hacking it's ntpd servers to support "seamless" leap second: [Time, technology and leaping seconds](http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html) – SaveTheRbtz Jul 01 '12 at 01:31
  • I lost one machine at midnight GMT - running Debian Squeeze (possibly still running the Debian Lenny kernel ~2.6.26-2-686-bigmem). But oddly, I have another Debian Squeeze instance that's definitely still running that Lenny kernel and it is still up and running (knock wood). –  Jul 01 '12 at 08:17
  • 5
    The LKML thread indicated that `date -s "\`date\`"` helps - it certainly helped me. – Pointy Jul 01 '12 at 14:25

5 Answers5

318

This is caused by a livelock when ntpd calls adjtimex(2) to tell the kernel to insert a leap second. See lkml posting http://lkml.indiana.edu/hypermail/linux/kernel/1203.1/04598.html

Red Hat should also be updating their KB article as well. https://access.redhat.com/knowledge/articles/15145

UPDATE: Red Hat has a second KB article just for this issue here: https://access.redhat.com/knowledge/solutions/154713 - the previous article is for an earlier, unrelated problem

The work-around is to just turn off ntpd. If ntpd already issued the adjtimex(2) call, you may need to disable ntpd and reboot to be 100% safe.

This affects RHEL 6 and other distros running newer kernels (newer than approx 2.6.26), but not RHEL 5.

The reason this is occurring before the leap second is actually scheduled to occur is that ntpd lets the kernel handle the leap second at midnight, but needs to alert the kernel to insert the leap second before midnight. ntpd therefore calls adjtimex(2) sometime during the day of the leap second, at which point this bug is triggered.

If you have adjtimex(8) installed, you can use this script to determine if flag 16 is set. Flag 16 is "inserting leap second":

adjtimex -p | perl -p -e 'undef $_, next unless m/status: (\d+)/; (16 & $1) && print "leap second flag is set:\n"'

UPDATE:

Red Hat has updated their KB article to note: "RHEL 6 customers may be affected by a known issue that causes NMI Watchdog to detect a hang when receiving the NTP leapsecond announcement. This issue is being addressed in a timely manner. If your systems received the leapsecond announcement and did not experience this issue, then they are no longer affected."

UPDATE: The above language was removed from the Red Hat article; and a second KB solution was added detailing the adjtimex(2) crash issue: https://access.redhat.com/knowledge/solutions/154713

However, the code change in the LKML post by IBM Engineer John Stultz notes there may also be a deadlock when the leap second is actually applied, so you may want to disable the leap second by rebooting or using adjtimex(8) after disabling ntpd.

FINAL UPDATE:

Well, I'm no kernel dev, but I reviewed John Stultz's patch again here: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d

If I'm reading it right this time, I was wrong about there being another deadlock when the leap second is applied. That seems to be Red Hat's opinion as well, based on their KB entry. However, if you have disabled ntpd, keep it disabled for another 10 minutes, so that you don't hit the deadlock when ntpd calls adjtimex(2).

We'll find out if there are any more bugs soon :)

POST-LEAP SECOND UPDATE:

I spent the last few hours reading through the ntpd and pre-patch (buggy) kernel code, and while I may be very wrong here, I'll attempt to explain what I think was going on:

First, ntpd calls adjtimex(2) all the time. It does this as part of its "clock loop filter", defined in local_clock in ntp_loopfilter.c. You can see that code here: http://www.opensource.apple.com/source/ntp/ntp-70/ntpd/ntp_loopfilter.c (from ntp version 4.2.6).

The clock loop filter runs quite often -- it runs every time ntpd polls its upstream servers, which by default is every 17 minutes or more. The relevant bit of the clock loop filter is:

if (sys_leap == LEAP_ADDSECOND)
    ntv.status |= STA_INS;

And then:

ntp_adjtime(&ntv)

In other words, on days when there's a leap second, ntpd sets the "STA_INS" flag and calls adjtimex(2) (via its portability-wrapper).

That system call makes its way to the kernel. Here's the relevant kernel code: https://github.com/mirrors/linux/blob/a078c6d0e6288fad6d83fb6d5edd91ddb7b6ab33/kernel/time/ntp.c

The kernel codepath is roughly this:

  • line 663 - start of do_adjtimex routine.
  • line 691 - cancel any existing leap-second timer.
  • line 709 - grab the ntp_lock spinlock (this lock is involved in the possible livelock crash)
  • line 724 - call process_adjtimex_modes.
  • line 616 - call process_adj_status.
  • line 590 - set time_status global variable, based on flags set in adjtimex(2) call
  • line 592 - check time_state global variable. in most cases, call ntp_start_leap_timer.
  • line 554 - check time_status global variable. STA_INS will be set, so set time_state to TIME_INS and call hrtimer_start (another kernel function) to start the leap second timer. in the process of creating a timer, this code grabs the xtime_lock. if this happens while another CPU has already grabbed the xtime_lock and the ntp_lock, then the kernel livelocks. this is why John Stultz wrote the patch to avoid using hrtimers. This is what was causing everyone trouble today.
  • line 598 - if ntp_start_leap_timer did not actually start a leap timer, set time_state to TIME_OK
  • line 751 - assuming the kernel does not livelock, the stack is unwound and the ntp_lock spinlock is released.

There are a couple interesting things here.

First, line 691 cancels the existing timer every time adjtimex(2) is called. Then, 554 re-creates that timer. This means each time ntpd ran its clock loop filter, the buggy code was invoked.

Therefore I believe Red Hat was wrong when they said that once ntpd had set the leap-second flag, the system would not crash. I believe each system running ntpd had the potential to livelock every 17 minutes (or more) for the 24-hour period before the leap-second. I believe this may also explain why so many systems crashed; a one-time chance of crashing would be much less likely to hit as compared to 3 chances an hour.

UPDATE: In Red Hat's KB solution at https://access.redhat.com/knowledge/solutions/154713 , Red Hat engineers did come to the same conclusion (that running ntpd would continuously hit the buggy code). And indeed they did so several hours before I did. This solution wasn't linked to the main article at https://access.redhat.com/knowledge/articles/15145 , so I didn't notice it until now.

Second, this explains why loaded systems were more likely to crash. Loaded systems will be handling more interrupts, causing the "do_tick" kernel function to be called more often, giving more of a chance for this code to run and grab the ntp_lock while the timer was being created.

Third, is there a chance of the system crashing when the leap-second actually occurs? I don't know for sure, but possibly yes, because the timer that fires and actually executes the leap-second adjustment (ntp_leap_second, on line 388) also grabs the ntp_lock spinlock, and has a call to hrtimer_add_expires_ns. I don't know if that call might also be able to cause a livelock, but it doesn't seem impossible.

Finally, what causes the leap-second flag to be disabled after the leap-second has run? The answer there is ntpd stops setting the leap-second flag at some point after midnight when it calls adjtimex(2). Since the flag isn't set, the check on line 554 will not be true, and no timer will be created, and line 598 will reset the time_state global variable to TIME_OK. This explains why if you checked the flag with adjtimex(8) just after the leap second, you would still see the leap-second flag set.

In short, the best advice for today seems to be the first I gave after all: disable ntpd, and disable the leap-second flag.

And some final thoughts:

  • none of the Linux vendors noticed John Stultz's patch and applied it to their kernels :(
  • why didn't John Stultz alert some of the vendors this was needed? perhaps the chance of the livelock seemed low enough making noise wasn't warranted.
  • I've heard reports of Java processes locking up or spinning when the leap-second was applied. Perhaps we should follow Google's lead and rethink how we apply leap-seconds to our systems: http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html

06/02 Update from John Stultz:

https://lkml.org/lkml/2012/7/1/203

The post contained a step-by-step walk-through of why the leap second caused the futex timers to expire prematurely and continuously, spiking the CPU load.

britoman
  • 3
  • 1
Daniel S. Sterling
  • 1,574
  • 2
  • 10
  • 13
  • 7
    Thanks for the excellent answer. So the rest of our servers are sitting waiting to crash. Lovely. Rolling restarts here we come! – Bron Gondwana Jun 30 '12 at 20:16
  • 3
    How do I know if the `adjtimex` has been issued, does the kernel print something in dmesg? What chance is there that a system that didn't crash before turning ntpd off will crash? – Hubert Kario Jun 30 '12 at 21:20
  • 3
    Hubert: run "adjtimex" (it's usually packaged separately) and look for flag 16 to indicate a leap second pending. – Dominic Cleal Jun 30 '12 at 21:26
  • @m0dlx: got it, thanks!. If the number printed by `echo $((`adjtimex -p | awk '/status/ { print $2 }'`/16))` is odd then kernel is adding a leap second. Now, should I reboot the servers or wait for them to crash... – Hubert Kario Jun 30 '12 at 21:46
  • Hubert, my understanding from Daniel's answer is that ntpd making the original leap second adjtimex call is causing the crash, so perhaps the presence of the flag indicates this event has passed without incident? – Dominic Cleal Jun 30 '12 at 21:57
  • Does this mean all android devices of the world could have a problem too? ... puh 00:01, 01.07.2012 is now showing up on it ;) or is it uct time? – Karussell Jun 30 '12 at 22:01
  • 22
    You're going to hate the rep cap. – Wesley Jun 30 '12 at 22:06
  • @m0dlx: My environment will work all right even if I have time up to a minute off (in a worst case scenario), on the other hand, if it crashes, I'll have to move my carbon-based shell to $WORK, on Sunday, to reset those servers. So... better safe than sorry. – Hubert Kario Jun 30 '12 at 22:22
  • 26
    @WesleyDavid: No worries, the rep cap will reset at UTC midnight. Maybe. – mmyers Jun 30 '12 at 23:01
  • Question: i though the kernel was responsible for adding the leap second, but if i understand correctly, it's a time aware program like NTP which insert the leap second using adjtimex(2) ? – Benoît Jun 30 '12 at 23:38
  • 1
    And we survived. Thanks for all the useful data in this answer! – David Poblador i Garcia Jul 01 '12 at 00:03
  • @Benoît: The kernel is responsible for adding the leap second, but something has to tell it to insert the leap second. How else would it know there was a leap second? – David Schwartz Jul 01 '12 at 00:16
  • 3
    > I've heard reports of Java processes locking up or spinning when the leap-second was applied. Yup, that happened to us (ElasticSearch) – Ask Bjørn Hansen Jul 01 '12 at 03:50
  • @DavidSchwartz so how does NTP keep track of leap seconds ? do you have to keep it update it so it knows there's going to be a leap second, or can it learn it from peers ? – Benoît Jul 01 '12 at 07:08
  • @DavidSchwartz - also, why didn't it get backported to the other stable kernels? I was running 3.2.18 or 3.2.21 on almost all those machines, and thye both came out well after that patch. – Bron Gondwana Jul 01 '12 at 07:57
  • @Benoît: It can learn it from peers. – David Schwartz Jul 01 '12 at 08:10
  • @DavidSchwartz any litterature about this ? How does it broadcast that there's going to be a leap second ? Or does you simply suddenly realize that you're running one second ahead from your peers ? But how do they know they have to insert a leap second ? Is there something like a timezone db for leap seconds that administrators have to keep updated on their servers ? – Benoît Jul 01 '12 at 09:18
  • 2
    @Benoît: Information about immediately pending leap seconds is passed along with time information in the NTP packets. People who operate trusted time servers can manually insert them any time in the 24 hours preceding the leap second or they can configure an NTP [leapfile](http://support.ntp.org/bin/view/Support/ConfiguringNTP#Section_6.14.). (This leap second was [announced](http://hpiers.obspm.fr/iers/bul/bulc/bulletinc.dat) on January 5, so there was plenty of time to prepare.) – David Schwartz Jul 01 '12 at 09:25
  • 1
    +1, but waiting until tomorrow to upvote you, so it doesn't get swallowed by the rep cap :) – Roy Tinker Jul 02 '12 at 21:48
  • Apparently it's more than just a deadlock in ntpd, there's (at least) five different bugs according to http://landslidecoding.blogspot.se/2012/07/linuxs-leap-second-deadlocks.html – slu Jul 03 '12 at 10:09
33

This hit us hard. After restarting many of our hosts, the following turned out to be embarrassingly simple and fully effective without a host restart:

/etc/init.d/ntp stop
ntpdate 0.us.pool.ntp.org
/etc/init.d/ntp start

All that is required is to reset the system clock. Sheesh. What I've give to have known this six hours ago.

HikeOnPast
  • 539
  • 3
  • 8
  • 8
    `date -s "\`date\`"` worked for me. – Pointy Jul 01 '12 at 14:28
  • @DeanB: I posted at 3 am UTC that resetting the clock does the trick, unfortunately it took a while to get moderated up. We did start out rebooting servers too – Gregor Jul 02 '12 at 07:57
24

A simple C program that clears the leap second bit in the kernel's time status field:

#include <sys/timex.h>
#include <string.h>
#include <stdio.h>

int main(int argc, char **argv) {
    struct timex txc;
    int ret;

    (void) argc;
    (void) argv;

    bzero(&txc, sizeof(txc));
    txc.modes = 0;  /* fetch */
    ret = adjtimex(&txc);
    if (ret < 0) {
        perror("adjtimex (get)");
        return 1;
    }

    txc.modes = ADJ_STATUS;
    txc.status &= ~16;
    ret = adjtimex(&txc);
    if (ret < 0) {
        perror("adjtimex (set)");
        return 1;
    }

    return 0;
}

Save as lsec.c, compile with gcc -Wall -Wextra -o lsec lsec.c and run as root.

You'll likely want to stop ntpd before running it, and restart ntpd after the leap second.

jon
  • 890
  • 5
  • 15
  • What does `(void) argc;` accomplish? Silence the warning for the unused variable? Wouldn't using `int main()` accomplish the same? Not trying to be a pedant, I'm genuinely curious. – gparent Nov 29 '12 at 00:25
18

Postmortem it seems ./lsec does not have an effect.

What we're seeing is lots of softirqd processes eating CPU (usually linear to the load of java processes)

What does work to fix POSTMORTEM with leap seconds already applied by ntp is the following:

It appears to be sufficient to just issue:

export LANG="en_EN"; date -s "`date`"

This should reduce the load without ntpd restart or reboot. Alternatively you can issue:

apt-get install ntpdate
/etc/init.d/ntpd stop; ntpdate pool.ntp.org; /etc/init.d/ntpd start
Gregor
  • 181
  • 4
  • why `sntp -s` and not `ntpdate`? – errordeveloper Jul 01 '12 at 04:12
  • ntpdate is just a wrapper to sntp here, sure it's fine to use ntpdate as well. – Gregor Jul 01 '12 at 04:16
  • ah i completely missed there is a ntpdate package for squeeze where it's actually a binary. I've edited my posting to include this. – Gregor Jul 01 '12 at 09:06
  • I've heard similar reports of fixing this issue as well (such as using `date -s`). It sounds like the fix just requires setting the system time instead of slewing it (the default ntpd behavior when offset is small). I'm guessing setting the time causes the kernel's internal time-keeping mechanics to reset themselves. – phemmer Jul 01 '12 at 09:39
  • 4
    My java apps CPU usage spiked too (with high amount of CPU time spent in softirqd), this fixed it. – Hubert Kario Jul 01 '12 at 14:28
  • +1 on high Java process usage (Hadoop processes) and ksoftirqd daemons eating CPU. And date -s "`date`" fixing it. Thankyou! – Erik Forsberg Jul 02 '12 at 07:04
16

http://my.opera.com/marcomarongiu/blog/2012/03/12/no-step-back seems to indicate that the Debian squeeze kernel won't handle the leap second.

This thread on comp.protocols.tim.ntp is of interest, also: https://groups.google.com/forum/?fromgroups#!topic/comp.protocols.time.ntp/KSflIgjUdPE

That said, the leap second hasn't happened yet: 23:59:60 UTC

Finally, https://access.redhat.com/knowledge/articles/15145 has the following to say: "When the leap second occurs, the kernel prints a message to the system log. There is a chance that the printing of this message can cause the kernel to crash in Red Hat Enterprise Linux."

Luca Filipozzi
  • 304
  • 1
  • 4