4

we experience a strange behaviour in our MongoDB Replica-Set, setup of 3 Nodes (all Xeon Quad-Core-Class CPUs, 16GB of RAM for one, 24GB for the other two nodes) The one node with less RAM is normal secondary with priority 0, other two priority 1. Recently we experienced a Replication-Lag of about 60 seconds every 3 to 4 hours, self disappearing after 2-3minutes (Nagios Checks!)

We have almost no traffic on those machines, just some databases with a size of 0,3GB and one is 5GB. And we have one collection which has about 65000 entries but also an id index.

The Strange thing is, that the 16gb-secondary has no lag, but only the secondary from the two larger machines. i just changed it to be primary to see if the old primary (now secondary) also has this behaviour.

Does anyone know what we can do or check? Because we have no clue.

I checked the Load and processes of those machines, the network connectivity and routing, disk states - everyhtings fine.

martinseener
  • 149
  • 11

1 Answers1

2

A few quick checks:

  • Are you running on 2.0 or below? Replication got a major overhaul in 2.2
  • Do you have any capped collections? A missing index on _id in a capped collection can cause this kind of lag
  • You mention that the hosts are not too busy - if you have gaps in your new ops, the math used to calculate lag can falsely report lag when no ops are happening
  • How are you calculating the lag? I would definitely try to confirm any lag from the shell - last optime from the entries in rs.status() would be a good start
  • Double check on the network side of things, latency spikes and/or intermittent packet loss could cause this and be transient enough to be hard to detect (take a look at netstat --statistics before and after a lag spike for example - see if retransmits or erorrs are increasing)
  • If you are running 2.2, see if switching the host the lagging secondary is syncing from, somewhat confusingly revealed by the [syncingTo][3] field in rs.status(). This is done using the rs.syncFrom() command.
  • If it's not there already, get the set into MMS and see if anything is spiking on/around the same time as the lag spike to point you in the right direction.

If, after all that, you still don't know what's causing this, then it may be beyond answering on serverfault in a reasonable way (would need to look at logs, stats etc.) - I'd recommend the mongodb-user Google group as the next step.

Adam C
  • 5,132
  • 2
  • 28
  • 49
  • Hey, thanks for you input. Here are the answers... - we`re running all 3 on 2.0.6 (Debian Squeeze AMD64 with stock Kernel) - we have no capped collections - can you please explain your point 3 more precisely please? - we use the check_mongodb.py script (https://github.com/mzupan/nagios-plugin-mongodb) which runs db.runcommand({ "replSetGetStatus" : 1 }) which just uses the optime (unix timestamp) to calculate it - ill recap the network settings - but it "should" be no problem. - ill have a look at MMS too thanks for the help, ill post an answer with my findings soon! – martinseener Dec 12 '12 at 08:28
  • On point 3, basically you can get false positives in terms of lag if you don't have operations replicating all the time (no updated optime on the secondary). For there to be an accurate lag calculation, there has to be a constant stream of ops - this is why the MMS lag graph has a bunch of logic to eliminate this kind of event. It's tough to diagnose this, but you could do a simple test by doing a low level of inserts (1 per 5 seconds say) to a test collection, thereby guaranteeing a constant flow. – Adam C Dec 12 '12 at 09:24
  • we switched the secondary (which had the lag) to the primary yesterday by just typing db.stepDown() on the old primary. since then, we had no nagios warns/criticals for lag. and rs.status() shows always the same optime (or optimeDate) on all 3 instances. iam quite researching your other points and will have a look at MMS. sounds interesting! as well, we considering an update in the next weeks to latest 2.2.x but we have to carefully test it beforehand with our Rails Application! – martinseener Dec 12 '12 at 11:04
  • today we switched back to the old primary again and the replication lag occured again on the second one but didnt happened while this secondary was primary for more than 1 day. very arkward!! – martinseener Dec 13 '12 at 09:51