Memcached Lagging

Question

Let me preface this by saying that this is a followup question to this topic.

That was "solved" by switching from Solaris (SmartOS) to Ubuntu for the memcached server. Now we've multiplied load by about 5x and are running into problems again.

We are running a site that is doing about 1000 requests/minute, each request hits Memcached with approximately 3 reads and 1 write. So load is approximately 65 requests per second. Total data in the cache is about 37M, and each key contains a very small amount of data (a JSON-encoded array of integers amounting to less than 1K).

We have setup a benchmarking script on these pages and fed the data into StatsD for logging. The problem is that there are spikes where Memcached takes a very long time to respond. These do not appear to correlate with spikes in traffic.

Execution Time from StatsD

What could be causing these spikes? Why would memcached take over a second to reply? We just booted up a second server to put in the pool and it didn't make any noticeable difference in the frequency or severity of the spikes.

This is the output of getStats() on the servers:

Array
(
    [-----------] => Array
        (
            [pid] => 1364
            [uptime] => 3715684
            [threads] => 4
            [time] => 1336596719
            [pointer_size] => 64
            [rusage_user_seconds] => 7924
            [rusage_user_microseconds] => 170000
            [rusage_system_seconds] => 187214
            [rusage_system_microseconds] => 190000
            [curr_items] => 12578
            [total_items] => 53516300
            [limit_maxbytes] => 943718400
            [curr_connections] => 14
            [total_connections] => 72550117
            [connection_structures] => 165
            [bytes] => 2616068
            [cmd_get] => 450388258
            [cmd_set] => 53493365
            [get_hits] => 450388258
            [get_misses] => 2244297
            [evictions] => 0
            [bytes_read] => 2138744916
            [bytes_written] => 745275216
            [version] => 1.4.2
        )

    [-----------:11211] => Array
        (
            [pid] => 8099
            [uptime] => 4687
            [threads] => 4
            [time] => 1336596719
            [pointer_size] => 64
            [rusage_user_seconds] => 7
            [rusage_user_microseconds] => 170000
            [rusage_system_seconds] => 290
            [rusage_system_microseconds] => 990000
            [curr_items] => 2384
            [total_items] => 225964
            [limit_maxbytes] => 943718400
            [curr_connections] => 7
            [total_connections] => 588097
            [connection_structures] => 91
            [bytes] => 562641
            [cmd_get] => 1012562
            [cmd_set] => 225778
            [get_hits] => 1012562
            [get_misses] => 125161
            [evictions] => 0
            [bytes_read] => 91270698
            [bytes_written] => 350071516
            [version] => 1.4.2
        )

)

Edit: Here is the result of a set and retrieve of 10,000 values.

Normal:

Stored 10000 values in 5.6118 seconds.
Average: 0.0006
High: 0.1958
Low: 0.0003

Fetched 10000 values in 5.1215 seconds.
Average: 0.0005
High: 0.0141
Low: 0.0003

When Spiking:

Stored 10000 values in 16.5074 seconds.
Average: 0.0017
High: 0.9288
Low: 0.0003

Fetched 10000 values in 19.8771 seconds.
Average: 0.0020
High: 0.9478
Low: 0.0003

Might I suggest you run a test(!) instance of that memcached under whatever the equivalent to strace is on solaris (IINM it is called truss, haven't dealt with solaris for ages though...) — rackandboneman, May 09 '12 at 22:04

score 0 · Answer 1 · answered May 16 '12 at 07:15

There may be a problem with network stack. I've got resembling problem with memcached and cause was running out of linux conntrack handlers. Can you check netstat -s -t output before and after spikes to check tcp errors and retransmits. You can also try to use wireshark to look at the traffic dump for more information on the problem.

score 0 · Answer 2 · edited Apr 13 '17 at 12:46

In addition to examining your network stats, is an upgrade to version 1.4.10 (or higher) which has performance improvements, possible? From 1.4.10 release notes:

This release is focused on thread scalability and performance improvements. This release should be able to feed data back faster than any network card can support as of this writing.

While our servers don't get the traffic yours do, in our case, an upgrade to 1.4.10 helped, as did enabling compression (we had values larger than 1MB) and binary protocol. See my answer on Drupal SE for details.

score 0 · Accepted Answer · answered Jun 18 '12 at 00:13

The issue turned out to be that the calling machine was eating up all of its available CPU. This was causing weird issues to happen and it was appearing that memcached was the issue when, in fact, it was not. That was just a symptom of the bigger problem.

Scaling the web tier horizontally solved the issue.

If you are on Joyent, the useful command to see how much of your CPU cap you're using is jinf -c

Memcached Lagging

3 Answers3