15

We have been having ongoing trouble with our ElastiCache Redis instance swapping. Amazon seems to have some crude internal monitoring in place which notices swap usage spikes and simply restarts the ElastiCache instance (thereby losing all our cached items). Here's the chart of BytesUsedForCache (blue line) and SwapUsage (orange line) on our ElastiCache instance for the past 14 days:

Redis ElastiCache BytesUsedForCache and Swap

You can see the pattern of growing swap usage seeming to trigger reboots of our ElastiCache instance, wherein we lose all our cached items (BytesUsedForCache drops to 0).

The 'Cache Events' tab of our ElastiCache dashboard has corresponding entries:

Source ID | Type | Date | Event

cache-instance-id | cache-cluster | Tue Sep 22 07:34:47 GMT-400 2015 | Cache node 0001 restarted

cache-instance-id | cache-cluster | Tue Sep 22 07:34:42 GMT-400 2015 | Error restarting cache engine on node 0001

cache-instance-id | cache-cluster | Sun Sep 20 11:13:05 GMT-400 2015 | Cache node 0001 restarted

cache-instance-id | cache-cluster | Thu Sep 17 22:59:50 GMT-400 2015 | Cache node 0001 restarted

cache-instance-id | cache-cluster | Wed Sep 16 10:36:52 GMT-400 2015 | Cache node 0001 restarted

cache-instance-id | cache-cluster | Tue Sep 15 05:02:35 GMT-400 2015 | Cache node 0001 restarted

(snip earlier entries)

Amazon claims:

SwapUsage -- in normal usage, neither Memcached nor Redis should be performing swaps

Our relevant (non-default) settings:

  • Instance type: cache.r3.2xlarge
  • maxmemory-policy: allkeys-lru (we had been using the default volatile-lru previously without much difference)
  • maxmemory-samples: 10
  • reserved-memory: 2500000000
  • Checking the INFO command on the instance, I see mem_fragmentation_ratio between 1.00 and 1.05

We have contacted AWS support and didn't get much useful advice: they suggested cranking up reserved-memory even higher (the default is 0, and we have 2.5 GB reserved). We do not have replication or snapshots set up for this cache instance, so I believe no BGSAVEs should be occurring and causing additional memory use.

The maxmemory cap of an cache.r3.2xlarge is 62495129600 bytes, and although we hit our cap (minus our reserved-memory) quickly, it seems strange to me that the host operating system would feel pressured to use so much swap here, and so quickly, unless Amazon has cranked up OS swappiness settings for some reason. Any ideas why we'd be causing so much swap usage on ElastiCache/Redis, or workaround we might try?

Josh Kupershmidt
  • 826
  • 1
  • 9
  • 9

2 Answers2

10

Since nobody else had an answer here, thought I'd share the only thing that has worked for us. First, these ideas did not work:

  • larger cache instance type: was having the same problem on smaller instances than the cache.r3.2xlarge we're using now
  • tweaking maxmemory-policy: neither volatile-lru nor allkeys-lru seemed to make any difference
  • bumping up maxmemory-samples
  • bumping up reserved-memory
  • forcing all clients to set an expiration time, generally at most 24 hours with a few rare callers allowing up to 7 days, but the vast majority of callers using 1-6 hours' expiration time.

Here is what finally did help, a lot: running a job every twelve hours which runs a SCAN over all keys in chunks (COUNT) of 10,000. Here is the BytesUsedForCache of that same instance, still a cache.r3.2xlarge instance under even heavier usage than before, with the same settings as before:

BytesUsedForCache

The sawtooth drops in memory usage correspond to the times of the cron job. Over this two week period our swap usage has topped out at ~45 MB (topped out at ~5 GB before restarts before). And the Cache Events tab in ElastiCache reports no more Cache Restart events.

Yes, this seems like a kludge that users shouldn't have to do themselves, and that Redis should be smart enough to handle this cleanup on its own. So why does this work? Well, Redis doesn't do much or any pre-emptive cleaning of expired keys, instead relying on eviction of expired keys during GETs. Or, if Redis realizes memory is full, then it will start evicting keys for each new SET, but my theory is that at that point Redis is already hosed.

Josh Kupershmidt
  • 826
  • 1
  • 9
  • 9
  • Josh, just wondering if you had any more progress on working on this issue? We're running into a similar situation. Are you still using the same solution as before? – Andrew C Feb 06 '17 at 23:17
  • @AndrewC we do still have this same cache instance hanging around, with similar sawtooth behavior from SCANs, and just a few lingering swap usage spikes in the last 3 months -- nowhere near as bad as I posted in the question, mainly due to offloading activity away from this instance, and the `SCAN` job in the answer still provoking cleanup. AWS does now offer [Redis Cluster](https://aws.amazon.com/about-aws/whats-new/2016/10/amazon-elasticache-for-redis-adds-sharding-support-with-redis-cluster/) features which I bet would help for heavy usage. – Josh Kupershmidt Feb 15 '17 at 21:13
  • good to hear; we took a similar approach to offloading cache load onto separate caches. What's your hypothesis on how clustering would help reduce swap usage? Just by reducing overall load? – Andrew C Feb 16 '17 at 22:07
  • @JoshKupershmidt your my hero. – Moriarty Jul 21 '17 at 01:11
  • You mentioned that redis will start evicting keys for each new SET (write operations). I am curious how SCAN (a read operation) is keeping the memory usage tight in your solution? – srgsanky Mar 05 '22 at 04:52
1

I know this may be old but I ran into this in the aws documentation.

https://aws.amazon.com/elasticache/pricing/ They state that the r3.2xlarge has 58.2gb of memory.

https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/ParameterGroups.Redis.html They state that the system maxmemory is 62gb (this is when the maxmemory-policy kicks in) and that it cant be changed. It seems that no matter what with Redis in AWS we will swap..

tlindhardt
  • 11
  • 1
  • AWS has it right -- they say maxmemory is `62495129600` bytes, which is exactly 58.2 GiB. The [pricing page](https://aws.amazon.com/elasticache/pricing/) you linked has memory in units of GiB, not GB. The `maxmemory` parameter is presumably not modifiable because there are better knobs provided by Redis, such as `reserved-memory` (though that one didn't help me...), that are modifiable, and AWS doesn't want you misconfiguring the node by e.g. telling Redis to use more memory than the Elasticache VM actually has. – Josh Kupershmidt Aug 15 '18 at 22:24