I received a report from a Redis user, and I'm not sure what to reply as I'm not an expert in the area of Linux and its scheduler, however we (as the Redis project) need to figure this kind of issues especially in the future as with Redis Cluster we'll have many Redis instances running at the same time in a single box. So I'm asking for some help here.
Problem:
- Kernel: "Linux redis1 2.6.32-305-ec2 #9-Ubuntu SMP Thu Apr 15 08:05:38 UTC 2010 x86_64 GNU/Linux"
- plenty of free RAM, no other processes doing significant I/O.
- Important, running on an EC2 big instance, not a real server. I never saw something like that in a non virtualized environment. The EC2 instance was: "High-Memory Extra Large Instance 17.1 GB memory, 6.5 ECU (2 virtual cores with 3.25 EC2 Compute Units each), 420 GB of local instance storage, 64-bit platform".
Basically once you restart a big Redis instance, the system will get so slow you can no longer type on the shell. When Redis loads an instance it uses 100% of CPU (it loads data as fast as possible) and reads the dump.rdb file sequentially. The I/O is not particularly high as loading is CPU-bound, not I/O bound.
Why on the earth a box with two CPUs and plenty of RAM, no swapped things on disk, should basically stop working with this work load?
I've the impression this has a lot to do with the fact it's an EC2 instance, so related to the virtualization technology used, as I load all the times Redis 24 GB datasets in my box without any problem (even with other instances of Redis running with high load).
Thanks for any hint!
Salvatore
Edit: adding some feedback I received from twitter:
from @ezmobius: @antirez first thing to do is try it from /mnt or the local ephemeral drives to see if its EBS flakiness, 2nd is to make sure its not the "first write penalty" (google it) and if it is then you need to dd 0's across the disk first.
from @dvirsky: @antirez I'm running many redis instances on exactly such ec2 nodes. I've noticed some slowdown on bgsave but not this phenomenon.