65

Since Kubernetes 1.8, it seems I need to disable swap on my nodes (or set --fail-swap-on to false).

I cannot find the technical reason why Kubernetes insists on the swap being disabled. Is this for performance reasons? Security reasons? Why is the reason for this not documented?

d4nyll
  • 334
  • 2
  • 9
Jeroen Jacobs
  • 1,276
  • 3
  • 15
  • 24
  • [how to enable swap](https://stackoverflow.com/questions/47094861/error-while-executing-and-initializing-kubeadm/62158455#62158455) in k8s (suitable for specific setups) – Jossef Harush Kadouri Jun 02 '20 at 18:25
  • Disabling swap may protect data from [data remanence](https://en.wikipedia.org/wiki/Data_remanence) attacks. – vhs Dec 14 '20 at 13:19

4 Answers4

50

The idea of kubernetes is to tightly pack instances to as close to 100% utilized as possible. All deployments should be pinned with CPU/memory limits. So if the scheduler sends a pod to a machine it should never use swap at all. You don't want to swap since it'll slow things down.

Its mainly for performance.

Mike
  • 21,910
  • 7
  • 55
  • 79
  • 3
    ya the idea is if a node only has 3gig free to use.. and your new pod wants 4.. its going to go on another node. – Mike Nov 02 '17 at 14:10
  • 4
    This doesn't make all that much sense to me, surely you could pack your nodes a bit further by letting the os put some infrequently used memory pages in swap without harming performance in a noticable way? – Frederik Baetens Jul 30 '19 at 13:35
  • 2
    another reason why kubes is dumb – tgwaste Jan 22 '20 at 02:45
  • 1
    Of course that reason is absurd. Linux always swaps, because it loads code on-demand. Which is the reason why Linux performance is always worse without swap. – Jan Hudec Jun 24 '20 at 14:41
  • 2
    So they recommend that the entire system crashes to a halt when the OOM killer starts reaping the first time you run out of RAM? How is that not a performance concern? – Shadur May 25 '21 at 11:55
35

TL;DR not properly using swap is just a lazy hack that demonstrates a poor understanding of the memory subsystems and a lack of fundamental systems administration skills. Designing infrastructure services and not understanding these systems is bound to end in failure.

So, I've got some commentary on this, this seems more like laziness to me rather than a feature or requirement. It's absolutely possible to properly handle swap, analyze the memory, and determine how to properly utilize the memory subsystem without hitting swap. There are a litany of tools built around this and you can guarantee a process will not utilize swap quite easily so the point of performance is incorrect. It's simply lazy coding to not put this instrumentation in, and overall the complete removal of swap will be to the detriment of system performance. The key here is using it properly. I'll agree that swapping out pods to disks will impact performance, however there are a number of things that should be swapped out to disk.

Additionally the linux kernel is designed to utilize swap, and completely disabling it is going to have negative consequences. A better way to handle this would be to pin the pods into main memory and not allow them to swap to disk, reduce the vfs cache pressure so that it does not swap unless it is absolutely necessary, and even then you could cause pinned processes to fail MALLOC in event that main memory is exhausted.

Depending on the processes in the containers having a hard failure of the container or having it killed by OOM killer could result in some pretty disastrous outcomes. I understand however that the processes run in these containers should ideally be stateless and ephemeral, but in 20 years of running systems, I have not once seen everyone follow the intended design to the letter 100% of the time.

Furthermore this doesn't take into account future technologies such as non volatile memory, and newer memory systems like intel xpoint which can be used to extend main memory significantly using hybrid disk/memory systems. With these type of systems they can use them directly as supplemental main memory or utilize swap files to extend main memory with negligible performance impact.

  • 10
    I highly doubt the maintainers of the kubernetes project are lazy. None of the arguments purposed seem to be within the context of a containerized ecosystem running in kubernetes. – spuder Jan 19 '19 at 21:49
  • 1
    "I highly doubt the maintainers of the kubernetes project are lazy." And why is that? Most of the software I have to support is atrocious, so to me, all devs are lazy/incompetent by default. Docker's authorization (or more like, the lack of) is also a complete joke. – bviktor Mar 04 '20 at 13:29
  • 1
    They are not lazy per se, it's just that their priorities seem to be on other aspects of K8S - it's still got a long-ass todo list, like better handling of storage. – Vladimir Akopyan May 31 '20 at 23:36
  • The allocated memory on average Linux system is usually 10 or more times the amount of actually used memory. Most of the pages never get accessed (the stacks), but some do get touched and then never used again. If the system can't swap those pages, it will have less space to load code into and cache files in, which will hurt performance. Even the pods shouldn't really be pinned into memory. Maybe get a lower swappiness. – Jan Hudec Jun 24 '20 at 14:49
  • Please do show a link for your claim that "the Linux Kernel is designed to use Swap". This is false in my experience, you will have a much more stable system without swap, swap is a relic from a time when RAM was scarce. When you run out of ram, the Out of memory killer will do its job. Swap, in any case, slows everything down, because the system and apps "expect" RAM to be fast. Windows on the other hand IS designed to need paging, you can erase the page file and set it to not use one and it still will :-D – Markus Bawidamann Oct 21 '20 at 04:44
  • @MarkusBawidamann, please keep in mind that on Linux, the swap file also stores paginated out pages. Linux does not have dedicated pagefiles. – Alain Pannetier Nov 14 '21 at 19:49
26

The reason for this, as I understand it, is that the kubelet isn't designed to handle swap situations and the Kubernetes team aren't planning to implement this as the goal is that pods should fit within the memory of the host.

from this GitHub issue #53533

Support for swap is non-trivial. Guaranteed pods should never require swap. Burstable pods should have their requests met without requiring swap. BestEffort pods have no guarantee. The kubelet right now lacks the smarts to provide the right amount of predictable behavior here across pods.

sanmai
  • 521
  • 5
  • 19
Rory McCune
  • 544
  • 4
  • 13
  • And the problem with that when it comes to it, guaranteed pods just require more swapping when there is no swap. Because if the anonymous memory can't be swapped out, the kernel will have swapped out the code instead. Because swapping is still enabled, there is just no place to swap anonymous pages to. – Jan Hudec Jun 24 '20 at 15:50
1

There is ticket to enable it again you'll get more insight there

https://github.com/kubernetes/kubernetes/issues/53533

rzr
  • 259
  • 2
  • 6
  • 3
    This link-only answer does not contribute anything new. The github issue is already mentioned in [this almost two years old answer](https://serverfault.com/a/886266/58830). – Andrew Savinykh Aug 03 '19 at 02:44