How to load entire process memory from swap quickly (linux)?

5

I run a bunch of CPU-hungry processes in parallel; they normally use a few GB of memory each. From time to time they allocate large amount of memory too (say 150-250GB). Usually at most one of the processes does so, so they fit in the available RAM (384GB on my machine). However, it sometimes happens that more of them allocate this large amount at the same time and (obviously) everything slows down because of swapping.

In such cases I stop all but one of the memory-hog processes, to allow it to compute effectively. But it takes ages to swap-in a stopped process, since it means loading tens of gigabytes from disk in random access pattern. Hence the question is: how I can force a single process to sequentially load entire core from swap?

So far I've only found a swappiness kernel hint, which (with a help of cgroups) can prevent a process from swapping more, but doesn't help on performance of de-swapping. Turning off all the swap is obviously not possible since the other, stopped processes have to occupy space there.

Building my own mini-scheduler is also not an option - the processes are various small scripts/programs in python and the memory peaks generally happen in library calls, so I cannot predict when a peak will occur.


Just to make clear: I don't consider buying terabytes of RAM, at this scale it is too expensive. Putting swap on SSD/SSD array will help only a bit (measured), so it is also not a solution I'm looking for.


(partial self-answer):

It seems that really sequential swap reading (only pages that belong to single process) is hardly possible without kernel hacking: I measured swapoff -a and it certainly did not read swap sequentially. And it would be logical to read it faster if such an optimization was easy to implement.

Currently my best approach is to read the whole process memory through /proc/[pid]/mem pseudo-file using the script below (which must be run as root):

#!/usr/bin/python2
import re
import sys
pid=str(sys.argv[1]) # process pid given by the first arg

print(pid) # just to aviod mistakes

CHUNKSIZE=10485760  # single read() invocation block size

total=0
maps_file = open("/proc/"+pid+"/maps", 'r')
mem_file = open("/proc/"+pid+"/mem", 'r', 0)
for line in maps_file.readlines():  # for each mapped region
    m = re.match(r'([0-9A-Fa-f]+)-([0-9A-Fa-f]+) ([-r])', line)
    if m.group(3) == 'r':  # if this is a readable region
        start = int(m.group(1), 16)
        end = int(m.group(2), 16)
        mem_file.seek(start)  # seek to region start
        togo = end-start # number of bytes to read
        while togo > CHUNKSIZE: # read sequential memory from region one block at the moment
            mem_file.read(CHUNKSIZE) 
            togo -= CHUNKSIZE
            total += CHUNKSIZE
            print(total/1048576) # be verbose, print megabytes read so far
        mem_file.read(togo) # read remaining region contents
        total+=togo # dump contents to standard output
        print(total/1048576) # be verbose...
maps_file.close()
mem_file.close()

It happens to fail at the very last bytes of memory, but generally works with the same performance as swapoff and loads only the given process. Script is a modified snippet from this answer.

bzaborow

Posted 2016-01-29T08:09:16.743

Reputation: 51

what about RAID0 array of 16 SSDs for swap space? If you can get hands on SAS raid card 16e it should provide you with up to 24Gbps bandwidth (3GB/sec) – aaaaa says reinstate Monica – 2016-01-29T08:19:57.690

I've measured and SSD is not fast enough (in terms of random access) to handle such a load more effectively than a sequential reading from a RAID of HDDs. It's a single threaded load and random access even on fast SSDs doesn't count in hundreds MB per second. RAID won't help here at all. I hoped there will be no answers like this or "buy more RAM":( – bzaborow – 2016-01-29T09:03:01.393

just crossing the t's and dotting the i's. You didn't say you can't get more RAM (1.5TB?) or if you considered RAIDing swap file. I am interested for our stuff, might come in handy – aaaaa says reinstate Monica – 2016-01-29T09:08:29.330

OK, you're right. I'm adding this explicitely to the description. I'm sorry if you feel offended. – bzaborow – 2016-01-29T09:15:39.900

I might be wrong, but seems like swap maps memory on block-level, it is not possible to split swap by processes. Can one be running each process in separate virtual machine with swap file, and assigning priority for each machine? sorry I have no experience with this things, just a suggestion. – aaaaa says reinstate Monica – 2016-01-29T09:28:18.900

1As far as I know you are right, it is block-level. In my question I mean loading all swapped-out blocks owned by a process, not anything like per-process swap. EDIT: a separate VM for each process might be an option as I can hibernate it. I'll try it if I get no better answer. Thanks! – bzaborow – 2016-01-29T09:31:44.160

Ideally you could add hooks into something and suspend a process that's starting to grow before it gets too big, if there is significant memory pressure already. So the system would do this management for you automatically. IDK if you could do this without modifying the kernel, though. Maybe set a small-ish ulimit value for RSS or vm size, and handle the resulting signal when it's exceeded by checking whether it's ok to proceed, or if some other process already holds the "using lots of RAM" lock. (I forget exactly what happens when you exceed a ulimit value; it mmap might just fail). – Peter Cordes – 2016-11-14T22:16:54.003

No answers