Poor concurrent IO performance, how to trade latency for throughput?

I have a fairly IO-heavy GPU-limited process that needs to read in random files from a folder in a local hard drive. When the process is running by itself, I get a consistent throughput of around 30 MB/s, but when there are two competing processes, the total throughput drops to barely 7 MB/s.

How can I maximise throughput when having two programs? Latency is not a problem.

Each file is in the order of 1-20 MB, The processes are running on independent GPUs, and they use very little CPU. The same effect is observed if I launch one GPU and one pure IO process at the same time.

There are no differences between the available schedulers: deadline, cfq, and noop. I also tried to increase the read deadline time to 5 s, without changes.

Machine details:

Fedora Linux with kernel 4.16.7-200.fc27.x86_64
i7-4770 CPU @ 3.40GHz
32 GB of RAM, of which 20 are taken by running processes.
Swap is enabled, but empty.
The drive is a WDC WD2003FYYS-0, 2TB, but I see the same if I move everything to other drives.
cat big_file > /dev/null gives a throughput of nearly 100 MB/s, so there is bandwidth for both.
The whole data is around 500 GB.

More info:

I moved the files to a different, faster drive that is not being used for anything else, and used compression. The overall throughput is slightly improved.
Giving maximum io priority to one of the processes improved performance by 10%.
Running iostat -x 1 shows that utilisation is around 87 % when running one process, and 100 % when running two.
The processes are reading random files. If I had only one process, it can provide more than double the throughput each of the individual ones can consume.

Davidmh

Posted 2018-07-17T16:34:34.613

Reputation: 111

3Get an SSD. Hard drives are awful with concurrent access. – Mokubai – 2018-07-17T16:43:15.607

@Mokubai I don't need more than what a HD can provide, and since my requests are queued, I don't mind latency, so the scheduler could, for example, give a full second dedicated to each process, at maximum throughput. – Davidmh – 2018-07-17T20:49:50.923

A shot in the dark: experiment with ionice -c best-effort with different -n levels for the two processes; or even try -c realtime for one of them. – Kamil Maciorowski – 2018-07-17T21:21:06.720

@KamilMaciorowski it seems to help marginally (~10% improvement). Thanks! – Davidmh – 2018-07-20T12:55:53.187

But you do need more than your hard drives can provide. SSDs are several orders of magnitude faster at random access than mechanical hard drives. You could of course create a custom daemon to manage access to files, with a larger read and write buffer, but why bother? – Daniel B – 2018-07-20T13:48:17.460

@DanielB if every process was assigned a slice of one second with full access to the drive it would be enough, because they have internal buffers. That is something that is within the realm of the OS scheduler. – Davidmh – 2018-07-22T16:59:33.530

The OS scheduler is designed to be fair and treat every process in the system equally. They have been tuned to give every process a chance at accessing the disk and a good likelihood of their disk accesses being handled within tens of milliseconds rather than hundreds of milliseconds. If you need something more than the defaults provided by the provided scheduler then you can modify or write your own but that would be a programming question out of the scope of this site. What we can say though is that there are other devices that are much more suited to small random accesses called SSDs. – Mokubai – 2018-07-27T12:35:29.143

Poor concurrent IO performance, how to trade latency for throughput?

Answers