Why doesn't an NVMe connection on an SSD make non-sequential access faster?

Why would it matter if you’re transferring a 1GB file at quintuple the speed (of SATA SSDs) or transferring 1,000 1MB files at quintuple the speed? It should always amount to being quintuple the speed of the SSD.

But in real life non-sequential access turns out to have only little benefit over a SATA SSD.

EDIT

Since the answers are concentrating on the difference between large and small files, let me clarify my question:

Yes, small files will have overhead.
And yes, they will waste time by reading data that will be ignored.
But this is irrelevant to my question since every read and write (including those pesky little MFT writes etc…) will (or rather should) see the x5 speed gain.

Saying that there is wasted drive access doesn't change that. I'm not asking why is 1GB not as fast as 1000 1MBs. I'm asking why:

(1GB_NVMe / 1GB_SSD) != (1000x1MB_NVMe / 1000x1MB_SSD)

ispiro

Posted 2019-09-05T17:58:17.533

Reputation: 1 259

1in real life... You say that like it's a universal truth. Can you substantiate this claim? Please do not only respond in the comments. Instead, [edit] the post with this information. – I say Reinstate Monica – 2019-09-05T18:34:56.630

@TwistyImpersonator I've searched the web for a contradictory source and only found one lead which turned out to be useless. It seems like there's practically no disagreement about that. That's why I omitted that from the question. Just like I omitted the fact that SSDs are faster than HDDs.

– ispiro – 2019-09-05T18:50:01.313

Seek latency becomes a real problem with random reads and all the speed benefits of SSDs are lost with tiny reads: https://superuser.com/a/1168029/19943 This question feels like a slightly differently phrased duplicate of that one...

– Mokubai – 2019-09-05T18:57:14.937

@Mokubai Every seek should see the speed difference. Your answer there explains the difference between small and large files. Not the difference between different types of drives. Every seek, every write amplification, every part should see the speed gain. – ispiro – 2019-09-05T19:05:23.140

@ispiro you're making assumptions about how the electronics works. In order to obtain their massive speeds they do a lot of work with parallellism and queueing both with flash chips and in the main electronics. Once you do lots of seeks that are below a certain threshold you start to loose any benefits of how the device was designed. Seeks are faster, but parallel transfers, queueing and caching are what achieve the phenomenal sequential speeds. Small transfers see only the benefits of an effective single controller thread and/or flash chip. – Mokubai – 2019-09-05T19:13:11.643

@Mokubai If you could expound on your comment, that might just be the answer to my question. – ispiro – 2019-09-05T19:15:34.113

@Mokubai: This seems like an over-simplification. I have on purpose said only that OS bookkeeping and cache flushing interrupt the smooth transfer of data, as I don't think the interaction between OS and SSD can be simply defined and does depend on too many parameters. – harrymc – 2019-09-05T19:27:03.190

@harrymc I've posted what I mean as an answer. While I agree that there is some added time and overhead in queuing and seeking, a lot of the performance benefits of the drives is due to how the controllers are designed and how the memory devices are laid out around it. – Mokubai – 2019-09-05T19:37:39.350

@ispiro This isn’t a bad question, but your tone is a bit harsh. Your edit clarifies things but seriously, saying something like “…let’s go over this again…” is not and inspiring statement. – JakeGould – 2019-09-05T19:42:09.200

1@JakeGould Thanks. Point taken. I edited that line. Did you mean there was another part where the tone was harsh? – ispiro – 2019-09-05T19:46:39.263

@ispiro Nope! Your edit is perfect. And the question is decent. – JakeGould – 2019-09-05T20:06:44.030

Answers

The problem here is that while NVMe and SSDs in general are faster than spinning rust due to using flash memories, the ability of NVMe to transfer multiple gigabytes of data per second is due to the way the flash memory is arranged around the controller.

Fast flash devices effectively use a scheme similar to RAID0 across what are (on their own) simply fast flash memory chips. Each chip on its own can handle a certain speed, but tied together with its siblings can achieve a much higher aggregate speed by having data written to and read from multiple devices simultaneously.

Effectively large transfers can take advantage of transfer parallelism and request multiple blocks from multiple chips and so reduce what would be 8 seeks times down to a single seek (across multiple chips) along with a larger transfer. The controller will have buffering and queueing to be able to sequentially stream out the data in whichever direction is required.

The individual flash chips themselves may also be configured to read ahead a few blocks for future requests and (for writes) cache it in a small internal buffer to further reduce delays for future requests.

The problem with working with lots of small files is that it ends up defeating all of the smarts used to achieve a single massive transfer. The controller has to operate in a queue going between flash devices requesting a block of data, waiting for a response, looking at the next item in the queue, requesting that data and so on.

If the data being read or written is on another chip then it might be able to use multiple channels but if a lot of the requests end up on the same chip for a period, as it could for lots of small writes, then what you end up seeing is the performance of a single flash chip rather than the full performance of an array of chips.

So thousands of small reads or writes could actually show you the performance of only a small part of your NVMe device, rather than what the device is fully capable of under so-called "perfect" conditions.

Mokubai

Posted 2019-09-05T17:58:17.533

Reputation: 64 434

1There is a problem with that: The disk knows nothing of files, only of sectors. So if the OS fed it enough data it wouldn't need to slow down. This means that the bottleneck is with the OS being too slow on many files, not with the way NVMe works. – harrymc – 2019-09-05T19:42:28.033

Thanks. This makes sense. – ispiro – 2019-09-05T19:49:30.687

@harrymc Are you saying that the OS should have fed the drive more but doesn't, or that it does, but that the loss is by the OS wasting other time? – ispiro – 2019-09-05T19:50:25.023

Yes, I'm saying that the problem is that the OS is not writing out data at the full speed possible with NVMe. The explanation of Mokubai may be correct as to what happens when there is not enough data being sent, but that's not the root cause of the problem. The root cause in my opinion is very simply inefficient OS cache algorithms and disk-driver, problems which are showing up more clearly with these fast disks. – harrymc – 2019-09-05T19:55:17.397

1@harrymc that could be the case if a flash device were a dumb slab of disk like spinning rust, but there is another source of seek latency, the wear leveller or "flash transition layer" that is wholly within the NVMe controller. For large reads and writes again this becomes a single check and then burst out to the memory devices, for a queue of things it becomes yet another bottleneck of "where's this" and "where's that" within the drive itself. I'm not meaning to say mine is the one true answer, but the drive itself, being a more complicated device, holds a lot of the cards performance wise. – Mokubai – 2019-09-05T19:57:36.163

@Mokubai This last point of yours would be the same for SATA as well, won't it? If so, it wouldn't explain the lack of serious gain by NVMe. – ispiro – 2019-09-05T20:03:39.283

I think that your argument strengthens my claim that the OS is driving the NVMe in an inefficient way for many files. The hardware can go much faster when the OS is not dealing with (quote) "where's this" and "where's that". Burst mode should work for chunks of 1 MB. And as ispiro said, we see exactly the same slow-down for SATA disks, which means that the problem is with the OS. – harrymc – 2019-09-05T20:05:49.773

2@ispiro performance SATA SSDs would have effectively the same internals as NVMe, you just don't really notice it because it is bottlenecked at the interface. Both SATA and NVMe get the same "worst case" performance, but for best case the NVMe can go way ahead. – Mokubai – 2019-09-05T20:06:29.840

@harrymc I think your argument is flawed in that you assume that the MFT is the ONLY latency bound lookup table. In reality, flash controllers do another layer of indirection so as to present a virtual block device to the OS (due to wear leveling and garbage collection). The block lookup table is often stored in the flash, which adds an extra low queue depth latency bound step. This is often offset (on high end controllers) using onboard DRAM cache. As with all caches, certain work loads will often result in cache misses, which will add "significant" latency to block access. – Aron – 2019-09-20T05:09:43.703

Copying many files involves the overhead of creating their entries in the Master File Table (MFT) which is an integral component of the NTFS file system (with equivalents for other file-systems than NTFS).

This means that creating a file entails first searching the MFT for the name, in order to avoid duplicates, then allocating the space, copying the file, and finally completing the entry in the MFT.

It is this bookkeeping overhead that slows down dramatically the copying of many files. The overhead involves matching work in the operating system, updating RAM tables, computer interrupts, system calls etc etc. Closing a file also causes the operating system to flush it to the disk, and that takes time and disturbs the smooth copying of the data, not letting NVMe achieve performance that is closer to its potential.

Note: This problem of the slow copy of many files is not unique to NVMe. We see exactly the same problem for any kind of fast disk, either mechanical or SSD, SATA or NVMe. This to my way of thinking proves that the problem is with inefficient OS handling of this case, perhaps because of inefficient cache memory algorithms and/or disk-driver.

harrymc

Posted 2019-09-05T17:58:17.533

Reputation: 306 093

This shouldn't matter because every access to the MFT should be 5 times as fast. The bottom line is we're doing the same amount of work, and every part of it is 5 times as fast. (And I assume that the CPU work here, which would, indeed, be the same, is negligible.) – ispiro – 2019-09-05T18:08:21.900

I added more info to the answer. – harrymc – 2019-09-05T18:35:22.263

1updating RAM tables shouldn't be a bottleneck compared with IO (unless you're saying that NVMe is that fast). Same goes for computer interrupts. Closing a file also causes the operating system to flush it to the disk - this should be 5 times (or whatever) as fast. system calls - they contain 2 parts: IO - which should be with that speed gain, and non-IO which should be negligible. Everything breaks up into 2: IO - where we should see the full speed gain, and non-IO which should be negligible. Unless: NVMe is really reaching near RAM speed. – ispiro – 2019-09-05T18:45:57.857

It is a problem when the OS operates in an inefficient manner that cannot drive the NVMe at full speed when there are many files. – harrymc – 2019-09-05T19:59:12.240