NVMe ssd: Why is 4k writing faster than reading?

52

11

I have a Samsung 960 Pro 512 GB SSD on NVMe with PCIe Gen.3x4 running. I use the Samsung NVMe Driver 2.0.0.1607. The SSD is running fine. However, I don't understand why the writing of 4k is faster than the reading of 4k. I am using AS Benchmark:

enter image description here

It is a factor of 3! Is there something wrong (with my system or AS Benchmark) or is this normal?

musbach

Posted 2017-01-16T18:04:55.563

Reputation: 691

And still so much faster than a rotating hard drive! – Zan Lynx – 2017-01-16T20:02:45.023

Answers

80

4k reads are going to be about the hardest thing the drive can do. They are amongst the smallest block sizes the drive is going to be able to handle, and there's no way for the drive to preload large quantities of data, in fact they are probably quite inefficient if the drive load-ahead logic is intending to read anything larger than 4kb.

"Normal" drive reads are more likely to be larger than 4kb as there are very few files that are that small, and even the page file is likely to be read in large chunks as it would be odd for a program to have "only" 4KB of memory paged out. This means that any preloading that the drive tries to do will actually penalise the drive throughput.

4K reads might pass through the drive buffer, but the "random" part of the test makes them entirely unpredictable. The controller won't know when the drive might need the more usual "large" reads again.

4K writes on the other hand can be buffered, queued, and written out sequentially in an efficient manner. The drive buffer can do a lot of the catch-and-write work that it was designed for, and the wear leveller might even allocate all those 4K writes to the same drive erase block, occasionally turning what is a 4K "random" write into something closer to a sequential write.

In fact I suspect that this is what is happening in the "4K-64Thrd" writes, the "64-Thrd" is apparently using a large queue depth, thus signalling to the drive that it has a large amount of data to read or write. This triggers a lot of clustering of writes and so approaches the sequential write speed of the drive. There is still an overhead to performing a 4K write, but now you are fully exposing the potential of the buffer. In the Read version of the test the drive controller, now recognising that it is under very constant heavy load, stops preloading data, possibly avoids the buffer and instead switches to a "raw" read mode, again approaching the sequential read speed.

Basically the drive controller can do something to make a 4K write more efficient, especially if a cluster of them arrive at a similar time, while it can't do anything to make a single 4K read more efficient, especially if it is trying to optimise dataflow by pre-loading data into the cache.

Mokubai

Posted 2017-01-16T18:04:55.563

Reputation: 64 434

5Not a part of the answer itself, but I suspect that the "4K read" (non 64Thrd version) is actually exposing the drive default read block size as either 32K or 64K. This would be either 2600 / 50 = 52 (64K with some overhead + the original 4K read) or the 1200 / 50 = 24 (32K with some overhead + the 4K read reducing it). – Mokubai – 2017-01-16T19:13:06.020

16Good answer overall, but I don't believe "there are very few files that are that small" at all. In fact I suspect that on most systems the majority of files are 4k or smaller. They don't take up the majority of space, but that's another matter. – hobbs – 2017-01-17T01:15:49.487

3The simplest answer is probably this: If you do them one at a time, you can't overlap the reads at all because you don't even find out what block the next read is for until you return the data from the previous read. But you can overlap the writes completely since you can get all the data for the next write while you're still working on the previous one. – David Schwartz – 2017-01-17T04:16:00.137

2@hobbs If you take for example NTFS the default cluster-size is 4K (or a multiple thereof) meaning that the NTFS filesystem itself works in 4K blocks even though the files and/or meta-data themselves are smaller. So smaller files don't make any difference. For all intends and purposes a Windows system reads/writes in 4K blocks or multiples of that. – Tonny – 2017-01-17T16:06:52.390

1@hobbs: With NTFS, you're likely to get the read of such small files for free (!). Small files are stored in the directory entry itself, adjacent to the file name. You have to hit a fairly particular file size close to 4KB to have an actual 4KB file on disk. – MSalters – 2017-01-18T16:02:34.120

Upvote for write through cache and sequential prefetch thoughts. – mckenzm – 2017-01-19T02:00:22.830

@MSalters that sounds cool, do you have a doc/spec link about that? – Nick T – 2017-01-19T17:55:37.110

@NickT: Not handy, but it's fairly easy to google. – MSalters – 2017-01-19T19:45:11.320

15

Other answers have already explained why it may be that writing is faster than reading; I would like to add that for this drive this is absolutely normal, as it is confirmed by benchmarks that you can find in reviews.

ArsTecnica's review

ArsTechnica has reviewed the drive, both your version (512 GB) and the 2 TB one:

ArsTechnica (This graph is not immediately visible in the review, it's the 5th one in the first gallery, you have to click on it)

The performance of these 2 models is very similar, and their numbers look like yours: the drive can read at 37 MB/s and write at 151 MB/s.

AnandTech's review

AnandTech has also reviewed the drive: they used the 2TB model, averaging the results of tests with a queue depth of 1, 2 and 4. These are the graphs:

AnandTech 4K read AnandTech 4K write

The drive reads at 137 MB/s and writes at 437 MB/s. The number are much higher than yours, but it's probably due to the higher queue depths. Anyway the write speed is 3 times the read speed, as in your case.

PC World's review

One more review, by PC World: they have tested the 1 TB version, and the results for 4K are 30 MB/s for reading and 155 MB/s for writing: PC World graph The write speed is in line with yours, but here the drive is even slower at reading. The result is that the ratio is five to one, not three to one.

Conclusion

Reviews confirm that for this drive it is normal that the write speed for random 4K is much faster than the read speed: depending on the test, it can even be 5 times faster.

Your drive is fine. There's no reason to believe it is faulty, or that your system has a problem.

Fabio says Reinstate Monica

Posted 2017-01-16T18:04:55.563

Reputation: 1 062

8

SSD controller caches writes in the onboard NVRAM, and flushes it to flash media at opportune times. Write latency is thus the cache access latency, typically 20us. Reads, on the contrary, are served off the media, with access time of 120-150us at best.

Andrey Kuzmin

Posted 2017-01-16T18:04:55.563

Reputation: 81

1

Expanding on Andrey's answer, you need to look at the overhead involved before the SSD can signal to the computer that the operation is complete.

For a write, the data must merely be written to an internal RAM cache. Later it will be written to flash memory, along with other 4k blocks and metadata needed to check, error correct and locate it.

For a read, the SSD must first locate the data. The location that the computer wants to read is called the logical address, and does not have a direct relationship with the physical location of the data in flash memory. The SSD translates the logical address into a physical one, based on the geometry of the flash memory (the way the cells are arranged), bad block remapping, wear levelling and various other factors. It then has to wait for any other operations to finish before retrieving the data from flash, then checking it and if required re-reading and applying error correction, possibly even re-writing the whole block somewhere else.

While the total time taken by a write operation may be longer than a typical read operation, the time taken for the SSD to report that the operation completed to the extent that it can process further commands is lower. With large blocks the overhead is not the limiting factor, but with many small blocks it starts to limit read/write speed.

user3241

Posted 2017-01-16T18:04:55.563

Reputation: 215