Saturday, October 9, 2021

Zynq Ultrascale+ Bare Metal NVMe: 2GB/s with FatFs + exFAT

This is a quick follow-up to my original post on speed testing bare metal NVMe with the Zynq Ultrascale+ AXI-PCIe bridge. There, I demonstrated a lightweight NVMe driver running natively on one Cortex-A53 core of the ZU+ PS that could comfortably achieve >1GB/s write speeds to a suitable M.2 NVMe SSD, such as the Samsung 970 Evo Plus. That's without any hardware acceleration: the NVMe queues are maintained in external DDR4 RAM attached to the PS, by software running on the A53.

I was actually able to get to much higher write speeds, over 2.5GB/s, writing directly to the SSD (no file system) with block sizes of 64KiB or larger. But this only lasts as long as the SLC cache: Modern consumer SSDs use either TLC or QLC NAND flash, which stores three or four bits per cell. But it's slower to write than single-bit SLC, so drives allocate some of their free space as an SLC buffer to achieve higher peak write speeds. Once the SLC cache runs out, the drive drops down to a lower sustained write speed.

It's not easy to find good benchmarks for sustained sequential writing. The best I've seen are from Tom's Hardware and AnandTech, but only as curated data sets in specific reviews, not as a global data set. For example, this Tom's Hardware review of the Sabrent Rocket 4 Plus 4TB has good sustained sequential write data for competing drives. And, this AnandTech review of the Samsung 980 Pro has some more good data for fast drives under the Cache Size Effects test. My own testing with some of these drives, using ZU+ bare metal NVMe, has largely aligned with these benchmarks.

The unfortunate trend is that, while peak write speeds have increased dramatically in the last few years, sustained sequential write speeds may have actually gotten worse. This trend can be seen globally as well as within specific lines. (It might even be true within different date codes of the same drive.) Take for example the Samsung 970 Pro, an MLC (two bit per cell) drive released in 2018 that had no SLC cache but could write its full capacity (1TB) MLC at over 2.5GB/s. Its successor, the 980 Pro, has much higher peak SLC cache write speeds, nearing 5GB/s with PCIe Gen4, but dips down to below 1.5GB/s at some points after the SLC cache runs out.

Things get more complicated when considering the allocation state of the SSD. The sustained write benchmarks are usually taken after the entire SSD has been deallocated, via a secure erase or whole-drive TRIM. This restores the SLC cache and resets garbage collection to some initial state. If instead the drive is left "full" and old blocks are overwritten, the SLC cache is not recovered. However, this may also result in faster and more steady sustained sequential writing, as it prevents the undershoot that happens when the SLC cache runs out and must be unloaded into TLC.

So in certain conditions and with the right SSD, it's just possible to get to sustained sequential write speeds of 2GB/s with raw disk access. But, what about with a file system? I originally tested FatFs with the drive formatted as FAT32, reasoning (incorrectly) that an older file system would be simpler and have less overhead. But as it turns out, exFAT is a much better choice for fast sustained sequential writing.

The most important difference is how FAT32 and exFAT check for and update cluster allocation. Clusters are the unit of memory allocated for file storage - all files take up an integer number of clusters on the disk. The clusters don't have to be sequential, though, so the File Allocation Table (FAT) contains chained lists of clusters representing a file. For sequentially-written files, this list is contiguous. But the FAT allows for clusters to be chained together in any order for non-contiguous files. Each 32b entry in the FAT is just a pointer to the next cluster in the file.

FAT32 cluster allocation entirely based on 32b FAT entries.

In FAT32, the cluster entries are mandatory and a sequential write must check and update them as it progresses. This means that for every cluster written (64KiB in maxed-out FAT32), 32b of read and write overhead is added. In FatFs, this gets buffered until a full LBA (512B) of FAT update is ready, but when this happens there's a big penalty for stopping the flow of sequential writing to check and update the FAT.

In exFAT, the cluster entries in the FAT are optional. Cluster allocation is handled by a bitmap, with one bit representing each cluster (0 = free, 1 = allocated). For a sequential file, this is all that's needed. Only non-contiguous files need to use the 32b cluster entries to create a chain in the FAT. As a result, sequential writing overhead is greatly reduced, since the allocation updates happen 32x less frequently.

exFAT cluster allocation using bitmap only for sequential files.

The cluster size in exFAT is also not limited to 64KiB. Using larger clusters further reduces the allocation update frequency, at the expense of more dead space between files. If the plan is to write multi-GB files anyway, having 1MiB clusters really isn't a problem. And speaking of multi-GB files, exFAT doesn't have the 4GiB file size limit that FAT32 has, so the file creation overhead can also be reduced. This does put more data "at risk" if a power failure occurs before the file is closed. (Most of the data would probably still be in flash, but it would need to be recovered manually.)

All together, these features reduce the overhead of exFAT to be almost negligible:

With 1MiB clusters and 16GiB files, it's possible to get ~2GB/s of sustained sequential file writing onto a 980 Pro for its entire 2TB capacity. I think this is probably the fastest implementation of FatFs in existence right now. The data block size still needs to be at least 64KiB, to keep the driver overhead low. But if a reasonable amount of streaming data can be buffered in RAM, this isn't too much of a constraint. And of course you do have to keep the SSD cool.

I've updated the bare metal NVMe test project to Vivado/Vitis 2021.1 here. It would still require some effort to port to a different board, and I still make no claims about the suitability of this driver for any real purposes. But if you need to write massive amounts of data and don't want to mess around in Linux (or want to try something similar in Linux user space...) it might be a good reference.