Saturday, October 9, 2021

Zynq Ultrascale+ Bare Metal NVMe: 2GB/s with FatFs + exFAT

This is a quick follow-up to my original post on speed testing bare metal NVMe with the Zynq Ultrascale+ AXI-PCIe bridge. There, I demonstrated a lightweight NVMe driver running natively on one Cortex-A53 core of the ZU+ PS that could comfortably achieve >1GB/s write speeds to a suitable M.2 NVMe SSD, such as the Samsung 970 Evo Plus. That's without any hardware acceleration: the NVMe queues are maintained in external DDR4 RAM attached to the PS, by software running on the A53.

I was actually able to get to much higher write speeds, over 2.5GB/s, writing directly to the SSD (no file system) with block sizes of 64KiB or larger. But this only lasts as long as the SLC cache: Modern consumer SSDs use either TLC or QLC NAND flash, which stores three or four bits per cell. But it's slower to write than single-bit SLC, so drives allocate some of their free space as an SLC buffer to achieve higher peak write speeds. Once the SLC cache runs out, the drive drops down to a lower sustained write speed.

It's not easy to find good benchmarks for sustained sequential writing. The best I've seen are from Tom's Hardware and AnandTech, but only as curated data sets in specific reviews, not as a global data set. For example, this Tom's Hardware review of the Sabrent Rocket 4 Plus 4TB has good sustained sequential write data for competing drives. And, this AnandTech review of the Samsung 980 Pro has some more good data for fast drives under the Cache Size Effects test. My own testing with some of these drives, using ZU+ bare metal NVMe, has largely aligned with these benchmarks.

The unfortunate trend is that, while peak write speeds have increased dramatically in the last few years, sustained sequential write speeds may have actually gotten worse. This trend can be seen globally as well as within specific lines. (It might even be true within different date codes of the same drive.) Take for example the Samsung 970 Pro, an MLC (two bit per cell) drive released in 2018 that had no SLC cache but could write its full capacity (1TB) MLC at over 2.5GB/s. Its successor, the 980 Pro, has much higher peak SLC cache write speeds, nearing 5GB/s with PCIe Gen4, but dips down to below 1.5GB/s at some points after the SLC cache runs out.

Things get more complicated when considering the allocation state of the SSD. The sustained write benchmarks are usually taken after the entire SSD has been deallocated, via a secure erase or whole-drive TRIM. This restores the SLC cache and resets garbage collection to some initial state. If instead the drive is left "full" and old blocks are overwritten, the SLC cache is not recovered. However, this may also result in faster and more steady sustained sequential writing, as it prevents the undershoot that happens when the SLC cache runs out and must be unloaded into TLC.

So in certain conditions and with the right SSD, it's just possible to get to sustained sequential write speeds of 2GB/s with raw disk access. But, what about with a file system? I originally tested FatFs with the drive formatted as FAT32, reasoning (incorrectly) that an older file system would be simpler and have less overhead. But as it turns out, exFAT is a much better choice for fast sustained sequential writing.

The most important difference is how FAT32 and exFAT check for and update cluster allocation. Clusters are the unit of memory allocated for file storage - all files take up an integer number of clusters on the disk. The clusters don't have to be sequential, though, so the File Allocation Table (FAT) contains chained lists of clusters representing a file. For sequentially-written files, this list is contiguous. But the FAT allows for clusters to be chained together in any order for non-contiguous files. Each 32b entry in the FAT is just a pointer to the next cluster in the file.

FAT32 cluster allocation entirely based on 32b FAT entries.

In FAT32, the cluster entries are mandatory and a sequential write must check and update them as it progresses. This means that for every cluster written (64KiB in maxed-out FAT32), 32b of read and write overhead is added. In FatFs, this gets buffered until a full LBA (512B) of FAT update is ready, but when this happens there's a big penalty for stopping the flow of sequential writing to check and update the FAT.

In exFAT, the cluster entries in the FAT are optional. Cluster allocation is handled by a bitmap, with one bit representing each cluster (0 = free, 1 = allocated). For a sequential file, this is all that's needed. Only non-contiguous files need to use the 32b cluster entries to create a chain in the FAT. As a result, sequential writing overhead is greatly reduced, since the allocation updates happen 32x less frequently.

exFAT cluster allocation using bitmap only for sequential files.

The cluster size in exFAT is also not limited to 64KiB. Using larger clusters further reduces the allocation update frequency, at the expense of more dead space between files. If the plan is to write multi-GB files anyway, having 1MiB clusters really isn't a problem. And speaking of multi-GB files, exFAT doesn't have the 4GiB file size limit that FAT32 has, so the file creation overhead can also be reduced. This does put more data "at risk" if a power failure occurs before the file is closed. (Most of the data would probably still be in flash, but it would need to be recovered manually.)

All together, these features reduce the overhead of exFAT to be almost negligible:

With 1MiB clusters and 16GiB files, it's possible to get ~2GB/s of sustained sequential file writing onto a 980 Pro for its entire 2TB capacity. I think this is probably the fastest implementation of FatFs in existence right now. The data block size still needs to be at least 64KiB, to keep the driver overhead low. But if a reasonable amount of streaming data can be buffered in RAM, this isn't too much of a constraint. And of course you do have to keep the SSD cool.

I've updated the bare metal NVMe test project to Vivado/Vitis 2021.1 here. It would still require some effort to port to a different board, and I still make no claims about the suitability of this driver for any real purposes. But if you need to write massive amounts of data and don't want to mess around in Linux (or want to try something similar in Linux user space...) it might be a good reference.

7 comments:

  1. Awesome. Shane, it'd be good to have a call. ash.dovejay AT orbastro.com

    ReplyDelete
  2. Hi Shane. Thank you so much for this reference material. It has been a great resource for learning about nvme drives and working the the xdma ip in baremetal! Great work!

    ReplyDelete
  3. Hi Shane, thank you so much! I previously got the raw read and write working and have now moved to the exFat implementation. I got it working and am able to create files in the root directory. However, when I started creating directories I started to have issues. For example, I will create a directory and them move into that directory using the ch_dir function. I will then create a new file in that directory and write to it. After I am done writing, I close the file. I get not return errors from any of the file functions (f_open, f_write, f_close, etc.). When I connect the SSD to my computer the directory folder is created, but the file does not show up in the folder. When I look at the properties of the SSD it is clear that the data was written because the used space increases every time I write a file.

    Any tips or advice to debug this?

    ReplyDelete
    Replies
    1. That is strange. I don't think I've used f_chdir before. Is the behavior the same if you access the file using the full path? In other words, opening "/dir/file.ext" instead of f_chdir to "/dir" and then opening "file.ext".

      http://elm-chan.org/fsw/ff/doc/filename.html

      If that still doesn't work, you might be able to search the raw disk image for the file name you created and see if there is even an entry for it. I have used WinHex for things like this, since it can parse the exFAT file system structures (with certain licenses), but HxD can also work to just look at the raw disk.

      Delete
    2. Hi, yes I have tried this as well and I had issues. I ended up landing on somewhat of a solution. I first open the file in the root directory, write to it, and then close the file. I then will move the file to the directory I wish to have it in.

      Another issue has been raised to the surface while testing this. I will sometimes have an unexpected problem where f_sync() stalls. The actual stall happens deeper in this function's code. The function that stalls is "load_obj_xdir". I'm not sure why this is failing.

      Delete
  4. Hi Shane,

    I am exhibiting some weird behavior. I will create a file, write to it, close it and then move it to a folder. This all happens without issues. If I create another file and move it to the same folder, both files end up getting deleted. I'm not sure what is going on with this.

    Also, I had issues when creating a file inside of a folder. I created the folder and then opened a file as "directory/filename.txt". I am able to write to the file without errors, however, when I try to close the file, it returns "FR_INT_ERR".

    Do you have any insight into what might be happening?

    Thank you!

    ReplyDelete
    Replies
    1. Are you using asynchronous writes and/or reads in your application? By that I mean nvmeGetSlip() not required to be zero before issuing the next write or read command issued from FatFs diskio.c? If so, you may want to try using blocking reads and writes that don't issue a new command until nvmeGetSlip() == 0. This will limit the maximum throughput, but if it gives you a different behavior with respect to your file system issues, it might be a good clue.

      If you find that things are working correctly with all-blocking reads and writes, then the next step could be to limit the asynchronous reads and writes to only bulk data. I do this based on the source or destination address of the transfer, only allowing slip if it's in a circular buffer. This way, file system reads and writes still block. That might prevent a case where FatFs tries to read a file system entry that was just written but hasn't made it through the NVMe command pipeline yet.

      Delete