Monday, July 15, 2019

Benchmarking NVMe through the Zynq Ultrascale+ PL PCIe Linux Root Port Driver

I want to be able to sink 1GB/s into an NVMe SSD from a Zynq Ultrascale+ device, something I know is technically possible but I haven't seen demonstrated without proprietary hardware accelerators. The software approach - through Linux and the Xilinx drivers - has enough documentation scattered around to make work, if you have a lot of patience. But the only speed reference I could find for it is this Z-7030 benchmark of 84.7MB/s. I found nothing for the newer ZU+, with the XDMA PCIe Bridge driver. I wasn't expecting it to be fast enough, but it seemed worth the effort to do a speed test.

For hardware, I have my TE0803 carrier with the ZU4CG version of the TE0803. All I need for this test is JTAG, UART, and the four PL-side GT transceivers, for PCIe Gen3 x4. I made a JTAG + UART cable out of the Digilent combo part, which is directly supported in Vivado and saves a separate USB port for the terminal. Using Trenz's bare board files, it was pretty quick to set up.

TE0803 Carrier + Dead-Bugged Digilent JTAG+USB Adapter.
Next, I wanted to validate the PCIe routing with a loopback test, following this video as a guide. I made my own loopback out of the cheapest M.2 to PCIe x4 adapter Amazon has to offer by desoldering the PCIe x4 connector and putting in twisted pairs. This worked out nicely since I could intentionally mismatch the length of one pair to get a negative result, confirming I wasn't in some internal loopback mode.

The three-eyed...something.
For most of the rest of this test, I'm roughly following the script from the FPGA Drive example design readme files, with deviations for my custom board and for Vivado 2019.1 support. The scripts there generates a Vivado project and block design with the Processing System and the XDMA PCIe Bridge. I had a few hardware differences that had to be taken care of manually (EMIO UART, inverted SSD reset signal), but having a reference design to start from was great.

The example design includes a standalone application for simply checking that a PCIe drive enumerates on the bus, but it isn't built for the ZU+. As the readme mentions, there had been no standalone driver for the XDMA PCIe Bridge. Well, as of Vivado 2019.1, there is! In SDK, the standalone project for xdmapcie_rc_enumerate_example.c can be imported directly from the peripheral driver list in system.mss from the exported hardware.

XDMA standalone driver example is available as of Vivado 2019.1!
I installed an SSD and ran this project and much to my amazement, the enumeration succeeded. By looking at the PHY Status/Control register at offset 0x144 from the Bridge Register Memory Map base address (0x400000000 here), I was also able to confirm that link training had finished and the link was Gen3 x4. (Documentation for this is in PG194.) Off to a good start, then.

Installed a 1TB Samsung 970 EVO Plus.
Unfortunately, that's where the road seems to end in terms of quick and easy setup. The next stage involves PetaLinux, which is a toolchain for building the Xilinx Linux kernel. I don't know about other people, but every time the words "Linux" and "toolchain" cross my path, I automatically lose a week of time to setup and debugging. This was no exception.

Unsurprisingly, PetaLinux tools run in Linux. I went off on a bit of a tangent trying to see if they would run in WSL2. They do, if you contain your project in the Linux file system. In other words, I couldn't get it to work on /mnt/c/... but it worked fine if the project was in ~/home/... But, WSL2 is a bit bleeding edge still and there's no USB support as of now. So you can build, but not JTAG boot. If you boot from an SD card, though, it might work for you.

So I started over with a VirtualBox VM running Ubuntu 18.04, which was mercifully easy to set up. For reasons I cannot at all come to terms with, you need at least 100GB of VM disk space for the PetaLinux build environment, all to generate a boot image that measures in the 10s of MB. I understand that tools like this tend to clone in entire repositories of dependencies, but seriously?! It's larger than all of my other development tools combined. I don't need the entire history of every tool involved in the build...

And here I thought Xilinx was a disk space hog...
The build process, even not including the initial creation of this giant environment, is also painfully slow. If you are running it on a VM, throw as many cores at it as you can and then still plan to go do something else for an hour. I started from the build script in the FPGA Drive example design, making sure it targetted cpu_type="zynqMP" and pcie_ip="xdma"

This should set up the kernel properly, but some of the config options in PetaLinux 2019.1 might not exactly match the packaged configs. There's a reference here explaining how to manually configure the kernel for PCIe and NVMe hosting on the Z-7030. I went through that, subbing in what I thought were correct ZU+ and XDMA equivalents where necessary. Specifically:
  • It seems like as of PetaLinux 2019.1 (v4.19.0), there's an entire new menu item under Bus support for PCI Express Port Bus support. Including this expands the menu with other PCI Express-specific items, which I left at whatever their default state was.
  • Under Bus support > PCI controller drivers, Xilinx XDMA PL PCIe host bridge support has to be included. I don't actually know if the NWL PCIe Core is also required, but left it in since it was enabled by default. It might be the driver for the PS-side PCIe?
  • Some things related to NVMe are in slightly different places. There's an item called Enable the block layer on the main config page that I assume should be included. Under Device Drivers, Block devices should be included. And under Device Drivers > NVME Support, NVM Express block device should also be included.

The rest of the kernel and rootfs config seems to match pretty closely the Z-7030 setup linked above. But I will admit it took me three attempts to create a build that worked, and I don't know exactly what trial-and-error steps I did between each one. Even once the correct controller driver (pcie-xdma-pl.c) was being included in the build, I couldn't get it to compile successfully without this patch. I don't know what the deal is with that, but after that I finally got a build that would enumerate the SSD on the PCIe bus:

Output from lspci -vv confirms link speed and width.
I had already partitioned the drive off-board, so I skipped over those steps and went straight to the speed tests as described here. I tested a few different block sizes and counts with pretty consistent results: about 460MB/s write and 630MB/s read.

Not sure about those correctable errors. I guess it's better than un-correctable errors.
That is actually pretty fast, compared to the Z-7030 benchmark. The ZU+ and the new driver seem like they're able to make much better use of the SSD. But, it's still about a factor of two below what I want. There could be some extra performance to squeeze out from driver optimization, but at this point I feel like the effort will be better-spent looking into hardware acceleration, which has been demonstrated to get to 1GB/s write speeds, even on older hardware.

Since there's no published datasheet or pricing information for that or any other NVMe hardware accelerator, I'm not inclined to even consider it as an option. At very least, I plan to read through the open specification and see what actually is required of an NVMe host. If it's feasible, I'd definitely prefer an ultralight custom core to a black-box IP...but that's just me. In the mean time, I have some parallel development paths to work on.

26 comments:

  1. Hi,
    Have you ever tried dd with oflag=direct?

    I used it and the nvme speed was x2 or x3 times better.

    ReplyDelete
    Replies
    1. I'll definitely give that a try! Thanks for the tip. Follow-up question: What, if anything, do you have to do during a normal non-dd sequential write (from a huge block of RAM to the SSD) to achieve the "direct" speed?

      Delete
    2. When you open a file, you can use O_DIRECT flag and it's the same as dd's direct flag.

      Also, the syscall sendfile is usefull to achieve a better performance. An study of "block size" per write is also recommended.

      A simple code you can use for test speed in python is:

      import os
      from sendfile import sendfile
      IMAGE_SIZE = 2 * 1080 * 1920
      f_in = os.open('/dev/udmabuf0', os.O_RDONLY)
      f_out = os.open('/mnt/disk0/image.raw', os.O_WRONLY | os.O_DIRECT | os.O_CREAT | os.O_SYNC)
      sent = sendfile(f_out, f_in, 0, IMAGE_SIZE)

      Delete
    3. Unfortunately, it's looking like the Petalinux version of dd does not support the oflag switch. (It's a stripped-down version.) I will see if it's possible to compile in the full version or maybe hdparm?, as well as python for testing direct file opening at some point in the future. Thanks again for the tips!

      Delete
    4. Would you try a debian distribution in zynqmp? https://github.com/ikwzm/ZynqMP-FPGA-Linux We are using it and it works really great. 100% recomended. (Goodbye to the ugly petalinux)

      Delete
  2. Hi Shane.

    I'm working a similar problem. I have a Zynq 030 writing to a Samsung 960 across x4 PCIe. Our data source is a set of hardware sensors that write around 1300 MB/s into the DRAM attached to the PS. (Note the PS DRAM itself is just 4 GB/s, so zero-copy is critical.) I don't think we'll end up saving all of it but we'd like to save over 450 MB/s.

    For our first attempt, we mmap()ed the DRAM buffers into the virtual address space of a process, and then issued write()s from that to files on an ext4 filesystem. 72 MB/s. dd if=/dev/zero of=/media/nmve/foo bs=2M count=1024 delivers 78 MB/s.

    Next, we rebuilt the ext4 filesystem to eliminate the journal. This made it possible to open writes to the filesystem with O_DIRECT. However, it's not possible to write from mmap()ed physical memory to an O_DIRECT file... the write fails with illegal address. We have a hack right now where we memcopy from the mmap()ed physical memory to a 1 MB aligned_alloc() buffer, and then write from that to the O_DIRECT file. 108 MB/s. We do not understand why Linux won't let us write from an address which has been mmap()ed from /dev/mem.

    We have a trivial test program which just sequentially writes 1MB blocks from an aligned_alloc() buffer to an O_DIRECT filehandle. It gets 240 MB/s, which suggests the Linux filesystem is never going to hit 450 MB/s, even if we get rid of that awful copy.

    I don't think we'll need actual hardware (FPGA firmware) to write 450 MB/s though. It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue. The data is in contiguous 1 MB hunks in physical memory, so we only need to issue at most a thousand NVMe commands per second. I don't think the ARM will have much trouble with that. I'm not sure what your data source looks like, but you might consider this approach. I would think that even hundreds of thousands of NVMe commands per second should be possible.

    One thing we'd be very interested in is partitioning the NVMe drive, and having Linux mount one partition while we operate a command queue on the other. Since the NVMe can handle thousands of queue pairs, this is possible in principle (we only need one queue pair!) We don't know what it takes to convince Linux to play along. If you discover anything we'd love to know.

    ReplyDelete
    Replies
    1. Good to know I'm not alone! I need to take a look at the Linux driver to see what it would take to strip out just the NVMe queue interface to run standalone on the ARM. That's an interesting approach that I hadn't considered. I don't really need Linux, although it would simplify UI a lot.

      Currently, I have a capture front-end that is entirely on the PL and a very small ARM program that does link training and configuration of that peripheral. I run the ARM program out of OCM and disable all but the PL-PS slave ports on the DDR controller during capture. I'd love it if I could dedicate one port to writing from the front end to the head of a circular buffer and one port for the PCIe DMA bridge to read from the tail and write to the SSD. How to make that happen is the question!

      I'll definitely post an update as I do more testing.

      Delete
    2. Check the Udmabuf driver (https://github.com/ikwzm/udmabuf)for doing that. Using it, you are able to do mmap and sequential read of a ddr memory buffer.

      Delete
    3. Oh that's neat! Reminds me a lot of resource binding for GPUs. I think once I get to the UI stage, I'll need that to be able to put menu overlays on top of the frame before it gets sent to a display. Thanks for the tip.

      Delete
    4. This little test loop gets 244 MB/s on a Zynq-7030:

      const int buffersize = 2*1024*1024;
      int outputfile = open("/dev/nvme0n1p1", O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0777);
      uint8_t *buffer = (uint8_t *) aligned_alloc(4096, buffersize);

      for(int i = 0; i < 1000; i++) {
      ssize_t written = write(outputfile, buffer, buffersize);
      }

      Would getting a buffer from udmabuf instead of aligned_alloc() improve performance? The pages would be pinned in physical memory, so perhaps the kernel doesn't have to do work to ensure they are present before sending write commands to the NVMe. I don't know if that's a lot of work -- I really don't understand what's holding up the Linux kernel at all in this test.

      Delete
    5. "It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue."

      As a follow-up, this is exactly what I wound up doing. My application runs in OCM RAM, but I allocate the queues in the external DDR, with the data. This way all PCIe traffic is addressed to external DDR and I don't have to waste valuable OCM space for big queues. With this I am able to get to 2GiB/s for 64GiB written in 1MiB blocks to a 970 Evo Plus. Will post more details once I get it fully integrated.

      Delete
    6. Hello. Could you please update your progress in detail ? I am finding the way to solve this issue.

      Delete
    7. I wound up writing a bare-metal NVMe driver to handle my data moving requirements, which got its own post here:

      https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html

      It's pretty application-specific (no O.S., heavily favors writes over reads), so I'm not sure how useful it'll be to others. There were a number of good suggestions for Linux speed-ups in other comments on this post, though, including using O_DIRECT to bypass page caching or using a faster user-space NVMe driver (SPDK is mentioned, but I also found NVMeDirect and depthcharge's NVME driver in my research).

      I couldn't test O_DIRECT in Petalinux, but below commenter kuku mentions hitting 1.2GB/s using O_DIRECT with the regular NVMe driver, but in a different distribution of Linux.

      Lastly, since the time of writing, FPGADrive has posted new benchmarks for the stock Petalinux NVMe drive with the Ultrascale+. Their read benchmark matches up with mine (~625MB/s), but their write benchmark is still lower for some reason.

      Delete
  3. Hi guys,
    what do you think about a NVMe driver from SPDK https://spdk.io/doc/nvme.html ? Maybe it would help.

    ReplyDelete
    Replies
    1. That looks promising! I like that it's just a C program running in user space. I'm not seeing any good references to it running in a Petalinux environment, but the fact that it's simple C gives me some hope that it could be ported.

      Delete
  4. Hi,
    Have you checked the actual negotiated link showed in "LnkSta"?
    I'm doing a similiar thing with PCIe in PL and can't get a gen3 link (it's 2.5GT/s, gen1, when the device is capable of establish a gen3 link). In that condition the ssd writing speed is about 600MB/s. It will be quite higher with a gen3 link.

    ReplyDelete
    Replies
    1. LnkSta reports: 8GT/s (ok), Width x4 (ok). I also checked that directly in the PCIe PHY register, offset 0x144 from the bridge's base address. If that also only shows Gen1 despite all the hardware being configured for Gen3, it might make sense to try the IBERT loopback test.

      Delete
    2. Thanks, I did it! The core registers showed gen3 capability so I assumed that for some reason the negotiation wasn't being done... I forced the PERSTN of the SSD manually and that was it! Gen3 and 1.2 GB/s in a Samsung 960 EVO :D

      Delete
    3. Great! What driver are you using for the NVMe? 1.2GB/s would be perfect for my application.

      Delete
    4. Sorry, I thought I'd replied this. I used the same XDMA PCIe bridge driver (this version: https://github.com/Xilinx/linux-xlnx/blob/xlnx_rebase_v4.14/drivers/pci/host/pcie-xdma-pl.c)

      Delete
    5. Last question, hopefully: What Linux distro are you using? I see below you were able to use dd with oflag=direct to test the speed, which I couldn't do from Petalinux. (The version of dd that gets compiled in by default doesn't support the oflag switch.) I'm sure I could figure out how to compile the full version in, but if there's a different distribution that works out-of-the-box, so to speak, I would rather switch to that. For example, a commenter above suggested https://github.com/ikwzm/ZynqMP-FPGA-Linux.

      Delete
    6. That's right, same here: https://github.com/ikwzm/ZynqMP-FPGA-Linux

      Forget petalinux as long as you can, you'll have a happier life! ;)

      Delete
  5. Hey kuku how are you measuring 1.2 GB/s? What filesystem setup, are you using O_DIRECT, and which memory are you using inside the Zynq?

    ReplyDelete
    Replies
    1. Hi, I'm writing a file with O_DIRECT in an ext4 fs from `/dev/zero`.

      `dd if=/dev/zero of=/mnt/ssd/test_file bs=8M count=128 conv=fdatasync oflag=direct`

      Delete
  6. Hello Shane, very interesting work. Do you happen to have any other reference for the video (as it seems to be broken).

    ReplyDelete
    Replies
    1. I'm not sure if this is the same video, but it covers pretty much the same process: https://www.youtube.com/watch?v=CjcqOTDtPPY

      Delete