Shane Colton: Benchmarking NVMe through the Zynq Ultrascale+ PL PCIe Linux Root Port Driver

Monday, July 15, 2019

Benchmarking NVMe through the Zynq Ultrascale+ PL PCIe Linux Root Port Driver

I want to be able to sink 1GB/s into an NVMe SSD from a Zynq Ultrascale+ device, something I know is technically possible but I haven't seen demonstrated without proprietary hardware accelerators. The software approach - through Linux and the Xilinx drivers - has enough documentation scattered around to make work, if you have a lot of patience. But the only speed reference I could find for it is this Z-7030 benchmark of 84.7MB/s. I found nothing for the newer ZU+, with the XDMA PCIe Bridge driver. I wasn't expecting it to be fast enough, but it seemed worth the effort to do a speed test.

For hardware, I have my TE0803 carrier with the ZU4CG version of the TE0803. All I need for this test is JTAG, UART, and the four PL-side GT transceivers, for PCIe Gen3 x4. I made a JTAG + UART cable out of the Digilent combo part, which is directly supported in Vivado and saves a separate USB port for the terminal. Using Trenz's bare board files, it was pretty quick to set up.

TE0803 Carrier + Dead-Bugged Digilent JTAG+USB Adapter.

Next, I wanted to validate the PCIe routing with a loopback test, following this video as a guide. I made my own loopback out of the cheapest M.2 to PCIe x4 adapter Amazon has to offer by desoldering the PCIe x4 connector and putting in twisted pairs. This worked out nicely since I could intentionally mismatch the length of one pair to get a negative result, confirming I wasn't in some internal loopback mode.

The three-eyed...something.

For most of the rest of this test, I'm roughly following the script from the FPGA Drive example design readme files, with deviations for my custom board and for Vivado 2019.1 support. The scripts there generates a Vivado project and block design with the Processing System and the XDMA PCIe Bridge. I had a few hardware differences that had to be taken care of manually (EMIO UART, inverted SSD reset signal), but having a reference design to start from was great.

The example design includes a standalone application for simply checking that a PCIe drive enumerates on the bus, but it isn't built for the ZU+. As the readme mentions, there had been no standalone driver for the XDMA PCIe Bridge. Well, as of Vivado 2019.1, there is! In SDK, the standalone project for xdmapcie_rc_enumerate_example.c can be imported directly from the peripheral driver list in system.mss from the exported hardware.

XDMA standalone driver example is available as of Vivado 2019.1!

I installed an SSD and ran this project and much to my amazement, the enumeration succeeded. By looking at the PHY Status/Control register at offset 0x144 from the Bridge Register Memory Map base address (0x400000000 here), I was also able to confirm that link training had finished and the link was Gen3 x4. (Documentation for this is in PG194.) Off to a good start, then.

Installed a 1TB Samsung 970 EVO Plus.

Unfortunately, that's where the road seems to end in terms of quick and easy setup. The next stage involves PetaLinux, which is a toolchain for building the Xilinx Linux kernel. I don't know about other people, but every time the words "Linux" and "toolchain" cross my path, I automatically lose a week of time to setup and debugging. This was no exception.

Unsurprisingly, PetaLinux tools run in Linux. I went off on a bit of a tangent trying to see if they would run in WSL2. They do, if you contain your project in the Linux file system. In other words, I couldn't get it to work on /mnt/c/... but it worked fine if the project was in ~/home/... But, WSL2 is a bit bleeding edge still and there's no USB support as of now. So you can build, but not JTAG boot. If you boot from an SD card, though, it might work for you.

So I started over with a VirtualBox VM running Ubuntu 18.04, which was mercifully easy to set up. For reasons I cannot at all come to terms with, you need at least 100GB of VM disk space for the PetaLinux build environment, all to generate a boot image that measures in the 10s of MB. I understand that tools like this tend to clone in entire repositories of dependencies, but seriously?! It's larger than all of my other development tools combined. I don't need the entire history of every tool involved in the build...

And here I thought Xilinx was a disk space hog...

The build process, even not including the initial creation of this giant environment, is also painfully slow. If you are running it on a VM, throw as many cores at it as you can and then still plan to go do something else for an hour. I started from the build script in the FPGA Drive example design, making sure it targetted cpu_type="zynqMP" and pcie_ip="xdma".

This should set up the kernel properly, but some of the config options in PetaLinux 2019.1 might not exactly match the packaged configs. There's a reference here explaining how to manually configure the kernel for PCIe and NVMe hosting on the Z-7030. I went through that, subbing in what I thought were correct ZU+ and XDMA equivalents where necessary. Specifically:

It seems like as of PetaLinux 2019.1 (v4.19.0), there's an entire new menu item under Bus support for PCI Express Port Bus support. Including this expands the menu with other PCI Express-specific items, which I left at whatever their default state was.

Under Bus support > PCI controller drivers, Xilinx XDMA PL PCIe host bridge support has to be included. I don't actually know if the NWL PCIe Core is also required, but left it in since it was enabled by default. It might be the driver for the PS-side PCIe?

Some things related to NVMe are in slightly different places. There's an item called Enable the block layer on the main config page that I assume should be included. Under Device Drivers, Block devices should be included. And under Device Drivers > NVME Support, NVM Express block device should also be included.

The rest of the kernel and rootfs config seems to match pretty closely the Z-7030 setup linked above. But I will admit it took me three attempts to create a build that worked, and I don't know exactly what trial-and-error steps I did between each one. Even once the correct controller driver (pcie-xdma-pl.c) was being included in the build, I couldn't get it to compile successfully without this patch. I don't know what the deal is with that, but after that I finally got a build that would enumerate the SSD on the PCIe bus:

Output from lspci -vv confirms link speed and width.

I had already partitioned the drive off-board, so I skipped over those steps and went straight to the speed tests as described here. I tested a few different block sizes and counts with pretty consistent results: about 460MB/s write and 630MB/s read.

Not sure about those correctable errors. I guess it's better than un-correctable errors.

That is actually pretty fast, compared to the Z-7030 benchmark. The ZU+ and the new driver seem like they're able to make much better use of the SSD. But, it's still about a factor of two below what I want. There could be some extra performance to squeeze out from driver optimization, but at this point I feel like the effort will be better-spent looking into hardware acceleration, which has been demonstrated to get to 1GB/s write speeds, even on older hardware.

Since there's no published datasheet or pricing information for that or any other NVMe hardware accelerator, I'm not inclined to even consider it as an option. At very least, I plan to read through the open specification and see what actually is required of an NVMe host. If it's feasible, I'd definitely prefer an ultralight custom core to a black-box IP...but that's just me. In the mean time, I have some parallel development paths to work on.

26 comments:

AnonymousAugust 20, 2019 at 7:34 PM
Hi,
Have you ever tried dd with oflag=direct?

I used it and the nvme speed was x2 or x3 times better.
ReplyDelete
Replies
Ambivalent EngineerAugust 28, 2019 at 10:28 PM
Hi Shane.

I'm working a similar problem. I have a Zynq 030 writing to a Samsung 960 across x4 PCIe. Our data source is a set of hardware sensors that write around 1300 MB/s into the DRAM attached to the PS. (Note the PS DRAM itself is just 4 GB/s, so zero-copy is critical.) I don't think we'll end up saving all of it but we'd like to save over 450 MB/s.

For our first attempt, we mmap()ed the DRAM buffers into the virtual address space of a process, and then issued write()s from that to files on an ext4 filesystem. 72 MB/s. dd if=/dev/zero of=/media/nmve/foo bs=2M count=1024 delivers 78 MB/s.

Next, we rebuilt the ext4 filesystem to eliminate the journal. This made it possible to open writes to the filesystem with O_DIRECT. However, it's not possible to write from mmap()ed physical memory to an O_DIRECT file... the write fails with illegal address. We have a hack right now where we memcopy from the mmap()ed physical memory to a 1 MB aligned_alloc() buffer, and then write from that to the O_DIRECT file. 108 MB/s. We do not understand why Linux won't let us write from an address which has been mmap()ed from /dev/mem.

We have a trivial test program which just sequentially writes 1MB blocks from an aligned_alloc() buffer to an O_DIRECT filehandle. It gets 240 MB/s, which suggests the Linux filesystem is never going to hit 450 MB/s, even if we get rid of that awful copy.

I don't think we'll need actual hardware (FPGA firmware) to write 450 MB/s though. It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue. The data is in contiguous 1 MB hunks in physical memory, so we only need to issue at most a thousand NVMe commands per second. I don't think the ARM will have much trouble with that. I'm not sure what your data source looks like, but you might consider this approach. I would think that even hundreds of thousands of NVMe commands per second should be possible.

One thing we'd be very interested in is partitioning the NVMe drive, and having Linux mount one partition while we operate a command queue on the other. Since the NVMe can handle thousands of queue pairs, this is possible in principle (we only need one queue pair!) We don't know what it takes to convince Linux to play along. If you discover anything we'd love to know.
ReplyDelete
Replies
MatesSeptember 4, 2019 at 4:55 AM
Hi guys,
what do you think about a NVMe driver from SPDK https://spdk.io/doc/nvme.html ? Maybe it would help.
ReplyDelete
Replies
kukuSeptember 4, 2019 at 10:08 AM
Hi,
Have you checked the actual negotiated link showed in "LnkSta"?
I'm doing a similiar thing with PCIe in PL and can't get a gen3 link (it's 2.5GT/s, gen1, when the device is capable of establish a gen3 link). In that condition the ssd writing speed is about 600MB/s. It will be quite higher with a gen3 link.
ReplyDelete
Replies
Ambivalent EngineerSeptember 9, 2019 at 11:00 AM
Hey kuku how are you measuring 1.2 GB/s? What filesystem setup, are you using O_DIRECT, and which memory are you using inside the Zynq?
ReplyDelete
Replies
EnthuManNovember 15, 2022 at 8:31 AM
Hello Shane, very interesting work. Do you happen to have any other reference for the video (as it seems to be broken).
ReplyDelete
Replies

Add comment

Monday, July 15, 2019

Benchmarking NVMe through the Zynq Ultrascale+ PL PCIe Linux Root Port Driver

26 comments:

My Projects

External Links

About Me

Blog Archive