Shane Colton: Zynq Ultrascale+ FatFs and Direct Speed Tests with Bare Metal NVMe via AXI-PCIe Bridge

Friday, November 29, 2019

Zynq Ultrascale+ FatFs and Direct Speed Tests with Bare Metal NVMe via AXI-PCIe Bridge

Blue wire PCIe REFCLK still hanging in there...

It's time to return to the problem of sinking 1GB/s of data onto an NVMe drive from a Zynq Ultrascale+ SoC. Last time, I benchmarked the Xilinx Linux drivers and found that they were fast, but not quite fast enough. In the comments of that post, there were many good suggestions for how to make up the difference without having to resort to a hardware accelerator. The consensus is that the hardware, namely the stock AXI-PCIe bridge, should be fast enough.

While a lot of the suggestions were ways to speed up the data transfer in Linux, and I have no doubt those would work, I also just don't want or need to run Linux in this application. The sensor input and wavelet compression modules are entirely built in Programmable Logic (PL), with only a minimal interface to the Processing System (PS) for configuration and control. So, I'm able to keep my entire application in the 256KB On-Chip Memory (OCM), leaving the external DDR4 RAM bandwidth free for data transfer.

After compression, the data is already in the DDR4 RAM where it should be visible to whatever DMA mechanism is responsible for transferring data to an NVMe drive. As Ambivalent Engineer points out in the comments:

It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue.

In other words, write a bare metal NVMe driver to interface with the AXI-PCIe bridge directly for initiating and controlling data transfers. This seems like a good fit, both to this specific application and to my general proclivity, for better or worse, to move to lower-level code when I get stuck. A good place to start is by exploring the functionality of the AXI-PCIe bridge itself.

AXI-PCIe Bridge

Part of the reason it took me a while to understand the AXI-PCIe bridge is that it has many names. The version for Zynq-7000 is called AXI Memory Mapped to PCI Express (PCIe) Gen2, and is covered in PG055. The version for Zynq Ultrascale is called AXI PCI Express (PCIe) Gen 3 Subsystem, and is covered in PG194. And the version for Zynq Ultrascale+ is called DMA for PCI Express (PCIe) Subsystem, and is nominally covered in PG195. But, when operated in bridge mode, as it will be here, it's actually still documented in PG194. I'll be focusing on this version.

Whatever the name, the block diagram looks like this:

AXI-PCIe Bridge Root Port block diagram, adapted from PG194 Figure 1.

The AXI-Lite Slave Interface is straightforward, allowing access to the bridge control and configuration registers. For example, the PHY Status/Control Register (offset 0x144) has information on the PCIe link, such as speed and width, that can be useful for debugging. When the bridge is configured as a Root Port, as it must be to host an NVMe drive, this address space also provides access to the PCIe Configuration Space of both the Root Port itself, at offset 0x0, and the enumerated Endpoint devices, at other offsets.

PCIe Confinguration Space layout, adapted from PG213 Table 2-35.

If the NVMe drive has successfully enumerated, its Endpoint PCIe Configuration Space will be mapped to some offset in the AXI-Lite Slave register space. In my case, with no switch involved, it shows up as Bus 1, Device 0, Function 0 at offset 0x1000. Here, it's possible to check the Device ID, Vendor ID, and Class Codes. Most importantly, the BAR0 register holds the PCIe memory address assigned to the device. The AXI address assigned to BAR0 in the Address Editor in Vivado is mapped to this PCIe address by the bridge.

Reads from and writes to the AXI BAR0 address are done through the AXI Slave Interface. This is a full AXI interface supporting burst transactions and a wide data bus. In another class of PCIe device, it might be responsible for transferring large amounts of data to the device through the BAR0 address range. But for an NVMe drive, BAR0 just provides access to the NVMe Controller Registers, which are used to set up the drive and inform it of pending data transfers.

The AXI Master Interface is where all NVMe data transfer occurs, for both reads and writes. One way to look at it is that the drive itself contains the DMA engine, which issues memory reads and writes to the system (AXI) memory space through the bridge. The host requests that the drive perform these data transfers by submitting them to a queue, which is also contained in system memory and accessed through this interface.

Bare Metal NVMe

Fortunately, NVMe is an open standard. The specification is about 400 pages, but it's fairly easy to follow, especially with help from this tutorial. The NVMe Controller, which is implemented on the drive itself, does most of the heavy lifting. The host only has to do some initialization and then maintain the queues and lists that control data transfers. It's worth looking at a high-level diagram of what should be happening before diving in to the details of how to do it:

System-level look at NVMe data flow, with primary data streaming from a source to the drive.

After BAR0 is set, the host has access to the NVMe drive's Controller Registers through the bridge's AXI Slave Interface. They are just like any other device/peripheral control registers, used for low-level configuration, status, and control of the drive. The register map is defined in the NVMe Specification, Section 2.

One of the first things the host has to do is allocate some memory for the Admin Submission Queue and Admin Completion Queue. A Submission Queue (SQ) is a circular buffer of commands submitted to the drive by the host. It's written by the host and read by the drive (via the bridge AXI Master Interface). A Completion Queue (CQ) is a circular buffer of notifications of completed commands from the drive. It's written by the drive (via the bridge AXI Master Interface) and read by the host.

The Admin SQ/CQ are used to submit and complete commands relating to drive identification, setup, and control. They can be located anywhere in system memory, as long as the bridge has access to them, but in the diagram above they're shown in the external DDR4. The host software notifies the drive of their address and size by setting the relevant Controller Registers (via the bridge AXI Slave Interface). After that, the host can start to submit and complete admin commands:

The host software writes one or more commands to the Admin SQ.
The host software notifies the drive of the new command(s) by updating the Admin SQ doorbell in the Controller Registers through the bridge AXI Slave Interface.
The drive reads the command(s) from the Admin SQ through the bridge AXI Master Interface.
The drive completes the command(s) and writes an entry to the Admin CQ for each, through the bridge AXI Master Interface. Optionally, an interrupt is triggered.
The host reads the completion(s) and updates the Admin CQ doorbell in the Controller Registers, through the AXI Slave Interface, to tell the drive where to place the next completion.

In some cases, an admin command may request identification or capability data from the drive. If the data is too large to fit in the Admin CQ entry, the command will also specify an address to which to write the requested data. For example, during initialization, the host software requests the Controller Identification and Namespace Identification structures, described in the NVMe Specification, Section 5.15.2. These contain information about the capabilities, size, and low-level format (below the level of file systems or even partitions) of the drive. The space for these IDs must also be allocated in system memory before they're requested.

Within the IDs is information that indicates the Logical Block (LB) size, which is the minimum addressable memory unit in the non-volatile memory. 512B is typical, although some drives can also be formatted for 4KiB LBs. Many other variables are given in units of LBs, so it's important for the host to grab this value. There's also a maximum and minimum page size, defined in the Controller Registers themselves, which applies to system memory. It's up to the host software to configure the actual system memory page size in the Controller Registers, but it has to be between these two values. 4KiB is both the absolute minimum and the typical value. It's still possible to address system memory in smaller increments (down to 32-bit alignment); this value just affects how much can be read/written per page entry in an I/O command or PRP List (see below).

Once all identification and configuration tasks are complete, the host software can then set up one or more I/O queue pairs. In my case, I just want one I/O SQ and one I/O CQ. These are allocated in system memory, then the drive is notified of their address and size via admin commands. The I/O CQ must be created first, since the I/O SQ creation references it. Once created, the host can start to submit and complete I/O commands, using a similar process as for admin commands.

I/O commands perform general purpose writes (from system memory to non-volatile memory) or reads (from non-volatile memory to system memory) over the bridge's AXI Master Interface. If the data to be transferred spans more than two memory pages (typically 4KiB each), then a Physical Region Page (PRP) List is created along with the command. For example, a write of 24 512B LBs starting in the middle of a 4KiB page might reference the data like this:

A PRP List is required for data transfers spanning more than two memory pages.

The first PRP Address in the I/O command can have any 32-bit-aligned offset within a page, but subsequent addresses must be page-aligned. The drive knows whether to expect a PRP Address or PRP List Pointer in the second PRP field of the I/O command based on the amount of data being transferred. It will also only pull as much data as is needed from the last page on the list to reach the final LB count. There is no requirement that the pages in the PRP list be contiguous, so it can also be used as a scatter-gather with 4KiB granularity. The PRP List for a particular command must be kept in memory until it's completed, so some kind of PRP Heap is necessary if multiple commands can be in flight.

Some (most?) drives also have a Volatile Write Cache (VWC) that buffers write data. In this case, an I/O write completion may not indicate that the data has been written to non-volatile memory. An I/O flush command forces this data to be written to non-volatile memory before a completion entry is written to the I/O CQ for that flush command.

That's about it for things that are described explicitly in the specification. Everything past this point is implementation detail that is much more application-specific.

A key question the host software NVMe driver needs to answer is whether or not to wait for a particular completion before issuing another command. For admin commands that run once during initialization and are often dependent on data from previous commands, it's fine to always wait. For I/O commands, though, it really depends. I'll be using write commands as an example, since that's my primary data direction, but there's a symmetric case for reads.

If the host software issues a write command referencing a range of data in system memory and then immediately changes the data, without waiting for the write command to be completed, then the write may be corrupted. To prevent this, the software could:

Wait for completion before allowing the original data to be modified. (Maybe there are other tasks that can be done in parallel.)
Copy the data to an intermediate buffer and issue the write command referencing that buffer instead. The original data can then be modified without waiting for completion.

Both could have significant speed penalties. The copy option is pretty much out of the question for me. But usually I can satisfy the first constraint: If the data is from a stream that's being buffered in memory, the host software can issue NVMe write commands that consume one end of the stream while the data source is feeding in new data at the other end. With appropriate flow control, these write commands don't have to wait for completion.

My "solution" is just to push the decision up one layer: the driver never blocks on I/O commands, but it can inform the application of the I/O queue backlog as the slip between the queues, derived from sequentially-assigned command IDs. If a particular process thinks it can get away without waiting for completions, it can put more commands in flight (up to some slip threshold).

An example showing the driver ready to submit Command ID 72, with the latest completion being Command ID 67. The doorbells always point to the next free slot in the circular buffer, so the entry there has the oldest ID.

I'm also totally fine with polling for completions, rather than waiting for interrupts. Having a general-purpose completion polling function that takes as an argument a maximum number of completions to process in one call seems like the way to go. NVMeDirect, SPDK, and depthcharge all take this approach. (All three are good open-source reference for light and fast NVMe drivers.)

With this set up, I am able to run a speed test by issuing read/write commands for blocks of data as fast as possible by trying to keep the I/O slip at a constant value:

Speed test for raw NVMe write/read on a 1TB Samsung 970 Evo Plus.

For smaller block transfers, the bottleneck is on my side, either in the driver itself or by hitting a limit on the throughput of bus transactions somewhere in the system. But for larger block transfers (32KiB and above) the read and write speeds split, suggesting that the drive becomes the bottleneck. And that's totally fine with me, since it's hitting 64% (write) and 80% (read) of the maximum theoretical PCIe Gen3 x4 bandwidth.

Sustained write speeds begin to drop off after about 32GiB. The Samsung Evo SSDs have a feature called TurboWrite that uses some fraction of the non-volatile memory array as fast Single-Level Cell (SLC) memory to buffer writes. Unlike the VWC, this is still non-volatile memory, but it gets transferred to more compact Multi-Level Cell (MLC) memory later since it's slower to write multi-level cells. The 1TB drive that I'm using has around 42GB of TurboWrite capacity according to this review, so a drop off in sustained write speeds after 32GiB makes sense. Even the sustained write speed is 1.7GB/s, though, which is more than fast enough for my application.

A bigger issue with sustained writes might be getting rid of heat. This drive draws about 7W during max speed writing, which nearly doubles the total dissipated power of the whole system, probably making a fan necessary. Then again, at these write speeds a 0.2kg chunk of aluminum would only heat up about 25ºC before the drive is full... In any case, the drive will also need a good conduction path to the rear enclosure, which will act as the heat sink.

FatFs

I am more than content with just dumping data to the SSD directly as described above and leaving the task of organizing it to some later, non-time-critical process. But, if I can have it arranged neatly into files on the way in, all the better. I don't have much overhead to spare for the file system operations, though. Luckily, ChaN gifted the world FatFs, an ultralight FAT file system module written in C. It's both tiny and fast, since it's designed to run on small microcontrollers. An ARM Cortex-A53 running at 1.2GHz is certainly not the target hardware for it. But, I think it's still a good fit for a very fast bare metal application.

FatFs supports exFAT, but using exFAT still requires a license from Microsoft. I think I can instead operate right on the limits of what FAT32 is capable of:

A maximum of 2^32 LBs. For 512B LBs, this supports up to a 2TiB drive. This is fine for now.
A maximum cluster size (unit of file memory allocation and read/write operations) of 128 LBs. For 512B LBs, this means 64KiB clusters. This is right at the point where I hit maximum (drive-limited) write speeds, so that's a good value to use.
A maximum file size of 4GiB. This is the limit of my RAM buffer size anyway. I can break up clips into as many files as I want. One file per frame would be convenient, but not efficient.

Linking FatFs to NVMe couldn't really get much simpler: FatFs's diskio.c device interface functions already request reads and writes in units of LBs, a.k.a. sectors. There's also a sync function that matches up nicely to the NVMe flush command. The only potential issue is that FatFs can ask for byte-aligned transfers, whereas NVMe only allows 32-bit alignment. My tentative understanding is that this can only happen via calls to f_read() or f_write(), so the application can guard against it.

For file system operations, FatFs reads and writes single sectors to and from a working buffer in system memory. It assumes that the read or write is complete when the disk_read() or disk_write() function returns, so the diskio.c interface layer has to wait for completion for NVMe commands issued as part of file system operations. To enforce this, but still allow high-speed sequential file writing from a data stream, I check the address of the disk_write() system memory buffer. If it's in OCM, I wait for completion. If it's in DDR4, I allow slip. For now, I wait for completion on all disk_read() calls, although a similar mechanism could work for high-speed stream reading. And of course, disk_ioctl() calls for CTRL_SYNC issue an NVMe flush command and wait for completion.

Interface between FatFs and NVMe through diskio.c, allowing stream writes from DDR4.

I also clear the queue prior to a read to avoid unnecessary read/write turnarounds in the middle of a streaming write. This logic obviously favors writes over reads. Eventually, I'd like to make a more symmetric and configurable diskio.c layer that allows fast stream reading and writing. It would be nice if the application could dynamically flag specific memory ranges as streamable for reads or writes. But for now this is good enough for some write speed testing:

Speed test for FatFs NVMe write on a 1TB Samsung 970 Evo Plus.

There's a very clear penalty for creating and closing files, since the process involves file system operations, including reads and flushes, that will have to wait for NVMe completions. But for writing sequentially to large (1GiB) files, it's still exceeding my 1GB/s requirement, even for total transfer sizes beyond the TurboWrite limit. So I think I'll give it a try, with the knowledge that I can fall back to raw writing if I really need to.

Utilization Summary

The good news is that the NVMe driver (not including the Queues, PRP Heap, and IDs) and FatFs together take up only about 27KB of system memory, so they should easily run in OCM with the rest of the application. At some point, I'll need to move the .text section to flash, but for now I can even fit that in OCM. The source is here, but be aware it is entirely a test implementation not at all intended to be a drop-in driver for any other application.

The bad news is that the XCZU4 is now pretty much completely full...

We're gonna need a bigger chip.

The AXI-PCIe bridge takes up 12960 LUTs, 17871 FFs, and 34 BRAMs. That's not even including the additional AXI Interconnects. The only real hope I can see for shrinking it would be to cut the AXI Slave Interface down to 32-bit, since it's only needed to access Controller Registers at BAR0. But I don't really want to go digging around in the bridge HDL if I can avoid it. I'd rather spend time optimizing my own cores, but I think no matter what I'll need more room for additional features, like decoding/HDMI preview and the subsampled 2048x1536 mode that might need double the number of Stage 1 Wavelet horizontal cores.

So, I think now is the right time to switch over to the XCZU6, along with the v0.2 Carrier. It's pricey, but it's a big step up in capability, with twice the DDR4 and more than double the logic. And it's closer to impedance-matched to the cost of the sensor...if that's a thing. With the XCZU6, I think I'll have plenty of room to grow the design. It's also just generally easier to meet timing constraints with more room to move logic around, so the compile times will hopefully be lower.

Hopefully the next update will be with the whole 3.8Gpx/s continuous image capture pipeline working together for the first time!

138 comments:

UnknownApril 2, 2020 at 3:24 AM
Is this project open source?
ReplyDelete
Replies
Ales GorkicJune 9, 2020 at 7:54 PM
Very impressive work on this camera from your side Shane.
I have been thinking to do something similar for our own cameras at Optomotive, but then I found your blog. I also used some TE0807 ZU7EV FPGA modules for development, but now they are accumulating dust.
Can you please email me directly to talk?
ReplyDelete
Replies
SubyAugust 6, 2020 at 8:25 PM
Dear Shane
Could you share the bd file of the WAVE-vivado project file?

Have a nice day.
Thank you & regards
ReplyDelete
Replies
UnknownMarch 20, 2021 at 6:27 AM
Hi Shane,

Excellent work! I'm using your driver for a multiple sensors based 360 camera. My SSD is the cheaper XPG S11 pro. I did some modifications to your driver code (mainly the namespace differences of my SSD), I was able to achieve 2300MB/s write and 3500MB/s raw read. It's pretty close to manufacturer's claim.

Regarding your driver, I think you can restore the CPU DCache. But instead of flushing/invalidating the entire cache, you could just flush/invalidate a single cache line. The I/O SQ/CQ entries are less than 64bytes in total.

Right now I'm porting these to the PS side GTR-PCIe. 1.5GB/s of write speed should be good enough for most cases under Gen2 x4. BTW, do you have any experience with that. I could access the controller's BAR space but namespace probing failed with NVME time out error. My guess is the controller could not access the AXI-master port to R/W the main DDR4 memory.

Best,
ReplyDelete
Replies
UnknownMay 5, 2021 at 2:10 PM
Wow, that'd be a lot of work. In part due to managing all the status/ctrl/crc on your own for the link layer.

Are you referring to the NWLogic soft-core being very expensive? FYI, I noticed Xilinx will have an Artix Ultrascale+ with hardened Gen 4.0 x4 core within. That FPGA might be cheaper given its market aim. I also got a Zynq 7015 SoM. Timing closure is hard for a Gen2x4 on Artix-7 fabric. Xilinx did a poor job on their AXI Memory Mapped Bridge IP. I saw something like 14 layer depth of combinatory logic internal to their IP core.

I recently hacked the SLVS-EC protocol using 7 series HRIO by reducing its frequency. If siganl integrity is properly tuned, I believe it's possible to overclock correctly at 1600Mbps or 2304 on a HP IO bank. Will write that up later on my block. https://landingfield.wordpress.com/category/cmos-camera-project/
ReplyDelete
Replies
Shane ColtonMay 7, 2021 at 9:01 PM
SLVS-EC on HPIO is definitely something I would be interested in. I've only ever seen it done with GTx ports, I guess since those have clock recovery built in. I'll have to catch up on your blog so I can follow along!

I've looked into NWLogic (Rambus now, I guess) and PLDA PCIe cores. They're all outside of my price range, for sure. My main motivation is to be able to use chips that don't have hard PCIe blocks, since Xilinx supply is so constrained now that you have to take what you can get. But it's also just a fun challenge.

Interestingly, the ZU+ -2 speed grade GTH can technically do Gen4 with the PHY IP. So maybe with a custom MAC and DMA I could also eventually get Gen4 speeds on future ZU+ projects.
ReplyDelete
Replies
Shane ColtonAugust 7, 2021 at 1:57 PM
As a follow-up, I've since tested this with exFAT (still through FatFs) and the performance is significantly better than FAT32. Maybe not all the way back to raw disk_write() speeds, but certainly getting up towards that. The main reason is that exFAT handles sequential cluster allocation with a bitmap (0 = free, 1 = allocated) rather than a 32-bit entry in the FAT chain. So, the file system overhead for sequential writing is 32x lower. It also allows cluster sizes greater than the FAT32 limit of 64KiB, which means fewer calls to disk_write() for larger files. At some point I will do some new benchmarking and post an update.
ReplyDelete
Replies
KLSSeptember 29, 2021 at 4:25 PM
Hey Shane, thanks for the super cool post!

I'm trying to do something similar, and I've taken your WAVE_TestSSD Vitis application and tried to run it to get the same baseline as you for reading and writing to/from an NVMe SSD. I'm running into an issue, however, in the PCIe Enumeration step; when I run XDmaPcie_EnumerateFabric in my main function, my application is erroring out, and I've traced it to a call of XDmaPcie_ReserveBarMem. It looks like the PMemMaxAddr is 0 for me, and I'm checking that it's not and it errors out. Did you ever have to set this parameter separately in order for that function call to work for you? Thanks!
ReplyDelete
Replies
FrancescoOctober 12, 2021 at 10:32 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownNovember 9, 2021 at 8:33 AM
Hi,
I am trying to do same project but on Zynq MYD-C7Z015, i am stuck in the NVME initialization step particulary in this part:
// NVME Controller Registers, via AXI BAR
// Root Port Bridge must be enabled through regRootPortStatusControl for R/W access.
u64 * regCAP = (u64 *)(0xB0000000); // Controller Capabilities
u32 * regCC = (u32 *)(0xB0000014); // Controller Configuration
u32 * regCSTS = (u32 *)(0xB000001C); // Controller Status
u32 * regAQA = (u32 *)(0xB0000024); // Admin Queue Attributes
u64 * regASQ = (u64 *)(0xB0000028); // Admin Submission Queue Base Address
u64 * regACQ = (u64 *)(0xB0000030); // Admin Completion Queue Base Address
u32 * regSQ0TDBL = (u32 *)(0xB0001000); // Admin Submission Queue Tail Doorbell
u32 * regCQ0HDBL = (u32 *)(0xB0001004); // Admin Completion Queue Head Doorbell
u32 * regSQ1TDBL = (u32 *)(0xB0001008); // I/O Submission Queue Tail Doorbell
u32 * regCQ1HDBL = (u32 *)(0xB000100C); // I/O Completion Queue Head Doorbell
i dont know where is these addreses, and where should i mape it when using my Zynq device
ReplyDelete
Replies
UnknownNovember 10, 2021 at 8:25 AM
Also please Kindly explain to me how to determine the base address 0x10000000 using for the following pointers in your code:

// Submission and Completion Queues
// Must be page-aligned at least large enough to fit the queue sizes defined above.
sqe_prp_type * asq = (sqe_prp_type *)(0x10000000); // Admin Submission Queue
cqe_type * acq = (cqe_type *)(0x10001000); // Admin Completion Queue
sqe_prp_type * iosq = (sqe_prp_type *)(0x10002000); // I/O Submission Queue
cqe_type * iocq = (cqe_type *)(0x10003000); // I/O Completion Queue

// Identify Structures
idController_type * idController = (idController_type *)(0x10004000);
idNamespace_type * idNamespace = (idNamespace_type *)(0x10005000);
logSMARTHealth_type * logSMARTHealth = (logSMARTHealth_type *)(0x10006000);

// Dataset Management Ranges (256 * 16B = 4096B)
dsmRange_type * dsmRange = (dsmRange_type *)(0x10007000);
ReplyDelete
Replies
Tom GMarch 27, 2022 at 12:15 PM
Hi Shane! This is some great work... an amazing read.
I am curious though - why 16 commands for the IO slip? Was this a number you found convenient or a compromise you found during testing? I would imagine submitting more commands shifts any bottleneck towards the PCIe core, but then results in a larger backlog (and larger queue) to keep track of.
ReplyDelete
Replies
Zak BouhannaAugust 15, 2022 at 7:35 AM
Hi Shane,
Many thanks for the amazing tutorial, it's a bright light one can see at the end of the long and tedious FPGA design tunnel (at least to me).
I'm currently trying to build a system comprising of the zynqmp (zcu102) + ad9361 to capture RF data and unload directly to the PS-PCIe NVMe drive.
The issue I'm facing is the speed. I target 200MB/s sustained speed within 15min of continuous wr operations. The maximum I could achieve is 75MB/s at the end of 15min. My current configuration utilises:
* meta-adi-core/ & meta-adi-xilinx/ user layers
* PS-PCIe x2 using BAR0 = 64MB
* everything else is set to default

Is there a way I can push the PS-PCIe to 200MB/s?
Many thanks in advance
ReplyDelete
Replies
dahuntSeptember 30, 2022 at 11:41 AM
This comment has been removed by the author.
ReplyDelete
Replies
dahuntSeptember 30, 2022 at 12:25 PM
Hi shane, I did have an issue of writing zeros to the file rather than the data. I just found out that I need to set the cluster size of EXFAT to 128k for it to actually insert the data correctly into the file. This is on a Samsung 980 pro 1 TB. I believe your example hasn't set to 1 MB. I just thought I'd throw this out there if anyone else experiences the same problem.
ReplyDelete
Replies
keeranMarch 6, 2023 at 7:04 PM
Hi Shane, thank you for this. I was able to reproduce it with outstanding results. Now I'm looking to do the same using the PCIe with the processing system. I can get it running in PetaLinux but would like to do it baremetal. Do you happen to know if nvme baremetal drivers are available for PS?
ReplyDelete
Replies
Simon BurkhardtNovember 17, 2023 at 8:37 AM
What a brilliant idea and execution thereof! It is exactly what I am looking for on my VCU118 using a 32 bit microblaze. The entire interface communication using SQ and CQ seems to be a popular programming concept (looking at the ERNIC IP / RoCE). I am in the process of porting your SSD_test code to my bare metal microblaze and I am getting a NVME_ERROR_ACQ_TIMEOUT during nvmeIdentifyController() because I assume the admin command is never executed. I believe this is because the driver only writes the SQE to DDR memory but does not trigger the PCIe IP to fetch the new SQE. (Maybe my BAR0 addressing is still wrong) but I am missing a mechanism to ring the SQ doorbell. Do you have a pointer why I am getting the timeout error or how to manually ring the SQ doorbell?
ReplyDelete
Replies
AnonymousDecember 26, 2023 at 8:23 PM
Hi,

First of all, I would like to say this post is extremely helpful and appreciate the level of detail you explained PCIe in. I am using your project as a base and building off of it for my own implementation. I am very new to PCIe so I apologize if this is a dumb question, but how would I modify the project to become a one-lane implementation. I understand what has to be done on the PL side, but I am having a hard time digging through the vitis code for where the number of lanes is defined. Any help with this would be amazing. Thank you!
ReplyDelete
Replies
AnonymousDecember 27, 2023 at 8:28 PM
So I modified the PHY_OK parameter and that didn't work, so now I am digging through the functions again. I believe my issue is before this step, because I am failing to establish the PCIe link. It fails when checking if the Link is up. Code snip:

/* check if the link is up or not */
for (Retries = 0; Retries < XDMAPCIE_LINK_WAIT_MAX_RETRIES; Retries++) {
if (XDmaPcie_IsLinkUp(XdmaPciePtr)){
Status = TRUE;
}
usleep(XDMAPCIE_LINK_WAIT_USLEEP_MIN);
}
if (Status != TRUE ) {
xil_printf("Warning: PCIe link is not up.\r\n");
return XST_FAILURE;
}

I assume that the core is not being able to establish a connection. Any debugging ideas? Maybe I made a mistake in the PL?
ReplyDelete
Replies
AnonymousApril 1, 2024 at 12:44 PM
Hello Shane Colton, your article has been extremely helpful to me, but i have a few questions.
We have recently been using Xilinx Zynq Ultrascale+ ZU4EV FPGA to implement data transfer functionality using the PCIe interface with peripheral SSD. We are using Vivados' block design, utilizing the DMA/Bridge Subsystem for PCI Express v4.1 IP in root complex mode for implementation. However, during our experiments, when trying to check if the PCIe is linked up by reading, the return value is always 0, indicating failure. We are unsure where the issue lies. We have referenced your block diagram, and I believe ours does not have significant issues. Could the problem be in the parameter settings of the XDMA IP ? For example, could there be errors in the settings of PCIe to AXI translation, AXI to PCIe translation, or the BAR0 value in the address editor ? I woul like to ask what scenarios typically result in reading a value of 0 when reading the register offset 0x144 ? If possible, could you please provide your .xsa file or Vivado project related to this part for our reference ? Thanks you!
ReplyDelete
Replies
IgorApril 10, 2024 at 8:09 PM
Hello Shane, found your blog and decided to try....My HW is ZCU106+FPGADRIVE.
diskWriteTest() seems to work - I can see (on ILA) proper data is read from DDR, however fsWriteTest() is not working for me. I am getting FR_NO_FILESYSTEM from mount_volume(). I've tried both EXFAT and FAT32 - same.
check_fs is finding BS_55AA word, but at BS_JmpBoot I see Zeros instead of
"\xEB\x76\x90" "EXFAT "
Any suggestions?
BTW, your blog is a pleasure to read!
ReplyDelete
Replies
AnonymousJuly 8, 2024 at 12:25 AM
Hi Shane,
The project and your heart is so wonderful.
I have a question. Will this code works for Picozed 7015 SOM?
I already have a block design and and Vitis code from GITHUB , and it's using AXI PCIe for enumeration, and here you are using DMA for enumeration.
So , what all changes should i make to integrate your read and write code to my current PCIe enumeration code?
Thanks in advance
NITHIN
ReplyDelete
Replies
AnonymousJuly 10, 2024 at 2:34 AM
Hi Shane,
the following are my macros.....

// AXI/PCIE Bridge and Device Registers
u32 * regPhyStatusControl = (u32 *)(0x400000144);
#define PHY_OK_GEN1_X1 0x00000800

int nvmeInitBridge(void)
{
u32 phyStatusValue = *regPhyStatusControl;
printf("PHY Status/Control Register Value: 0x%08X\n", phyStatusValue);
if(*regPhyStatusControl != PHY_OK_GEN1_X1) {
printf("Expected PHY Status/Control Register Value: 0x%08X\n", PHY_OK_GEN1_X1);
return NVME_ERROR_PHY;
}........................................

I'm getting the value printed as
PHY Status/Control Register Value: 0x2D2542EA
Expected PHY Status/Control Register Value: 0x00000800

It's not returning expected value
ReplyDelete
Replies
AnonymousJuly 12, 2024 at 1:14 AM
Hi Shane,
Is there a need for physical status check inside nvmeInitBridge() ?
Because it is already checking in the PcieInitRootComplex(XAxiPcie *AxiPciePtr, u16 DeviceId)

Status = XAxiPcie_IsLinkUp(AxiPciePtr);
if (Status != TRUE ) {
xil_printf("Link:\r\n - LINK NOT UP!\r\n");
return XST_FAILURE;
}

xil_printf("Link:\r\n - LINK UP, Gen%d x%d lanes\r\n",
get_pcie_link_speed(AxiPciePtr),get_pcie_link_width(AxiPciePtr));

Because if i comment out nvmeInit() inside main, everything is working perfectly in your code.....
Please reply on this.........................................
ReplyDelete
Replies
AnonymousJuly 16, 2024 at 3:49 AM
Hi Shane,
To which base address do we write? And where the base address is calculated?
I'm asking because, i wrote to the LBA=0x0 and read it back after a full power cycle, but the data is not there....if it wrote to NVMe, the data should be there ,right?
How to solve this?

ReplyDelete
Replies
AnonymousJuly 23, 2024 at 1:44 AM
hi shane,
in the function nvmeIdentifyController(); ,why we assign 0x06 to sqe.OPC?
and why we are expecting -------------if (idController->SQES != 0x66) { return NVME_ERROR_QUEUE_TYPE; } and if (idController->CQES != 0x44) { return NVME_ERROR_QUEUE_TYPE; }
is this numbers are constant? do we need to change this numbers....
please reply....................

ReplyDelete
Replies
AnonymousAugust 1, 2024 at 8:38 PM
Hi,

Again, thank you for this resource. It is a huge help. I have been running this project on the APU (cortex a53) since December 2023 and it works great. I am now trying to port this to the RPU (cortex r5). I moved the sources over and ran the project but the pcie and nvme drivers are no longer working. I was wondering if you had any insight on this. The pcie link reports that it is up, however the LTSSM state is in the 0x02 state ("polling active" state).

Thank you so much!
ReplyDelete
Replies
AnonymousAugust 6, 2024 at 9:01 PM
Hi again,

I have not been able to get this to work still. Did you have to do anything to remedy the fact that PS DDR no longer starts at 0x0 in the RPU but rather at 0x100000 due to the atcm and btcm? Also, were there an PL changes that you made such as in the address editor?

Thank you so much!
ReplyDelete
Replies
AnonymousOctober 15, 2024 at 7:32 PM
Thank you for this. I have it working beautifully on a 1CG. It has saved me many, many hours.
ReplyDelete
Replies
AnonymousDecember 12, 2024 at 10:14 PM
Hi Shane. This work is really interesting and thanks for documenting it in this blog. I wanted to check if you would have considered licensing your work? If so, could you add a license to your github code base?
ReplyDelete
Replies
AnonymousMarch 12, 2025 at 6:32 PM
Hello Shane, I was wondering if you had any more information about the changes you made to nvme.c / nvme.h / nvme_priv.h to have it function on the R5. I'm working on porting it over and I think I did what is listed throughout the comments but I can't get past reading the capabilities so i have something set up wrong. I am getting a NVME_ERROR_MIN_PAGE_SIZE error.
ReplyDelete
Replies
AnonymousMarch 19, 2025 at 8:20 PM
Hi Shane,

I thank you for continuing to be a resource to me. Previously I have gotten PCIe NVMe working on the ZCU104 in tandem with the FPGADrive module. I have now moved to a custom carrier. I have successfully gotten the PCIe link up and am able to get through the "nvmeInitBridge()" and "nvmeInitAdminQueue" functions, however, I get stuck in the "nvmeInitController" function when trying to enable the controller. This line: "*regCC |= REG_CC_EN;". Do you have any suggestions on how to debug this?

Thank you!
ReplyDelete
Replies

Add comment

Friday, November 29, 2019

Zynq Ultrascale+ FatFs and Direct Speed Tests with Bare Metal NVMe via AXI-PCIe Bridge

AXI-PCIe Bridge

Bare Metal NVMe

FatFs

Utilization Summary

138 comments:

My Projects

External Links

About Me

Blog Archive