Blue wire PCIe REFCLK still hanging in there... |
While a lot of the suggestions were ways to speed up the data transfer in Linux, and I have no doubt those would work, I also just don't want or need to run Linux in this application. The sensor input and wavelet compression modules are entirely built in Programmable Logic (PL), with only a minimal interface to the Processing System (PS) for configuration and control. So, I'm able to keep my entire application in the 256KB On-Chip Memory (OCM), leaving the external DDR4 RAM bandwidth free for data transfer.
After compression, the data is already in the DDR4 RAM where it should be visible to whatever DMA mechanism is responsible for transferring data to an NVMe drive. As Ambivalent Engineer points out in the comments:
It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue.In other words, write a bare metal NVMe driver to interface with the AXI-PCIe bridge directly for initiating and controlling data transfers. This seems like a good fit, both to this specific application and to my general proclivity, for better or worse, to move to lower-level code when I get stuck. A good place to start is by exploring the functionality of the AXI-PCIe bridge itself.
AXI-PCIe Bridge
Part of the reason it took me a while to understand the AXI-PCIe bridge is that it has many names. The version for Zynq-7000 is called AXI Memory Mapped to PCI Express (PCIe) Gen2, and is covered in PG055. The version for Zynq Ultrascale is called AXI PCI Express (PCIe) Gen 3 Subsystem, and is covered in PG194. And the version for Zynq Ultrascale+ is called DMA for PCI Express (PCIe) Subsystem, and is nominally covered in PG195. But, when operated in bridge mode, as it will be here, it's actually still documented in PG194. I'll be focusing on this version.
Whatever the name, the block diagram looks like this:
AXI-PCIe Bridge Root Port block diagram, adapted from PG194 Figure 1. |
The AXI-Lite Slave Interface is straightforward, allowing access to the bridge control and configuration registers. For example, the PHY Status/Control Register (offset 0x144) has information on the PCIe link, such as speed and width, that can be useful for debugging. When the bridge is configured as a Root Port, as it must be to host an NVMe drive, this address space also provides access to the PCIe Configuration Space of both the Root Port itself, at offset 0x0, and the enumerated Endpoint devices, at other offsets.
PCIe Confinguration Space layout, adapted from PG213 Table 2-35. |
Reads from and writes to the AXI BAR0 address are done through the AXI Slave Interface. This is a full AXI interface supporting burst transactions and a wide data bus. In another class of PCIe device, it might be responsible for transferring large amounts of data to the device through the BAR0 address range. But for an NVMe drive, BAR0 just provides access to the NVMe Controller Registers, which are used to set up the drive and inform it of pending data transfers.
The AXI Master Interface is where all NVMe data transfer occurs, for both reads and writes. One way to look at it is that the drive itself contains the DMA engine, which issues memory reads and writes to the system (AXI) memory space through the bridge. The host requests that the drive perform these data transfers by submitting them to a queue, which is also contained in system memory and accessed through this interface.
Bare Metal NVMe
Fortunately, NVMe is an open standard. The specification is about 400 pages, but it's fairly easy to follow, especially with help from this tutorial. The NVMe Controller, which is implemented on the drive itself, does most of the heavy lifting. The host only has to do some initialization and then maintain the queues and lists that control data transfers. It's worth looking at a high-level diagram of what should be happening before diving in to the details of how to do it:
For smaller block transfers, the bottleneck is on my side, either in the driver itself or by hitting a limit on the throughput of bus transactions somewhere in the system. But for larger block transfers (32KiB and above) the read and write speeds split, suggesting that the drive becomes the bottleneck. And that's totally fine with me, since it's hitting 64% (write) and 80% (read) of the maximum theoretical PCIe Gen3 x4 bandwidth.
System-level look at NVMe data flow, with primary data streaming from a source to the drive. |
After BAR0 is set, the host has access to the NVMe drive's Controller Registers through the bridge's AXI Slave Interface. They are just like any other device/peripheral control registers, used for low-level configuration, status, and control of the drive. The register map is defined in the NVMe Specification, Section 2.
One of the first things the host has to do is allocate some memory for the Admin Submission Queue and Admin Completion Queue. A Submission Queue (SQ) is a circular buffer of commands submitted to the drive by the host. It's written by the host and read by the drive (via the bridge AXI Master Interface). A Completion Queue (CQ) is a circular buffer of notifications of completed commands from the drive. It's written by the drive (via the bridge AXI Master Interface) and read by the host.
The Admin SQ/CQ are used to submit and complete commands relating to drive identification, setup, and control. They can be located anywhere in system memory, as long as the bridge has access to them, but in the diagram above they're shown in the external DDR4. The host software notifies the drive of their address and size by setting the relevant Controller Registers (via the bridge AXI Slave Interface). After that, the host can start to submit and complete admin commands:
- The host software writes one or more commands to the Admin SQ.
- The host software notifies the drive of the new command(s) by updating the Admin SQ doorbell in the Controller Registers through the bridge AXI Slave Interface.
- The drive reads the command(s) from the Admin SQ through the bridge AXI Master Interface.
- The drive completes the command(s) and writes an entry to the Admin CQ for each, through the bridge AXI Master Interface. Optionally, an interrupt is triggered.
- The host reads the completion(s) and updates the Admin CQ doorbell in the Controller Registers, through the AXI Slave Interface, to tell the drive where to place the next completion.
In some cases, an admin command may request identification or capability data from the drive. If the data is too large to fit in the Admin CQ entry, the command will also specify an address to which to write the requested data. For example, during initialization, the host software requests the Controller Identification and Namespace Identification structures, described in the NVMe Specification, Section 5.15.2. These contain information about the capabilities, size, and low-level format (below the level of file systems or even partitions) of the drive. The space for these IDs must also be allocated in system memory before they're requested.
Within the IDs is information that indicates the Logical Block (LB) size, which is the minimum addressable memory unit in the non-volatile memory. 512B is typical, although some drives can also be formatted for 4KiB LBs. Many other variables are given in units of LBs, so it's important for the host to grab this value. There's also a maximum and minimum page size, defined in the Controller Registers themselves, which applies to system memory. It's up to the host software to configure the actual system memory page size in the Controller Registers, but it has to be between these two values. 4KiB is both the absolute minimum and the typical value. It's still possible to address system memory in smaller increments (down to 32-bit alignment); this value just affects how much can be read/written per page entry in an I/O command or PRP List (see below).
Once all identification and configuration tasks are complete, the host software can then set up one or more I/O queue pairs. In my case, I just want one I/O SQ and one I/O CQ. These are allocated in system memory, then the drive is notified of their address and size via admin commands. The I/O CQ must be created first, since the I/O SQ creation references it. Once created, the host can start to submit and complete I/O commands, using a similar process as for admin commands.
I/O commands perform general purpose writes (from system memory to non-volatile memory) or reads (from non-volatile memory to system memory) over the bridge's AXI Master Interface. If the data to be transferred spans more than two memory pages (typically 4KiB each), then a Physical Region Page (PRP) List is created along with the command. For example, a write of 24 512B LBs starting in the middle of a 4KiB page might reference the data like this:
A PRP List is required for data transfers spanning more than two memory pages. |
The first PRP Address in the I/O command can have any 32-bit-aligned offset within a page, but subsequent addresses must be page-aligned. The drive knows whether to expect a PRP Address or PRP List Pointer in the second PRP field of the I/O command based on the amount of data being transferred. It will also only pull as much data as is needed from the last page on the list to reach the final LB count. There is no requirement that the pages in the PRP list be contiguous, so it can also be used as a scatter-gather with 4KiB granularity. The PRP List for a particular command must be kept in memory until it's completed, so some kind of PRP Heap is necessary if multiple commands can be in flight.
Some (most?) drives also have a Volatile Write Cache (VWC) that buffers write data. In this case, an I/O write completion may not indicate that the data has been written to non-volatile memory. An I/O flush command forces this data to be written to non-volatile memory before a completion entry is written to the I/O CQ for that flush command.
Some (most?) drives also have a Volatile Write Cache (VWC) that buffers write data. In this case, an I/O write completion may not indicate that the data has been written to non-volatile memory. An I/O flush command forces this data to be written to non-volatile memory before a completion entry is written to the I/O CQ for that flush command.
That's about it for things that are described explicitly in the specification. Everything past this point is implementation detail that is much more application-specific.
A key question the host software NVMe driver needs to answer is whether or not to wait for a particular completion before issuing another command. For admin commands that run once during initialization and are often dependent on data from previous commands, it's fine to always wait. For I/O commands, though, it really depends. I'll be using write commands as an example, since that's my primary data direction, but there's a symmetric case for reads.
If the host software issues a write command referencing a range of data in system memory and then immediately changes the data, without waiting for the write command to be completed, then the write may be corrupted. To prevent this, the software could:
- Wait for completion before allowing the original data to be modified. (Maybe there are other tasks that can be done in parallel.)
- Copy the data to an intermediate buffer and issue the write command referencing that buffer instead. The original data can then be modified without waiting for completion.
Both could have significant speed penalties. The copy option is pretty much out of the question for me. But usually I can satisfy the first constraint: If the data is from a stream that's being buffered in memory, the host software can issue NVMe write commands that consume one end of the stream while the data source is feeding in new data at the other end. With appropriate flow control, these write commands don't have to wait for completion.
My "solution" is just to push the decision up one layer: the driver never blocks on I/O commands, but it can inform the application of the I/O queue backlog as the slip between the queues, derived from sequentially-assigned command IDs. If a particular process thinks it can get away without waiting for completions, it can put more commands in flight (up to some slip threshold).
An example showing the driver ready to submit Command ID 72, with the latest completion being Command ID 67. The doorbells always point to the next free slot in the circular buffer, so the entry there has the oldest ID. |
I'm also totally fine with polling for completions, rather than waiting for interrupts. Having a general-purpose completion polling function that takes as an argument a maximum number of completions to process in one call seems like the way to go. NVMeDirect, SPDK, and depthcharge all take this approach. (All three are good open-source reference for light and fast NVMe drivers.)
With this set up, I am able to run a speed test by issuing read/write commands for blocks of data as fast as possible by trying to keep the I/O slip at a constant value:
Speed test for raw NVMe write/read on a 1TB Samsung 970 Evo Plus. |
Sustained write speeds begin to drop off after about 32GiB. The Samsung Evo SSDs have a feature called TurboWrite that uses some fraction of the non-volatile memory array as fast Single-Level Cell (SLC) memory to buffer writes. Unlike the VWC, this is still non-volatile memory, but it gets transferred to more compact Multi-Level Cell (MLC) memory later since it's slower to write multi-level cells. The 1TB drive that I'm using has around 42GB of TurboWrite capacity according to this review, so a drop off in sustained write speeds after 32GiB makes sense. Even the sustained write speed is 1.7GB/s, though, which is more than fast enough for my application.
A bigger issue with sustained writes might be getting rid of heat. This drive draws about 7W during max speed writing, which nearly doubles the total dissipated power of the whole system, probably making a fan necessary. Then again, at these write speeds a 0.2kg chunk of aluminum would only heat up about 25ÂșC before the drive is full... In any case, the drive will also need a good conduction path to the rear enclosure, which will act as the heat sink.
A bigger issue with sustained writes might be getting rid of heat. This drive draws about 7W during max speed writing, which nearly doubles the total dissipated power of the whole system, probably making a fan necessary. Then again, at these write speeds a 0.2kg chunk of aluminum would only heat up about 25ÂșC before the drive is full... In any case, the drive will also need a good conduction path to the rear enclosure, which will act as the heat sink.
FatFs
I am more than content with just dumping data to the SSD directly as described above and leaving the task of organizing it to some later, non-time-critical process. But, if I can have it arranged neatly into files on the way in, all the better. I don't have much overhead to spare for the file system operations, though. Luckily, ChaN gifted the world FatFs, an ultralight FAT file system module written in C. It's both tiny and fast, since it's designed to run on small microcontrollers. An ARM Cortex-A53 running at 1.2GHz is certainly not the target hardware for it. But, I think it's still a good fit for a very fast bare metal application.
FatFs supports exFAT, but using exFAT still requires a license from Microsoft. I think I can instead operate right on the limits of what FAT32 is capable of:
For file system operations, FatFs reads and writes single sectors to and from a working buffer in system memory. It assumes that the read or write is complete when the disk_read() or disk_write() function returns, so the diskio.c interface layer has to wait for completion for NVMe commands issued as part of file system operations. To enforce this, but still allow high-speed sequential file writing from a data stream, I check the address of the disk_write() system memory buffer. If it's in OCM, I wait for completion. If it's in DDR4, I allow slip. For now, I wait for completion on all disk_read() calls, although a similar mechanism could work for high-speed stream reading. And of course, disk_ioctl() calls for CTRL_SYNC issue an NVMe flush command and wait for completion.
FatFs supports exFAT, but using exFAT still requires a license from Microsoft. I think I can instead operate right on the limits of what FAT32 is capable of:
- A maximum of 2^32 LBs. For 512B LBs, this supports up to a 2TiB drive. This is fine for now.
- A maximum cluster size (unit of file memory allocation and read/write operations) of 128 LBs. For 512B LBs, this means 64KiB clusters. This is right at the point where I hit maximum (drive-limited) write speeds, so that's a good value to use.
- A maximum file size of 4GiB. This is the limit of my RAM buffer size anyway. I can break up clips into as many files as I want. One file per frame would be convenient, but not efficient.
For file system operations, FatFs reads and writes single sectors to and from a working buffer in system memory. It assumes that the read or write is complete when the disk_read() or disk_write() function returns, so the diskio.c interface layer has to wait for completion for NVMe commands issued as part of file system operations. To enforce this, but still allow high-speed sequential file writing from a data stream, I check the address of the disk_write() system memory buffer. If it's in OCM, I wait for completion. If it's in DDR4, I allow slip. For now, I wait for completion on all disk_read() calls, although a similar mechanism could work for high-speed stream reading. And of course, disk_ioctl() calls for CTRL_SYNC issue an NVMe flush command and wait for completion.
Interface between FatFs and NVMe through diskio.c, allowing stream writes from DDR4. |
I also clear the queue prior to a read to avoid unnecessary read/write turnarounds in the middle of a streaming write. This logic obviously favors writes over reads. Eventually, I'd like to make a more symmetric and configurable diskio.c layer that allows fast stream reading and writing. It would be nice if the application could dynamically flag specific memory ranges as streamable for reads or writes. But for now this is good enough for some write speed testing:
Speed test for FatFs NVMe write on a 1TB Samsung 970 Evo Plus. |
There's a very clear penalty for creating and closing files, since the process involves file system operations, including reads and flushes, that will have to wait for NVMe completions. But for writing sequentially to large (1GiB) files, it's still exceeding my 1GB/s requirement, even for total transfer sizes beyond the TurboWrite limit. So I think I'll give it a try, with the knowledge that I can fall back to raw writing if I really need to.
Utilization Summary
The good news is that the NVMe driver (not including the Queues, PRP Heap, and IDs) and FatFs together take up only about 27KB of system memory, so they should easily run in OCM with the rest of the application. At some point, I'll need to move the .text section to flash, but for now I can even fit that in OCM. The source is here, but be aware it is entirely a test implementation not at all intended to be a drop-in driver for any other application.
The bad news is that the XCZU4 is now pretty much completely full...
The bad news is that the XCZU4 is now pretty much completely full...
We're gonna need a bigger chip. |
The AXI-PCIe bridge takes up 12960 LUTs, 17871 FFs, and 34 BRAMs. That's not even including the additional AXI Interconnects. The only real hope I can see for shrinking it would be to cut the AXI Slave Interface down to 32-bit, since it's only needed to access Controller Registers at BAR0. But I don't really want to go digging around in the bridge HDL if I can avoid it. I'd rather spend time optimizing my own cores, but I think no matter what I'll need more room for additional features, like decoding/HDMI preview and the subsampled 2048x1536 mode that might need double the number of Stage 1 Wavelet horizontal cores.
So, I think now is the right time to switch over to the XCZU6, along with the v0.2 Carrier. It's pricey, but it's a big step up in capability, with twice the DDR4 and more than double the logic. And it's closer to impedance-matched to the cost of the sensor...if that's a thing. With the XCZU6, I think I'll have plenty of room to grow the design. It's also just generally easier to meet timing constraints with more room to move logic around, so the compile times will hopefully be lower.
Hopefully the next update will be with the whole 3.8Gpx/s continuous image capture pipeline working together for the first time!
Is this project open source?
ReplyDeleteUpdated link to the lightweight NVMe driver source:
Deletehttps://github.com/coltonshane/SSD_Test
(I moved it to Vivado/Vitis 2021.1, added project creation scripts, and fixed a few minor issues.)
Very impressive work on this camera from your side Shane.
ReplyDeleteI have been thinking to do something similar for our own cameras at Optomotive, but then I found your blog. I also used some TE0807 ZU7EV FPGA modules for development, but now they are accumulating dust.
Can you please email me directly to talk?
Thanks! I took a look at some of your cameras - they look awesome! You can reach me directly at colton (dot) shane (at) gmail (dot) com. I might have some questions for you about SLVS-EC and about utilizing the EV's VCU, if you're able to share that info.
DeleteDear Shane
ReplyDeleteCould you share the bd file of the WAVE-vivado project file?
Have a nice day.
Thank you & regards
New 2021.1 repo with BD and project configuration scripts is here:
Deletehttps://github.com/coltonshane/SSD_Test
Hi Shane,
ReplyDeleteExcellent work! I'm using your driver for a multiple sensors based 360 camera. My SSD is the cheaper XPG S11 pro. I did some modifications to your driver code (mainly the namespace differences of my SSD), I was able to achieve 2300MB/s write and 3500MB/s raw read. It's pretty close to manufacturer's claim.
Regarding your driver, I think you can restore the CPU DCache. But instead of flushing/invalidating the entire cache, you could just flush/invalidate a single cache line. The I/O SQ/CQ entries are less than 64bytes in total.
Right now I'm porting these to the PS side GTR-PCIe. 1.5GB/s of write speed should be good enough for most cases under Gen2 x4. BTW, do you have any experience with that. I could access the controller's BAR space but namespace probing failed with NVME time out error. My guess is the controller could not access the AXI-master port to R/W the main DDR4 memory.
Best,
That's excellent! Do you know how big the S11 Pro's SLC cache is? And what sort of write speeds you can get after the SLC cache is full? Wondering if I should add it to my list of drives to benchmark.
DeleteThanks for the tip on the DCache flushing, that makes sense.
I haven't tried porting it to the PS-side GTR PCIe. It should be possible, but I'd have to do a bit of reading on the PS PCIe controller to understand the differences. Maybe the address translation/masking is different?
The SLC cache is pretty big on S11 Pro 512G. I got more than a minute (close to 90s) of sustained writes from 1.5GB/s total camera streams before being throttled down to 600MB/s. Then it continued on for a very long time. I probably need to do tests in a more systematic way.
DeleteThe register mapping on PS-PCIe looks very different. Probably it's IP from a different vendor.
I got it working on PSU PCIe. I need to manually enable memory access and bus mastering bit on the PCIe configuration for rootport.
DeleteSSD R/W speed could saturate Gen2 x4 link.
Nice! I'm just now starting a new project: building a PCIe controller (down to the MAC) in PL logic. For chips that don't have the Integrated Block for PCI Express, the only other solutions cost as much as a sports car. Insanity. I bought a PCIe book for $90...we'll see how far that takes me.
DeleteWow, that'd be a lot of work. In part due to managing all the status/ctrl/crc on your own for the link layer.
ReplyDeleteAre you referring to the NWLogic soft-core being very expensive? FYI, I noticed Xilinx will have an Artix Ultrascale+ with hardened Gen 4.0 x4 core within. That FPGA might be cheaper given its market aim. I also got a Zynq 7015 SoM. Timing closure is hard for a Gen2x4 on Artix-7 fabric. Xilinx did a poor job on their AXI Memory Mapped Bridge IP. I saw something like 14 layer depth of combinatory logic internal to their IP core.
I recently hacked the SLVS-EC protocol using 7 series HRIO by reducing its frequency. If siganl integrity is properly tuned, I believe it's possible to overclock correctly at 1600Mbps or 2304 on a HP IO bank. Will write that up later on my block. https://landingfield.wordpress.com/category/cmos-camera-project/
SLVS-EC on HPIO is definitely something I would be interested in. I've only ever seen it done with GTx ports, I guess since those have clock recovery built in. I'll have to catch up on your blog so I can follow along!
ReplyDeleteI've looked into NWLogic (Rambus now, I guess) and PLDA PCIe cores. They're all outside of my price range, for sure. My main motivation is to be able to use chips that don't have hard PCIe blocks, since Xilinx supply is so constrained now that you have to take what you can get. But it's also just a fun challenge.
Interestingly, the ZU+ -2 speed grade GTH can technically do Gen4 with the PHY IP. So maybe with a custom MAC and DMA I could also eventually get Gen4 speeds on future ZU+ projects.
As a follow-up, I've since tested this with exFAT (still through FatFs) and the performance is significantly better than FAT32. Maybe not all the way back to raw disk_write() speeds, but certainly getting up towards that. The main reason is that exFAT handles sequential cluster allocation with a bitmap (0 = free, 1 = allocated) rather than a 32-bit entry in the FAT chain. So, the file system overhead for sequential writing is 32x lower. It also allows cluster sizes greater than the FAT32 limit of 64KiB, which means fewer calls to disk_write() for larger files. At some point I will do some new benchmarking and post an update.
ReplyDeleteAre you going to update your project on github? Also, can you provide your block diagram, or vivado project on github?
DeleteFinally got around to updating the project to Vivado/Vitis 2021.1 and adding some project creation scripts. New repo is here: https://github.com/coltonshane/SSD_Test.
DeleteHey Shane, thanks for the super cool post!
ReplyDeleteI'm trying to do something similar, and I've taken your WAVE_TestSSD Vitis application and tried to run it to get the same baseline as you for reading and writing to/from an NVMe SSD. I'm running into an issue, however, in the PCIe Enumeration step; when I run XDmaPcie_EnumerateFabric in my main function, my application is erroring out, and I've traced it to a call of XDmaPcie_ReserveBarMem. It looks like the PMemMaxAddr is 0 for me, and I'm checking that it's not and it errors out. Did you ever have to set this parameter separately in order for that function call to work for you? Thanks!
I think there must be some configuration difference. In my project, PMemMaxAddr is also zero, but the check in the XDMA driver is disabled by an #ifdef. Anyway, I've updated the project to Vivado/Vitis 2021.1 with project creation scripts, so maybe that will resolve the issue. New repo is here:
Deletehttps://github.com/coltonshane/SSD_Test.
This comment has been removed by the author.
ReplyDeleteI haven't worked with the RFSoC at all - the PCIE and GT primitives are a little different. That said, the format of the constraints looks fine to me. What error message are you getting? If it's not able to resolve the get_cells argument, I usually just keep tweaking it in the TCL console with the Sythesized design open until I get a hit, then move that back to the .xdc.
DeleteHi,
ReplyDeleteI am trying to do same project but on Zynq MYD-C7Z015, i am stuck in the NVME initialization step particulary in this part:
// NVME Controller Registers, via AXI BAR
// Root Port Bridge must be enabled through regRootPortStatusControl for R/W access.
u64 * regCAP = (u64 *)(0xB0000000); // Controller Capabilities
u32 * regCC = (u32 *)(0xB0000014); // Controller Configuration
u32 * regCSTS = (u32 *)(0xB000001C); // Controller Status
u32 * regAQA = (u32 *)(0xB0000024); // Admin Queue Attributes
u64 * regASQ = (u64 *)(0xB0000028); // Admin Submission Queue Base Address
u64 * regACQ = (u64 *)(0xB0000030); // Admin Completion Queue Base Address
u32 * regSQ0TDBL = (u32 *)(0xB0001000); // Admin Submission Queue Tail Doorbell
u32 * regCQ0HDBL = (u32 *)(0xB0001004); // Admin Completion Queue Head Doorbell
u32 * regSQ1TDBL = (u32 *)(0xB0001008); // I/O Submission Queue Tail Doorbell
u32 * regCQ1HDBL = (u32 *)(0xB000100C); // I/O Completion Queue Head Doorbell
i dont know where is these addreses, and where should i mape it when using my Zynq device
Those are the addresses for the NVMe controller registers through the AXI Slave interface of the XDMA bridge (PG194/PG195), which is mapped at 0xB0000000. These get translated the BAR0 address by the bridge.
DeleteFor the 7Z015, I think you're using the AXI Memory Mapped to PCI Express (PG055)? I think it works similarly. Wherever its AXI Slave IF is mapped in your design, that's the base address for these registers in the AXI domain.
Hope that helps!
In my address editior i can see that i have 2 slave interfaces, the first one with Base Name BAR0 at offset address 0x50000000, and the second one with Base Name CTL0. so i must replace your address base 0xB0000000 with 0x50000000 right?
DeleteYes, that sounds correct.
DeleteThe other interface (CTL0) is the AXI-Lite Slave that connects to the PCIe controller. That's the one that would have things like the PHY Status and Control register (offset 0x144) and the PCIe Configuration Space (where the BAR0 address is set). Most of that should be handled by the Xilinx driver for the bridge, though.
First i would like to thank you for your continues support. I tried using address 0x50000000. this address is also found in xparameters.h file as XPAR_AXI_PCIE_0_AXIBAR_0. i changed the code to the following:
Delete// NVME Controller Registers, via AXI BAR
// Root Port Bridge must be enabled through regRootPortStatusControl for R/W access.
u64 * regCAP = (u64 *)(0x50000000); // Controller Capabilities
u32 * regCC = (u32 *)(0x50000014); // Controller Configuration
u32 * regCSTS = (u32 *)(0x5000001C); // Controller Status
u32 * regAQA = (u32 *)(0x50000024); // Admin Queue Attributes
u64 * regASQ = (u64 *)(0x50000028); // Admin Submission Queue Base Address
u64 * regACQ = (u64 *)(0x50000030); // Admin Completion Queue Base Address
u32 * regSQ0TDBL = (u32 *)(0x50001000); // Admin Submission Queue Tail Doorbell
u32 * regCQ0HDBL = (u32 *)(0x50001004); // Admin Completion Queue Head Doorbell
u32 * regSQ1TDBL = (u32 *)(0x50001008); // I/O Submission Queue Tail Doorbell
u32 * regCQ1HDBL = (u32 *)(0x5000100C); // I/O Completion Queue Head Doorbell
But the code now stuck in the nvmeInitController function at the following line
*regCC &= ~REG_CC_IOCQES_Msk;
*regCC |= (0x4) << REG_CC_IOCQES_Pos;
it seems that the zynq is unable to access the address of the regCC. i am sure that i am missing something, do i need to initialize the BAR0 ? do i need to do any kind of initialization to access this address?
The bridge must be enabled in order to read and write those addresses. The function nvmeInitBridge checks some registers before enabling the bridge - maybe its failing those checks? Or maybe the control interface registers= pointers (regPhyStatusControl, regRootPortStatusControl, regDeviceClassCode) also need to be updated to match your address mapping?
DeleteI checked that the bridge is enables by checking that the value is 1, also i tried that baremetal rc_enumerate_example and it enables the bride using the following code
Delete/* Bridge enable */
XAxiPcie_GetRootPortStatusCtrl(AxiPciePtr, &RegVal);
RegVal |= XAXIPCIE_RPSC_BRIDGE_ENABLE_MASK;
XAxiPcie_SetRootPortStatusCtrl(AxiPciePtr, RegVal);
but also after that i cannot access address 0x50000000 using Xil_in32 function or using pointer, the processor completly stuck.
maybe if you tell me how you figured out the 0xB0000000 address i can follow the exact steps to make sure that i am not missing anything
Have you also checked the mapping of the three CTL0 registers in nvme.c (regPhyStatusControl, regRootPortStatusControl, regDeviceClassCode)? In my design, the AXI-Lite Slave (CTL0) is mapped to 0x50000000. So if you now have the AXI-Slave (BAR0) mapped there, then CTL0 must be mapped to a different base address.
DeleteAs for why the AXI-Slave BAR0 interface is mapped to 0xB0000000 in my design: There's no specific reason for it. I allowed the Address Editor in Vivado to assign it and it wound up there. In my design, 0x00000000-0x7FFFFFFF is PS DDR4 RAM (2GiB) and then beyond that are some other AXI peripherals, so I guess 0xB0000000 was the first free space.
Yes I have checked the mapping of the three CTL0 registers, and i changed the base address to 0x60000000 and i can read the Class Codes of the SSD.
DeleteAlso i deeply debuged the initialization process and i found that the processor stuck exactly inside the nvmeInitController(u32 tTimeout_ms) function starting from the follwoing line:
// I/O Completion Queue Entry Size
// TO-DO: This is fixed at 4 in NVMe v1.4, but should be pulled from Identify Controller CQES.
*regCC &= ~REG_CC_IOCQES_Msk;
*regCC |= (0x4) << REG_CC_IOCQES_Pos;
So basically i can access the BAR0 addresses:
u32 * regAQA = (u32 *)(0x50000024); // Admin Queue Attributes
u64 * regASQ = (u64 *)(0x50000028); // Admin Submission Queue Base Address
u64 * regACQ = (u64 *)(0x50000030); // Admin Completion Queue Base Address
but the processor stuck when it try to access regCC with address u32 * regCC = (u32 *)(0x50000014); // Controller Configuration
what do you suggest?
Another note is that my PCI NMME is 1 X not 4 X and i modifed PHY_OK as follow
Delete#define PHY_OK 0x00008b0
in order to perform nvmeInitBridge() function
I figured out the problem, it was in the BAR0 vector translation in the IP, by default it is set to 0x40000000, i changed it to 0x00000000 and everything worked
DeleteGreat! Glad you were able to figure it out.
DeleteAlso please Kindly explain to me how to determine the base address 0x10000000 using for the following pointers in your code:
ReplyDelete// Submission and Completion Queues
// Must be page-aligned at least large enough to fit the queue sizes defined above.
sqe_prp_type * asq = (sqe_prp_type *)(0x10000000); // Admin Submission Queue
cqe_type * acq = (cqe_type *)(0x10001000); // Admin Completion Queue
sqe_prp_type * iosq = (sqe_prp_type *)(0x10002000); // I/O Submission Queue
cqe_type * iocq = (cqe_type *)(0x10003000); // I/O Completion Queue
// Identify Structures
idController_type * idController = (idController_type *)(0x10004000);
idNamespace_type * idNamespace = (idNamespace_type *)(0x10005000);
logSMARTHealth_type * logSMARTHealth = (logSMARTHealth_type *)(0x10006000);
// Dataset Management Ranges (256 * 16B = 4096B)
dsmRange_type * dsmRange = (dsmRange_type *)(0x10007000);
These are the NVMe queues themselves and some other NVMe data structures. They can be anywhere in system memory, as long as the AXI Master interface of the bridge has access to them. I've put them in PS DDR4 RAM at 0x10000000 in order to reserve the bottom 256MiB (0x00000000-0x0FFFFFFF) for executing code. But they can go anywhere.
DeleteTheir location is passed to the NVMe controller during initialization, for example during nvmeInitAdminQueue() and nvmeCreateIOQueues().
Hi Shane! This is some great work... an amazing read.
ReplyDeleteI am curious though - why 16 commands for the IO slip? Was this a number you found convenient or a compromise you found during testing? I would imagine submitting more commands shifts any bottleneck towards the PCIe core, but then results in a larger backlog (and larger queue) to keep track of.
Thanks! I think you're right that as long as the allowed slip is enough to accommodate the latency, it won't be the bottleneck. I just chose a number that clearly got me into drive-write-speed-limited operation. I don't think there's much harm in making it larger, though, up to half the maximum queue depth size. It does increase the time that the data and PRP lists have to stay resident in memory, I guess.
DeleteHi Shane,
ReplyDeleteMany thanks for the amazing tutorial, it's a bright light one can see at the end of the long and tedious FPGA design tunnel (at least to me).
I'm currently trying to build a system comprising of the zynqmp (zcu102) + ad9361 to capture RF data and unload directly to the PS-PCIe NVMe drive.
The issue I'm facing is the speed. I target 200MB/s sustained speed within 15min of continuous wr operations. The maximum I could achieve is 75MB/s at the end of 15min. My current configuration utilises:
* meta-adi-core/ & meta-adi-xilinx/ user layers
* PS-PCIe x2 using BAR0 = 64MB
* everything else is set to default
Is there a way I can push the PS-PCIe to 200MB/s?
Many thanks in advance
You might have to do some A/B testing to figure out what the bottleneck is. I don't think it's the bus: Gen2 x2 would still have a theoretical maximum of 1250MB/s, so even with overhead it should be more than fast enough. A commenter above mentions using the PS-GTR (in x4) for a 360 camera with pretty high data rates.
DeleteI'm not familiar with meta-adi-*, but I guess that means you are in Linux. Have you tried a "dd" speed test, to rule out the ADI layer as a source of slowdown? That was where I started on this project, in this post: https://scolton.blogspot.com/2019/07/benchmarking-nvme-through-zynq.html. There are a bunch of useful comments there on using direct transfers to speed up the Linux NVMe driver as well.
This comment has been removed by the author.
ReplyDeleteHi shane, I did have an issue of writing zeros to the file rather than the data. I just found out that I need to set the cluster size of EXFAT to 128k for it to actually insert the data correctly into the file. This is on a Samsung 980 pro 1 TB. I believe your example hasn't set to 1 MB. I just thought I'd throw this out there if anyone else experiences the same problem.
ReplyDeleteWere you able to troubleshoot this any further? Is it something specific about writing zeros, rather than data? What does the failure look like? I can try to replicate it to see if it's something in the driver itself.
DeleteHi Shane, thank you for this. I was able to reproduce it with outstanding results. Now I'm looking to do the same using the PCIe with the processing system. I can get it running in PetaLinux but would like to do it baremetal. Do you happen to know if nvme baremetal drivers are available for PS?
ReplyDeleteThere aren't any that I'm aware of, but there is a PS-PCIe driver with example code that gets you up to the point where the bus is enumerated and BAR0 assigned. From there, it should be possible to link this NVMe driver to that BAR0 and make it work. (There are some other minor changes required such as the check that the link is up will now have to check the PS-PCIe registers.) Someone has definitely gotten it working, though. (See comment thread starting from March 20, 2021.)
DeleteI was able to configure the proper BAR address and got it working. Thank you! Your example helped a lot. I just had to remove the NVME bridge init as the PS isn't using the PCIe subsystem.
DeleteThanks again!
What a brilliant idea and execution thereof! It is exactly what I am looking for on my VCU118 using a 32 bit microblaze. The entire interface communication using SQ and CQ seems to be a popular programming concept (looking at the ERNIC IP / RoCE). I am in the process of porting your SSD_test code to my bare metal microblaze and I am getting a NVME_ERROR_ACQ_TIMEOUT during nvmeIdentifyController() because I assume the admin command is never executed. I believe this is because the driver only writes the SQE to DDR memory but does not trigger the PCIe IP to fetch the new SQE. (Maybe my BAR0 addressing is still wrong) but I am missing a mechanism to ring the SQ doorbell. Do you have a pointer why I am getting the timeout error or how to manually ring the SQ doorbell?
ReplyDeleteI see where my problem is. The queues are not in DDR but in the PCIe memory space, correct? Where exactly is 0x1000_0000 because I cannot find it in your AXI address space.
Deletesqe_prp_type * asq = (sqe_prp_type *)(0x10000000); // Admin Submission Queue
The queues can technically be anywhere the PCIe IP's AXI master can access, but in this case they are in fact in DDR. So 0x1000_0000 is a physical DDR address for the PS DDR, with no extra translation or mapping.
DeleteThe driver does ring the ASQ doorbell at the end of nvmeSubmitAdminCommand(). If the BAR0 mapping wasn't correct, regSQ0TDBL wouldn't be pointing to the right place. But I don't think you'd make it past nvmeInitController() in that case, since the other NVMe Controller registers would also be wrong or unreadable.
You can try setting a breakpoint just before nvmeSubmitAdminCommand() and reading regCSTS on the debugger to see if it's readable and, if so, if it reports ready (Bit 0 set). Be careful not to leave the watch open when restarting, though, because it will be a memfault until the PCIe link is up.
To clarify further about the BAR0 mapping: the addresses assigned to the NVMe Controller registers should match the address you assign to the BAR0 AXI Slave of the PCIe IP in the Vivado Address Editor. For example,
Deleteu64 * regCAP = (u64 *)(0xB0000000);
implies that AXI Slave is assigned to 0xB000_0000. It should really be pulled from xparameters.h rather than hard-coded, but I was probably being lazy.
However if you made it past nvmeInitAdminQueue() and nvmeInitController(), you probably already figured that out.
Just reading through the comment history on this post reminded me of two more things:
DeleteOne person above ran into an issue where the BAR0 _translation_ was set to 0x4000_0000 by default. This would mean that if you write to 0xB000_0000 in AXI space, it would map to 0x4000_0000 in PCIe space. It should still work correctly if the PCIe driver assigns 0x4000_0000 to the device's BAR0 register, but if for some reason it doesn't, that could also lead to problems. You can check the device's PCIe Config BAR0 register after pcieInit() to see what's actually there, and if it matches the configuration of the PCIe IP in Vivado. The address translation is confusing, so I just set it to 0x0000_0000.
Second, since you're using a 32-bit processor, you'll have to break up regCAP, regASQ, and regACQ into two parts, I think. I did it like this:
u32 * regCAP_L = (u32 *)(NVME_CONTROLLER_BASE + 0x0000);
u32 * regCAP_H = (u32 *)(NVME_CONTROLLER_BASE + 0x0004);
u32 * regCC = (u32 *)(NVME_CONTROLLER_BASE + 0x0014);
u32 * regCSTS = (u32 *)(NVME_CONTROLLER_BASE + 0x001C);
u32 * regAQA = (u32 *)(NVME_CONTROLLER_BASE + 0x0024);
u32 * regASQ_L = (u32 *)(NVME_CONTROLLER_BASE + 0x0028);
u32 * regASQ_H = (u32 *)(NVME_CONTROLLER_BASE + 0x002C);
u32 * regACQ_L = (u32 *)(NVME_CONTROLLER_BASE + 0x0030);
u32 * regACQ_H = (u32 *)(NVME_CONTROLLER_BASE + 0x0034);
u32 * regSQ0TDBL = (u32 *)(NVME_CONTROLLER_BASE + 0x1000);
u32 * regCQ0HDBL = (u32 *)(NVME_CONTROLLER_BASE + 0x1004);
u32 * regSQ1TDBL = (u32 *)(NVME_CONTROLLER_BASE + 0x1008);
u32 * regCQ1HDBL = (u32 *)(NVME_CONTROLLER_BASE + 0x100C);
I guess you already figured that out too, though, or it wouldn't even compile.
Thanks for the explanation. I now understand better how the SQ mechanism is supposed to work (regSQ0TDBL). I still cannot get any Admin SQE to complete. My best guess is that I did not do a correct portation from 64 bit to 32 bit architecture. In my opinion I don't need to split the registers into to _H and _L. But perphaps somewhere else I don't have the correct setting yet (maybe when memcpy of the SQE, endianness problems or the countless pointer casts). Unfortunately, Vitis bugs prevent me from using a 64 bit microblaze. Do you have a 32 bit version laying around somewhere?
DeleteI am currently also suppressing the assert error in XDmaPcie_ReserveBarMem from XDmaPcie_EnumerateFabric that other users reported. And I am on PCIe x2 but adjusted for the value in PHY_OK.
You're right about the register pointers - they shouldn't need to be split. I forgot that I did that for a completely different reason, as part of a PCIe IP optimization.
DeleteAside from fixing some casts to clear warnings, I don't think I made any substantial changes for 32-bit compatibility when I set up a version for the Cortex-R5. I will double check.
Were you able to check regCSTS to see if the device is reporting ready status (0x00000001)?
Yes, CSTS is 1, and the other registers seem to be correct too.
Delete// AXI PCIE bridge IP (AXI-Lite config interface BASE=0x2000_0000)
0x2000_0144 = 0x0000_1882 // x2 lanes, Gen 3, link is up (PHY Status/Control Register)
0x2010_0008 = 0x0108_02b0 // class code
0x2000_0148 = 0x0000_0001 // bridge is enabled (Root Port Status/Control Register)
// PCIE BAR0 (AXI_B BASE=0x6000_0000) (DDR4 BASE=0x8000_0000)
// PCIE controller register
0x6000_0000 = 0x0080_0030_1400_03ff (controler capability, CAP)
0x6000_0014 = 0x0046_0001 // enabled (controler configuration, CC)
0x6000_0024 = 0x000f_000f // AQA
0x6000_001c = 0x0000_0001 // (controler status, CSTS)
0x6000_0028 = 0x0000_0000_8000_0000 // in DDR offset 0x0000 (ASQ)
0x6000_0030 = 0x0000_0000_8000_1000 // in DDR offset 0x1000 (ACQ)
// create SQE in DDR memory
// ASQ in DDR4 starting at 0x8000_0000
0x8000_0000 = 0x0000_0006 // opcode 6h: identify
0x8000_0004 = 0x0000_0000
0x8000_0008 = 0x0000_0000
0x8000_000C = 0x0000_0000
0x8000_0010 = 0x0000_0000
0x8000_0014 = 0x0000_0000
0x8000_0018 = 0x8000_4000 // PRP1 = pointer to destination buffer to write results of ID opcode
0x8000_001C = 0x0000_0000
0x8000_0020 = 0x0000_0000
0x8000_0024 = 0x0000_0000
0x8000_0028 = 0x0000_0001 // CID = 1
0x8000_002C = 0x0000_0000
0x6000_1000 = 0x01 // write SQ0TDBL (Admin Submission Queue Tail Doorbell)
Hi! Did you ever get to solve this problem? I'm getting the exact same problem.
DeleteSome verifications that I have done:
- I've verified that CSTS is 1 before the admin submit command;
- Enumeration is correctly performed, class code is correct, x4 lanes, Gen 3 and link is up;
- AXI to PCIe translation in the IP is set to address 0x000000;
- BAR0 address is 0xB000_0000, and code files reflect this;
- CTL0 address is 0x5_0000_0000 and code files reflect this;
I can get to the indentify controller function, but always get NVME_ERROR_ACQ_TIMEOUT.
I have even started experimenting with AXI clocks and resets in the design. I can't get it past the identify function.
Hi,
ReplyDeleteFirst of all, I would like to say this post is extremely helpful and appreciate the level of detail you explained PCIe in. I am using your project as a base and building off of it for my own implementation. I am very new to PCIe so I apologize if this is a dumb question, but how would I modify the project to become a one-lane implementation. I understand what has to be done on the PL side, but I am having a hard time digging through the vitis code for where the number of lanes is defined. Any help with this would be amazing. Thank you!
I think the only necessary software change for x1 would be to the value of PHY_OK (nvme_priv.h, line 34). The bit definitions for the PHY Status/Control Register are in PG194.
DeleteThe rest should be handled automatically by the AXI-PCIe Bridge IP, so that the driver doesn't need to know how many lanes there are.
Okay awesome, again I thank you so much for your help! I will look into this. Also, I was wondering if there is any specific format the SSD needs to be in to R/W to the raw disk? Do I format it to cleared?
DeleteYou shouldn't need to do anything to the SSD to be able to do raw disk R/W. There is an admin command for formatting and an I/O command for TRIM that can be used to deallocate all or part of the disk, but this isn't strictly necessary. If you issue a write command to an already-allocated block, it will overwrite the old data. Depending on the drive, link speed, and write workload, this is might affect performance and endurance.
DeleteSo I modified the PHY_OK parameter and that didn't work, so now I am digging through the functions again. I believe my issue is before this step, because I am failing to establish the PCIe link. It fails when checking if the Link is up. Code snip:
ReplyDelete/* check if the link is up or not */
for (Retries = 0; Retries < XDMAPCIE_LINK_WAIT_MAX_RETRIES; Retries++) {
if (XDmaPcie_IsLinkUp(XdmaPciePtr)){
Status = TRUE;
}
usleep(XDMAPCIE_LINK_WAIT_USLEEP_MIN);
}
if (Status != TRUE ) {
xil_printf("Warning: PCIe link is not up.\r\n");
return XST_FAILURE;
}
I assume that the core is not being able to establish a connection. Any debugging ideas? Maybe I made a mistake in the PL?
Some ideas for things to try:
Delete1. Make sure you're using Lane 0 of the PHY and the SSD. I don't think it can make an x1 link on any other lane. (Lane 3 might also work, depending on lane swapping ability?)
2. Make sure the REFCLK is present at the PHY GT and SSD. Make sure it's set to the correct channel and frequency in the PL setup.
3. Make sure the GPIO is set up correctly to drive the SSD's PERST# line.
4. Make sure the GT Tx diff pairs have correct-sized (220nF) AC coupling capacitors. (If not, Rx detect might fail.)
5. Check the LTSSM state (regPhyStatusControl[8:3]) - this might give a clue as to how far the PCIe controller is making it through initialization.
Thank you so much for these suggestions!
Delete"1. Make sure you're using Lane 0 of the PHY and the SSD. I don't think it can make an x1 link on any other lane. (Lane 3 might also work, depending on lane swapping ability?)"
- I am using the "FPGADrive Gen4" card in conjunction with the ZCU104. I have the PCIe constrained to GTH Quad 226 (the zcu104 only has 2 quads, this is the quad that is connected to the FMC_LPC). The PCIe lanes are properly constrained to the FMC pins that connect to the "FPGADrive" card. Is that what you were asking about?
"2. Make sure the REFCLK is present at the PHY GT and SSD. Make sure it's set to the correct channel and frequency in the PL setup."
- The REFCLK is properly constrained from the "FPGADrive" card. I will test if the clock is running via and ILA.
"3. Make sure the GPIO is set up correctly to drive the SSD's PERST# line."
- I am confident that this is connected properly.
"4. Make sure the GT Tx diff pairs have correct-sized (220nF) AC coupling capacitors. (If not, Rx detect might fail.)"
- I will have to get back to you on this.
"5. Check the LTSSM state (regPhyStatusControl[8:3]) - this might give a clue as to how far the PCIe controller is making it through initialization."
- Is this the same as the dedicated output of the XDMA_PCIE IP block?
Thank you so much!
Since you're using the ZCU104 and FPGADrive, that rules out a lot of potential hardware issues. The 220nF AC-coupling capacitors are on the FPGADrive board. They also have the lanes properly constrained in their .xdc, as far as I can tell. (PCIe Lane 0 is constrained to B226 CH3.) It might be worth double-checking that in the Implemented Design.
DeleteHave you successfully built and run their ZCU104 reference design? It has the bare-metal example project (xdmapcie_rc_enumerate_example.c) in it that should achieve Link Up state. If that works, then it might be best to replace the C source in that project with this driver as a next step, modifying the addresses as necessary.
For the LTSSM, I think the register value is the same as the hard-wired output, but I haven't tested that myself. You're looking for 0x10, which is L0 state. Any other state might indicate where the initialization is failing. (The states are listed in PG213.)
Okay, I think I have it working now. I went back to the "FPGADrive" example design and then used your PS code for the driver. Thank you for the help!
DeleteAlso, I am having a hard time understanding the PF0_BAR0 inside of the XDMA_PCIE IP block. Do I need it? I believe that you have this option checked in your example design, however, the "FPGADrive" example design does not. I was wondering if you could explain why that is?
Thanks again!
Also, what is the difference between a PCIe BAR and an AXI BAR?
DeleteGreat, glad it's working now!
DeleteI just ran create_projects.tcl from scratch in Vivado 2021.1 and PF0_BAR0 is not enabled for me. I think that does address translation in the device-to-host direction, which I don't use for anything. (SSD just sends read and write requests to the physical memory address.) There is some sparse documentation on it in PG194:
https://docs.xilinx.com/r/en-US/pg194-axi-bridge-pcie-gen3/Root-Port-BAR
The AXI BAR0 can be used to do address translation in the host-to-device direction. This would be useful if you had multiple functions or devices and wanted to assign them different addresses in PCIe address space that aren't just mapped to a contiguous block in the bridge's memory assignment. But for a single SSD, you can just set it to zero and anything you write to the assigned BAR0 address in the memory manager will go to that SSD's NVMe Controller registers.
Thank you so much for all of your help! Thank you for the explanation about BAR0 as well.
DeleteDoes your driver start reading/writing at the beginning of the SSD or does it start at some offset?
DeleteI am asking this because I am attempting to read off an SSD I have written raw data to. I opened the SSD in linux as "/dev/sda" and then wrote some data to it. I know this works fine and have checked it by reading the data off on another linux machine. However, the data is not matching up when I then read it off via the fpga.
The srcLBA/destLBA in nvmeRead()/nvmeWrite() is the raw LBA, no partition offset or anything like that. It should match dev/sda, I think.
DeleteDoes the nvmeRead() actually modify the memory at destByte, just with incorrect values? Or is it not modifying the memory at all?
Okay awesome, thank you! Hopefully that won't be an issue then.
DeleteI believe that the destByte is being populated, just with incorrect values.
I wrote some code to populate the DDR buffer with about a kB of data. I then use your diskWriteTest() function to write that data (still write 1GB so that all of the block size stuff is taken care of) to the SSD. I then clear DDR by setting that kB space to zeros to ensure that I am not just reading off the same data in DDR but rather from the SSD. I then use your diskReadTest() function to read the data from the SSD and the data that I wanted to write is returned back to me.
Now I am clearing the buffer in DDR for x amount of bytes and then reading 1GB off of the SSD via the diskReadTest() function. I check the data in the buffer and it is not non-zero values.
So upon further investigation, it looks like it is only writing one block at a time to DDR. I feel dumb because when I look at the code this makes sense. I misunderstood the code originally. I thought that it wrote one GB of data in DDR at a time but this is not the case. I see now that it does individual transfers to the base address of the "data" pointer.
DeleteHello Shane Colton, your article has been extremely helpful to me, but i have a few questions.
ReplyDeleteWe have recently been using Xilinx Zynq Ultrascale+ ZU4EV FPGA to implement data transfer functionality using the PCIe interface with peripheral SSD. We are using Vivados' block design, utilizing the DMA/Bridge Subsystem for PCI Express v4.1 IP in root complex mode for implementation. However, during our experiments, when trying to check if the PCIe is linked up by reading, the return value is always 0, indicating failure. We are unsure where the issue lies. We have referenced your block diagram, and I believe ours does not have significant issues. Could the problem be in the parameter settings of the XDMA IP ? For example, could there be errors in the settings of PCIe to AXI translation, AXI to PCIe translation, or the BAR0 value in the address editor ? I woul like to ask what scenarios typically result in reading a value of 0 when reading the register offset 0x144 ? If possible, could you please provide your .xsa file or Vivado project related to this part for our reference ? Thanks you!
Link Up = 0 is being reported by the PHY, so it's below the level of the AXI-PCIe bridge. (Incorrect BAR0 or address translation configuration would not cause it.) It could be a hardware/pinout issue, missing REFCLK, or asserted PERST#. What is the LTSSM State (bits 8:3 in the PHY Status/Control Register, offset 0x144)? That might give some more clues.
DeleteYou can generate the project using the create_project.tcl script in https://github.com/coltonshane/SSD_Test. It requires the TE0803 board files from Trenz Electronic, or you can retarget it to your ZU4EV board.
DeleteHello Shane Colton,
DeleteThank you for your previous response.
Regarding hardware issues such as the reference clock and PERST#, we have confirmed that there are no problems. We used Vivado's XDMA IP LTSSM debug ports to measure LTSSM. After observation, we have confirmed that the PCIe link speed has reached 8G, and LTSSM state has passed through Detect, Polling, Configuration, and finally reached the L0 state. Therefore, there should be no issues with the hardware connection.
However, the current issue is that the address generated for xdma_baseaddr_CTL0 in our xsa is 0x9000_0000. When we attempt to read 1024 bytes by the function "Status = PcieInitRootComplex(&XdmaPcieInstance, XDMAPCIE_DEVICE_ID);" from this address, all return values are 0, including the return value for "link up." Could you please advise on what could possibly be causing this situation? Thank you.
Hello Shane Colton,
DeleteThanks for your project.
we used your .tcl and successfully created the vivado project, but when we wanted to open the block design, it pop up an error message which maybe said the version of the IP like zynq was not equal to our vivado version. So we want to ask what vivado version you used, thank you very much.
The project on GitHub is from Vivado/Vitis 2021.1.
DeleteJust want to make sure I understand correctly:
- Directly probing LTSSM State on the IP shows L0.
- PcieInitRootComplex() is failing.
- Your IP is mapped to 0x9000_0000, but if you read 1024B from that address while in a breakpoint near PcieInitRootComplex(), you get all zeros, including the Link Up and LTSSM State bits of the PHY Status register at offset 0x144.
If so, it does sound like it could be an AXI issue of some type. I think those address mappings should be conveyed to the BSP by the .xsa file, though. The only one you would need to modify manually are the ones in nvme.c. (It has hard-coded CTL0 at 0x5_0000_0000 and BAR0 at 0xB000_0000.)
Hello Shane,
DeleteThanks for your previously helps, and we currently faced another problem.
Current status:
The PCIe link up is successful. My PCIe is ready to use, and I can also read the Vendor ID(0x144D)/Device ID(0xA808) of the SSD (Samsung 970 EVO Plus 500MB) , and we use ZU4EV series FPGA.
My question is:
When I call the nvmeIdentifyController() first time, the nvmeCompleteAdminCommand()
return NVME_OK. It means, write asq & read acq is successed.
But no data has been written to this memory (idController).
When I call the nvmeIdentifyController() again, the nvmeCompleteAdminCommand() return
NVME_ERROR_ADMIN_COMMAND_TIMEOUT.
Is there something wrong with the AXI DMA settings in vivado ? or it is a software issue maybe i missed some steps ?
The Register & memory information:
---------------
sqe_prp_type * asq = (sqe_prp_type *)(0x10000000); // Admin Submission Queue
cqe_type * acq = (cqe_type *)(0x10001000); // Admin Completion Queue
idController_type * idController = (idController_type *)(0x10004000);
// AXI PCIE bridge IP (AXI-Lite configure interface BASE=0x5_0000_0000)
0x5_0000_0144 = 0x0000_1884 // x4 lanes, Gen 3, link is up (PHY Status/Control Register)
0x5_0010_0008 = 0x0108_0200 // class code
0x5_0000_0148 = 0x0000_0001 // bridge is enabled (Root Port Status/Control Register)
// PCIE BAR0 (AXI_B BASE=0xB000_0000) (DDR4 BASE=0x1000_0000)
// PCIE controller register
0xB000_0000 = 0x0000_0030_3C03_3FFF (controller capability, CAP)
0xB000_0014 = 0x0046_0001 // enabled (controller configuration, CC)
0xB000_0024 = 0x000F_000F // AQA
0xB000_001C = 0x0000_0001 // (controller status, CSTS)
0xB000_0028 = 0x0000_0000_1000_0000 // in DDR offset 0x0000 (ASQ)
0xB000_0030 = 0x0000_0000_1000_1000 // in DDR offset 0x1000 (ACQ)
// ASQ in DDR4 starting at 0x1000_0000
0x1000_0000 = 0x0000_0006 // opcode 6h: identify
0x1000_0004 = 0x0000_0000
0x1000_0008 = 0x0000_0000
0x1000_000C = 0x0000_0000
0x1000_0010 = 0x0000_0000
0x1000_0014 = 0x0000_0000
0x1000_0018 = 0x1000_4000 // PRP1
0x1000_001C = 0x0000_0000
0x1000_0020 = 0x0000_0000
0x1000_0024 = 0x0000_0000
0x1000_0028 = 0x0000_0001 // CID = 1
0x1000_002C = 0x0000_0000
// ACQ in DDR4 starting at 0x1000_1000
0x1000_1000 = 0x0000_0000
0x1000_1004 = 0x0000_0000
0x1000_1008 = 0x0000_0001
0x1000_100C = 0x0001_0000
I don't see anything obviously wrong. If the bridge can write completions to the ACQ, it should also be able to write to idController. Any AXI-PCIe address mapping problems would affect both.
DeleteCould you post the second ASQ entry as well? And check if the CSTS changes after the second attempt?
Hello Shane,
DeleteThank you very much for your pervious assistance. We later found that there were issues with the configuration of AXI. Now, we can read and write to the SSD successfully, thank you!
Hello Shane, found your blog and decided to try....My HW is ZCU106+FPGADRIVE.
ReplyDeletediskWriteTest() seems to work - I can see (on ILA) proper data is read from DDR, however fsWriteTest() is not working for me. I am getting FR_NO_FILESYSTEM from mount_volume(). I've tried both EXFAT and FAT32 - same.
check_fs is finding BS_55AA word, but at BS_JmpBoot I see Zeros instead of
"\xEB\x76\x90" "EXFAT "
Any suggestions?
BTW, your blog is a pleasure to read!
I can't think of anything obvious. You might want to try disallowing slip entirely in disk_write() (diskio.c). It's configured to allow slip for source addresses above 0x1000_0000, for streamed writes. But if for some reason FatFs is using memory above that address to stage file system writes during f_mkfs(), it could cause problems. I think that would only happen if your program was placed above 0x1000_0000 by the linker script. Low probability, but easy to try.
DeleteYou could also try formatting the drive externally and seeing if it will mount. That might help bisect the problem into something related to f_mkfs() (write-related) or something related to f_mount() (read-related).
Hi Shane,
ReplyDeleteThe project and your heart is so wonderful.
I have a question. Will this code works for Picozed 7015 SOM?
I already have a block design and and Vitis code from GITHUB , and it's using AXI PCIe for enumeration, and here you are using DMA for enumeration.
So , what all changes should i make to integrate your read and write code to my current PCIe enumeration code?
Thanks in advance
NITHIN
also, i'm using GEN1 X1 for this example
DeleteI think it should be portable to the Zynq 7000. I'm not using any of the DMA features of PG195, only the AXI-PCIe bridge. So it should be similar in concept to interface with PG055. If there's a PG055 example project that gets to Link Up state, that would be a good place to start. Then you can add in just this NVMe driver with the NVMe Controller register addresses updated to match your AXI memory-mapped BAR0.
DeleteThe PCIe hard IP handles the width and speed. The only thing you would need to do is modify nvmeInitBridge() to check the PHY registers for PG055 for a valid link.
Hi Shane,
DeleteThanks for your reply.
I saw you have defined the following
#define PHY_OK 0x00001884 // Gen3 x4 link is up, in state L0.
I'm using Picozed 7015 ,and it is Gen1 X1 , so what number should i use instead of 0x00001884......
I searched a lot, but i didn't get any.
Currently in your code i modified the pcie.c so as to match my block design and it's enumerating then it is showing the error, " NVMe driver failed to initialize. Error Code , PCIe link must be Gen3 x4 for this example"
Can you help me on this?
If you're using PG055, try to read the value of the PHY Status/Control Register (0x144) after running the PCIe Enumeration example. The register definition is here:
Deletehttps://docs.amd.com/r/en-US/pg055-axi-bridge-pcie/PHY-Status/Control-Register-Offset-0x144
The value used for PHY_OK should include Link Up and the Rate and Width you want. If you're not able to read that register, or if the Link Up bit isn't set, something else might be wrong.
Hope that helps!
Hi Shane,
ReplyDeletethe following are my macros.....
// AXI/PCIE Bridge and Device Registers
u32 * regPhyStatusControl = (u32 *)(0x400000144);
#define PHY_OK_GEN1_X1 0x00000800
int nvmeInitBridge(void)
{
u32 phyStatusValue = *regPhyStatusControl;
printf("PHY Status/Control Register Value: 0x%08X\n", phyStatusValue);
if(*regPhyStatusControl != PHY_OK_GEN1_X1) {
printf("Expected PHY Status/Control Register Value: 0x%08X\n", PHY_OK_GEN1_X1);
return NVME_ERROR_PHY;
}........................................
I'm getting the value printed as
PHY Status/Control Register Value: 0x2D2542EA
Expected PHY Status/Control Register Value: 0x00000800
It's not returning expected value
Can you double-check the address of the AXI-PCIe Bridge? I think it should match XPAR_AXI_PCIE_0_BASEADDR.
DeleteHi Shane,
ReplyDeleteIs there a need for physical status check inside nvmeInitBridge() ?
Because it is already checking in the PcieInitRootComplex(XAxiPcie *AxiPciePtr, u16 DeviceId)
Status = XAxiPcie_IsLinkUp(AxiPciePtr);
if (Status != TRUE ) {
xil_printf("Link:\r\n - LINK NOT UP!\r\n");
return XST_FAILURE;
}
xil_printf("Link:\r\n - LINK UP, Gen%d x%d lanes\r\n",
get_pcie_link_speed(AxiPciePtr),get_pcie_link_width(AxiPciePtr));
Because if i comment out nvmeInit() inside main, everything is working perfectly in your code.....
Please reply on this.........................................
It is for the NVMe driver to check that the link configuration is as-expected (width and speed). If you don't require that check, you can of course leave it out. It doesn't affect the function of the rest of the driver.
DeleteHi Shane,
ReplyDeleteTo which base address do we write? And where the base address is calculated?
I'm asking because, i wrote to the LBA=0x0 and read it back after a full power cycle, but the data is not there....if it wrote to NVMe, the data should be there ,right?
How to solve this?
Did you see a completion entry in the I/O completion queue?
Deletehi shane,
ReplyDeletein the function nvmeIdentifyController(); ,why we assign 0x06 to sqe.OPC?
and why we are expecting -------------if (idController->SQES != 0x66) { return NVME_ERROR_QUEUE_TYPE; } and if (idController->CQES != 0x44) { return NVME_ERROR_QUEUE_TYPE; }
is this numbers are constant? do we need to change this numbers....
please reply....................
0x06 is the Admin Command Opcode for "Identify".
DeleteThe SQES and CQES bytes have the maximum [7:4] and minimum [3:0] entry sizes for the submission (SQES) and completion (CQES) queues. I have never seen these be anything other than 0x66 (SQES) and 0x44 (CQES) on any drives I've tested, and the rest of the driver assumes those queue sizes.
You can find more info in the NVMe Specification, which is thankfully not locked behind a paywall.
Hi,
ReplyDeleteAgain, thank you for this resource. It is a huge help. I have been running this project on the APU (cortex a53) since December 2023 and it works great. I am now trying to port this to the RPU (cortex r5). I moved the sources over and ran the project but the pcie and nvme drivers are no longer working. I was wondering if you had any insight on this. The pcie link reports that it is up, however the LTSSM state is in the 0x02 state ("polling active" state).
Thank you so much!
The LTSSM should be managed by the PCIe IP without much dependence on the CPU, so I can't imagine why changing to the R5 would cause it to get stuck in Polling.Active. Is there a chance that the device is being held in reset? If not, can you try the RC Enumerate example on the R5 to see if that works?
DeleteI've run the NVMe driver on the R5 with only some minor modifications, I think mostly to clear compiler warnings about pointer casts. I think somebody else in the comments was running it on a Microblaze. So hopefully once the LTSSM issue is cleared up, it will work.
That was my reaction as well.
DeleteI don't believe so, I haven't changed my RTL or source code.
I am wondering if it is a addressing/memory assignment issue. I previously had everything for the pcie/nvme stuff stored in PS DDR like you do. However, when I switched to the r5 core the PS DDR space that I was using previously is no longer available for the RPU in the linker script.
What modifications did you have to do besides pointer casts?
Thank you so much!
I am running the "XDmaPcie_EnumerateFabric" function after the "PcieInitRootComplex" function. I believe that there is an issue here.
DeleteThere is one thing that stands out. Normally when I run it, it will print out:
"bus: 0, device: 0, function: 0: BAR 0 is not implemented"
"bus: 0, device: 0, function: 0: BAR 1 is not implemented"
However, currently it is printing out this:
"bus: 0, device: 0, function: 0: BAR 0 required IO space; it is unassigned"
"bus: 0, device: 0, function: 0: BAR 1 required IO space; it is unassigned"
I think this means that an IO address space has not been assigned which is weird. Any insight on this would be amazing.
Thank you so much!
I think Bus 0 is within the Root Complex itself. So I'm not sure if those messages are important for the functionality of the external device (which would usually be on Bus 1). Are you using the same bitstream for the A53 version and the R5 version?
DeleteYea, I'm using the same bitstream for both.
DeleteI think I figured out what's happening but I'm not sure how to resolve it.
So within the "xdmapcie.h" xilinx header file they have the typedef of the struct "XDmaPcie_Config." The variable "BaseAddress" is of type UINTPTR. This variable stores the value of "xdma_0_baseaddr_CTL0" which in my case is 0x400000000. This is a value that needs 35 bits to be represented. However, in the case of the RPU, "BaseAddress" is of size 32 bits because UINTPTR changes based on the processor architecture. Thus, the base address value of 0x400000000 gets chopped off and becomes 0x0. I have verified this by printing the size and value of "ConfigPtr->BaseAddress" running first on the APU and then on the RPU.
When I am using the APU, the size is 8 (bytes) and the value is 0x400000000. When I am using the RPU, the size is 4 (bytes) and the value is 0x0.
How were you able to get around this in your project? What is your xdma_0_baseaddr_CTL0 address?
I replied in the comment below.
DeleteHi again,
ReplyDeleteI have not been able to get this to work still. Did you have to do anything to remedy the fact that PS DDR no longer starts at 0x0 in the RPU but rather at 0x100000 due to the atcm and btcm? Also, were there an PL changes that you made such as in the address editor?
Thank you so much!
Are you trying to run the NVMe queues in ATCM/BTCM? I've never tried that - it might require some changes in the memory map. When running on the R5, I still have the queues in PS DDR, at addresses that don't overlap with ATCM/BTCM. In that case, it shouldn't require any changes to the memory map in the address editor.
DeleteWere you able to get xdmapcie_rc_enumerate_example.c to run successfully on the R5?
No, I'm not attempting to do that lol. What address value do you have your queues at?
DeleteUnfortunately not, I replied above about all the UINTPTR stuff. Do you have any ideas?
Thank you so much!
Ah, thank you for the reminder! I kept looking for software changes, which were only what I described above (pointer warning fixes). But I forgot to look at the Vivado project and the memory map. Here, there were some important changes to support the R5's 32-bit addressing:
Delete/xdma/S_AXI_LITE is moved to 0xA000_0000 w/ 64M size.
/xdma/S_AXI_B is moved to 0xA800_0000 w/ 1M size.
The commit message is:
"Update AXI-PCIe bridge peripheral addresses to work with 32b addressing on the R5. Note that we must use M_AXI_HPM0_FPD to place the CTL0 aperture at 0xA0000000 in order to satisfy the 512MiB alignment requirement and have bit 28 be cleared to select the bridge/ECAM register space. See PG194 for details."
I honestly don't remember the details of the constraint regarding Bit 28, but it forces the use of low-range HPM0 instead of low-range HPM1.
Hi,
DeleteThank you so much! I was wondering about PL addressing. I played with it a little bit but was getting aperture issues. I will try these address.
Just to verify, BAR0 is now at 0xA800_0000, and CTL0 is now at 0xA000_0000 correct?
When you say constraints, are you talking about like actual constraints in the xdc or something else?
Thank you again so much!
Yes, that's correct. The bit 28 switch is something specific to PG194/PG195: it selects between Bridge mode and DMA mode for that IP. I don't think that applies to PG055. So you may have more flexibility in where you place the two apertures. They just need to be within the 32b address space.
DeleteAwesome, thank you so much! I am now able to initialize the PCIe and NVMe drivers as well as pass the enumerate step. I am having some other data transfer issues but I don't think that this is related to the XDMA block.
DeleteI was wondering what the performance hit is due to moving the project from the APU to the RPU as far as read and write speeds?
Also, have you ever implemented PS PCIe via the zynq block? I'm looking at implementing PCIe on the ZCU102, however it does not have a PL PCIe hard core. Thus, I am unable to use the XDMA. Any thoughts?
Thank you again so much!
In my case, I saw a slight decrease from about 3.05GB/s to 2.90GB/s for sequential 128K raw disk writes. I'm not sure where the actual bottleneck is, though. It's possible that with some optimization, there would be no difference, especially for a smaller link. The DMA transfers to/from the drive to DDR4 should be unaffected. It's just the overhead of reading and writing the queues, plus any file system or data management overhead, that might take a small hit from the R5's lower speed.
DeleteI haven't personally set up this driver on the PS PCIe, but a couple of people in the comments above have gotten it to work.
Hey,
DeleteThank you again for this post, it continues to be very helpful.
I was wondering if there is a way to have registers in the PL be the source of data rather than a DDR4 address. Or to put it more plain, instead of passing a pointer to a DDR4 address into the "nvmeRead" function could you have it route directly to a register in the PL somehow?
Thank you again!
Yes, this should be possible. If the PL registers are exposed in an AXI Slave, then the AXI-PCIe bridge's AXI Master just needs access to that slave in the memory map. Same goes for AXI-attached block memory. I know there are some NVMe IP cores that offload the queues to block RAM this way.
DeleteThank you for your reply.
DeleteWould I still maintain the queues in PS DRR?
Is there a more "stream-like" data flow where I can simply send an address and data to read and write to/from the SSD? Like is there a way to simplify all of the queue stuff into just an address and data, or is this not possible?
Thank you again!
There are IP Cores where the interface to the SSD is more like a FIFO stream. This one from DG is one example: https://dgway.com/NVMe-IP_X_E.html#block. It handles the queues internally - no software required. But they're not cheap!
DeleteThank you for your reply.
DeleteYea, I've seen blocks like that. Is there a free solution lol? Or open-source?
I assume this is not the case unfortunatly
DeleteNone that I'm aware of.
DeleteHello again!
DeleteI opted for a BRAM/URAM solution which has worked out nicely and allows easy access from the PL side.
Right now I am sustaining write transactions at a rate of 737 MB/s. I was wondering if there is a way to do consecutive writes to the SSD rather than writing one block at a time in order to increase the speed? I am trying to achieve something closer to the 3.05GB/s rate that you said you are getting. Also, how many pcie lanes do you have? I only have one pcie lane due to the constraints of my board which is probably limiting my performance.
Thank you again so much!
For PCIe 3.0 x1, the raw data rate is only 1GB/s, so practical transfer of 737MB/s is pretty good. You would need four lanes to get to 3GB/s. The SSD also needs to support those speeds. What SSD are you using?
DeleteThe second half of this post describes how you can pipeline writes to maximize bus speed. The nvmeWrite() function in this driver is non-blocking, so you can set up multiple transfers before the first is completed. But, it becomes the application's responsibility to make sure the data for outstanding writes remains valid until they are completed. In your case, this would probably mean operating the BRAM/URAM as a circular buffer and using nvmeGetSlip() to make sure the number of writes in-progress never exceeds the capacity of the buffer.
Hello, thank you for the speedy reply!
DeleteOkay that makes sense, I figured it was a constraint due to the number of pcie lanes. I am using the crucial P3 Plus PCIe 4.0 NVMe M.2 SSD.
Okay cool, so I just call the nvmeWrite() function multiple times to queue consecutive writes correct? I already have the nvmeGetSlip logic in place such that if the number of slips in greater than 16 I call nvmeServiceIOCompletions(16). Is this correct?
That's basically it - you would want to put nvmeServiceIOCompletions(16) in a while loop (while slip > 16) since it doesn't wait for 16 completions - it just process up to 16 if they are available.
DeleteThe URAM/BRAM buffer would need to be sized so that it can fit more than 16 transactions in that case. You can adjust the number accordingly. Also, there's no way to provide back-pressure in the other direction with this implementation. In other words, if the SSD can't keep up, you'd need some other mechanism to throttle the data source, if that's even possible.
Hi again, thank you for your advice.
DeleteCould I set the slip thresh to be as low as 2? What is an appropriate range of values to use for this thresh?
There's no lower limit: while(slip > 0) simply means "wait until all NVMe transactions have been completed", no pipelining. The upper limit is theoretically one less than the size of your NVMe I/O queue. But practically, it also depends on how many transactions you can buffer. The data pointed to in outstanding NVMe writes must remain valid until they're completed.
DeleteThank you for this. I have it working beautifully on a 1CG. It has saved me many, many hours.
ReplyDeleteGlad to hear it!
DeleteCould someone share a code for psu pcie please?
Delete