tag:blogger.com,1999:blog-8200098102909041178.post5425774488286185847..comments2024-03-24T09:12:10.872-04:00Comments on Shane Colton: Benchmarking NVMe through the Zynq Ultrascale+ PL PCIe Linux Root Port DriverShane Coltonhttp://www.blogger.com/profile/10603406287033587039noreply@blogger.comBlogger26125tag:blogger.com,1999:blog-8200098102909041178.post-79977467732735800802022-12-31T15:41:14.257-05:002022-12-31T15:41:14.257-05:00I'm not sure if this is the same video, but it...I'm not sure if this is the same video, but it covers pretty much the same process: https://www.youtube.com/watch?v=CjcqOTDtPPYShane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-11758974007014211442022-11-15T08:31:23.811-05:002022-11-15T08:31:23.811-05:00Hello Shane, very interesting work. Do you happen ...Hello Shane, very interesting work. Do you happen to have any other reference for the video (as it seems to be broken). EnthuManhttps://www.blogger.com/profile/16455225761878315626noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-56539373612819599342020-02-27T17:22:40.640-05:002020-02-27T17:22:40.640-05:00I wound up writing a bare-metal NVMe driver to han...I wound up writing a bare-metal NVMe driver to handle my data moving requirements, which got its own post here:<br /><br />https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html<br /><br />It's pretty application-specific (no O.S., heavily favors writes over reads), so I'm not sure how useful it'll be to others. There were a number of good suggestions for Linux speed-ups in other comments on this post, though, including using O_DIRECT to bypass page caching or using a faster user-space NVMe driver (SPDK is mentioned, but I also found NVMeDirect and depthcharge's NVME driver in my research).<br /><br />I couldn't test O_DIRECT in Petalinux, but below commenter kuku mentions hitting 1.2GB/s using O_DIRECT with the regular NVMe driver, but in a different distribution of Linux.<br /><br />Lastly, since the time of writing, FPGADrive has posted new benchmarks for the stock Petalinux NVMe drive with the Ultrascale+. Their read benchmark matches up with mine (~625MB/s), but their write benchmark is still lower for some reason.Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-65355858054617434732020-02-26T22:53:13.311-05:002020-02-26T22:53:13.311-05:00Hello. Could you please update your progress in de...Hello. Could you please update your progress in detail ? I am finding the way to solve this issue. squarednhttps://www.blogger.com/profile/01351399368347235762noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-88268777981733876202019-11-24T17:22:31.102-05:002019-11-24T17:22:31.102-05:00"It should be possible to issue commands dire..."It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue."<br /><br />As a follow-up, this is exactly what I wound up doing. My application runs in OCM RAM, but I allocate the queues in the external DDR, with the data. This way all PCIe traffic is addressed to external DDR and I don't have to waste valuable OCM space for big queues. With this I am able to get to 2GiB/s for 64GiB written in 1MiB blocks to a 970 Evo Plus. Will post more details once I get it fully integrated.Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-11772335721147970182019-09-19T23:10:36.165-04:002019-09-19T23:10:36.165-04:00That's right, same here: https://github.com/i...That's right, same here: https://github.com/ikwzm/ZynqMP-FPGA-Linux<br /><br />Forget petalinux as long as you can, you'll have a happier life! ;)Kukunoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-34969304839740913822019-09-19T18:04:08.621-04:002019-09-19T18:04:08.621-04:00Last question, hopefully: What Linux distro are yo...Last question, hopefully: What Linux distro are you using? I see below you were able to use dd with oflag=direct to test the speed, which I couldn't do from Petalinux. (The version of dd that gets compiled in by default doesn't support the oflag switch.) I'm sure I could figure out how to compile the full version in, but if there's a different distribution that works out-of-the-box, so to speak, I would rather switch to that. For example, a commenter above suggested https://github.com/ikwzm/ZynqMP-FPGA-Linux.Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-46282914054475189092019-09-19T07:28:54.571-04:002019-09-19T07:28:54.571-04:00Sorry, I thought I'd replied this. I used the ...Sorry, I thought I'd replied this. I used the same XDMA PCIe bridge driver (this version: https://github.com/Xilinx/linux-xlnx/blob/xlnx_rebase_v4.14/drivers/pci/host/pcie-xdma-pl.c)kukunoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-75328536856257546162019-09-17T08:30:20.442-04:002019-09-17T08:30:20.442-04:00Hi, I'm writing a file with O_DIRECT in an ext...Hi, I'm writing a file with O_DIRECT in an ext4 fs from `/dev/zero`.<br /><br />`dd if=/dev/zero of=/mnt/ssd/test_file bs=8M count=128 conv=fdatasync oflag=direct`<br /><br />kukunoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-8664147635864320872019-09-09T11:00:52.498-04:002019-09-09T11:00:52.498-04:00Hey kuku how are you measuring 1.2 GB/s? What fil...Hey kuku how are you measuring 1.2 GB/s? What filesystem setup, are you using O_DIRECT, and which memory are you using inside the Zynq?Ambivalent Engineerhttps://www.blogger.com/profile/16491915174390340818noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-86319288589475222812019-09-09T04:59:31.268-04:002019-09-09T04:59:31.268-04:00This little test loop gets 244 MB/s on a Zynq-7030...This little test loop gets 244 MB/s on a Zynq-7030:<br /><br /> const int buffersize = 2*1024*1024;<br /> int outputfile = open("/dev/nvme0n1p1", O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0777);<br /> uint8_t *buffer = (uint8_t *) aligned_alloc(4096, buffersize);<br /><br /> for(int i = 0; i < 1000; i++) {<br /> ssize_t written = write(outputfile, buffer, buffersize);<br /> }<br /><br />Would getting a buffer from udmabuf instead of aligned_alloc() improve performance? The pages would be pinned in physical memory, so perhaps the kernel doesn't have to do work to ensure they are present before sending write commands to the NVMe. I don't know if that's a lot of work -- I really don't understand what's holding up the Linux kernel at all in this test.Ambivalent Engineerhttps://www.blogger.com/profile/16491915174390340818noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-18768891643501568122019-09-06T11:20:27.232-04:002019-09-06T11:20:27.232-04:00Great! What driver are you using for the NVMe? 1.2...Great! What driver are you using for the NVMe? 1.2GB/s would be perfect for my application.Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-86470227319839191992019-09-06T07:53:21.612-04:002019-09-06T07:53:21.612-04:00Thanks, I did it! The core registers showed gen3 c...Thanks, I did it! The core registers showed gen3 capability so I assumed that for some reason the negotiation wasn't being done... I forced the PERSTN of the SSD manually and that was it! Gen3 and 1.2 GB/s in a Samsung 960 EVO :Dkukuhttps://github.com/akukulanskinoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-9980812957326229372019-09-05T10:01:39.775-04:002019-09-05T10:01:39.775-04:00Would you try a debian distribution in zynqmp? htt...Would you try a debian distribution in zynqmp? https://github.com/ikwzm/ZynqMP-FPGA-Linux We are using it and it works really great. 100% recomended. (Goodbye to the ugly petalinux)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-85718870269174370582019-09-04T13:16:19.919-04:002019-09-04T13:16:19.919-04:00LnkSta reports: 8GT/s (ok), Width x4 (ok). I also ...LnkSta reports: 8GT/s (ok), Width x4 (ok). I also checked that directly in the PCIe PHY register, offset 0x144 from the bridge's base address. If that also only shows Gen1 despite all the hardware being configured for Gen3, it might make sense to try the IBERT loopback test.Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-25642988527796536922019-09-04T13:07:33.025-04:002019-09-04T13:07:33.025-04:00That looks promising! I like that it's just a ...That looks promising! I like that it's just a C program running in user space. I'm not seeing any good references to it running in a Petalinux environment, but the fact that it's simple C gives me some hope that it could be ported.Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-13568018661492189162019-09-04T10:08:44.647-04:002019-09-04T10:08:44.647-04:00Hi,
Have you checked the actual negotiated link sh...Hi,<br />Have you checked the actual negotiated link showed in "LnkSta"?<br />I'm doing a similiar thing with PCIe in PL and can't get a gen3 link (it's 2.5GT/s, gen1, when the device is capable of establish a gen3 link). In that condition the ssd writing speed is about 600MB/s. It will be quite higher with a gen3 link.kukunoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-61560670869831527052019-09-04T04:55:48.408-04:002019-09-04T04:55:48.408-04:00Hi guys,
what do you think about a NVMe driver fro...Hi guys,<br />what do you think about a NVMe driver from SPDK https://spdk.io/doc/nvme.html ? Maybe it would help.Matesnoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-89444247184804815492019-09-03T19:00:33.155-04:002019-09-03T19:00:33.155-04:00Unfortunately, it's looking like the Petalinux...Unfortunately, it's looking like the Petalinux version of dd does not support the oflag switch. (It's a stripped-down version.) I will see if it's possible to compile in the full version or maybe hdparm?, as well as python for testing direct file opening at some point in the future. Thanks again for the tips!Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-16021562614668946192019-08-31T09:45:10.249-04:002019-08-31T09:45:10.249-04:00Oh that's neat! Reminds me a lot of resource b...Oh that's neat! Reminds me a lot of resource binding for GPUs. I think once I get to the UI stage, I'll need that to be able to put menu overlays on top of the frame before it gets sent to a display. Thanks for the tip.Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-74414758833227958652019-08-31T00:03:13.200-04:002019-08-31T00:03:13.200-04:00Check the Udmabuf driver (https://github.com/ikwzm...Check the Udmabuf driver (https://github.com/ikwzm/udmabuf)for doing that. Using it, you are able to do mmap and sequential read of a ddr memory buffer.Unknownnoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-13239182774037162522019-08-30T14:14:01.673-04:002019-08-30T14:14:01.673-04:00Good to know I'm not alone! I need to take a l...Good to know I'm not alone! I need to take a look at the Linux driver to see what it would take to strip out just the NVMe queue interface to run standalone on the ARM. That's an interesting approach that I hadn't considered. I don't really need Linux, although it would simplify UI a lot.<br /><br />Currently, I have a capture front-end that is entirely on the PL and a very small ARM program that does link training and configuration of that peripheral. I run the ARM program out of OCM and disable all but the PL-PS slave ports on the DDR controller during capture. I'd love it if I could dedicate one port to writing from the front end to the head of a circular buffer and one port for the PCIe DMA bridge to read from the tail and write to the SSD. How to make that happen is the question! <br /><br />I'll definitely post an update as I do more testing.Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-47508079810392335782019-08-28T22:28:24.381-04:002019-08-28T22:28:24.381-04:00Hi Shane.
I'm working a similar problem. I h...Hi Shane.<br /><br />I'm working a similar problem. I have a Zynq 030 writing to a Samsung 960 across x4 PCIe. Our data source is a set of hardware sensors that write around 1300 MB/s into the DRAM attached to the PS. (Note the PS DRAM itself is just 4 GB/s, so zero-copy is critical.) I don't think we'll end up saving all of it but we'd like to save over 450 MB/s.<br /><br />For our first attempt, we mmap()ed the DRAM buffers into the virtual address space of a process, and then issued write()s from that to files on an ext4 filesystem. 72 MB/s. dd if=/dev/zero of=/media/nmve/foo bs=2M count=1024 delivers 78 MB/s.<br /><br />Next, we rebuilt the ext4 filesystem to eliminate the journal. This made it possible to open writes to the filesystem with O_DIRECT. However, it's not possible to write from mmap()ed physical memory to an O_DIRECT file... the write fails with illegal address. We have a hack right now where we memcopy from the mmap()ed physical memory to a 1 MB aligned_alloc() buffer, and then write from that to the O_DIRECT file. 108 MB/s. We do not understand why Linux won't let us write from an address which has been mmap()ed from /dev/mem.<br /><br />We have a trivial test program which just sequentially writes 1MB blocks from an aligned_alloc() buffer to an O_DIRECT filehandle. It gets 240 MB/s, which suggests the Linux filesystem is never going to hit 450 MB/s, even if we get rid of that awful copy.<br /><br />I don't think we'll need actual hardware (FPGA firmware) to write 450 MB/s though. It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue. The data is in contiguous 1 MB hunks in physical memory, so we only need to issue at most a thousand NVMe commands per second. I don't think the ARM will have much trouble with that. I'm not sure what your data source looks like, but you might consider this approach. I would think that even hundreds of thousands of NVMe commands per second should be possible.<br /><br />One thing we'd be very interested in is partitioning the NVMe drive, and having Linux mount one partition while we operate a command queue on the other. Since the NVMe can handle thousands of queue pairs, this is possible in principle (we only need one queue pair!) We don't know what it takes to convince Linux to play along. If you discover anything we'd love to know.<br />Ambivalent Engineerhttps://www.blogger.com/profile/16491915174390340818noreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-6622238214443944512019-08-22T09:10:52.971-04:002019-08-22T09:10:52.971-04:00When you open a file, you can use O_DIRECT flag an...When you open a file, you can use O_DIRECT flag and it's the same as dd's direct flag.<br /><br />Also, the syscall sendfile is usefull to achieve a better performance. An study of "block size" per write is also recommended.<br /><br />A simple code you can use for test speed in python is:<br /><br />import os<br />from sendfile import sendfile<br />IMAGE_SIZE = 2 * 1080 * 1920<br />f_in = os.open('/dev/udmabuf0', os.O_RDONLY) <br />f_out = os.open('/mnt/disk0/image.raw', os.O_WRONLY | os.O_DIRECT | os.O_CREAT | os.O_SYNC)<br />sent = sendfile(f_out, f_in, 0, IMAGE_SIZE)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8200098102909041178.post-66438411146601096702019-08-21T12:22:59.543-04:002019-08-21T12:22:59.543-04:00I'll definitely give that a try! Thanks for th...I'll definitely give that a try! Thanks for the tip. Follow-up question: What, if anything, do you have to do during a normal non-dd sequential write (from a huge block of RAM to the SSD) to achieve the "direct" speed?Shane Coltonhttps://www.blogger.com/profile/10603406287033587039noreply@blogger.com