Saturday, April 18, 2020

Full-Speed CMV12000 Subsampled Readout: 1440fps 1080p

Now that I've got a continuous multi-Gpx/s image capture pipeline running, it's time to rearrange some things to break the 1000fps barrier:



For this clip I'm using the CMV12000's X/Y subsampling mode to trade resolution for frame rate, hitting 1440fps at 2048x1088. The overall pixel rate is a little lower than in 4K (3.2Gpx/s vs. 3.8Gpx/s), so it's feasible to send this through the same Zynq Ultrascale+ capture pipeline, with some modifications, to record continuously to an NVMe SSD. With ~4:1 wavelet compression, this writes about 1GB/s to the drive, up to 1000s (16.7min) for a 1TB drive. That would be 16.7 hours of playback at 24fps, though. I figured 30 seconds real-time and 30 minutes of playback was enough water droplet footage for now.

CMV12000 Subsampling

In a previous post, I covered the pipeline architecture for continuously recording 400fps 4K video from a CMV12000 image sensor to an NVMe SSD. That was a 4096x2304 (16:9) frame, slightly larger than 4K UHD. The sensor's native resolution is 4096x3072 (4:3), which it can read in at 300fps. By reading in fewer rows, the maximum frame rate is increased. Going wider than 16:9 would allow frame rates higher than 400fps, but since the sensor always reads in full 4096px-wide rows, the speed gain is only linear.

To go much faster, it's necessary to read in fewer columns as well. Not all sensors can do this; reading whole rows may be baked into the hardware architecture. The CMV12000 doesn't support arbitrary readout width, but it does support 2x subsampling. In this mode, every other four-pixel square (Bayer group) is skipped in both the X and Y directions. The remaining squares are transmitted on the LVDS channels using an alternate packing:

CMV12000 subsampled readout (color, X-flipped).
Each of the 64 LVDS channels alternates between two rows, with the lower 32 channels handling two even (G1/R1) rows and the upper 32 channels handling two odd (B1/G2) rows. This alternate data packing allows the subsampled image, with 1/4 as many total pixels, to be read out nearly 4x faster. There is a small amount of extra overhead time that makes the actual gain not quite 4x.

Subsampling drops the resolution from 4K to 2K but preserves the crop factor of the sensor, since the full width and height are still used. This is preferable to cropping a 2048px-wide image out of the middle. It doesn't give any increase in sensitivity though; to do that would require binning (averaging the larger 4x4 squares to generate the final 2x2). The CMV12000 does support binning, but the overhead is so bad that you might as well read out the 4K image and do it in post (assuming you have the data storage bandwidth, which I certainly do). So to go ~4x faster, I will need ~4x more light.

Light sensitivity of subsampling vs. binning.
Before worrying about a shortage of photons, though, I first need to deal with a shortage of programmable logic. To fit everything on the XCZU4, my main bottlenecks are BRAMs and LUTs. I managed to add the decoder for HDMI output with no increase in either by sacrificing the third wavelet stage. But I've known for a long time that the day would come when I would need to add 128 more Stage 1 horizontal cores to handle the subsampled inputs.

It might seem odd that more cores are needed to process a smaller image. Even at the higher frame rate, the pixel input rate is lower than in 4K. Surely the existing horizontal cores could time-multiplex to handle the data? But, the wavelet cores must operate on groups of adjacent pixels. In this case, adjacency describes the nearest horizontal pixels of the same color, since applying a difference operation to pixels of different colors would not have the desired result. And whatever the color, pixels from another row are not horizontally adjacent. Since each LVDS channel now services two color fields and two rows, it must feed four independent wavelet cores.

In 2K mode, each LVDS channel feeds four independent Stage 1 horizontal cores.
So, the total number of Stage 1 horizontal cores doubles from 128 to 256. This jump has been on my mind since the early stages of the design, and I tried to optimize the horizontal cores as much as possible. A big part of this was reducing the operating pixel width from 16-bit to 12-bit, which brought the per-core LUT count down from 107 to 83. As this is the first stage of the pipeline, it's easy to verify that it won't saturate on 10-bit inputs. The horizontal cores operate in-line with the input using only distributed memory, so no additonal BRAMs are required. But there's now way around the additional 10,000 or so LUTs, and that will bring me right up to the limits of this chip.

Since I knew there would be very few LUTs remaining for switching modes, I originally thought the 4K and 2K modes might have to exist as entirely separate PL configurations, their bitstreams loaded on as-needed by software. I've seen other cameras do this; it looks like a software reset when changing capture formats. And while it only takes a few seconds, I really dislike the workflow and the idea of maintaining two configurations.

So, I spent some time looking at the actual differences between modes at all stages of the pipeline and decided that I could and should build the switch. I had this mode change in mind early in the design, so I tried to minimize the number of touch points required in each of the modules to switch between 4K and 2K. Even so, there are a number of small changes needed in the Wavelet, Encoder, and HDMI modules. They are collectively driven by a master switch in each module's AXI slave registers. I'll go through them in pipeline order below.

Wavelet Stage 4K/2K Switch

First, no actual switching is required to distribute the inputs to the Stage 1 horizontal cores; each channel always connects to the same four cores. Instead, the cores are gated by a master pixel counter based on their color and, when in 2K mode, also their row. The 2K mode switch turns on this extra enable gate and offsets the counter that handles first/last row states by one bit, to account for the half-width rows. Miraculously, this did not add any LUTs to the horizontal cores. I assume the extra logic just got merged into existing smaller LUTs...I'll take it.

The most complicated part of the switch happens next, at the interface between the Stage 1 horizontal and vertical cores. Instead of distributing outputs from four adjacent horizontal cores into a single row of a vertical core BRAM, the 2K interface distributes outputs from eight horizontal cores into two rows of a vertical core BRAM. Since the rows are half as wide, this takes the same number of pixel clock cycles (128). So, as will be the case at many points in the pipeline, this just boils down to rearranging the bits of the BRAM write address:

Aspect ratio change and read/write addressing of the Stage 1 vertical core BRAMs in 4K vs. 2K mode.
Conceptually, the aspect ratio of the vertical core BRAM changes from 8 rows of 256px to 16 rows of 128px. The figure above shows where writes and reads occur in the BRAM at a given relative pixel count. Reads occur on half-counts since the Stage 1 vertical DWT operates at double px_clk frequency. The read address generator is also modified by the switch to account for the new aspect ratio. Only the eight most recent rows are actively written or read, so in 2K mode the BRAM is twice as big as it needs to be. The latency of the vertical core is also halved, since it's determined by the number of rows required to complete the vertical DWT operation. This will come into play later.

The Stage 1 vertical core buffers the alternating-row 2K mode inputs into a single-row format that's compatible with the rest of the pipeline, so changes after this point are relatively minor. Each Stage 1 vertical core feeds its output row to a Stage 2 horizontal core. The only modification required there is to offset the counter that handles first/last row states by one bit, to account for the half-width rows. Then, the Stage 2 vertical core just needs some more BRAM address rearrangement:

Aspect ratio change and read/write addressing of the Stage 2 vertical core BRAMs in 4K vs. 2K mode.
Like the Stage 1 vertical core BRAM, the aspect ratio is changed from 8 rows of 256px to 16 rows of 128px. But since the first stage already rearranged things into single rows, the write addressing here is more straighforward: In both 4K and 2K mode, only a single row is filled at a time (by two adjacent Stage 2 horizontal cores). The row width is halved, but there's no write interleaving between the two rows. Ultimately, this is just a different arrangement of the write address bits. The read address generator is similarly modified to grab the right data for the Stage 2 vertical DWT. As with the Stage 1 vertical core, the BRAM is twice as big as it needs to be, and the latency is halved.

Encoder 4K/2K Switch

The compression stage doesn't care about the aspect ratio change, since the only context it uses for variable-length encoding is an immediate group of four pixels. However, it does need to know the adjusted latency of both wavelet stages, since the first pixel to be encoded will arrive sooner in 2K mode. For that, I just made all the latency offsets software-defined, through the encoder's AXI slave registers. And that should be the only change required here...

Except things are never that easy. I noticed after plugging in the expected latency values in 2K mode, two of the four color fields (R1 and G2) were actually dropping one pixel per row. It took a while to isolate this to the encoder, and then even more staring at this module to figure out what the problem was. Since the only change I made was to the latency offsets, I figured there had to be some fundamental difference between how the local pixel counter (px_count_e) drives the encoder states during row transitions with different offsets, and there was:

Encoder gating in 4K mode, showing the difference between sequential and combinational px_count_e_updated.
The above shows px_count_e at the first row overhead time (ROT) in 4K mode. It's negative since pixels haven't made it to the encoder yet, but the same behavior happens at all subsequent row transitions. During ROT, the sensor is not sending pixel data and all the pixel counters (including px_count_e) hold their previous values. A signal called px_count_e_updated is cleared, which gates the encoder from sending pixels to RAM (via an intermediate shift register called e_buffer). This signal was previously sequential, which would add one clock cycle delay between the ROT and when the encoder is gated. It should have been combinational, to line up correctly with the ROT.

But the write to e_buffer also only takes place every other group of four pixel clocks, for reasons discussed here. In 4K mode, the ROT happens to fall in a period where writes don't occur anyway. The sequential vs. combinational difference didn't matter to the final e_buffer_wr_en signal. But in 2K mode, the new latency offsets just happen to put the ROT one cycle before the start of a four-cycle write sequence, where the difference does matter:

Encoder gating in @K mode, showing the difference between sequential and combinational px_count_e_updated.
After switching over to combinational logic for px_count_e_updated, the missing pixel returned, and things were almost happy again. It turns out there was a similar issue at the quantizer and encoder modules themselves, before the write to e_buffer. This was simply due to them not being enable-gated at all, though. (Again, it must have been working thanks to lucky latency offsets in 4K mode.) Gating each with the same combinational px_count_e_updated signal worked fine.

HDMI 4K/2K Switch

But wait, isn't the HDMI output always 1080p? While that is true, it doesn't mean there's nothing to be done here. In 4K mode, only the Stage 2 wavelet compression is decoded, leaving a 2K preview image (really, four color fields that are each 1024px wide) to be output via HDMI. This greatly reduces the size of the HDMI module, since it only has to decode four of the sixteen codestreams and do one stage of inverse DWT. However, getting to the same preview size in 2K mode would mean complete decoding, require all sixteen codestreams and two wavelet stages. I simply don't have room to do that, so I'm going to cheat.

The first step is to change how the viewport is mapped to a pixel count. To achieve arbitrary scaling of the preview image, I first normalize the viewport to 16-bit, i.e. top-left (0, 0) to bottom-right (65535, 65535). The x and y components, vxNorm and vyNorm, are shifted around to create the pixel counters that drives the output pipeline. When switching from 4K to 2K, each component gets right-shifted by one and the split between x and y moves over by one bit in the final counter:
Mapping between 16-bit (vxNorm, vyNorm) coordinates and opx_count in 4K vs. 2K mode.
This remapping means that the entire output pipeline operates at half resolution in 2K mode. The preview will actually just be scaled up from the four LL1 color fields, which are each 512px wide. There will still be bilinear interpolation to help smooth out the result, but it will be blurrier than the 1080p preview in 4K mode. But again there isn't really an alternative, at least not with the resources I have left on this chip.

The output pixel counter (opx_count) drives all parts of the decoding process, starting with a RAM reading FIFO through the HDMI module's AXI master. No changes are required there or in the decoder itself, other than modifying the latency offsets accordingly. These have always been software-defined, so I just added the expected values for 2K mode and they worked without any hassle. (There was no equivalent sequential vs. computational bug, thankfully.)

After this, the modifications to the Stage 2 inverse vertical wavelet cores are pretty simple and almost the same as in the forward direction. Each color field's IV2 core uses a single URAM for row storage. In 2K mode, the aspect ratio is changed from 16 rows of 1024px to 32 rows of 512px, by rearranging read and write address bits:

Aspect ratio change and read/write addressing of the Stage 2 inverse vertical core URAMs in 4K vs. 2K mode.
Unlike the forward direction, the Stage 2 inverse horizontal wavelet cores also use URAMs for row storage and these likewise need address bit rearrangement to change aspect ratios for 2K mode:

Aspect ratio change and read/write addressing of the Stage 2 inverse horizontal core URAMs in 4K vs. 2K mode.
And finally, the bilinear interpolation module needs to be adjusted to automatically scale up the preview image by 2x, so it can fill the viewport using the 512px-wide color field LL1 outputs. This can be done quickly by passing the shifted vxNorm and vyNorm values to the module, although this isn't quite correct, as will be discussed below. It's good enough for now, though.

Debayering

Applying an ordinary debayering algorithm, whatever it is, to the 2K subsampled raw data doesn't really work. This is because the physical spacing between pixels is no longer symmetric. For example, a red pixel is closer to its green and blue neighbors to the left and below than to the right and above. A proper bilinear interpolation needs to take this asymmetry into account, by modifying the location of pixel centers for each color field accordingly. More advanced algorithms are still built on the assumption of symmetric neighbors, so they'd all need modification to some degree.

Asymmetric neighboring pixels in subsampled mode can be handled by modifying interpolation pixel centers (left) or with an intermediate supersampling step (right).
Alternatively, the subsampled data can be supersampled by 2x to estimate the missing pixels (G2' and G1' in the image above) and then run through the ordinary debayer algorithm in 4K. The final output can then be scaled back to 2K to reflect the true information content of the data. This path takes longer for what may be an equivalent result for simpler debayer algorithms, but it might have advantages for more complex algorithms. All this will probably be obsoleted by neural networks that upscale 240p images to 16K in a few year anyway, so I'm not going to worry about it.

It is important to adapt the debayer algorithm for the subsampled pixel locations somehow, though, or there will be significant artifacts. The following comparison shows three different algorithms, nearest-neighbor, bilinear, and a Microsoft 5x5 interpolator that I like. For each, a reference 4K capture and 4K debayer is compared to a 2K subsampled capture with an unmodified 2K debayer and a 2K subsampled capture with a supersampled 4K debayer.

Comparison of three different interpolation algorithms with 4K capture/debayer, 2K subsampled capture with unmodified 2K debayer, and 2K subsampled capture with supersampled 4K debayer.
None of these simple algorithms can do much to recover resolution - for that I defer to the AI supersampling state of the art - but using an unmodified 2K debayer on subsampled raw data creates significant color checkerboarding artifacts on edges. Supersampling the data by 2x and running a simple 4K debayer at least bypasses the problem of neighboring pixel asymmetry.

Resource Utilization

Squeezing in the 4K/2K switch was beyond what I'd hoped to fit on the XCZU4, but it just barely works. The switch itself really only adds LUTs where BRAM/URAM address bits are remapped or where pixel counts are shifted to account for the aspect ratio change. The main addition is the 128 new Stage 1 horizontal wavelet cores, which really push the resource utilization to the limits.

The XCZU4 with everything crammed in.
At this point I'm at 77143 LUTs (87.82%), 93883 FFs (53.44%), 118 BRAMs (92.20%), 14 URAMs (29.17%) and 146 DSPs (20.05%). But, since most of my cores are running at px_clk (60MHz) or HDMI clock (74.25MHz) frequency, the timing constraints are not too difficult to meet. The exception seems to be things that interact with the 250MHz AXI clock, including the encoder and decoder BRAM FIFOs. These have to have some amount of manual placement help to meet timing.

The good news is I don't really have much else to add to the programmable logic. I've already built in placeholder URAMs for UI overlays in the HDMI module, so those just need to be filled in by software. I might add some more color processing to the HDMI output, but that will mostly use DSPs, and possibly URAMs for color look-up tables, which should be no problem to add. I'm really happy that everything fits on the XCZU4, not just because the bigger chips are way more expensive, but because it's been a much better lesson in optimizing cores to fit resource constraints than if I had just switched to the XCZU7 early on.

Saturday, March 14, 2020

HDMI, the Hard Way

If I were to rank the components of this project in terms of the ratio of their actual vs. expected difficulty, the NVMe interface would probably be lowest, since it was nowhere near as hard as I thought it would be. The CMV12000 input (easy, expected to be easy) and wavelet engine (hard, expected to be hard) would be somewhere in the middle. And the new top of the list, the hardest module that should have been easy, would be the HDMI output.

HDMI

There seem to be two main reference designs for outputting an HDMI signal from a Zynq SoC. Zynq-7000 series boards such as the ZC70x and Zedboard use an external HDMI transmitter, the ADV7511, to convert a parallel RGB interface into serial HDMI TMDS outputs. Zynq Ultrascale+ boards such as the ZCU10x and UltraZed-EV Carrier Card use the built-in serial transceivers of the ZU+ to drive the TMDS outputs through a SN65DP159 HDMI retimer. The latter is a more modern approach, supporting up to 4K60 through the HDMI TX Subsystem IP. But, that IP is not included with Vivado. It also requires three free GTH transceiver channels, which I don't have on the XCZU4. (Its four available channels are in use for PCIe Gen3 to the SSD.)

There's nothing wrong with using an external HDMI transmitter with the ZU+, though. I left a PL GPIO bank open specifically for a parallel RGB pixel bus, either for an LCD controller or an HDMI interface. I opted for the slightly newer ADV7513, which supports up to 1080p60 at 8-bit. This is perfectly acceptable as a preview and local playback resolution. Outputting a full 4K frame over HDMI might be useful for interfacing with a RAW recorder, but that is out of the question anyway at 400fps. In fact, I only really need a 24-30fps HDMI output, which means a very manageable 74.25MHz pixel clock, based on the CEA-861 standard.

HDMI timing parameters for 1920x1080p 24/25/30Hz with a 74.25MHz pixel clock.
Generating the required pixel clock, sync, and dummy RGB signals in the ZU+ Programmable Logic (PL) is pretty simple; I set that up as a module on day one of playing with HDMI. Typically, you'd just point this module at a frame buffered in RAM and let it pull the real data. (There are video DMAs and drivers that will do this more-or-less automatically.) But here's where I run into a slight problem: I don't actually have a frame buffered in RAM.

The Hard Way

While it is possible to write the full 3.8Gpx/s raw frame data to RAM on the ZU+, it would be futile to try doing any significant processing on it there. Even if I used all three 128b AXI bus connections between the PL and the memory controller at 250MHz, that would allow for less than three accesses per pixel...including the initial write. The Processing System (PS) has a similar memory access constraint, although processing pixels serially on the ARM cores is much too slow anyway. So I made the decision early on to implement the wavelet compression engine in PL hardware and write the ~5:1 compressed codestreams to RAM instead, on their way to the SSD.
The capture pipeline, with the main data path highlighted and shown decreasing in width where compression occurs at the PL Encoder, before data is written to DDR4 RAM.
"No problem," you might say, "just split off raw data from the sensor and feed it to the HDMI module." Unfortunately, this doesn't quite work: In the time it takes the HDMI scan to complete one row, the capture pipeline has processed 50+ rows from the CMV12000. The input and output are just not in sync, and any attempt to buffer partial frames between them would require much more block RAM than I have available. It would also cause frame tearing that would ruin any attempt to preview periodic phenomenon with the global shutter.

The only real choice is to put the HDMI output module after the RAM buffer, which means decoding compressed frame data on the way out:
The only logical place to put the HDMI output, and not just because I left space for it there in the block diagram.
The HDMI module reads codestream data from RAM as an AXI Master, decodes the pixel values, and runs an Inverse Discrete Wavelet Transform (IDWT) to recover the raw image. While this is a lot more work, it pays off twofold because the same module can be used for playback by reading frames back out of the SSD into RAM and pointing the decoder at them.

Notwithstanding the design effort, the actual resource utilization of this module should be pretty low. For one, only four of the sixteen codestreams need to be decoded to reconstruct a 2048px-wide image to use for the preview; there's no need to decode any LH1, HL1, or HH1 data. Also, the preview frame rate is at least 10x slower than the capture frame rate, so the amount of parallelism needed in the decoding and IDWT pipeline is much lower. Still, it's more logic on an already-crowded chip.

Kill Your Darlings

At this point I'm stubbornly committed to fitting this design on the XCZU4. With the capture pipeline complete, I was getting pretty close to maxing out this chip, especially the LUTs (65593 / 87840) and BRAMs (122 / 128). And this was after a significant optimization pass on all the cores, including trimming pixel math operations from 16-bit to 12-bit where applicable and removing debug interfaces. These bottlenecks were already causing routing difficulty that was pushing up compile times, so I needed to make more room somehow. And then one day I woke up and decided to delete Wavelet Stage 3.
An example showing the effect of deleting the third DWT stage without changing the target compression ratios of any other stages. The red bars are each sized proportionally to the  compressed sub-band they represent.
Stage 3 only handles 1/16 of the total data throughput, but it is visually the most significant and thus uses the least amount of compression. In the example above, replacing Stage 3's output with a raw 1/4-scale average image (LL2) has a relatively small effect on the overall compression ratio. It's also not a complete loss, since the 1:1 LL2 will yield slightly better visual quality if the other subbands remain unchanged. The distribution of bandwidth that achieves the best image quality with an overall compression ratio of 5:1 is still an unknown, but ditching Stage 3 probably isn't restricting the search space too far.

Although Stage 3 is by far the smallest wavelet core, removing it also simplifies a lot of downstream logic. The "XX3" encoder, which previously handled all four Stage 3 subbands by cycling through different inputs and quantizer settings, now becomes a pass-through for raw LL2 data. It also now has the same latency as the HL2, LH2, and HH2 encoders. This latency is the new maximum and is significantly lower than the former XX3 latency. (It's no longer necessary to wait for six whole LL2 rows for the Stage 3 DWT.) There's a symmetric payoff on the decoder side as well.

So while I'm sad to see it go, I think it's the right call for now. Having three stages probably does improve the compression performance (objectively, the PSNR at a given compression ratio), but I think I can still achieve good image quality at an overall ratio of 5:1 with only two. Not even including prospective decoder savings, the reduction in LUTs (-4575), FFs (-5320), and most crucially BRAMs (-8) is well worth-it.

Working Backwards

In may ways, the HDMI output module is just a mirror image of the pixel input pipeline, from the deserialized CMV12000 input pixels to the AXI Master that writes encoded data to RAM. The 74.25MHz HDMI clock runs a master pixel counter that scans across and down the output frame. Whereas the CMV12000 clocks in 64 pixels in parallel, though, the HDMI only has to clock out one.

Or does it? Each HDMI pixel (in RGB 4:4:4 format) consists of an 8-bit red, green, and blue value, whereas the Bayer-masked sensor input is split into four interleaved color fields. Each color field's decoded LL1 image will only be 1024px wide. One option would be to center this in the HDMI frame and pull the 8-bit R, G, and B values directly from each color field's LL1:
1:1 scaling from LL1 color field pixels to HDMI pixels.
In this case, each HDMI clock requires one pixel from each of the four color fields (the two greens are averaged). The logic couldn't really get any simpler. But, it makes poor use of the 1920x1080 HDMI frame, especially for widescreen aspect ratios. An alternative would be to scale everything up by a factor of two:
2:1 scaling from LL1 color field pixels to HDMI pixels.
Now, a debayering method has to be used to reconstruct the missing color values at each pixel. For this application, a simple average of the neighboring pixels would be fine. (The off-line decoder uses a more complex, higher-quality method.) Each HDMI pixel now references as many as four pixels from each color field. But, these pixels don't all update at each HDMI clock. The average pixel consumption from each color field is actually only one per four HDMI clocks, as expected from the 2:1 scaling factor.

But a 2:1 scaled preview doesn't fit in 1920x1080. The cropping isn't too bad for widescreen aspect ratios, but it's unusable for 4:3. Switching between 1:1 and 2:1 scaling depending on the aspect ratio would work, but adds a lot of conditional logic for a still-compromised result. An arbitrary software-controlled scaling between 1:1 and 2:1 would be so much better. So, time to break out the DSPs:
Arbitrary scaling from LL1 color field pixels to HDMI pixels, using bilinear interpolation.
To achieve arbitrary scaling, the four 1024px-wide LL1 color fields are resampled onto a 65536px-wide grid, accounting for the offsets between the centers of pixels of each color. Then, a viewport is defined within the HDMI frame and normalized onto this 16-bit grid (using DSPs). The four pixel centers of each color field that box in the normalized viewport coordinate are used for bilinear interpolation (using more DSPs) to produce the R, G, and B values. This is also the debayer step, thanks to the pixel center offsets.

One thing I actually do have plenty of is DSPs, and this seems like a great use for 14 of them. Being able to reposition and rescale the preview image from software makes life a lot easier. The down-side is that sixteen LL1 pixels are required to generate a single HDMI pixel. But as with the 2:1 case, the input pixels don't all change with every HDMI clock. The average LL1 pixel consumption rate will depend on the scale, but if the viewport width is always at least 1024px, it will never exceed one LL1 pixel per color field per HDMI clock. All upstream logic in the decoder is designed with this constraint in mind.

Ultra-IDWT

Next upstream is the Inverse Discrete Wavelet Transform (IDWT). One of the most significant simplifications achieved by deleting Wavelet Stage 3 is that the HDMI output module only has to do one stage of IDWT: Stage 2. This stage recovers LL1 from the LL2, LH2, HL2, and HH2 subbands. The order of operations is reversed in the IDWT: vertical first, then horizontal. Since we're working backwards from the HDMI output, let's look at the horizontal core first.

The forward horizontal DWT core is heavily optimized for speed and size using only FF-based distributed memory. In the inverse direction, there's a lot more breathing room. Only four cores are needed (one per color field) and they only need to process at most one pixel per HDMI clock. So, I am able to combine the horizontal IDWT with a block RAM buffer and output shift register pretty easily. I'm almost completely out of BRAMs, but I have plenty of UltraRAM (URAM) for this.
Horizontal IDWT and output buffer for one color field built around a single URAM.
Each URAM is 32KiB, enough to store 16 rows of LL1 data. The oldest two rows (N+0 and N+1) feed output shift registers that end in the four pixels the bilinear interpolator needs. The horizontal IDWT is performed on data from Row N+3, its result written back to Row N+2. As in the forward direction, pixels are processed in 64-bit groups of four: two interleaved pairs of low-pass and high-pass values become four LL1 outputs. Two half-speed shift registers unpack 64-bit URAM reads for the IDWT and pack the results into 64-bit writes. Running the IDWT as a single combinational step is not as efficient as using sequential lifting steps, as in the forward horizontal DWT, but it's a bit simpler to do with shift registers. Meanwhile, new data from the vertical stage is fed in at Row N+6.
Vertical IDWT for one color field built around a single URAM.
The vertical IDWT cores are also each built around a single URAM. In this case, the URAM is split in half for low-pass (HL2/LL2) and high-pass (HH2/LH2) vertical data. Four pixels each from three rows of low-pass data (N+0 to N+2) and one row of high-pass data (N+9) are processed every four clocks to create two four-pixel outputs to write to horizontal core URAM. In a shameful waste of clock cycles, input rows are scanned twice and the output write alternates between the even and odd IDWT results. (There are other ways to deal with the 2:1 row scanning ratio, but I'm willing to trade power for simpler logic right now.) Meanwhile, raw interleaved LL2, LH2, HL2, and HH2 data are written in to rows somewhere just ahead of the IDWT read pointers.

Decompressor and Distributor

Each horizontal and vertical core operates on a single color field, but the four input codestreams are instead separated by subband (LL2, LH2, HL2, HH2), with all four color fields being present in each codestream. The codestreams also cycle through four different column positions in a given row, since the Stage 2 forward vertical DWT uses four cores in parallel. A distributor remaps decoded subband data to the appropriate write address in one of the vertical IDWT cores. This is also a good place to interleave the high-pass and low-pass data, which facilitates the horizontal IDWT.
After decoding, subband pixels are redistributed to the appropriate location in each color field's vertical IDWT buffer.
The distributor writes four pixels into one of the four vertical core URAMs at most once per HDMI clock, to satisfy the one pixel per color field per clock constraint discussed above. For viewport widths greater than 1024px, the distribution is gated by the master pixel counter, which only updates when the interpolators actually need new pixels.

Continuing upstream, the distributor receives 16-bit signed pixel values from the four codestream decompressors. Each one takes in codestream data from RAM as-needed, decoding four pixels at a time by reversing the variable length code used by the encoder. The pixels are then multiplied by the inverse of the quantizer multiplication factor, using more DSPs, to recover their full range.

Raw codestream data is read in from RAM by an AXI Master into BRAM FIFOs at the entrance to each decompressor. I'm using precious BRAMs here, for the built-in FIFO functionality and to make the decoder RAM reader symmetric to the encoder RAM writer. A round-robin arbiter checks the FIFO levels to see when more data needs to be read. I'm only using a 64-bit AXI Master on the decoder, since the bandwidth already far exceeds the worst-case HDMI output requirement.

Start-Of-Frame Context

So far, the HDMI output pipeline looks a lot like the sensor input pipeline in reverse. But one subtle way in which they differ is in Start-Of-Frame (SOF) context: the state of the pipeline at the beginning of each frame. In the interest of speed, the input pipeline is not flushed between frames. Furthermore, codestream addresses for a given frame are updated during the Frame Overhead Time (FOT) interrupt, while some data is still in the pipeline, so the very bottom of Frame N-1 becomes the top of Frame N in memory.

Overlap between Frame N-1 and Frame N in memory. SOF N marks the sector-aligned start of "Frame N" in RAM, set during the FOT interrupt from the CMV12000. The decoder seeks the actual start of Frame N data.
If the decoder processes every frame, this isn't a problem: it can wrap cleanly through the overlapping region to get the data it needs for both frames. But the HDMI output only processes a subset of the frames captured. It needs to be able to find the start of any individual frame and process it independently. This is needed for seeking in a playback context too. But I can't afford the time it would take to flush the input pipeline between each frame. So instead I need to completely capture the state of the pipeline at the SOF boundary.

As it turns out, this isn't too bad, since there are only a few places where data can remain in the input pipeline at the SOF: 
  1. In the pre-encoder pixel memory: registers or BRAM buffers that are part of sensor input, DWT or quantizer operations. These have a fixed latency of 6336px for Stage 1+2. The decoder can offset its pixel counter by this amount, essentially discarding the overlapping pixels into the space between VSYNC and the start of the viewport.
     
  2. In the 128-bit e_buffer register of each codestream that accumulates encoded data before writing it to that codesteram's BRAM FIFO. The number of bits remaining in this register is neatly captured by its write index, e_buffer_idx.
     
  3. In the codestream BRAM FIFO itself. This is captured by the FIFO read level, already used as the AXI write trigger. Since these FIFOs are 64-bit write and 128-bit read, care must be taken to keep track the write level LSB as well, to know if there's an extra half-word in memory that can't be read yet.
The last two combine to give a number of bits to discard for each codestream: 

e_buffer_idx + 128 * fifo_rd_count + 64 * fifo_wr_count[0] 

To fully capture the SOF context, these three values are written to the frame headers during the FOT interrupt. A VSYNC interrupt from the HDMI module prompts software to read the header of the next frame to be displayed, calculate the number of bits to discard for each codestream, and pass it to the decoder along with the codestream start addresses. That number of bits are then discarded by the decoders prior to attempting to decode any pixels.

High-level architecture of the encoder and decoder interactions with the CPU and RAM.
In total, the HDMI output module (decoder and all) uses 4363 LUTs, 4227 FFs, and 4 BRAMs, less than what was saved by deleting Wavelet Stage 3. It adds 8 URAMs and 26 DSPs, but I'm not running short of those (yet). Except for the AXI Master, it runs entirely on the 74.25MHz HDMI clock, so it shouldn't be too much of a timing burden. There might be room for a bit more optimization, but I'm happy with the functionality it gives for its size.

Focus Assist

The main reason I wanted to get the HDMI module done now, ahead of some of the other remaining tasks, is so I can use the real-time preview for testing. It sucks to have to pull frames off one-by-one through USB to iterate on framing, exposure, and especially focus. Having a 1080p 30fps preview on something like an Atomos Shinobi on-camera monitor makes life a lot easier, and moves in the direction of standalone operation.


One neat trick you an do with wavelets is overmultiply the high-pass subbands (LH1, HL1, HH1) to highlight edges in the preview image. This effect is useful for focus assist. Most on-camera monitors can do this anyway (by running a high-pass filter on the HDMI data), but it's essentially free to do in the decoder since the subbands pass through a multiplier anyway to undo the quantization division. I'll take free features any day.

Macro Machining

With the newfound ability to actually focus the image in a reasonable amount of time, I'm finally able to play with a new lens: The Irix Cine 150mm T3.0 Macro. I started drooling over this lens for close-up high-speed stuff after watching this review. I'm no lens expert, but I feel like this lens series competes with ones 3x its price in terms of image quality. My first test was to attempt to get some macro shots of my mini mill:

Shooting my mini-mill at 400fps with the Irix 150mm T3.0 Macro lens.
The HDMI output was crucial for this, since the lens has an insanely shallow depth-of-field at T3.0, less than the width of the cutting tool. The CMV12000 is not a particularly good low-light sensor, so with an exposure time of around 1.87ms, I needed to add a good deal of light. To make things more interesting, I threw in some cheap IKEA RGBs as well. It took a while to get set up, but the result was promising:


I'll probably repeat this with a more interesting subject (this was just a piece of scrap aluminum) and a more stable mount. If I can get more light, it might be good to close down to T5.6 or so as well, to get a bit more depth of field, and drop the exposure to 180º shutter for less motion blur on the cutter. But the lens is terrific and I'm happy with the quality of the two-stage wavelet compression so far. The above clip has an average compression ratio of right around 6:1, helped along by the ultra-shallow depth of field.

Next Up

The last major HDL task on this project is modifying the pipeline to accept 2K subsampled frames from the sensor at higher frame rates (up to around 1400fps at 1080p!). This will probably be a separate Vivado project and bitstream, since it requires substantial modifications to the input pipeline. It also needs twice as many Stage 1 horizontal cores, since four rows are being read in simultaneously instead of two.

But I may tackle some simpler but no less important usability tasks first. For one, I still don't have pass-through Mass Storage Device access to the SSD over USB C. This is necessary for getting footage off without opening the camera (or using RAM as intermediate storage). With that and a bit of on-camera UI work (record button, simple menus), I can finally run everything completely standalone soon.

Wednesday, December 18, 2019

Continuous 3.8Gpx/s (4K 400fps+) Image Capture Pipeline

In the original Freight Train of Pixels post, I laid out three main technical challenges to building a continuous recording 3.8Gpx/s imager. All three have now been dealt with, using a Zynq Ultrascale+ SoC as a hardware base. The detailed implementations for each one has its own post:

[The Source] - Full-speed read-in of the CMV12000's 64 LVDS channels.
[The Pipe] - Hardware wavelet compression engine.
[The Sink] - Sustained 1GB/s writing to an NVMe SSD.

Now it's time to put all three pieces together and run it as a full pipeline:

Since YouTube uploads are at the mercy of H.264, here's a PNG frame as well.

There are lots of technical details to dive into, but the first thing to point out is that this is 12000 frames of continuously-recorded 4K 400fps video. That's 30s in real-time and 500s of playback at 24fps, something very few existing high-speed imaging systems can do. And I can keep going. This clip is "only" 24GB of a 1TB SSD. To fill the entire 1TB would take about 20 minutes at this bit rate. That's 20 minutes of real-time, 5.5 hours of playback at 24fps.

This is made possible mostly due to the insane speed of modern SSDs. High-speed cameras typically use RAM to buffer raw frame data in real-time, transferring short clips to non-volatile storage after the capture period. But with NVMe flash write speeds now well into the GB/s range, maybe continuous direct recording to a single drive (or a RAID 0 array) will catch on. Besides the ability to capture long clips, it also allows an alternative trigger-free user interface that would be familiar to anyone: push to record, push to stop.

The real MVP here is the 1TB Samsung 970 Evo Plus NVMe SSD, a good example of modern consumer electronics running laps around everything above it in the pro/industrial world.
Of course, the other enabling factor is the use of wavelet compression to reduce the data rate by a ratio of at least 5:1. This might seem like cheating, but since a similar compression ratio is utilized by REDCODE, Blackmagic RAW, Apple ProRes RAW, and probably many other "raw" formats, I don't feel the least bit guilty. On a sensor like the CMV12000, a lightweight compression pass might actually help the image quality, since it'll inherently denoise the image somewhat.

I also picked a good hardware platform for this project: the lower-end Zynq Ultrascale+ SoCs, like that of the wonderful XCZU4CG module I've been using, have just barely enough resources to pull it off. I'm using 134 of the 144 LVDS-capable I/O, all four PCIe Gen3-capable transceivers, and most of the programmable logic. I was actually going to move up to the XCZU6, which has more than double the logic, but it seems to be on a different branch of the product tree that a) doesn't have PCIe hard blocks and b) isn't supported by Vivado HL WebPACK. For now, I'll just try my best to optimize for the XCZU4. I still have the XCZU5 and XCZU7 available to me if needed, though.

XCZU4 module transferred over to the v0.2 carrier, which has some new connectors including a barrel jack for power in, a full-size HDMI out, and some isolated GPIO.
Although I'm sticking with the same ZU+ module, I did finally get around to building a new carrier. The main addition is an HDMI transmitter, which I'll probably play with in the coming weeks. There's also a new barrel jack power input and an isolated GPIO connector. Other than a PCIe reference clock routing fix, I didn't touch any of the existing high-speed signals. I also (re)discovered good drag soldering technique, so there were zero issues with the 160-pin headers this time around.

Most importantly, I felt confident enough in the design now to risk a color sensor on this board. Unlike my stockpile of monochrome CMV12000s from eBay, I bought this one new, at full price, and I do not want to break it. I haven't had any power supply issues in months of testing on the v0.1 board, so after a thorough multimeter check on the v0.2 board, I permanently soldered on the color sensor and crossed my fingers.

My one and only CMV12000-2E5C1PA, now committed to this board.
The wavelet compression engine was designed for the Bayer-masked color sensor, with independent encoding for each color field, so I didn't have to change any hardware or software to accommodate it. All the color processing happens off-board, a point of reckoning discussed below. But from an image capture point of view, it's 100% drop-in.

Returning to the integration of the three main pieces of the image capture pipeline, there were a handful of small but important details to sort out regarding how frames are buffered in RAM and how they are written out to files on the SSD:



  1. The AXI Master port on the Encoder module now transfers data from the 16 compressor codestream FIFOs to RAM in increments of 512B, one logical block / sector on disk. This greatly simplifies downstream file writing operations.
     
  2. The (now sector-aligned) addresses of the next RAM write for each compressor codestream are presented to software via the Encoder module's AXI Slave. These addresses increment automatically as data is written, but can be overwritten by software.
     
  3. The CMV Input module generates an interrupt at the start of the Frame Overhead Time (FOT), a short (~20-30μs) period of deadtime between frame readouts where there shouldn't be any RAM writing happening.
     
  4. During the FOT interrupt, software reads the Encoder RAM write addresses, resets them if necessary, and records the start address and size of each compressor codestream for a given frame as part of a 512B frame header. Frame headers are stored in a circular buffer in RAM.
     
  5. In the main loop, frames are dequeued from RAM as needed by writing the frame header, followed by each of the 16 codestreams it references, to a file with FatFs. A new file is created every so often to prevent the file size from exceeding the 4GiB FAT32 limit.
This is all pretty easy to implement, at least compared to the three main hardware modules themselves. Most of it is ARM software, which can be iterated and debugged much faster than programmable logic. Also, having the codestream RAM address and size baked into the frame header helps with validating and analyzing the output:

The codestream size per frame shows how bandwidth is being distributed to the 16 codestreams, with more bits per frame going to the low-frequency subbands. Compression ratios relative to raw 10-bit data are shown on the codestream size axis. Spikes can be seen during the portions of the clip where the steel wool burns brightest. There's also a gradual increase in codestream size over the 30 seconds, especially in the high-frequency subbands, which I believe is due to image sensor noise increasing with temperature. Feeding the codestream size back into the quantizer settings to maintain a roughly constant bit rate will be a problem for another day.

Each codestream is given a range of RAM addresses to use as a circular buffer for frame data. At the start of a clip, the encoder RAM write addresses are set to the bottom of their range. As data is written, the addresses increment automatically. When they reach the top of their range, software resets them during the FOT interrupt. Each codestream does this independently, but the overall frame buffering capability is limited by the most frequently reset stream. In this case, XX3 resets approximately every 400 frames, so a maximum of 1s can be buffered. Ideally, the RAM ranges would be sized proportionally to the codestream bit rates to maximize the number of frames that can be buffered.

The RAM frame buffer allows the pipeline to ride out NVMe writing delays that last for several frame intervals. Each frame is time stamped once when it is read into RAM and again when it is submitted for writing to the SSD. The two time stamps and the frame write backlog count are added to the frame header, and can be used to review the delay.

Here, the read-to-read interval between frames is a steady 2.5ms, as it must be for 400fps. (A spike would indicate a dropped frame.) The read-to-write delay is typically 10ms, corresponding to a four frame backlog. This offset is set by software to allow for a small number of frames between the read and write pointer, which could be used to generate a low-latency local preview. There is a spike in read-to-write delay every 240 frames when a new file is created, since the file creation NVMe operations are slower than streaming writes. Most of these spikes are only one frame, but there were instances of higher delays, up to 20 total frames (50ms). This is still easily absorbed by the RAM buffer, although it would be good to understand where the extra delay comes from.

So all 12000 frames (113.2Gpx) made it onto the SSD in 30 seconds. That's the last stop on the freight train of pixels - once they hit the flash, they're no longer volatile or time-critical. So in some sense this project is done. But from a practical standpoint, there's still an equal amount of computation that has to happen to decode the frames and run the inverse DWTs, not to mention the additional load of debayering, color correction, and eventual transcoding. What the ZU+ does at 400fps, my laptop CPU struggles to undo at 0.25fps. So implementing decode and inverse DWT in a GPU-accelerated way just became high-priority. Luckily, I remember that I do at least have an existing GPU debayer and color correction solution.


I had to do a little work to modernize it, but since laptop hardware has also gotten way better since I wrote it, it can scrub through 4K raw just fine. Hopefully, I can implement the decode and IDWT there as well, to save the 4s per frame it currently takes to do those on the CPU. I'll also need the ZU+ to be able to do it, for preview and playback, but since it only has to run at 30fps that should be much easier than the forward direction.

Besides that, there are a few side-challenges I still have to take on:
  1. The HDMI output, so I can see what the hell I'm doing. This will involve some amount of decoding and inverse DWT, at least of the 3rd and maybe 2nd wavelet stage, to generate a usable preview image with a menu and status overlay. I don't have the ability to output 4K, so a 1080p preview will have to suffice.
     
  2. USB mass-storage device access to the SSD. As much as I love my Sabrent NVMe SSD enclosure, I fear I am approaching the insertion/removal cycle limit on this drive. I have already demonstrated USB 3.0-speed mass storage device access to ZU+ RAM, so this should be a simple SCSI bridge project. Bonus points if I can implement a few custom SCSI commands to start/stop recording or do other useful control tasks.
     
  3. An alternative programmable logic configuration for the CMV12000's subsampled read-out mode. In this mode, the X and Y read-out skips every other 2x2 Bayer block, for a maximum resolution of 2048x1536 at a frame rate of 1050fps. Or an even more interesting 1472fps at 1080p. Because this mode reads in four rows in parallel, it will require more Stage 1 Horizontal DWT Cores, which might be tough on the XCZU4.
But having the full capture pipeline in place is a good milestone, and I'm happy that it's pretty close to what I had in mind back in June. Some of the details changed as I learned about the capabilities and limitations of the ZU+, but the high-level architecture is as-planned. Being able to capture 3.8Gpx/s continuously (for less than 5nJ/px, too) is something that I think is new and only recently possible with the latest generation SSDs and FPGA hardware working together. Keeping an eye on new components and trying to figure out interesting corner cases where they might be useful is something I enjoy, so this was a fun challenge.

Friday, November 29, 2019

Zynq Ultrascale+ FatFs and Direct Speed Tests with Bare Metal NVMe via AXI-PCIe Bridge

Blue wire PCIe REFCLK still hanging in there...
It's time to return to the problem of sinking 1GB/s of data onto an NVMe drive from a Zynq Ultrascale+ SoC. Last time, I benchmarked the Xilinx Linux drivers and found that they were fast, but not quite fast enough. In the comments of that post, there were many good suggestions for how to make up the difference without having to resort to a hardware accelerator. The consensus is that the hardware, namely the stock AXI-PCIe bridge, should be fast enough.

While a lot of the suggestions were ways to speed up the data transfer in Linux, and I have no doubt those would work, I also just don't want or need to run Linux in this application. The sensor input and wavelet compression modules are entirely built in Programmable Logic (PL), with only a minimal interface to the Processing System (PS) for configuration and control. So, I'm able to keep my entire application in the 256KB On-Chip Memory (OCM), leaving the external DDR4 RAM bandwidth free for data transfer.

After compression, the data is already in the DDR4 RAM where it should be visible to whatever DMA mechanism is responsible for transferring data to an NVMe drive. As Ambivalent Engineer points out in the comments:
It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue.
In other words, write a bare metal NVMe driver to interface with the AXI-PCIe bridge directly for initiating and controlling data transfers. This seems like a good fit, both to this specific application and to my general proclivity, for better or worse, to move to lower-level code when I get stuck. A good place to start is by exploring the functionality of the AXI-PCIe bridge itself.

AXI-PCIe Bridge

Part of the reason it took me a while to understand the AXI-PCIe bridge is that it has many names. The version for Zynq-7000 is called AXI Memory Mapped to PCI Express (PCIe) Gen2, and is covered in PG055. The version for Zynq Ultrascale is called AXI PCI Express (PCIe) Gen 3 Subsystem, and is covered in PG194. And the version for Zynq Ultrascale+ is called DMA for PCI Express (PCIe) Subsystem, and is nominally covered in PG195. But, when operated in bridge mode, as it will be here, it's actually still documented in PG194. I'll be focusing on this version. 

Whatever the name, the block diagram looks like this:

AXI-PCIe Bridge Root Port block diagram, adapted from PG194 Figure 1.
The AXI-Lite Slave Interface is straightforward, allowing access to the bridge control and configuration registers. For example, the PHY Status/Control Register (offset 0x144) has information on the PCIe link, such as speed and width, that can be useful for debugging. When the bridge is configured as a Root Port, as it must be to host an NVMe drive, this address space also provides access to the PCIe Configuration Space of both the Root Port itself, at offset 0x0, and the enumerated Endpoint devices, at other offsets.
PCIe Confinguration Space layout, adapted from PG213 Table 2-35.
If the NVMe drive has successfully enumerated, its Endpoint PCIe Configuration Space will be mapped to some offset in the AXI-Lite Slave register space. In my case, with no switch involved, it shows up as Bus 1, Device 0, Function 0 at offset 0x1000. Here, it's possible to check the Device ID, Vendor ID, and Class Codes. Most importantly, the BAR0 register holds the PCIe memory address assigned to the device. The AXI address assigned to BAR0 in the Address Editor in Vivado is mapped to this PCIe address by the bridge.

Reads from and writes to the AXI BAR0 address are done through the AXI Slave Interface. This is a full AXI interface supporting burst transactions and a wide data bus. In another class of PCIe device, it might be responsible for transferring large amounts of data to the device through the BAR0 address range. But for an NVMe drive, BAR0 just provides access to the NVMe Controller Registers, which are used to set up the drive and inform it of pending data transfers.

The AXI Master Interface is where all NVMe data transfer occurs, for both reads and writes. One way to look at it is that the drive itself contains the DMA engine, which issues memory reads and writes to the system (AXI) memory space through the bridge. The host requests that the drive perform these data transfers by submitting them to a queue, which is also contained in system memory and accessed through this interface.

Bare Metal NVMe

Fortunately, NVMe is an open standard. The specification is about 400 pages, but it's fairly easy to follow, especially with help from this tutorial. The NVMe Controller, which is implemented on the drive itself, does most of the heavy lifting. The host only has to do some initialization and then maintain the queues and lists that control data transfers. It's worth looking at a high-level diagram of what should be happening before diving in to the details of how to do it:

System-level look at NVMe data flow, with primary data streaming from a source to the drive.
After BAR0 is set, the host has access to the NVMe drive's Controller Registers through the bridge's AXI Slave Interface. They are just like any other device/peripheral control registers, used for low-level configuration, status, and control of the drive. The register map is defined in the NVMe Specification, Section 2.

One of the first things the host has to do is allocate some memory for the Admin Submission Queue and Admin Completion Queue. A Submission Queue (SQ) is a circular buffer of commands submitted to the drive by the host. It's written by the host and read by the drive (via the bridge AXI Master Interface). A Completion Queue (CQ) is a circular buffer of notifications of completed commands from the drive. It's written by the drive (via the bridge AXI Master Interface) and read by the host. 

The Admin SQ/CQ are used to submit and complete commands relating to drive identification, setup, and control. They can be located anywhere in system memory, as long as the bridge has access to them, but in the diagram above they're shown in the external DDR4. The host software notifies the drive of their address and size by setting the relevant Controller Registers (via the bridge AXI Slave Interface). After that, the host can start to submit and complete admin commands:
  1. The host software writes one or more commands to the Admin SQ.
  2. The host software notifies the drive of the new command(s) by updating the Admin SQ doorbell in the Controller Registers through the bridge AXI Slave Interface.
  3. The drive reads the command(s) from the Admin SQ through the bridge AXI Master Interface.
  4. The drive completes the command(s) and writes an entry to the Admin CQ for each, through the bridge AXI Master Interface. Optionally, an interrupt is triggered.
  5. The host reads the completion(s) and updates the Admin CQ doorbell in the Controller Registers, through the AXI Slave Interface, to tell the drive where to place the next completion.
In some cases, an admin command may request identification or capability data from the drive. If the data is too large to fit in the Admin CQ entry, the command will also specify an address to which to write the requested data. For example, during initialization, the host software requests the Controller Identification and Namespace Identification structures, described in the NVMe Specification, Section 5.15.2. These contain information about the capabilities, size, and low-level format (below the level of file systems or even partitions) of the drive. The space for these IDs must also be allocated in system memory before they're requested.

Within the IDs is information that indicates the Logical Block (LB) size, which is the minimum addressable memory unit in the non-volatile memory. 512B is typical, although some drives can also be formatted for 4KiB LBs. Many other variables are given in units of LBs, so it's important for the host to grab this value. There's also a maximum and minimum page size, defined in the Controller Registers themselves, which applies to system memory. It's up to the host software to configure the actual system memory page size in the Controller Registers, but it has to be between these two values. 4KiB is both the absolute minimum and the typical value. It's still possible to address system memory in smaller increments (down to 32-bit alignment); this value just affects how much can be read/written per page entry in an I/O command or PRP List (see below).

Once all identification and configuration tasks are complete, the host software can then set up one or more I/O queue pairs. In my case, I just want one I/O SQ and one I/O CQ. These are allocated in system memory, then the drive is notified of their address and size via admin commands. The I/O CQ must be created first, since the I/O SQ creation references it. Once created, the host can start to submit and complete I/O commands, using a similar process as for admin commands.

I/O commands perform general purpose writes (from system memory to non-volatile memory) or reads (from non-volatile memory to system memory) over the bridge's AXI Master Interface. If the data to be transferred spans more than two memory pages (typically 4KiB each), then a Physical Region Page (PRP) List is created along with the command. For example, a write of 24 512B LBs starting in the middle of a 4KiB page might reference the data like this:

A PRP List is required for data transfers spanning more than two memory pages.
The first PRP Address in the I/O command can have any 32-bit-aligned offset within a page, but subsequent addresses must be page-aligned. The drive knows whether to expect a PRP Address or PRP List Pointer in the second PRP field of the I/O command based on the amount of data being transferred. It will also only pull as much data as is needed from the last page on the list to reach the final LB count. There is no requirement that the pages in the PRP list be contiguous, so it can also be used as a scatter-gather with 4KiB granularity. The PRP List for a particular command must be kept in memory until it's completed, so some kind of PRP Heap is necessary if multiple commands can be in flight.

Some (most?) drives also have a Volatile Write Cache (VWC) that buffers write data. In this case, an I/O write completion may not indicate that the data has been written to non-volatile memory. An I/O flush command forces this data to be written to non-volatile memory before a completion entry is written to the I/O CQ for that flush command.

That's about it for things that are described explicitly in the specification. Everything past this point is implementation detail that is much more application-specific.

A key question the host software NVMe driver needs to answer is whether or not to wait for a particular completion before issuing another command. For admin commands that run once during initialization and are often dependent on data from previous commands, it's fine to always wait. For I/O commands, though, it really depends. I'll be using write commands as an example, since that's my primary data direction, but there's a symmetric case for reads.

If the host software issues a write command referencing a range of data in system memory and then immediately changes the data, without waiting for the write command to be completed, then the write may be corrupted. To prevent this, the software could:
  1. Wait for completion before allowing the original data to be modified. (Maybe there are other tasks that can be done in parallel.)
  2. Copy the data to an intermediate buffer and issue the write command referencing that buffer instead. The original data can then be modified without waiting for completion.
Both could have significant speed penalties. The copy option is pretty much out of the question for me. But usually I can satisfy the first constraint: If the data is from a stream that's being buffered in memory, the host software can issue NVMe write commands that consume one end of the stream while the data source is feeding in new data at the other end. With appropriate flow control, these write commands don't have to wait for completion.

My "solution" is just to push the decision up one layer: the driver never blocks on I/O commands, but it can inform the application of the I/O queue backlog as the slip between the queues, derived from sequentially-assigned command IDs. If a particular process thinks it can get away without waiting for completions, it can put more commands in flight (up to some slip threshold).
An example showing the driver ready to submit Command ID 72, with the latest completion being Command ID 67. The doorbells always point to the next free slot in the circular buffer, so the entry there has the oldest ID.
I'm also totally fine with polling for completions, rather than waiting for interrupts. Having a general-purpose completion polling function that takes as an argument a maximum number of completions to process in one call seems like the way to go. NVMeDirectSPDK, and depthcharge all take this approach. (All three are good open-source reference for light and fast NVMe drivers.)

With this set up, I am able to run a speed test by issuing read/write commands for blocks of data as fast as possible by trying to keep the I/O slip at a constant value:

Speed test for raw NVMe write/read on a 1TB Samsung 970 Evo Plus.
For smaller block transfers, the bottleneck is on my side, either in the driver itself or by hitting a limit on the throughput of bus transactions somewhere in the system. But for larger block transfers (32KiB and above) the read and write speeds split, suggesting that the drive becomes the bottleneck. And that's totally fine with me, since it's hitting 64% (write) and 80% (read) of the maximum theoretical PCIe Gen3 x4 bandwidth.

Sustained write speeds begin to drop off after about 32GiB. The Samsung Evo SSDs have a feature called TurboWrite that uses some fraction of the non-volatile memory array as fast Single-Level Cell (SLC) memory to buffer writes. Unlike the VWC, this is still non-volatile memory, but it gets transferred to more compact Multi-Level Cell (MLC) memory later since it's slower to write multi-level cells. The 1TB drive that I'm using has around 42GB of TurboWrite capacity according to this review, so a drop off in sustained write speeds after 32GiB makes sense. Even the sustained write speed is 1.7GB/s, though, which is more than fast enough for my application.

A bigger issue with sustained writes might be getting rid of heat. This drive draws about 7W during max speed writing, which nearly doubles the total dissipated power of the whole system, probably making a fan necessary. Then again, at these write speeds a 0.2kg chunk of aluminum would only heat up about 25ºC before the drive is full... In any case, the drive will also need a good conduction path to the rear enclosure, which will act as the heat sink.

FatFs

I am more than content with just dumping data to the SSD directly as described above and leaving the task of organizing it to some later, non-time-critical process. But, if I can have it arranged neatly into files on the way in, all the better. I don't have much overhead to spare for the file system operations, though. Luckily, ChaN gifted the world FatFs, an ultralight FAT file system module written in C. It's both tiny and fast, since it's designed to run on small microcontrollers. An ARM Cortex-A53 running at 1.2GHz is certainly not the target hardware for it. But, I think it's still a good fit for a very fast bare metal application.

FatFs supports exFAT, but using exFAT still requires a license from Microsoft. I think I can instead operate right on the limits of what FAT32 is capable of:
  • A maximum of 2^32 LBs. For 512B LBs, this supports up to a 2TiB drive. This is fine for now.
  • A maximum cluster size (unit of file memory allocation and read/write operations) of 128 LBs. For 512B LBs, this means 64KiB clusters. This is right at the point where I hit maximum (drive-limited) write speeds, so that's a good value to use.
  • A maximum file size of 4GiB. This is the limit of my RAM buffer size anyway. I can break up clips into as many files as I want. One file per frame would be convenient, but not efficient.
Linking FatFs to NVMe couldn't really get much simpler: FatFs's diskio.c device interface functions already request reads and writes in units of LBs, a.k.a. sectors. There's also a sync function that matches up nicely to the NVMe flush command. The only potential issue is that FatFs can ask for byte-aligned transfers, whereas NVMe only allows 32-bit alignment. My tentative understanding is that this can only happen via calls to f_read() or f_write(), so the application can guard against it.

For file system operations, FatFs reads and writes single sectors to and from a working buffer in system memory. It assumes that the read or write is complete when the disk_read() or disk_write() function returns, so the diskio.c interface layer has to wait for completion for NVMe commands issued as part of file system operations. To enforce this, but still allow high-speed sequential file writing from a data stream, I check the address of the disk_write() system memory buffer. If it's in OCM, I wait for completion. If it's in DDR4, I allow slip. For now, I wait for completion on all disk_read() calls, although a similar mechanism could work for high-speed stream reading. And of course, disk_ioctl() calls for CTRL_SYNC issue an NVMe flush command and wait for completion.

Interface between FatFs and NVMe through diskio.c, allowing stream writes from DDR4.
I also clear the queue prior to a read to avoid unnecessary read/write turnarounds in the middle of a streaming write. This logic obviously favors writes over reads. Eventually, I'd like to make a more symmetric and configurable diskio.c layer that allows fast stream reading and writing. It would be nice if the application could dynamically flag specific memory ranges as streamable for reads or writes. But for now this is good enough for some write speed testing:

Speed test for FatFs NVMe write on a 1TB Samsung 970 Evo Plus.
There's a very clear penalty for creating and closing files, since the process involves file system operations, including reads and flushes, that will have to wait for NVMe completions. But for writing sequentially to large (1GiB) files, it's still exceeding my 1GB/s requirement, even for total transfer sizes beyond the TurboWrite limit. So I think I'll give it a try, with the knowledge that I can fall back to raw writing if I really need to.

Utilization Summary

The good news is that the NVMe driver (not including the Queues, PRP Heap, and IDs) and FatFs together take up only about 27KB of system memory, so they should easily run in OCM with the rest of the application. At some point, I'll need to move the .text section to flash, but for now I can even fit that in OCM. The source is here, but be aware it is entirely a test implementation not at all intended to be a drop-in driver for any other application.

The bad news is that the XCZU4 is now pretty much completely full...

We're gonna need a bigger chip.
The AXI-PCIe bridge takes up 12960 LUTs, 17871 FFs, and 34 BRAMs. That's not even including the additional AXI Interconnects. The only real hope I can see for shrinking it would be to cut the AXI Slave Interface down to 32-bit, since it's only needed to access Controller Registers at BAR0. But I don't really want to go digging around in the bridge HDL if I can avoid it. I'd rather spend time optimizing my own cores, but I think no matter what I'll need more room for additional features, like decoding/HDMI preview and the subsampled 2048x1536 mode that might need double the number of Stage 1 Wavelet horizontal cores. 

So, I think now is the right time to switch over to the XCZU6, along with the v0.2 Carrier. It's pricey, but it's a big step up in capability, with twice the DDR4 and more than double the logic. And it's closer to impedance-matched to the cost of the sensor...if that's a thing. With the XCZU6, I think I'll have plenty of room to grow the design. It's also just generally easier to meet timing constraints with more room to move logic around, so the compile times will hopefully be lower.

Hopefully the next update will be with the whole 3.8Gpx/s continuous image capture pipeline working together for the first time!