Saturday, March 14, 2020

HDMI, the Hard Way

If I were to rank the components of this project in terms of the ratio of their actual vs. expected difficulty, the NVMe interface would probably be lowest, since it was nowhere near as hard as I thought it would be. The CMV12000 input (easy, expected to be easy) and wavelet engine (hard, expected to be hard) would be somewhere in the middle. And the new top of the list, the hardest module that should have been easy, would be the HDMI output.


There seem to be two main reference designs for outputting an HDMI signal from a Zynq SoC. Zynq-7000 series boards such as the ZC70x and Zedboard use an external HDMI transmitter, the ADV7511, to convert a parallel RGB interface into serial HDMI TMDS outputs. Zynq Ultrascale+ boards such as the ZCU10x and UltraZed-EV Carrier Card use the built-in serial transceivers of the ZU+ to drive the TMDS outputs through a SN65DP159 HDMI retimer. The latter is a more modern approach, supporting up to 4K60 through the HDMI TX Subsystem IP. But, that IP is not included with Vivado. It also requires three free GTH transceiver channels, which I don't have on the XCZU4. (Its four available channels are in use for PCIe Gen3 to the SSD.)

There's nothing wrong with using an external HDMI transmitter with the ZU+, though. I left a PL GPIO bank open specifically for a parallel RGB pixel bus, either for an LCD controller or an HDMI interface. I opted for the slightly newer ADV7513, which supports up to 1080p60 at 8-bit. This is perfectly acceptable as a preview and local playback resolution. Outputting a full 4K frame over HDMI might be useful for interfacing with a RAW recorder, but that is out of the question anyway at 400fps. In fact, I only really need a 24-30fps HDMI output, which means a very manageable 74.25MHz pixel clock, based on the CEA-861 standard.

HDMI timing parameters for 1920x1080p 24/25/30Hz with a 74.25MHz pixel clock.
Generating the required pixel clock, sync, and dummy RGB signals in the ZU+ Programmable Logic (PL) is pretty simple; I set that up as a module on day one of playing with HDMI. Typically, you'd just point this module at a frame buffered in RAM and let it pull the real data. (There are video DMAs and drivers that will do this more-or-less automatically.) But here's where I run into a slight problem: I don't actually have a frame buffered in RAM.

The Hard Way

While it is possible to write the full 3.8Gpx/s raw frame data to RAM on the ZU+, it would be futile to try doing any significant processing on it there. Even if I used all three 128b AXI bus connections between the PL and the memory controller at 250MHz, that would allow for less than three accesses per pixel...including the initial write. The Processing System (PS) has a similar memory access constraint, although processing pixels serially on the ARM cores is much too slow anyway. So I made the decision early on to implement the wavelet compression engine in PL hardware and write the ~5:1 compressed codestreams to RAM instead, on their way to the SSD.
The capture pipeline, with the main data path highlighted and shown decreasing in width where compression occurs at the PL Encoder, before data is written to DDR4 RAM.
"No problem," you might say, "just split off raw data from the sensor and feed it to the HDMI module." Unfortunately, this doesn't quite work: In the time it takes the HDMI scan to complete one row, the capture pipeline has processed 50+ rows from the CMV12000. The input and output are just not in sync, and any attempt to buffer partial frames between them would require much more block RAM than I have available. It would also cause frame tearing that would ruin any attempt to preview periodic phenomenon with the global shutter.

The only real choice is to put the HDMI output module after the RAM buffer, which means decoding compressed frame data on the way out:
The only logical place to put the HDMI output, and not just because I left space for it there in the block diagram.
The HDMI module reads codestream data from RAM as an AXI Master, decodes the pixel values, and runs an Inverse Discrete Wavelet Transform (IDWT) to recover the raw image. While this is a lot more work, it pays off twofold because the same module can be used for playback by reading frames back out of the SSD into RAM and pointing the decoder at them.

Notwithstanding the design effort, the actual resource utilization of this module should be pretty low. For one, only four of the sixteen codestreams need to be decoded to reconstruct a 2048px-wide image to use for the preview; there's no need to decode any LH1, HL1, or HH1 data. Also, the preview frame rate is at least 10x slower than the capture frame rate, so the amount of parallelism needed in the decoding and IDWT pipeline is much lower. Still, it's more logic on an already-crowded chip.

Kill Your Darlings

At this point I'm stubbornly committed to fitting this design on the XCZU4. With the capture pipeline complete, I was getting pretty close to maxing out this chip, especially the LUTs (65593 / 87840) and BRAMs (122 / 128). And this was after a significant optimization pass on all the cores, including trimming pixel math operations from 16-bit to 12-bit where applicable and removing debug interfaces. These bottlenecks were already causing routing difficulty that was pushing up compile times, so I needed to make more room somehow. And then one day I woke up and decided to delete Wavelet Stage 3.
An example showing the effect of deleting the third DWT stage without changing the target compression ratios of any other stages. The red bars are each sized proportionally to the  compressed sub-band they represent.
Stage 3 only handles 1/16 of the total data throughput, but it is visually the most significant and thus uses the least amount of compression. In the example above, replacing Stage 3's output with a raw 1/4-scale average image (LL2) has a relatively small effect on the overall compression ratio. It's also not a complete loss, since the 1:1 LL2 will yield slightly better visual quality if the other subbands remain unchanged. The distribution of bandwidth that achieves the best image quality with an overall compression ratio of 5:1 is still an unknown, but ditching Stage 3 probably isn't restricting the search space too far.

Although Stage 3 is by far the smallest wavelet core, removing it also simplifies a lot of downstream logic. The "XX3" encoder, which previously handled all four Stage 3 subbands by cycling through different inputs and quantizer settings, now becomes a pass-through for raw LL2 data. It also now has the same latency as the HL2, LH2, and HH2 encoders. This latency is the new maximum and is significantly lower than the former XX3 latency. (It's no longer necessary to wait for six whole LL2 rows for the Stage 3 DWT.) There's a symmetric payoff on the decoder side as well.

So while I'm sad to see it go, I think it's the right call for now. Having three stages probably does improve the compression performance (objectively, the PSNR at a given compression ratio), but I think I can still achieve good image quality at an overall ratio of 5:1 with only two. Not even including prospective decoder savings, the reduction in LUTs (-4575), FFs (-5320), and most crucially BRAMs (-8) is well worth-it.

Working Backwards

In may ways, the HDMI output module is just a mirror image of the pixel input pipeline, from the deserialized CMV12000 input pixels to the AXI Master that writes encoded data to RAM. The 74.25MHz HDMI clock runs a master pixel counter that scans across and down the output frame. Whereas the CMV12000 clocks in 64 pixels in parallel, though, the HDMI only has to clock out one.

Or does it? Each HDMI pixel (in RGB 4:4:4 format) consists of an 8-bit red, green, and blue value, whereas the Bayer-masked sensor input is split into four interleaved color fields. Each color field's decoded LL1 image will only be 1024px wide. One option would be to center this in the HDMI frame and pull the 8-bit R, G, and B values directly from each color field's LL1:
1:1 scaling from LL1 color field pixels to HDMI pixels.
In this case, each HDMI clock requires one pixel from each of the four color fields (the two greens are averaged). The logic couldn't really get any simpler. But, it makes poor use of the 1920x1080 HDMI frame, especially for widescreen aspect ratios. An alternative would be to scale everything up by a factor of two:
2:1 scaling from LL1 color field pixels to HDMI pixels.
Now, a debayering method has to be used to reconstruct the missing color values at each pixel. For this application, a simple average of the neighboring pixels would be fine. (The off-line decoder uses a more complex, higher-quality method.) Each HDMI pixel now references as many as four pixels from each color field. But, these pixels don't all update at each HDMI clock. The average pixel consumption from each color field is actually only one per four HDMI clocks, as expected from the 2:1 scaling factor.

But a 2:1 scaled preview doesn't fit in 1920x1080. The cropping isn't too bad for widescreen aspect ratios, but it's unusable for 4:3. Switching between 1:1 and 2:1 scaling depending on the aspect ratio would work, but adds a lot of conditional logic for a still-compromised result. An arbitrary software-controlled scaling between 1:1 and 2:1 would be so much better. So, time to break out the DSPs:
Arbitrary scaling from LL1 color field pixels to HDMI pixels, using bilinear interpolation.
To achieve arbitrary scaling, the four 1024px-wide LL1 color fields are resampled onto a 65536px-wide grid, accounting for the offsets between the centers of pixels of each color. Then, a viewport is defined within the HDMI frame and normalized onto this 16-bit grid (using DSPs). The four pixel centers of each color field that box in the normalized viewport coordinate are used for bilinear interpolation (using more DSPs) to produce the R, G, and B values. This is also the debayer step, thanks to the pixel center offsets.

One thing I actually do have plenty of is DSPs, and this seems like a great use for 14 of them. Being able to reposition and rescale the preview image from software makes life a lot easier. The down-side is that sixteen LL1 pixels are required to generate a single HDMI pixel. But as with the 2:1 case, the input pixels don't all change with every HDMI clock. The average LL1 pixel consumption rate will depend on the scale, but if the viewport width is always at least 1024px, it will never exceed one LL1 pixel per color field per HDMI clock. All upstream logic in the decoder is designed with this constraint in mind.


Next upstream is the Inverse Discrete Wavelet Transform (IDWT). One of the most significant simplifications achieved by deleting Wavelet Stage 3 is that the HDMI output module only has to do one stage of IDWT: Stage 2. This stage recovers LL1 from the LL2, LH2, HL2, and HH2 subbands. The order of operations is reversed in the IDWT: vertical first, then horizontal. Since we're working backwards from the HDMI output, let's look at the horizontal core first.

The forward horizontal DWT core is heavily optimized for speed and size using only FF-based distributed memory. In the inverse direction, there's a lot more breathing room. Only four cores are needed (one per color field) and they only need to process at most one pixel per HDMI clock. So, I am able to combine the horizontal IDWT with a block RAM buffer and output shift register pretty easily. I'm almost completely out of BRAMs, but I have plenty of UltraRAM (URAM) for this.
Horizontal IDWT and output buffer for one color field built around a single URAM.
Each URAM is 32KiB, enough to store 16 rows of LL1 data. The oldest two rows (N+0 and N+1) feed output shift registers that end in the four pixels the bilinear interpolator needs. The horizontal IDWT is performed on data from Row N+3, its result written back to Row N+2. As in the forward direction, pixels are processed in 64-bit groups of four: two interleaved pairs of low-pass and high-pass values become four LL1 outputs. Two half-speed shift registers unpack 64-bit URAM reads for the IDWT and pack the results into 64-bit writes. Running the IDWT as a single combinational step is not as efficient as using sequential lifting steps, as in the forward horizontal DWT, but it's a bit simpler to do with shift registers. Meanwhile, new data from the vertical stage is fed in at Row N+6.
Vertical IDWT for one color field built around a single URAM.
The vertical IDWT cores are also each built around a single URAM. In this case, the URAM is split in half for low-pass (HL2/LL2) and high-pass (HH2/LH2) vertical data. Four pixels each from three rows of low-pass data (N+0 to N+2) and one row of high-pass data (N+9) are processed every four clocks to create two four-pixel outputs to write to horizontal core URAM. In a shameful waste of clock cycles, input rows are scanned twice and the output write alternates between the even and odd IDWT results. (There are other ways to deal with the 2:1 row scanning ratio, but I'm willing to trade power for simpler logic right now.) Meanwhile, raw interleaved LL2, LH2, HL2, and HH2 data are written in to rows somewhere just ahead of the IDWT read pointers.

Decompressor and Distributor

Each horizontal and vertical core operates on a single color field, but the four input codestreams are instead separated by subband (LL2, LH2, HL2, HH2), with all four color fields being present in each codestream. The codestreams also cycle through four different column positions in a given row, since the Stage 2 forward vertical DWT uses four cores in parallel. A distributor remaps decoded subband data to the appropriate write address in one of the vertical IDWT cores. This is also a good place to interleave the high-pass and low-pass data, which facilitates the horizontal IDWT.
After decoding, subband pixels are redistributed to the appropriate location in each color field's vertical IDWT buffer.
The distributor writes four pixels into one of the four vertical core URAMs at most once per HDMI clock, to satisfy the one pixel per color field per clock constraint discussed above. For viewport widths greater than 1024px, the distribution is gated by the master pixel counter, which only updates when the interpolators actually need new pixels.

Continuing upstream, the distributor receives 16-bit signed pixel values from the four codestream decompressors. Each one takes in codestream data from RAM as-needed, decoding four pixels at a time by reversing the variable length code used by the encoder. The pixels are then multiplied by the inverse of the quantizer multiplication factor, using more DSPs, to recover their full range.

Raw codestream data is read in from RAM by an AXI Master into BRAM FIFOs at the entrance to each decompressor. I'm using precious BRAMs here, for the built-in FIFO functionality and to make the decoder RAM reader symmetric to the encoder RAM writer. A round-robin arbiter checks the FIFO levels to see when more data needs to be read. I'm only using a 64-bit AXI Master on the decoder, since the bandwidth already far exceeds the worst-case HDMI output requirement.

Start-Of-Frame Context

So far, the HDMI output pipeline looks a lot like the sensor input pipeline in reverse. But one subtle way in which they differ is in Start-Of-Frame (SOF) context: the state of the pipeline at the beginning of each frame. In the interest of speed, the input pipeline is not flushed between frames. Furthermore, codestream addresses for a given frame are updated during the Frame Overhead Time (FOT) interrupt, while some data is still in the pipeline, so the very bottom of Frame N-1 becomes the top of Frame N in memory.

Overlap between Frame N-1 and Frame N in memory. SOF N marks the sector-aligned start of "Frame N" in RAM, set during the FOT interrupt from the CMV12000. The decoder seeks the actual start of Frame N data.
If the decoder processes every frame, this isn't a problem: it can wrap cleanly through the overlapping region to get the data it needs for both frames. But the HDMI output only processes a subset of the frames captured. It needs to be able to find the start of any individual frame and process it independently. This is needed for seeking in a playback context too. But I can't afford the time it would take to flush the input pipeline between each frame. So instead I need to completely capture the state of the pipeline at the SOF boundary.

As it turns out, this isn't too bad, since there are only a few places where data can remain in the input pipeline at the SOF: 
  1. In the pre-encoder pixel memory: registers or BRAM buffers that are part of sensor input, DWT or quantizer operations. These have a fixed latency of 6336px for Stage 1+2. The decoder can offset its pixel counter by this amount, essentially discarding the overlapping pixels into the space between VSYNC and the start of the viewport.
  2. In the 128-bit e_buffer register of each codestream that accumulates encoded data before writing it to that codesteram's BRAM FIFO. The number of bits remaining in this register is neatly captured by its write index, e_buffer_idx.
  3. In the codestream BRAM FIFO itself. This is captured by the FIFO read level, already used as the AXI write trigger. Since these FIFOs are 64-bit write and 128-bit read, care must be taken to keep track the write level LSB as well, to know if there's an extra half-word in memory that can't be read yet.
The last two combine to give a number of bits to discard for each codestream: 

e_buffer_idx + 128 * fifo_rd_count + 64 * fifo_wr_count[0] 

To fully capture the SOF context, these three values are written to the frame headers during the FOT interrupt. A VSYNC interrupt from the HDMI module prompts software to read the header of the next frame to be displayed, calculate the number of bits to discard for each codestream, and pass it to the decoder along with the codestream start addresses. That number of bits are then discarded by the decoders prior to attempting to decode any pixels.

High-level architecture of the encoder and decoder interactions with the CPU and RAM.
In total, the HDMI output module (decoder and all) uses 4363 LUTs, 4227 FFs, and 4 BRAMs, less than what was saved by deleting Wavelet Stage 3. It adds 8 URAMs and 26 DSPs, but I'm not running short of those (yet). Except for the AXI Master, it runs entirely on the 74.25MHz HDMI clock, so it shouldn't be too much of a timing burden. There might be room for a bit more optimization, but I'm happy with the functionality it gives for its size.

Focus Assist

The main reason I wanted to get the HDMI module done now, ahead of some of the other remaining tasks, is so I can use the real-time preview for testing. It sucks to have to pull frames off one-by-one through USB to iterate on framing, exposure, and especially focus. Having a 1080p 30fps preview on something like an Atomos Shinobi on-camera monitor makes life a lot easier, and moves in the direction of standalone operation.

One neat trick you an do with wavelets is overmultiply the high-pass subbands (LH1, HL1, HH1) to highlight edges in the preview image. This effect is useful for focus assist. Most on-camera monitors can do this anyway (by running a high-pass filter on the HDMI data), but it's essentially free to do in the decoder since the subbands pass through a multiplier anyway to undo the quantization division. I'll take free features any day.

Macro Machining

With the newfound ability to actually focus the image in a reasonable amount of time, I'm finally able to play with a new lens: The Irix Cine 150mm T3.0 Macro. I started drooling over this lens for close-up high-speed stuff after watching this review. I'm no lens expert, but I feel like this lens series competes with ones 3x its price in terms of image quality. My first test was to attempt to get some macro shots of my mini mill:

Shooting my mini-mill at 400fps with the Irix 150mm T3.0 Macro lens.
The HDMI output was crucial for this, since the lens has an insanely shallow depth-of-field at T3.0, less than the width of the cutting tool. The CMV12000 is not a particularly good low-light sensor, so with an exposure time of around 1.87ms, I needed to add a good deal of light. To make things more interesting, I threw in some cheap IKEA RGBs as well. It took a while to get set up, but the result was promising:

I'll probably repeat this with a more interesting subject (this was just a piece of scrap aluminum) and a more stable mount. If I can get more light, it might be good to close down to T5.6 or so as well, to get a bit more depth of field, and drop the exposure to 180ยบ shutter for less motion blur on the cutter. But the lens is terrific and I'm happy with the quality of the two-stage wavelet compression so far. The above clip has an average compression ratio of right around 6:1, helped along by the ultra-shallow depth of field.

Next Up

The last major HDL task on this project is modifying the pipeline to accept 2K subsampled frames from the sensor at higher frame rates (up to around 1400fps at 1080p!). This will probably be a separate Vivado project and bitstream, since it requires substantial modifications to the input pipeline. It also needs twice as many Stage 1 horizontal cores, since four rows are being read in simultaneously instead of two.

But I may tackle some simpler but no less important usability tasks first. For one, I still don't have pass-through Mass Storage Device access to the SSD over USB C. This is necessary for getting footage off without opening the camera (or using RAM as intermediate storage). With that and a bit of on-camera UI work (record button, simple menus), I can finally run everything completely standalone soon.