Saturday, April 18, 2020

Full-Speed CMV12000 Subsampled Readout: 1440fps 1080p

Now that I've got a continuous multi-Gpx/s image capture pipeline running, it's time to rearrange some things to break the 1000fps barrier:



For this clip I'm using the CMV12000's X/Y subsampling mode to trade resolution for frame rate, hitting 1440fps at 2048x1088. The overall pixel rate is a little lower than in 4K (3.2Gpx/s vs. 3.8Gpx/s), so it's feasible to send this through the same Zynq Ultrascale+ capture pipeline, with some modifications, to record continuously to an NVMe SSD. With ~4:1 wavelet compression, this writes about 1GB/s to the drive, up to 1000s (16.7min) for a 1TB drive. That would be 16.7 hours of playback at 24fps, though. I figured 30 seconds real-time and 30 minutes of playback was enough water droplet footage for now.

CMV12000 Subsampling

In a previous post, I covered the pipeline architecture for continuously recording 400fps 4K video from a CMV12000 image sensor to an NVMe SSD. That was a 4096x2304 (16:9) frame, slightly larger than 4K UHD. The sensor's native resolution is 4096x3072 (4:3), which it can read in at 300fps. By reading in fewer rows, the maximum frame rate is increased. Going wider than 16:9 would allow frame rates higher than 400fps, but since the sensor always reads in full 4096px-wide rows, the speed gain is only linear.

To go much faster, it's necessary to read in fewer columns as well. Not all sensors can do this; reading whole rows may be baked into the hardware architecture. The CMV12000 doesn't support arbitrary readout width, but it does support 2x subsampling. In this mode, every other four-pixel square (Bayer group) is skipped in both the X and Y directions. The remaining squares are transmitted on the LVDS channels using an alternate packing:

CMV12000 subsampled readout (color, X-flipped).
Each of the 64 LVDS channels alternates between two rows, with the lower 32 channels handling two even (G1/R1) rows and the upper 32 channels handling two odd (B1/G2) rows. This alternate data packing allows the subsampled image, with 1/4 as many total pixels, to be read out nearly 4x faster. There is a small amount of extra overhead time that makes the actual gain not quite 4x.

Subsampling drops the resolution from 4K to 2K but preserves the crop factor of the sensor, since the full width and height are still used. This is preferable to cropping a 2048px-wide image out of the middle. It doesn't give any increase in sensitivity though; to do that would require binning (averaging the larger 4x4 squares to generate the final 2x2). The CMV12000 does support binning, but the overhead is so bad that you might as well read out the 4K image and do it in post (assuming you have the data storage bandwidth, which I certainly do). So to go ~4x faster, I will need ~4x more light.

Light sensitivity of subsampling vs. binning.
Before worrying about a shortage of photons, though, I first need to deal with a shortage of programmable logic. To fit everything on the XCZU4, my main bottlenecks are BRAMs and LUTs. I managed to add the decoder for HDMI output with no increase in either by sacrificing the third wavelet stage. But I've known for a long time that the day would come when I would need to add 128 more Stage 1 horizontal cores to handle the subsampled inputs.

It might seem odd that more cores are needed to process a smaller image. Even at the higher frame rate, the pixel input rate is lower than in 4K. Surely the existing horizontal cores could time-multiplex to handle the data? But, the wavelet cores must operate on groups of adjacent pixels. In this case, adjacency describes the nearest horizontal pixels of the same color, since applying a difference operation to pixels of different colors would not have the desired result. And whatever the color, pixels from another row are not horizontally adjacent. Since each LVDS channel now services two color fields and two rows, it must feed four independent wavelet cores.

In 2K mode, each LVDS channel feeds four independent Stage 1 horizontal cores.
So, the total number of Stage 1 horizontal cores doubles from 128 to 256. This jump has been on my mind since the early stages of the design, and I tried to optimize the horizontal cores as much as possible. A big part of this was reducing the operating pixel width from 16-bit to 12-bit, which brought the per-core LUT count down from 107 to 83. As this is the first stage of the pipeline, it's easy to verify that it won't saturate on 10-bit inputs. The horizontal cores operate in-line with the input using only distributed memory, so no additonal BRAMs are required. But there's now way around the additional 10,000 or so LUTs, and that will bring me right up to the limits of this chip.

Since I knew there would be very few LUTs remaining for switching modes, I originally thought the 4K and 2K modes might have to exist as entirely separate PL configurations, their bitstreams loaded on as-needed by software. I've seen other cameras do this; it looks like a software reset when changing capture formats. And while it only takes a few seconds, I really dislike the workflow and the idea of maintaining two configurations.

So, I spent some time looking at the actual differences between modes at all stages of the pipeline and decided that I could and should build the switch. I had this mode change in mind early in the design, so I tried to minimize the number of touch points required in each of the modules to switch between 4K and 2K. Even so, there are a number of small changes needed in the Wavelet, Encoder, and HDMI modules. They are collectively driven by a master switch in each module's AXI slave registers. I'll go through them in pipeline order below.

Wavelet Stage 4K/2K Switch

First, no actual switching is required to distribute the inputs to the Stage 1 horizontal cores; each channel always connects to the same four cores. Instead, the cores are gated by a master pixel counter based on their color and, when in 2K mode, also their row. The 2K mode switch turns on this extra enable gate and offsets the counter that handles first/last row states by one bit, to account for the half-width rows. Miraculously, this did not add any LUTs to the horizontal cores. I assume the extra logic just got merged into existing smaller LUTs...I'll take it.

The most complicated part of the switch happens next, at the interface between the Stage 1 horizontal and vertical cores. Instead of distributing outputs from four adjacent horizontal cores into a single row of a vertical core BRAM, the 2K interface distributes outputs from eight horizontal cores into two rows of a vertical core BRAM. Since the rows are half as wide, this takes the same number of pixel clock cycles (128). So, as will be the case at many points in the pipeline, this just boils down to rearranging the bits of the BRAM write address:

Aspect ratio change and read/write addressing of the Stage 1 vertical core BRAMs in 4K vs. 2K mode.
Conceptually, the aspect ratio of the vertical core BRAM changes from 8 rows of 256px to 16 rows of 128px. The figure above shows where writes and reads occur in the BRAM at a given relative pixel count. Reads occur on half-counts since the Stage 1 vertical DWT operates at double px_clk frequency. The read address generator is also modified by the switch to account for the new aspect ratio. Only the eight most recent rows are actively written or read, so in 2K mode the BRAM is twice as big as it needs to be. The latency of the vertical core is also halved, since it's determined by the number of rows required to complete the vertical DWT operation. This will come into play later.

The Stage 1 vertical core buffers the alternating-row 2K mode inputs into a single-row format that's compatible with the rest of the pipeline, so changes after this point are relatively minor. Each Stage 1 vertical core feeds its output row to a Stage 2 horizontal core. The only modification required there is to offset the counter that handles first/last row states by one bit, to account for the half-width rows. Then, the Stage 2 vertical core just needs some more BRAM address rearrangement:

Aspect ratio change and read/write addressing of the Stage 2 vertical core BRAMs in 4K vs. 2K mode.
Like the Stage 1 vertical core BRAM, the aspect ratio is changed from 8 rows of 256px to 16 rows of 128px. But since the first stage already rearranged things into single rows, the write addressing here is more straighforward: In both 4K and 2K mode, only a single row is filled at a time (by two adjacent Stage 2 horizontal cores). The row width is halved, but there's no write interleaving between the two rows. Ultimately, this is just a different arrangement of the write address bits. The read address generator is similarly modified to grab the right data for the Stage 2 vertical DWT. As with the Stage 1 vertical core, the BRAM is twice as big as it needs to be, and the latency is halved.

Encoder 4K/2K Switch

The compression stage doesn't care about the aspect ratio change, since the only context it uses for variable-length encoding is an immediate group of four pixels. However, it does need to know the adjusted latency of both wavelet stages, since the first pixel to be encoded will arrive sooner in 2K mode. For that, I just made all the latency offsets software-defined, through the encoder's AXI slave registers. And that should be the only change required here...

Except things are never that easy. I noticed after plugging in the expected latency values in 2K mode, two of the four color fields (R1 and G2) were actually dropping one pixel per row. It took a while to isolate this to the encoder, and then even more staring at this module to figure out what the problem was. Since the only change I made was to the latency offsets, I figured there had to be some fundamental difference between how the local pixel counter (px_count_e) drives the encoder states during row transitions with different offsets, and there was:

Encoder gating in 4K mode, showing the difference between sequential and combinational px_count_e_updated.
The above shows px_count_e at the first row overhead time (ROT) in 4K mode. It's negative since pixels haven't made it to the encoder yet, but the same behavior happens at all subsequent row transitions. During ROT, the sensor is not sending pixel data and all the pixel counters (including px_count_e) hold their previous values. A signal called px_count_e_updated is cleared, which gates the encoder from sending pixels to RAM (via an intermediate shift register called e_buffer). This signal was previously sequential, which would add one clock cycle delay between the ROT and when the encoder is gated. It should have been combinational, to line up correctly with the ROT.

But the write to e_buffer also only takes place every other group of four pixel clocks, for reasons discussed here. In 4K mode, the ROT happens to fall in a period where writes don't occur anyway. The sequential vs. combinational difference didn't matter to the final e_buffer_wr_en signal. But in 2K mode, the new latency offsets just happen to put the ROT one cycle before the start of a four-cycle write sequence, where the difference does matter:

Encoder gating in @K mode, showing the difference between sequential and combinational px_count_e_updated.
After switching over to combinational logic for px_count_e_updated, the missing pixel returned, and things were almost happy again. It turns out there was a similar issue at the quantizer and encoder modules themselves, before the write to e_buffer. This was simply due to them not being enable-gated at all, though. (Again, it must have been working thanks to lucky latency offsets in 4K mode.) Gating each with the same combinational px_count_e_updated signal worked fine.

HDMI 4K/2K Switch

But wait, isn't the HDMI output always 1080p? While that is true, it doesn't mean there's nothing to be done here. In 4K mode, only the Stage 2 wavelet compression is decoded, leaving a 2K preview image (really, four color fields that are each 1024px wide) to be output via HDMI. This greatly reduces the size of the HDMI module, since it only has to decode four of the sixteen codestreams and do one stage of inverse DWT. However, getting to the same preview size in 2K mode would mean complete decoding, require all sixteen codestreams and two wavelet stages. I simply don't have room to do that, so I'm going to cheat.

The first step is to change how the viewport is mapped to a pixel count. To achieve arbitrary scaling of the preview image, I first normalize the viewport to 16-bit, i.e. top-left (0, 0) to bottom-right (65535, 65535). The x and y components, vxNorm and vyNorm, are shifted around to create the pixel counters that drives the output pipeline. When switching from 4K to 2K, each component gets right-shifted by one and the split between x and y moves over by one bit in the final counter:
Mapping between 16-bit (vxNorm, vyNorm) coordinates and opx_count in 4K vs. 2K mode.
This remapping means that the entire output pipeline operates at half resolution in 2K mode. The preview will actually just be scaled up from the four LL1 color fields, which are each 512px wide. There will still be bilinear interpolation to help smooth out the result, but it will be blurrier than the 1080p preview in 4K mode. But again there isn't really an alternative, at least not with the resources I have left on this chip.

The output pixel counter (opx_count) drives all parts of the decoding process, starting with a RAM reading FIFO through the HDMI module's AXI master. No changes are required there or in the decoder itself, other than modifying the latency offsets accordingly. These have always been software-defined, so I just added the expected values for 2K mode and they worked without any hassle. (There was no equivalent sequential vs. computational bug, thankfully.)

After this, the modifications to the Stage 2 inverse vertical wavelet cores are pretty simple and almost the same as in the forward direction. Each color field's IV2 core uses a single URAM for row storage. In 2K mode, the aspect ratio is changed from 16 rows of 1024px to 32 rows of 512px, by rearranging read and write address bits:

Aspect ratio change and read/write addressing of the Stage 2 inverse vertical core URAMs in 4K vs. 2K mode.
Unlike the forward direction, the Stage 2 inverse horizontal wavelet cores also use URAMs for row storage and these likewise need address bit rearrangement to change aspect ratios for 2K mode:

Aspect ratio change and read/write addressing of the Stage 2 inverse horizontal core URAMs in 4K vs. 2K mode.
And finally, the bilinear interpolation module needs to be adjusted to automatically scale up the preview image by 2x, so it can fill the viewport using the 512px-wide color field LL1 outputs. This can be done quickly by passing the shifted vxNorm and vyNorm values to the module, although this isn't quite correct, as will be discussed below. It's good enough for now, though.

Debayering

Applying an ordinary debayering algorithm, whatever it is, to the 2K subsampled raw data doesn't really work. This is because the physical spacing between pixels is no longer symmetric. For example, a red pixel is closer to its green and blue neighbors to the left and below than to the right and above. A proper bilinear interpolation needs to take this asymmetry into account, by modifying the location of pixel centers for each color field accordingly. More advanced algorithms are still built on the assumption of symmetric neighbors, so they'd all need modification to some degree.

Asymmetric neighboring pixels in subsampled mode can be handled by modifying interpolation pixel centers (left) or with an intermediate supersampling step (right).
Alternatively, the subsampled data can be supersampled by 2x to estimate the missing pixels (G2' and G1' in the image above) and then run through the ordinary debayer algorithm in 4K. The final output can then be scaled back to 2K to reflect the true information content of the data. This path takes longer for what may be an equivalent result for simpler debayer algorithms, but it might have advantages for more complex algorithms. All this will probably be obsoleted by neural networks that upscale 240p images to 16K in a few year anyway, so I'm not going to worry about it.

It is important to adapt the debayer algorithm for the subsampled pixel locations somehow, though, or there will be significant artifacts. The following comparison shows three different algorithms, nearest-neighbor, bilinear, and a Microsoft 5x5 interpolator that I like. For each, a reference 4K capture and 4K debayer is compared to a 2K subsampled capture with an unmodified 2K debayer and a 2K subsampled capture with a supersampled 4K debayer.

Comparison of three different interpolation algorithms with 4K capture/debayer, 2K subsampled capture with unmodified 2K debayer, and 2K subsampled capture with supersampled 4K debayer.
None of these simple algorithms can do much to recover resolution - for that I defer to the AI supersampling state of the art - but using an unmodified 2K debayer on subsampled raw data creates significant color checkerboarding artifacts on edges. Supersampling the data by 2x and running a simple 4K debayer at least bypasses the problem of neighboring pixel asymmetry.

Resource Utilization

Squeezing in the 4K/2K switch was beyond what I'd hoped to fit on the XCZU4, but it just barely works. The switch itself really only adds LUTs where BRAM/URAM address bits are remapped or where pixel counts are shifted to account for the aspect ratio change. The main addition is the 128 new Stage 1 horizontal wavelet cores, which really push the resource utilization to the limits.

The XCZU4 with everything crammed in.
At this point I'm at 77143 LUTs (87.82%), 93883 FFs (53.44%), 118 BRAMs (92.20%), 14 URAMs (29.17%) and 146 DSPs (20.05%). But, since most of my cores are running at px_clk (60MHz) or HDMI clock (74.25MHz) frequency, the timing constraints are not too difficult to meet. The exception seems to be things that interact with the 250MHz AXI clock, including the encoder and decoder BRAM FIFOs. These have to have some amount of manual placement help to meet timing.

The good news is I don't really have much else to add to the programmable logic. I've already built in placeholder URAMs for UI overlays in the HDMI module, so those just need to be filled in by software. I might add some more color processing to the HDMI output, but that will mostly use DSPs, and possibly URAMs for color look-up tables, which should be no problem to add. I'm really happy that everything fits on the XCZU4, not just because the bigger chips are way more expensive, but because it's been a much better lesson in optimizing cores to fit resource constraints than if I had just switched to the XCZU7 early on.