Wednesday, September 4, 2019

CMV12000 Full-Speed (38.4Gb/s) Read-In on Zynq Ultrascale+

In my original Freight-Train-of-Pixels post, I explored three main challenges of building a 3.8Gpx/s imager: the source, the pipe, and the sink. Working backwards, the sink is an NVMe SSD that (hopefully) will be capable of 1GB/s writes. The pipe is a ~5:1 wavelet compression engine that has to make 3.8Gpx/s = 1GB/s in realtime, with minimal effect on image quality. And the source is the CMV12000 image sensor that relentlessly feeds pixel data into this machine. This post focuses on the source, and specifically the read-in mechanism implemented on a Zynq Ultrascale+ SoC for the 38.4Gb/s of LVDS data from the sensor.

Physical Interface

The Source: Breaking out the CMV12000's 64 LVDS pairs was interesting.
The pixel data interface on the CMV12000 is 64 LVDS pairs, each operating at (up to) 300MHz DDR (600Mb/s). In the context of FPGA I/O interfaces, 300MHz DDR is really not that fast. It's just a lot of inputs. Most Zynq Ultrascale+ SoCs have enough LVDS-capable package pins to do this, but it took some searching to find a carrier board that breaks out enough of them to headers. I'm using the Trenz Electronic TE0803, specifically the ZU4CG vesion, which breaks out a total of 72 HP LVDS pairs from the ZU+.

The physical interface for LVDS is a 100Ω differential pair. At 300MHz DDR, the length-matching requirements are not difficult. A bit is about 250mm long, so millimeter-scale mismatches due to an uneven number of 45º left and right turns are not a big deal; no meandering is really needed for intrapair matching. Likewise, I felt it was okay to break some routing rules by splitting a pair for a few millimeters to reach inner CMV12000 pins, rather than pushing trace/space limits to squeeze two traces between pins.

Routing of the LVDS pairs to TE0803 headers. Pairs are length-matched to within ~1mm, but no interpair matching was attempted. The FPGA must deal with interpair length differences as well as the CMV12000's large internal skew.
Interpair skew is still an issue. For ease of routing, no interpair length matching was attempted, resulting in length differences of as much as 30% of a bit interval. But this isn't even the bad news. The CMV12000 has a ~150ps skew between each channel from 1-32 and 33-64. That means that channels 32 and 64 are ~4.7ns behind channels 1 and 33, a whopping 280% of a bit interval. It would be silly to try to compensate for this with length matching, since that's equivalent to about 700mm at c/2!

Deserialization and Link Training

For a brief moment after reading about the CMV12000's massive interchannel skew, I thought I might be screwed. FPGA inputs deal with skew by using adjustable delay elements to add delay to the edges that arrive early, slowing them all down to align with the most fashionably late edge. But the delay elements in the Zynq Ultrascale+ are only guaranteed to provide up to 1.1ns of delay. It's possible to cascade the unused output delay elements with their associated input delay elements, but that's still only 2.2ns.

But I don't need to account for the whole 4.7ns of interchannel skew; I only need to reach the same phase angle in the next bit. At 600Mb/s, that's only 1.67ns away. Delays larger than this can be done with bit slipping, as shown below. Since this still relies on the cascaded delay elements to span one full bit interval, an interesting consequence is that a minimum speed is imposed (about 450Mb/s for 2.2ns of available delay). So I guess it's go fast or go home...

Channels are aligned using an adjustable delay of up to one bit period and integer bit slipping in the deserialized data.
The Ultrascale+ deserializer hardware supports up to 1:8 (byte) deserialization from DDR inputs. The bit slip logic selects a new byte from any position in the 16-bit concatenation of the current and previous raw deserialized bytes. The combination of the delay value and integer bit slip offset independently align each channel.

A complication is that the CMV12000 has 8-, 10-, and 12-bit pixel modes, with the highest readout efficiency in the default 10-bit mode. To go from 8-bit deserialized data to 10-bit pixel data requires building a "gearbox", a nomenclature I really like. An 8:10 gearbox can be built pretty easily with just a few registers:

An 8:10 gearbox, with four states corresponding to alignment of the 10-bit output within two adjacent 8-bit inputs.
The gearbox cycles through four states, registering a 10-bit output from an offset of {0, 2, 4, or 6} within two adjacent 8-bit inputs to pick out whole pixels from the data. This looks simple enough, but there's a subtlety in the fact that five bytes must cycle through the registers for every four pixels. In other words, the input clock (byte_clk) is running 5/4 as fast as the output clock (px_clk). The two clocks must be divided down from the same source (the LVDS clock in this case) to ensure that timing constraints can be evaluated. Additionally, to work as pictured above, the phase of the two clocks must be such the "extra" byte shift occur between states 3 and 0.

The overall input module is pretty tiny, which is good because I have to instantiate 65 of them (64 pixel channels and one control channel). They're built into an AXI-Lite slave peripheral with all the per-channel tweakable parameters as well as the final 10-bit pixel outputs mapped for the ARM to play with. The CMV12000 outputs training data on the pixel channels any time they're not being used to send real data. So, my link training process is:
  1. Find the correct phase for the px_clk so that, as described above, the gearbox works properly. Incorrect phase will result in flickering pixel data as the byte shifts occur in the wrong place relative to the gearbox px_clk state machine. I'm not sure why this phase changes from reset to reset. It's the same value for all 65 channels, so I feel like there should be a way to have it start up deterministically. But for now it's easy enough to try all four values and see which one produces constant data.
     
  2. On each channel, set the sampling point by sweeping through the adjustable delay values looking for an eye center. (Or, since it's not guaranteed that a complete eye will be contained in the available delay range, a sampling point sufficiently far from eye edges.)
     
  3. On the control channel, set the bit slip offset to the value between 3 and 12 that produces the expected training value. This covers all ten possibilities for phasing of the pixel data relative to the deserializer. Note that this requires registering and concatenating three deserialized bytes, rather than two as pictured in the bit slip example above.
     
  4. On each pixel channel, set the bit slip offset to the value closest to the control channel bit slip offset that produces the expected training value. It should be within ±3 of the control channel bit slip offset, since that's the maximum interchannel skew.
This only takes a fraction of a second, so it can easily be done on start-up or even in between captures to protect against temperature-dependent skew. By looking at the total delay represented by the delay tap values and bit slip offsets, it's clear that the CMV12000's interchannel skew is the dominant factor and that the trained delays roughly match the datasheet skew specification of 150ps per channel:

Total CMV12000 channel delays measured by training results.
That's the hard part of the source done, with less trouble than I expected. The output is a 60MHz px_clk and 65 10-bit values that update on that clock. This will be the interface to the middle of the pipeline, the wavelet engine. But I need to be able to test the sensor before that's complete, and more than 64 pixels at a time. Without the compression stage, though, that means writing data at full rate to external DDR4 attached to the ZU+. Although it's a throwaway test, I will need to write to that RAM (at a lower rate) after the compression stage anyway, so this would be good practice.

RAMMING SPEED

The ZU4CG version of the TE0803 has 2GB of 2400MT/s DDR4 configured as x64. That's over 150Gb/s of theoretical memory bandwidth, so the 38.4Gb/s CMV12000 data should be pretty easy. The DDR4 is attached to the PS side of the ZU+, though, and the dedicated DDR controller there is shared by many elements of the system, including the ARM cores. 

The CMV12000 front-end described above exists on the PL side. The fastest interface between the PL and the PS is a set of 128-bit AXI memory buses, exposed as the slave ports S_AXI_HPx_FPD to the PL. There are four such slave ports, but only a maximum of three can be simultaneously routed to the DDR controller:

Up to three 128-bit AXI memory buses can be dedicated to direct PL-PS DDR access.
The Ultrascale+ AXI might be able to go up to 333MHz, according to the datasheet, but 250MHz is the more common setting. That's okay - that's still 96Gb/s of theoretical bus bandwidth. But you can start to see why it's infeasible to store intermediate compression data in external RAM. Even 2.5 accesses per pixel would saturate the bus.

For this test, I set up some custom BRAM FIFOs to use for buffering between the hard-timed pixel input and the more uncertain AXI write transfer. To keep things simple, four adjacent channels share one 64b-wide FIFO, aligning their pixel data to 16 bits. All FIFO writes happen on the px_clk when the control channel indicates valid pixel data.

The other side of the FIFO is a little more confusing. I split channels 1-32 and 33-64 (8 FIFOs each) into two write groups, each with its own AXI master port with 32Gb/s of theoretical bandwidth. The bottom channels drive S_AXI_HP0_FPD and the top drive S_AXI_HP1_FPD, and I rely on the DDR controller to sort out simultaneous write requests.

Bottom channel RAM writing test pipeline, through BRAM FIFO buffers. Top channels are similar.
When the FIFO levels reach a certain threshold, a write transaction is started. Each transaction is 16 bursts of 16 beats of 16 bytes, and the 16 bytes of a beat are views of the FIFO output data. For simplicity, I just alternate between views of the 8 MSBs of 16 pixels to fill each 128-bit beat. I may stick the 2 LSBs from all 64 channels in their own view at some point, but for now I can at least confirm sensor operation with the 8 MSBs.

Without further ado, the first full image off the sensor:

What were you expecting?
It turned out better than I thought, even looking like a VHS tape on rewind as it does. There are both horizontal and vertical defects. The vertical defects were concentrated in one 128px-wide column, served by a single LVDS pair, so that was easily traceable to a marginal solder joint. The horizontal defects were more likely to be missing or corrupted RAM writes. They would change position every frame.

At first I suspected the DDR controller might be struggling to arbitrate between the two PL-PS ports and the ARM. The ARM might try to read program data while the image capture front-end is writing, incurring both a read/write turnaround penalty and a page change penalty. But in that case the AXI slave ports should exert back-pressure on their PL masters by deasserting the AWREADY signal, and I didn't see this happening. To further rule out ARM contention, I moved the ARM program into on-chip memory and disabled all but the two slave ports being used to write data to the DDR controller...still no good.

I also tried different combinations of pixel clock speed (down to 30MHz), AXI clock speed (down to 125MHz), burst size, and total transfer size with no real change. Even with only one port writing, the problem persisted. Then I tried replacing the image views with some FIFO debug info: input/output counters and the difference used to calculate the fill level. I had expected the difference to vary up and down by one or two address units since the counters run on different clocks, but I what I saw were cases where the difference was entirely wrong, possibly triggering bad transfers.

So what I had was a clock domain crossing problem. Rather than describe it in detail, I'll just link this article that I wish I had read beforehand. The crux of it is that the individual bits of the counter can't be trusted to change together and if you catch them in mid-transition during an asynchronous clock overlap, you can get results that are complete nonsense, not just off-by-one. The article details a comprehensive bidirectional synchronization method using Gray code counters, but for now I just tried a simple one-way method where the input counter is driven across the clock domain with an alternating "pump" signal:

Synchronization pump for FIFO input counter.
The pump is driven by the LSB of the input-side counter and synchronized to the AXI clock domain through a series of flip-flops. This only works if the output-side clock is sufficiently faster than the input-side clock that it can always detect every edge of the pump signal. That's the case here, with a 250MHz axi_clk and a 60MHz px_clk. The value of in_cnt_axi, the input counter pumped to the AXI clock domain, is what's compared to the output counter (which is already in the AXI clock domain) to evaluate the FIFO level and trigger AXI transfers. It's the right amount of simple for me, adding only a few flip-flops to the design.

And just like that, clean kerbal portraits.
In theory, I could read in about 170 frames this way (in 0.567s...). It currently takes me 30 seconds to get each frame off over JTAG, though, so I may want to get USB (SuperSpeed!) up and running first. More importantly, I can evaluate sensor stuff independent of the two other main challenges (wavelet pipeline and SSD sink). I'm actually surprised at the okay-ness of the raw image, but there is definitely some fixed pattern noise to contend with. I also want to try the multi-slope HDR mode, which should be great for fitting more dynamic range in the 10-bit data (with no processing on my end!).

I started with the source and sink because, even though they're the more known tasks, they represent external constraints that are actually show-stoppers if they don't work. Now I am confident in everything up to the pixel data hand-off on the source side. The sink side is still a mess, but the hardware has been checked at least. That leaves the more unknown challenge of the wavelet compression engine. But since it's entirely built from logic, with interfaces on both ends that I control, I'm actually less worried about it. In other words, it's nice to not have to think about whether or not to build something from scratch...