Thursday, June 13, 2019

Freight Train of Pixels

I have a problem. After any amount of time at any level of development of anything, I feel the urge to move down one layer into a place where I really shouldn't be. Thus, after spending time implementing capture software for my Point Grey FLIR block cameras, I am now tired of dealing with USB cables and drivers and firmware and settings.

What I want is image data. Pixels straight from a sensor. As many as I can get, as fast as I can get them. To quote Jeremy Clarkson from The Great Train Race (Top Gear S13E1), "Make millions of coals go in there." Except instead of coals, pixels. And instead of millions, trillions. It doesn't matter how. I mean, it does, but really I want the only real constraints to be where the pixels are coming from and where they are going. So let's see what's in this rabbit hole.

The Source

The image sensor feeding this monster will be an ams (formerly CMOSIS) CMV12000. It's got lots of pros and a few cons for this type of project, which I'll get into in more detail. But the main reason for the choice is entirely non-technical: This is a sensor that I can get a full datasheet for and purchase without any fucking around. This was true even back in the CMOSIS days, but as an active ams part it's now documented and distributed the same way as their $1 ICs.
The CMV12000 is not $1, sadly, but you can Buy It Now if you really want. For prototyping, I have two monochrome ones that came from a heavily-discounted surplus listing. Hopefully they turn on.
This is a case, then, where the available component drives the design. The CMV12000 is not going to win an image quality shootout with a 4K Sony sensor, but it is remarkably fast for its resolution: up to 300fps at 4096x3072. That's 3.8Gpx/s, somewhere between the total camera interface on a Tesla Full Self Driving Chip (2.5Gpx/s) and the imaging rate of a Phantom Flex 4K (8.9Gpx/s). There was a version jump on this sensor that I think moved it into a different category of speed, and that's where I'm placing the lever for pushing this design.

The CMV12000 is also a global shutter CMOS sensor, something more common in industrial and machine vision applications than consumer cameras. The entire frame is sampled at once, instead of row-by-row as in rolling shutter CMOS. (The standupmaths video on the topic is my favorite.) The  advantage is that moving objects and camera panning don't create distortion, which is arguably just correct behavior for an image sensor... But although a few pro cameras with global shutter have existed, even those have mostly died out. This is due to an interlinked set of trade-offs that give rolling shutter designs the advantage in cost and/or dynamic range.

For engineering applications, though, a global shutter sensor with an external trigger is essentially a visual oscilloscope, and can be useful beyond just creating normal video. By synchronizing the exposure to a periodic event, you can measure frequencies or visualize oscillations well beyond the frame rate of the sensor. Here's an example of my global shutter Grasshopper 3 camera capturing the cycle of a pixel shifting DLP projector. Each state is 1s/720 in duration, but the trigger can be set to any multiple of that period, plus or minus a tiny bit, to capture the sequence with an effective frame rate much higher than 720fps.



Whether a consequence of the global shutter or not, the main on-paper shortcoming of the CMV12000 is the relatively high dark noise of 13e-. For comparison, the Sony IMX294CJK, the 4K sensor in some new cameras with very good low-light capability, is below 2e-. That's a rolling shutter sensor, though. Sony also makes low-noise global shutter CMOS sensors like the IMX253, at around 2.5e-. The extra noise on the CMV12000 will mean that it needs more light for the same image quality compared to these sensors.

Even given adequate light, the higher noise also eats into the dynamic range of the sensor. The signal-to-noise ratio for a given saturation depth will be lower. This means either noisy shadows or blown-out highlights. But the CMV12000 has a feature I haven't seen on any other commercially-available sensor: a per-pixel stepped partial reset. The theory is to temporarily stop accumulating charge on bright pixels when they hit intermediate voltages, while allowing dark pixels to keep integrating. Section 4.5.1 in this thesis has more on this method.

In the example below, the charge reading is simulated for 16 stops of contrast. With baseline lighting, the bottom four stops are lost in the noise and the top four are blown out. Increasing the illumination by 4x recovers two stops on the bottom, but loses two on top. The partial reset capability slows down the brightest pixels, recovering several more stops on top without affecting the dark pixels. The extra light is still needed to overcome the dark noise, but it's less of an issue in terms of dynamic range.
Dynamic range recovery using 3-stage partial reset.
The end result of partial reset is a non-linear pixel response to illumination. This is often done anyway, after the ADC conversion, to create log formats that compress more dynamic range into fewer bits per pixel. Having hardware that does something similar in-pixel, before the ADC, is a powerful feature that's not at all common.

Another aspect of the CMV12000 that helps with implementation is the pixel data interface: the data is spread out on 64 parallel LVDS output pairs that each serve a group of pixel columns. This extra-wide bus means more reasonable clock speeds: 300MHz DDR (600Mb/s) for full rate. A half-meter wavelength means wide intra-pair routing tolerances. There is still a massive 4.8ns inter-channel skew that has to be dealt with, but it would be futile to try to length match that. The sensor does put out training data meant for synchronizing the individual channels at the receiver, which is a headache I plan to have in the future.

The Sink

I'm starting from the assumption that it's impossible to really do anything permanent with 38Gb/s of data, if you're working with hardware at or below that of a laptop PC. In an early concept, I was planning to just route the data to a PCIe x4 output and send it in to something like an Intel NUC for further processing. But even that isn't fast enough for the CMV12000. (Also, you can buy something like that already. No fun.) And even if you could set up a 40Gb/s link to a host PC through something like Thunderbolt 3, it's really just kicking the problem down the road to more and more general hardware, which probably means more Watts per bit per second.

Ultimately, unless the data is consumed immediately (as with a machine vision algorithm that uses one frame and then discards it), or buffered into RAM as a short clip (as with circular buffers in high-speed cameras), the only way to sink this much data reasonably is to compress it. And this is where this project goes off the rails a little.

For starters, I choose 1GB/s as a reasonable sink rate for the data. This is within reach of NVMe SSD write speeds, and makes for completely reasonable recording times of 17min/TB (at maximum frame rate). This is very light compression, as far as video goes - less than 5:1. I think the best tool for the job is probably wavelet compression, rather than something like h.265. It's intra-frame and uses relatively simple logic, which means fast and cheap. But putting aside the question of how fast and how cheap for now, I first just want to make sure the quality would be acceptable.

There are several good examples of wavelet compression already in use: JPEG2000 uses different variants for lossless and lossy image compression. REDCODE is wavelet-based and 5:1 is a standard setting described as "visually lossless". CineForm is a wavelet codec recently open-sourced by GoPro. The SDK for CineForm includes a lightweight example project that just compresses a monochrome image with different settings. Running a test image through that with settings close to 5:1 produces good results:
The original monochrome image.
The wavelet transform outputs a 1/8-scale low-frequency thumbnail and three stages of quantized high-frequency blocks, which are sparse and easy to compress. I just zipped this image as a test and got a 5.7:1 ratio with these settings.
The recovered image.
Since these images are going to be destroyed by rescaling anyway, here's a 400% zoom of some high-contrast features.

The choice of wavelet type does matter, but I think the quantization strategy is even more important. The wavelet transform doesn't reduce the size of the data, it just splits it into low-frequency and high-frequency blocks. In fact, for all but the simplest wavelets, the blocks require more bits to store than the original pixels:
Output range maps for different wavelets. All but the simplest wavelets (Haar, Bilinear) have corner cases of low-frequency or high-frequency outputs that require one extra bit to store.
Take the Cineform 2/6 wavelet (a.k.a reverse biorthogonal 1.3?) as an example: the low-frequency block is just an average of two adjacent pixels, so it doesn't need any more bits than the source data. But the high-frequency blocks look at six adjacent pixels and could, for some corner cases, give a result that's larger than the maximum pixel amplitude. It needs one extra bit to store the result without clipping. Seems like we're going in the wrong direction!

Like most image compression techniques, the important fact is that the high frequency information is less valuable, and can be manipulated and even discarded without as much visual penalty. By applying a deadband and quantization step to the high-frequency blocks, the data becomes more sparse and easier to compress. Since this is the lossy part of the algorithm, the details are hugely important. I have a little sandbox program that I use to play with different wavelet and quantization settings on test images. In most cases, 5:1 compression is very reasonable.
Different wavelets and quantizer settings can be compared quickly in this software sandbox.
That's enough evidence for me that wavelet compression is a completely acceptable trade-off for opening up the possibility of sinking to a normal 1TB SSD instead of an absurd amount of RAM. A very fast RAM buffer is still needed to smooth things out, but it can be limited in size to just as many frames as are needed to ride out pipeline transients. Now, with the source and sink constraints defined, what the hell kind of hardware sits in the middle?

The Pipe

There was never any doubt that the entrance to this pipeline had to be an FPGA. Nothing else can deal with 64 LVDS channels. But instead of just repackaging the data for PCIe and passing it along to some poor single board computer to deal with, I'm now asking the FPGA to do everything: read in the data, perform the wavelet compression, and write it out to an SSD. This will ultimately be smaller and cheaper, since there's no need for a host computer, but it means a much fancier FPGA.

I'm starting from scratch here, so all of this is just an educated guess, but I think a viable solution lies somewhere in the spectrum of Xilinx Zynq Ultrascale+ devices. They are FPGA hardware bolted to ARM cores in a single chip. Based on the source and sink requirements I can narrow down further to something between the ZU4 and ZU7. (Below the ZU4 doesn't have the necessary transceivers for PCIe Gen3 x4 to the SSD, and above the ZU7 is prohibitively expensive.) Within each ZU number, there are also three categories: CG has no extra hardware, EG has a GPU, and EV has a GPU and h.264/h.265 codec.

In the interest of keeping development cost down, I'm starting with the bottom of this window, the ZU4CG. The GPU and video codec might be useful down the road for 30fps previews or making proxies, but they're too slow to be part of the main pipeline. Since they're fairly sideways-compatible, I think it's reasonable to start small and move up the line if necessary.

I really want to avoid laying out a board for the bare chip, its RAM, and its other local power supplies and accessories. The UltraZed-EV almost works, but it doesn't break out enough of the available LVDS pins. It's also only available with the ZU7EV, the very top of my window. The TE08xx Series of boards from Trenz Electronic is perfect, though, covering a wider range of the parts and breaking out enough IO. I picked up the ZU4CG version for less than the cost of just the ZU4CG on Digi-Key.
Credit card-sized TE0803 board with the ZU4CG and 2GB of RAM. Not counting the FPGA, the processing power is actually good deal less than what's on a modern smartphone.
One small detail I really like about the TE0803 is that the RAM is wired up as 64-bit wide. Assuming the memory controller can handle it, that would be over 150Gb/s for DDR4-2400, which dwarfs even the CMV12000's data rate. I think the RAM buffer will wind up on the compressed side of the pipeline, but it's good to know that it has the bandwidth to handle uncompressed sensor data too, if necessary.

Time for a motherboard:
The "tall" side has the TE0803 headers, an M.2 connector, USB-C, a microSD slot, power supplies, and an STM32F0 to act as a sort-of power/configuration supervisor. Sensor pins are soldered on this side.
The "short" side has just the sensor and some straggler passives that are under 1mm tall.
Aside from the power supplies, this board is really just a breakout for the TE0803, and the placement of everything is driven by where the LVDS- and PCIe-capable pins are. Everything is a differential pair, pretty much. There are a bunch of different target impedances: 100Ω for LVDS, 85Ω for PCIe Gen3, 90Ω for USB. I was happy to find that JLCPCB offers a standard 6-layer controlled-impedance stackup. They even have their own online calculator. I probably still fucked up somehow, but hopefully at least some of it is right so I can start prototyping the software.

Software? Hardware? What do you call FPGA logic? There are a bunch of somewhat independent tasks to deal with on the chip. At the input side, the pixel data needs to be synchronized using training data to deal with the massive 4.8ns inter-channel skew. The FPGA inputs have a built-in delay tap, but it maxes out at 1.25ns. You can, in theory, cascade these with the adjacent unused output delays, to reach 2.5ns. That's obviously not enough to directly cancel the skew, but it is enough to reach the next 300MHz clock edge. So, possibly some combination of cascaded hardware delays and intentional bit slipping can cover the full range. It's going to be a nightmare.

The output side might be even worse. Just look at the number of differential pairs going in to the TE0803 headers vs. the number coming out. That's the ratio of how much tighter the timing tolerance is on the PCIe outputs. The edge of one bit won't hit the M.2 connector until a couple more have already left the FPGA. In this case, I have taken the effort to length match the pairs themselves. I won't know how close I am until I can do a loopback test.
Length matching the PCIe differential pairs to make up for the left turns and TE0803 routing.
Even assuming the routing is okay, there's the problem of NVMe. NVMe is an open specification for what lives on top of the PCIe PHY to control communication with the SSD. It's built in to Linux, including versions that can run on the ZU4CG's ARM cores. But that puts the operating system in the pipeline, which sounds like a disaster. I haven't seen any examples of that running at anywhere near 1GB/s. I think hardware-accelerated NVMe might work, but as far as I can tell there are no license-free NVMe cores in existence. I don't have a solution to this problem yet, but I will happily sink hours into anything that prevents me from having to deal with IP vendors.

Sitting right in the middle, between these input and output constraints, is the complete mystery that is the wavelet core. This has to be done in hardware. The ARM cores and even the GPU are just not fast enough, and even if they were, accessing intermediate results would quickly eat the RAM bus. The math operations involved are so compact, though, that it seems natural to implement them in tiny logic/memory cores and then put as many of them in parallel as possible.

The wavelet cores are the most interesting part of this pipeline and require a separate post to cover in enough detail to be meaningful. I have a ton of references on the theory and a little bit of concept for how to turn it into lightweight hardware. As it stands, I know only enough to have some confidence that it will fit on the ZCU4CG, in terms of both logic elements and distributed memory for storing intermediate results. (The memory requirement is much less than a full frame, since the wavelets only look ahead/behind a few pixels at a time.) But there is an immense amount of implementation detail to fill in, and I hope to make a small dent in that while these boards are in flight.

To summarize, I still have no clue if, how, or when any of this will work. My philosophy on this project is to send the pixels as fast as they want to go and try to remove anything that gets in the way. It's not really a plan - more of a series of challenges.