The next stop on the Freight Train of Pixels is the wavelet compression engine. Previously, I built up the CMV12000 input module, which turned out to be easier than I thought. The output of that module is a set of 64 10-bit pixels and one 10-bit control signal that update on a 60MHz pixel clock (px_clk). This is too much data to write directly to an NVMe SSD, so I want to compress it by about 5:1 in real-time on the XCZU4 Zynq Ultrascale+ SoC.
Wavelet compression seems like the right tool for the job, prioritizing speed and quality over compression ratio. It needs to run on the SoC's programmable logic (PL), where there's enough parallel computation and memory bandwidth for the task. This post will build up the theory and implementation of this wavelet compression engine, starting from the basics of the discrete wavelet transform and ending with encoded data streams being written to RAM. It's a bit of a long one, so I broke it into several sections:
[Wrapping Up] - First test video here.
[More Information] - Source here, plus some references.
The Discrete Wavelet Transform
Suppose we want to store the row vector [150, 150, 150, 150, 150, 150, 150, 150]. All the values are less than 256, so eight bytes works. If this vector is truly from a random stream of bytes, that might be the best we can do. But real-world signals of interest are not random and a pattern of similar numbers reflects something of physical significance, like a stripe in a bar code. This fact can be leveraged to represent a structured signal more compactly. There are many ways to do this, but let's look specifically at the discrete wavelet transform (DWT).
A simple DWT could operate on adjacent pairs of data points, taking their sum (or average) and difference. This two-input/two-output operation would scan, without overlap, through the row vector to produce four averages and four differences, as shown below. Assuming no rounding has taken place, the resulting eight values fully represent the original data, since the process could be reversed. Furthermore, the process can be repeated in a binary fashion on just the averages:
After three levels, all that remains is a single average and seven zeros, representing the lack of difference between adjacent data points at each stage above. This is an extreme example, but in general the DWT will concentrate the information content of a structured signal into fewer elements, paving the way for compression algorithms to follow.
For image compression, it's possible to perform a 2D DWT by first transforming the data horizontally, then transforming the intermediate result vertically. This results in four outputs representing the average image as well as the high-frequency horizontal, vertical, and diagonal information. The entire process can then be repeated on the new 1/2-scale average image.
A three-stage 2D Haar DWT output. Green indicates zero difference outputs. |
The DWT discussed above uses the simplest possible wavelet, the Haar wavelet, which only looks at two adjacent values for both the sum/average and the difference calculation. While this is extremely fast, it has relatively low curve-fitting ability. Consider what happens if the Haar DWT is performed on a ramp signal:
Instead of zeros, the ramp input produces a constant difference output. It's still smaller than the original signal, but not as good for compression as all zeros. It's possible to use more complex wavelets to capture higher-order signal structure. For example, a more complex wavelet may compute the deviation from the local slope instead of the immediate difference, which is back to zero for a ramp input:
To compute the "local slope", some number of data points before and after the pair being processed are needed. The sum/average operation may also be more complex and use more than two data points. One classification system for wavelets is based on how many points they use for the sum and difference operations, their support. The Haar wavelet would be classified as 2/2 (for sum/difference), and is unique in that it doesn't require any data beyond the immediate pair being processed. A wavelet with larger support can usually fit higher-order signals with fewer non-zero coefficients.
The better signal-fitting ability of wavelets with larger support comes at a cost, since each individual operation requires more data and more computation. Now is a good time to introduce the 2/6 wavelet that will be the focus of most of this post. It uses two data points for the sum/average, just like the Haar wavelet, but six for the difference. One way to look at it is as a pair of weighted sums being slid across the data:
The two-point sum is a straightforward moving average. The six-point difference is a little more involved: the four outer points are used to establish the local slope, which is then subtracted from the immediate difference calculated from the two inner points. This results in zero difference for constant, ramp, and even second-order signals. Although it's more work than the Haar wavelet, the computational requirement is still relatively low thanks to weights that are all powers of two.
The 2/6 wavelet is used by the CineForm codec, which was open-sourced by GoPro in 2017. The GitHub page has a great explanation of how wavelet compression works, along with the full SDK and example code. There's also a very easy to follow stripped-down C example that compresses a single grayscale image. If you want to explore the entire history of CineForm, which overlaps in many ways with the history of wavelet-based video compression in general, it's definitely worth reading through the GoPro/CineForm Insider blog.
Besides the added computation, the wavelets other than the Haar also need data from outside the immediate input pair. The 2/6 wavelet, for example, needs two more data points to the left and to the right of the input pair. This creates a problem at the first and last input pair of the vector, where two of the data points needed for the difference calculation aren't available. Typically, the vector is extended by padding it with data that is in some way an extension of the nearby signal. This adds a notion of state to the system that can be its own logical burden.
Different ways to extend data for the 2/6 DWT operation at the start and end of a vector. |
There are other subtle issues with the DWT in the context of compression. For one, if the data is not structured, the DWT actually increases the storage requirement. Consider all the possible sum and difference outputs for two random 8-bit data points under the Haar DWT. The sum range is [0:510] and the difference range is [-255:255], both seemingly requiring 9-bit results. The signal's entropy hasn't changed; the extra bits are disguising two new redundancies in the data: 1) The sum and difference are either both even or both odd. 2) A larger difference implies a sum closer to 255.
Things get even a bit worse with the 2/6 DWT. If the difference weights are multiplied by 8 to give an integer result, the difference range for six random 8-bit values is [-2550:2550], which would require 13 bits to store. Since there are six inputs per two outputs in the difference operation, it's also harder to see the extra bits as representing some simple redundancies in the transformed data.
65536 random vectors fed through the 2/2 and 2/6 DWT give an idea of the output range. |
The differences in a structured signal will still be concentrated heavily in the smaller values, so compression can still be effective. But it would be nice not to have to reserve so much memory for intermediate results, especially in a multi-stage DWT. Rounding or truncating the data after each sum and difference operation seems justified, since a lossy compression algorithm will wind up discarding least significant bits anyway. But it would be nice to keep the DWT itself lossless, deferring lossy compression to a later operation. In fact, there's an efficient algorithm for performing the DWT operations reversibly without introducing as much redundancy in the form of extra result bits.
The Lifting Scheme
What follows are a couple of simple examples of an amazing piece of math that's relatively modern, with the original publication in 1995. There's a lot to the theory, but one core concept is that a DWT can be built from smaller steps, each inherently reversible. For example, the Haar (2/2) DWT could be built from two steps as follows:
Included in the forward Step 2 is a truncation of the difference by right shift. This is equivalent to dividing by two and rounding down, and turns the sum output into an integer average, which only requires as many bits as one input. But, remarkably, no information is really discarded. The C code makes the inherent reversibility pretty clear. Even if you truncated more bits of the difference, it would be reversible. In fact, you could skip the sum entirely and just forward x0 to s.
The mechanism of the lifting scheme has effectively eliminated one of the two redundancies that was adding bits to the results of the 2/2 DWT. Specifically, the fact that the sum and difference were either both even or both odd has been traded in for a one-bit reduction in the range of the sum output. Likewise, it can help reduce the range of the difference output of the 2/6 DWT without sacrificing reversibility:
One additional step is added that calculates the local slope from two adjacent sum outputs. (The symbols z-1 and z are standard notation for "one sample behind" and "one sample ahead".) After the subtraction, a round-down-bias-compensating constant of two is added in and then the result is right shifted by two, effectively dividing it by four. The result is similar to the four outer 1/8-weighted terms of the six-point difference, but with rounding. However, because this whole step is done identically in the forward and reverse direction, it's still fully reversible.
The calculation of the local slope from two adjacent outputs instead of four adjacent inputs highlights another important efficiency feature of the lifting scheme: intermediate results get reused in later lifting steps. This reduces the total number of shift/add operations (or multiply/add operations, for more complex wavelets). It also means that once the immediate sum and difference steps have been performed on a particular pixel pair, that input pair is no longer needed and its memory can be used to store intermediate results instead. (Fully in-place computation is possible.)
The 2/6 lifting scheme construction as described above will be the basis for the hardware implementation to follow. One important consideration is that the real implementation must be causal, so the "one sample ahead" (z) component of the local slope calculation implies a latency of at least one pixel pair from input to output. This has different consequences for the horizontal DWT, which operates in the same dimension as the sensor readout scan, and the vertical DWT, which must wait for enough completed rows. For this and other reasons, the hardware implementations for each dimension can wind up being quite different.
Horizontal 2/6 DWT Core
For running a multi-stage 2D DWT at 3.8Gpx/s on a Zynq Ultrascale+ SoC, the bottleneck isn't really computation, it's memory access. Writing raw frames to external (PS-side) DDR4 RAM and then doing an in-place 2D DWT on them is infeasible. Even the few Tb/s of BRAM bandwidth available on the PL-side needs to be carefully rationed! For that reason, I decided I wanted the Stage 1 horizontal cores to run directly on the 64 pixel input streams, using only distributed memory. Something like this:
Because the DWT operates on pixel pairs, one register is required to store the even pixel. Then, all the action happens on the odd pixel's clock cycle:
- Combinational block A does Step 1 and Step 2 of the 2/6 DWT lifting scheme as described above, placing its results in registers D_0 and S_0.
- S_0 and D_0's previous values are shifted into S_1 and D_1.
- S_1's previous value is shifted into S_out.
- Combinational block B does Step 3 of the 2/6 DWT lifting scheme, using S_0 and S_out's previous values to compute the local slope and subtract it from D_1. The result is placed in D_out.
This is a tiny core: the seven 16-bit registers get implemented as 112 total Flip-Flops (FFs) and the combinational logic takes around 64 Look-Up Tables (LUTs), as four 16-bit adders. And it has to be tiny, because not only are there 64 pixel input streams, but each stream services two color fields (in the Bayer-masked color version of the sensor). So that's 128 Stage 1 horizontal cores running in parallel. The good news is that they only need to run at px_clk frequency, 60MHz, which isn't much of a timing burden.
Unfortunately, that tiny core was a little too good to be true. The 64 pixel input streams from the CMV12000 are divided up into two rows and 32 columns. Within each column, there are only 128 adjacent pixels, and only 64 of a given color field. Remember the part about the 2/6 DWT requiring input extensions of two samples at the beginning and end of stream? Now, each core would need logic to handle that. But unlike the actual beginning and end of a row, in most cases the required data actually does exist, just not in the same stream as the one being processed by the core. A better solution, then, is to tie adjacent cores together with the requisite edge pair data:
Now, the horizontal core can be in one of three states: Normal, during which the core has access to everything it needs from within its own column. Last Out, where the core needs to borrow the sum of the first pixel pair from the core to its right (S_pp0_fromR). And First Out, where it needs to borrow the first two sums and the first difference from the core to its right (S_pp0_fromR, S_pp1_fromR, and D_pp0_fromR). The active state is driven by a global pixel counter.
In addition to the extra switching logic (implemented as LUT-based multiplexers), the cores need to store their first and second pixel pair sums (S_pp0, S_pp1) and first pixel pair difference (D_pp0) for later use by the adjacent core. One more register, S_2, is also added as a dedicated local sum history to allow the output sum to be the target of one of the multiplexers. The total resource utilization of the Stage 1 horizontal core is 176 FFs (11x16b registers) and 107 LUTs. That's still pretty small, but 128 cores in parallel eats up about 15% of the available logic in the XCZU4.
Luckily, things get a lot easier in Stage 2 and Stage 3, which only need 32 and 16 horizontal cores, respectively. They're also fed whole pixel pairs from the stage above them, so the X_even register isn't needed. Otherwise, they operate in a similar manner to the Stage 1 core shown above.
Storing whole rows in each BRAM is the wrong way to do things, though. The data coming from the CMV12000 is primarily parallel by column, and preserving that parallelism for as long as possible is the key to going fast. If all 32 Stage 1 horizontal cores of a given color field had to access a single BRAM every time a new pixel pair was ready at their output (once every four px_clk), it would require eight write accesses per px_clk. While 480MHz is theoretically in-spec for the BRAM, it would make meeting timing constraints much harder. It's much better to divide up the BRAMs into column-parallel groups, each handling 1/8 of the total color field width and receiving data from only four horizontal cores (just a 60MHz write access).
Now, each BRAM stores all eight rows for its column group. The vertical 2/6 DWT core can be built around the BRAM, with the DWT operations running on six older rows while two newer rows are being written with horizontal core data. Doing six reads per two writes (180MHz read access) isn't too bad, especially since the BRAM has independent read and write ports. But I can do even better by using a 64-bit read width and processing two pixel pairs at a time in the vertical core. To avoid dealing with a 3/2-speed clock, I decided to implement the reads as six of eight states in a state machine that runs at double px_clk (120MHz).
In addition to the extra switching logic (implemented as LUT-based multiplexers), the cores need to store their first and second pixel pair sums (S_pp0, S_pp1) and first pixel pair difference (D_pp0) for later use by the adjacent core. One more register, S_2, is also added as a dedicated local sum history to allow the output sum to be the target of one of the multiplexers. The total resource utilization of the Stage 1 horizontal core is 176 FFs (11x16b registers) and 107 LUTs. That's still pretty small, but 128 cores in parallel eats up about 15% of the available logic in the XCZU4.
Luckily, things get a lot easier in Stage 2 and Stage 3, which only need 32 and 16 horizontal cores, respectively. They're also fed whole pixel pairs from the stage above them, so the X_even register isn't needed. Otherwise, they operate in a similar manner to the Stage 1 core shown above.
Vertical 2/6 DWT Core
While I can get away with using only distributed memory for the horizontal cores, the vertical cores need to store several entire rows. This means using Block RAM (BRAM), which is dedicated synchronous memory available on the Zynq Ultrascale+ PL side. The XCZU4 has 128 BRAMs, each with 36Kib of storage. Each color field is 2048px wide, so a single BRAM could store a single row of a single color field (2048 · 16b = 32Kib). I settled on storing eight rows, requiring a total of 32 BRAMs for the four color fields of Stage 1.Storing whole rows in each BRAM is the wrong way to do things, though. The data coming from the CMV12000 is primarily parallel by column, and preserving that parallelism for as long as possible is the key to going fast. If all 32 Stage 1 horizontal cores of a given color field had to access a single BRAM every time a new pixel pair was ready at their output (once every four px_clk), it would require eight write accesses per px_clk. While 480MHz is theoretically in-spec for the BRAM, it would make meeting timing constraints much harder. It's much better to divide up the BRAMs into column-parallel groups, each handling 1/8 of the total color field width and receiving data from only four horizontal cores (just a 60MHz write access).
Stage 1 parallel architecture for a single color field. BRAM writes round-robin through four horizontal cores at 60MHz. |
Since it has access to all the data it needs in the six oldest rows, the vertical core can actually be pretty simple. Input extensions are only needed at the top and bottom of a frame (sort-of, see below); no data is needed from adjacent cores. It's not as computationally efficient as the horizontal core: Step 1 and Step 2 are repeated (via combinational block A) three times on a given row pair as it gets pushed through. This is done intentionally to avoid having to write back local sums to the BRAM, so the vertical core only needs read access.
Since the vertical core operates on 64-bit registers (two pixel pairs at a time), all the multiplexers and adders are four times as wide, giving a total LUT count of 422, roughly four times that of the horizontal core. This seems justified, with a ratio of 4:1 for horizontal:vertical cores in Stage 1. Still, it means the complete Stage 1 uses about 30% of the total logic available on the XCZU4. I do get a bit of a break on the FF count, since this core only has six 64-bit registers (384 FFs, much less than four times the horizontal core FF count).
The output of the vertical core is two 64-bit registers containing the vertical sum and difference outputs of two pixel pairs. Because of the way the horizontal sums and differences are interleaved in the BRAM rows, this handily creates four 32-bit pixel pairs to route to either the next wavelet stage (for the LLx pair) or the compression hardware (for the HLx, LHx, and HHx pairs). The LLx pair becomes the input for the next stage horizontal cores.
At each stage, the color field halves in width and height. But, the vertical cores need to store eight rows at all stages, so only the width reduction comes into play. Thus, Stage 2 needs 16 vertical cores (four per color field) and Stage 3 needs eight (two per color field). The extra factor of two is not lost completely, though: at each stage, the write and read access rates of the vertical core BRAM are also halved. In total, the three-stage architecture looks something like this:
Vertical core outputs that don't go to a later stage (HHx, HLx, LHx, and LL3) are sent to the compression hardware, which is where we're heading next as well, after tying up some loose ends.
Input Extensions?
I mentioned that the horizontal cores are chained together so that the DWTs can access adjacent data as needed within a row, but what happens with the left-most and right-most cores, at the beginning and end of a row? And what happens at the start and end of a frame, in the vertical cores? Essentially, nothing:
- For the horizontal DWT, the left-most and right-most cores are just tied to each other as if they were adjacent, a free way to implement periodic extension.
- For the vertical DWT, the BRAM contents are just allowed to roll over between frames. The first rows of frame N reference the last rows of frame N-1.
This isn't necessarily the optimal approach for an image; symmetric extension will produce smaller differences, which are easier to compress. But, it's fast and cheap. It's also reversible if the inverse DWT implements the same type of extension of its sum inputs. Even the first rows of the first frame can be recovered if the BRAM is known to have started empty (free zero padding).
Quantizer
If you take a typical raw image straight off a camera sensor, run it through a multi-stage 2D DWT, and then zip the result, you'll probably get a compression ratio of around 3:1. This isn't as much a metric of the compression method itself as of the image, or maybe even of our brains. We consider 50dB to be a good signal-to-noise ratio for an image. No surprise, this gives an RMS noise of just under 1.0 in the standard 8-bit display depth of a single color channel. But a 10- or 12-bit sensor will probably have single-digit noise, which winds up on the DWT difference outputs. This single-digit noise is distributed randomly on the frame and requires a few bits per pixel to encode, limiting the lossless compression ratio to around 3:1.
To get to 5:1 or more, it's necessary to introduce a lossy stage in the form of re-quantization to a lower bit depth. It's lossy in the sense of being mathematically irreversible, unlike the quantization in the lifting scheme. Information will be discarded, but we can be selective about what to discard. The goal is to reduce the entropy with as little effect on the subjective visual quality of the image as possible. Discarding bits from the difference channels, especially the highest-frequency ones (HH1, HL1, LH1), has the least visual impact on the final result. This is especially true if the least significant bits of those channels are lost in sensor noise anyway.
To get to 5:1 or more, it's necessary to introduce a lossy stage in the form of re-quantization to a lower bit depth. It's lossy in the sense of being mathematically irreversible, unlike the quantization in the lifting scheme. Information will be discarded, but we can be selective about what to discard. The goal is to reduce the entropy with as little effect on the subjective visual quality of the image as possible. Discarding bits from the difference channels, especially the highest-frequency ones (HH1, HL1, LH1), has the least visual impact on the final result. This is especially true if the least significant bits of those channels are lost in sensor noise anyway.
Bits can be discarded from the difference channels with far less visual impact than discarding bits from the average. |
In a sense, the whole purpose of doing the multi-stage DWT was to separate out the information content into sub-bands that can be prioritized by visual importance:
- LL3. This is kept as raw 10- or 12-bit sensor data.
- LH3, HL3, and HH3.
- LH2, HL2, and HH2.
- LH1, HL1, and HH1.
Arguably, the HHx sub-bands are of lower priority than the LHx and HLx sub-bands on the same level. This would prioritize the fidelity of horizontal and vertical edges over diagonal ones. With these priorities in mind, each sub-band can be configured for a different level of re-quantization in order to achieve a target compression ratio.
In C, the fastest way to implement a configurable quantization step would be a variable right shift, causing the signal to be divided by {2, 4, 8, ...} and then rounded down. But implementing a barrel shifter in LUT-based multiplexers eats resources quickly if you need to process a bunch of 16-bit variable right shifts in parallel. Fortunately, there are dedicated multipliers distributed around the PL-side of the Zynq Ultrascale+ SoC that can facilitate this task. These DSP slices have a 27-bit by 18-bit multiplier with a full 45-bit product register. To implement a configurable quantizer, the input can be multiplied by a software-set value and the output can be the product, right shifted by a constant number of bits (just a bit-select):
This is more flexible than a variable shift, since the multiplier can be any integer value. Effectively, it opens up division by non powers-of-two. For example, 85/256 for an approximate divide-by-3. The DSP slices also have post-multiply logic that can be used, among many other things, to implement different rounding strategies. An unbiased round-toward-zero can be implemented by adding copies of the input's sign bit to the product up to (but not including) the first output bit:
The XCZU4 has 728 DSP slices distributed throughout its PL-side, so dedicating a bunch of these to the quantization task seems appropriate. The combined output of all the wavelet stages is 64 16-bit values per px_clk, so 64 DSPs would do the trick with very little timing burden. But there's a small catch that has to do with how the final quantized values get encoded.
This is more flexible than a variable shift, since the multiplier can be any integer value. Effectively, it opens up division by non powers-of-two. For example, 85/256 for an approximate divide-by-3. The DSP slices also have post-multiply logic that can be used, among many other things, to implement different rounding strategies. An unbiased round-toward-zero can be implemented by adding copies of the input's sign bit to the product up to (but not including) the first output bit:
The XCZU4 has 728 DSP slices distributed throughout its PL-side, so dedicating a bunch of these to the quantization task seems appropriate. The combined output of all the wavelet stages is 64 16-bit values per px_clk, so 64 DSPs would do the trick with very little timing burden. But there's a small catch that has to do with how the final quantized values get encoded.
So far we've done a terrible job at reducing the bit rate: the sensor's 37.75Gb/s (4096 · 3072 · 10b · 300fps) has turned into 61.44Gb/s at the quantizer (64 · 16b · 60MHz). But, if the DWT and quantizer have worked, most of the 16-bit values in HHx, HLx, and LHx should be zeros. A lot will be ±1, fewer will be ±2, fewer ±3, and so on. There will be patches of larger numbers for actual edges, but the probability will be heavily skewed toward smaller values.
If the new probability distribution has an entropy of less than 2.00 bits, it's theoretically possible to achieve 5:1 compression. One way to do this is with a variable-length code, which maps inputs with higher probability to outputs with fewer bits. A known or measured probability distribution of the inputs can be used to create the an efficient code, such as in Huffman coding. Adapting the code as the probability distribution changes is a little beyond the amount of logic I want to add at the moment, so I will just take an educated guess at a fixed code that will work okay given the expected distributions:
This encoder looks at four sequential quantized values and determines how many bits are required to store the largest of the four. It then encodes that bit requirement in a prefix and concatenates the reduced-width codes for each of the four values after that. This is all relatively fast and easy to do in discrete logic. Determining how many bits are required to store a value is similar to the find first set bit operation. Some bit bins are grouped to reduce multiplexer count for constructing the coded output. Applying this code to the three example probability distributions above gives okay results:
That's a quick test at 4096x2304, with all the quantizers (other than LL3) set to 32/256 and the frame rate maxed out (400fps at 16:9). This results in an overall compression ratio of about 6.4:1 for this clip. The second half of the clip shows the wavelet output at different stages, although the sensor noise wreaks havoc on the H.264. It's hard to tell anything from the YouTube render, but here's a PNG frame (full size):
Probably, HH3, HL3, and LH3 should get back one or two bits (quantizer setting of 64-128). On other hand, HH1 might be able to go down one more bit since it looks like it's still encoding mostly sensor noise. I'm not sure how much the quantizer settings will need to change from scene to scene, or even from frame to frame, but overall I think it'll be easy to maintain good image quality at a target ratio of 5:1. I also have a few possibilities for improving the compression efficiency without adding much more logic, such as adding some local context-based prediction to the quantizer DSP's adder port.
I'll probably take a break from wavelets for now, though, and return to the last big component of this system: the NVMe SSD write. Now that the data is below 1GB/s, I need a place to put it. The RAM will still be used as a big FIFO frame buffer, but ultimately the goal is to continuously record. I also want to drop in a color sensor at some point, since this wavelet compression architecture is really meant for that (four independent color fields). Glad to have the most unknown subsystem completed though!
That's way too much information for a single post, but oh well. I'll just end with some wavelet art from the incomplete last frame of the sequence above.
- LH1: 1.38bpp (7.26:1) compared to 1.14bpp (8.77:1) theoretical.
- HL1: 1.39bpp (7.20:1) compared to 1.06bpp (9.43:1) theoretical.
- HH1: 2.41bpp (4.16:1) compared to 1.84bpp (5.44:1) theoretical.
To get closer to the theoretical maximum compression, more logic can be added to revise the code based on observed probabilities. It might also be possible to squeeze out more efficiency by using local context to condition the encoder, almost like an extra mini wavelet stage. But since this has to go really fast, I'm willing to trade compression efficiency for simplicity for now.
I emphasized that the four quantized values need to be sequential, i.e. from a single continuous row of data. The probability distribution depends on this, and the parallelization strategy of the quantizers and encoders must enforce it. The vertical core outputs are all non-adjacent pixel pairs, either from different levels of the DWT, different color fields within a level, or different columns within a color field. So, while a single four-pixel quantizer/encoder input can round-robin through a number of vertical core outputs, the interface must store one previous pixel pair. I decided to group them in 128-bit compressor cores, each with two 4x16b quantizers and encoders:
Each input is a 32-bit pixel pair from a single vertical core output. During the "even" phase, previous pixel pair values are saved in the in_1[n] registers. During the "odd" phase, the quantizers and encoders round-robin through the inputs, encoding the current and previous pixel pairs of each. A final step merges the two codes and repacks them into a 192-bit buffer. When this fills past 128 bits, a write is triggered to a 128-bit FIFO (made with two BRAMs) that buffers data for DDR4 RAM writing, using a similar process as I previously used for raw sensor read-in.
The eight-input compressor operates on two pixel pairs simultaneously at each px_clk, covering eight inputs in four px_clk. This matches up well to the eight vertical cores per color field of Stage 1, which update their outputs every fourth px_clk. In total, Stage 1 uses 12 compressor cores: four color fields each for HH1, HL1, and LH1 pixel pair outputs. Things get a little more confusing in Stage 2 and Stage 3, where the outputs are updated less frequently. To balance the load before it hits the RAM writer, it makes sense to increase the round-robin by 2x at each stage. When load-balanced, the overall mapping uses winds up using 16 compressor cores:
Each core handles 1/16 of the 61.44Gb/s output from the DWT stages, writing compressed data into its BRAM FIFO. An AXI Master module checks the fill level of each FIFO and triggers burst writes to PS DDR4 RAM as needed. A single PL-PS AXI interface has a maximum theoretical bandwidth of 32Gb/s at 250MHz, which is why I needed two of them in parallel for writing raw images to RAM. But since the target compression ratio is 5:1, I should be able to get away with a single AXI Master here. In that case, the BRAM FIFOs are critical for handling transient entropy bursts.
And that's it, we've reached the end of this subsystem. The 128b-wide AXI interface to the PS-side DDR controller is the final output of the wavelet compressor. What becomes of the data after that is a problem for another day.
Wrapping Up
Even though most of the cores described above are tiny, there are just so many of them running in parallel that the combination of CMV12000 Input stage and Wavelet stages described here already takes up 70% of the XCZU4's LUTs, 39% of its FFs, and 69% of its BRAMs.
I'm sure there's still some room for optimization. Many of the modules are running well below their theoretical maximum frequencies, so there's a chance that some more time multiplexing of certain cores could be worthwhile. But I think I would rather stick to having many slow cores in parallel. The next step up would probably be the XCZU6, which has more than double the logic resources. It's getting more expensive than I'd like, but I will probably need it to add more pieces:
- The PCIe bridge and, if needed, NVMe hardware acceleration.
- Decoding and preview hardware, such as HDMI output generation. This can be much less parallel, since it only needs to run at 30fps. Maybe the ARMs can help.
- 128 more Stage 1 horizontal cores to deal with the sensor's 2048x1536 sub-sampling mode, where four whole rows are read in at once. This should run at above 1000fps.
For now, though, let's run this machine:
That's a quick test at 4096x2304, with all the quantizers (other than LL3) set to 32/256 and the frame rate maxed out (400fps at 16:9). This results in an overall compression ratio of about 6.4:1 for this clip. The second half of the clip shows the wavelet output at different stages, although the sensor noise wreaks havoc on the H.264. It's hard to tell anything from the YouTube render, but here's a PNG frame (full size):
No Kerbals were harmed in the making of this video. |
The areas of high contrast look pretty good. I'm happy with the reconstruction of the text and QR code on the label. The smoke looks fine - it's mostly out of focus anyway. Wavelet compression is entirely intraframe, so things like smoke and fire don't produce any motion artifacts. There are a few places where the wavelet artifacts do show up, mostly in areas with contrast between two relatively dark features. For example on the color cube string:
Probably, HH3, HL3, and LH3 should get back one or two bits (quantizer setting of 64-128). On other hand, HH1 might be able to go down one more bit since it looks like it's still encoding mostly sensor noise. I'm not sure how much the quantizer settings will need to change from scene to scene, or even from frame to frame, but overall I think it'll be easy to maintain good image quality at a target ratio of 5:1. I also have a few possibilities for improving the compression efficiency without adding much more logic, such as adding some local context-based prediction to the quantizer DSP's adder port.
I'll probably take a break from wavelets for now, though, and return to the last big component of this system: the NVMe SSD write. Now that the data is below 1GB/s, I need a place to put it. The RAM will still be used as a big FIFO frame buffer, but ultimately the goal is to continuously record. I also want to drop in a color sensor at some point, since this wavelet compression architecture is really meant for that (four independent color fields). Glad to have the most unknown subsystem completed though!
That's way too much information for a single post, but oh well. I'll just end with some wavelet art from the incomplete last frame of the sequence above.
It's interesting to see what uninitialized RAM looks like when run through the decoder and inverse DWT. |
More Information
I put the Verilog source for the modules described above here. There isn't enough there to clone a working project from scratch, but you can take a look at the individual cores. Keep in mind that these are just my implementations and I am far from a Verilog master. I would love to see how somebody with 10,000 hours of Verilog experience would write the modules described above.
Here are some unordered references to other good wavelet compression resources:
- The GoPro CineForm SDK and Insider Blog has a ton of good discussion of wavelet compression, including a history of the CineForm codec going back to 2005.
- This paper, by some of the big names, lays the foundation for reversible integer-to-integer wavelet transforms, including the one described above: R. Calderbank, I. Daubechies, W. Sweldens, and B.-L. Yeo, “Wavelet Transforms That Map Integers to Integers,” Applied and Computational Harmonic Analysis, vol. 5, pp. 332-369, 1998.
- The Wavelet Browser, by PyWavelets, has a catalog of wavelets to look at. Interestingly, what I've been calling the 2/6 wavelet is there called the reverse biorthogonal 1.3 wavelet.
- This huge set of course notes on JPEG2000, which is also wavelet-based, has a lot of information on the 5/3 and 9/7 wavelets used there, as well as quantization and encoding strategies.