## Saturday, September 28, 2019

### Fast atan2() alternative for three-phase angle measurement.

Normally, to get the phase angle of a set of (assumed balanced) three-phase signals, I'd do a Clarke Transform followed by a atan2(β,α). This could be atan2f(), for single-precision floating-point in C, or some other approximation that trades off accuracy for speed. The crudest (and fastest) of these is a first-order approximation atan(x) ≈ (π/4)·x which has maximum error of ±4.073º over the range {-1 ≤ x ≤ 1}:
Interestingly, this isn't the best (minimax or least mean square) linear fit over that range. But it's pretty good and has zero error on both ends, so it can be stitched together into a continuous four-quadrant approximation that covers all finite inputs to the two-argument atan2(β,α):
One common implementation determines the quadrant based on α and β and then runs the linear approximation on either x = β/α or x = α/β, whichever is in the range {-1 ≤ x ≤ 1} in that quadrant. The combination of a quadrant offset and the local linear approximation determines the final result.

It's possible to extend this method to three inputs, a set of three-phase signals assumed to be balanced. Instead of quadrants, the input domain is split based on the six possible sorted orders of the three-phase signals. Within each sextant, the middle input (the one crossing zero) is divided by the difference of the other two to form a normalized input, analogous to selecting x = β/α or x = α/β in the atan2() implementation:
This normalized input, which happens to range from -1/3 to 1/3, is multiplied by a linear fit constant to create the local approximation. To follow the pattern of the four-quadrant approximation, a constant of π/2 gives a fit that's not (minimax or least mean square) optimal, but stitches together continuously at sextant boundaries. As with the atan2() implementation, the combination of a sextant offset and the local approximation determine the final result.
For this three-phase approximation the maximum error is ±1.117º, significantly lower than the four-quadrant approximation. If starting from three-phase signals anyway, this method may also be faster, or at least nearly the same speed. The conditional section for selecting a sextant is more complex, but there are fewer intermediate math operations. (Both still have the single pesky floating-point divide for normalization.)

To put this to the test, I tried directly computing the phase of the three flux observer signals on TinyCross's dual motor drive. This usually isn't the best way to derive sensorless rotor angle: An angle tracking observer or PLL-type method can do a better job at filtering out noise by enforcing physical bandwidth constraints. But for this test, I just compute the angle directly using either atan2f(β,α) or one of the two approximations above.

 Computation times for different angle-deriving algorithms.
The three-phase approximation does turn out to be a little faster in this case. To keep the comparison fair, I tried to use the same structure for both approximations: the quadrant/sextant selection conditional runs first, setting bits in a 2- or 3-bit code. That code is then used to look up the offset and the numerator/denominator for the the local linear approximation. This is running on an STM32F303 at 72MHz. The PWM loop period is 42.67μs, so a 1.5-2.0μs calculation per motor isn't too bad, but every cycle counts. It's also "free" accuracy improvement:

The ±4º error ripple in the four-quadrant approximation shows up clearly in real data. The smaller error in the three-phase approximation is mostly lost in other noise. When the error is taken with respect to a post-computed atan2f(), the four-quadrant approximation looks less noisy. But I think this is just a mathematical symptom. When considering error with respect to an independent angle measurement (from Hall sensor interpolation), they show similar amounts of noise.

I don't have an immediate use for this, since TinyCross is primarily sensored and the flux signals are already synchronously logged (for diagnostics only). But clock cycle hunting is a fun hobby.

## Tuesday, September 24, 2019

### TinyCross: First Test Drive and Synchronous Data Logging

With the front wheel drive complete and the steering wheel control board working, it's finally time for a first test drive:

I've been waiting over a year to see if this mountain bike air shock suspension setup would work, and it looks like it does! I haven't done any tuning on it besides setting the preload, but it handles my pretty beat up parking lot nicely, absorbing bumps that would have broken tinyKart in minutes. The steering linkage also seems okay, with good travel and minimal bump steer. There are still some minor mechanical improvements I want to make, but it's nice to see the suspension concept in action after all this time.

I started with front wheel drive so I could see if the motor drive had any Flame Emitting Transistors, but happily it did not. It's the same gate drive design that I use on everything and it always just works, so I shouldn't be surprised anymore. But I am asking a lot of the FDMT80080DC FETs (just one per leg), so I'm working my way up to 120A (peak, line-to-neutral) phase current incrementally. The above test is at 80A and the FETs seem happy, although the motors do get pretty warm already. They might need some i²t thermal protection to handle 120A peaks.

#### Synchronous Data Logging

One of the early lessons I learned in building motor drives is to always log data. Nothing ever works perfectly on the first try, but having data logging built in from the start is the best way I know of to quickly diagnose problems. A lot of the stuff that happens in a motor drive is faster than typical data logging can capture, but a lot of it is also periodic. By synchronizing the data collected to the rotor electrical angle, its possible to reveal detailed periodic signals even with relatively low frequency (50Hz) logging to an SD card. As a quick example, here's a standard data logger plot of motor phase currents over time:
 Phase current vs. time, pretty boring.
This type of plot shows the drive cycle, with periods of high current during acceleration (or braking) and periods of near zero current when coasting or stopped. And it shows that phase currents sometimes exceed 100A even with an 80A command. But a plot of Q-axis (torque-producing) current, which is already synchronous, could give a better summary of this information. The time resolution (40ms) isn't fine enough to show the AC signals.

However, each set of three phase currents is also stamped with a rotor electrical angle measured at the same time (within about 10 microseconds). Cross-plotting the phase currents against their angle stamp, instead of against time, reveals a much more interesting view of the data:
 Same data, different meaning.
Now it's possible to see the three phase current waveforms separated by 120edeg. The peaks are at 0º (Phase A), 120º (Phase B), and -120º (Phase C). There are also negative peaks at the same angles, where braking is occurring. Most interestingly, the shape of the current waveforms at 80A peak is revealed to be asymmetric and far from sinusoidal.

The angular resolution of this type of waveform capture is only limited by the angle measurement, regardless of logging frequency. By contrast, the fastest it would be possible to log a continuous waveform would be at the PWM frequency (23.4kHz, in this case), which gives an speed-dependent angular resolution of 11.3edeg per 1000rpm. It would become difficult to resolve the shape of the current waveform at high speeds. There's always a trade-off, though: Synchronizing low-speed log data with angle stamps is only able to show the average shape of long-term periodic signals. It would not catch a glitch in a single cycle of the phase currents.

While the phase current shape is interesting, the position of the peaks is just a consequence of the current controller. Zero electrical degrees is defined (by me, arbitrarily) as the angle at which Phase A's back EMF is at a peak. The current controller aligns the Phase A current with the Phase A back EMF for maximum torque per amp. So the phase current plot shows that the current controller is doing its job. This information is also captured by the already-synchronous Q-axis and D-axis current signals:

 Q-axis and D-axis current plotted against time.
The Q-axis current represents torque-producing current, aligned with the back EMF, and is the current being commanded by the throttle input. The D-axis current is field-augmenting (or weakening, if negative) current and doesn't contribute to torque production. In this case, the current controller seeks to hit the desired Q-axis current and keep the D-axis current at zero. It does this by varying the voltage vector applied to the motor. More on this later. The Q-axis and D-axis currents are rotor-synchronous values, so they already convey the magnitude and phase of the phase currents, just not the actual shape.

All of this is based on the assumption that the measured rotor angle is correct, i.e. properly defined with respect to the permanent magnets. On this kart, I'm using magnetic rotary sensors mounted to the motor shafts that communicate the rotor angle to the motor controller via SPI and optically-isolated emulated Hall sensor signals. But it's also possible to measure the rotor angle with a flux observer, as long as the motor is spinning sufficiently fast. I have this running in the background, logging flux estimates for each phase.

Again, plotting flux against time doesn't give a whole lot of information. It's interesting to see the observer converge as speed increases from zero at the start, and the average amplitude of about 5mWb is consistent with the motor's rpm/V constant and measured back-EMF. But the real value of this data comes from cross-plotting against the sensor-derived rotor angle:
 Cross-plotting against sensor-derived electrical angle shows substantial offset between the two motors.
The flux from Phase A should cross zero when its back EMF is at its peak, i.e. at an electrical angle of 0º in my arbitrarily-defined system. So, the front-right motor is more correct. The front-left is offset by about 30-45edeg, which is enough to start causing significant differences in torque. Indeed I had noticed some torque steer during the first test drives, which is what prompted me to do the sensor/sensorless angle comparison in the first place.

Since I have all three phases of flux, I can estimate the flux vector angle with some math and compare it to the sensor-derived rotor angle:

 Digging into flux angle offset of the front-left motor a little more.
Both motors have some variation in flux angle offset, but the front-left varies more and is further from the nominal 90º. Except...when it's not. There are two five-second intervals where the average offset of the front-left flux looks like it returns to nearly 90º, both occurring either during or just after applying negative current. However, there's one more negative current pulse, earlier in the test drive, that does not have a flux angle shift. My troubleshooting neural network has been trained over many project iterations to interpret this as the signature of a mechanical problem.

Sure enough, I was able to grab the rotor of the front-left motor and twist with hand strength only (< 5Nm) enough to make the shaft move relative to the rotor can. It only moved about 5º, but that's 35edeg, which is about the offset I had been seeing in the data. The press fit had failed and it was relying on back-up set screws on flats to keep from completely slipping. I suspect this won't be the last motor to fail in this way. I pressed out the shaft, roughed up the surface a little, and pressed it back in with some Loctite 609. I also drilled a hole in the back that can potentially be tapped as a back-up plan. And finally I recalibrated everything and marked the shaft so I'll know if it slips again.

 Reworked shaft, with a 1/4-20 tap drill (not going to tap it unless I absolutely have to), roughed surface, and press-fit augmented with Loctite 609, which should be good up to 25Nm for this surface area (4-5x margin).
After a few more test drives, it looks like it's holding. The front-left flux vs. sensor-derived angle looks much closer to the correct phase as well:

 Phase A flux vs. sensor-derived angle after shaft rework.
There's still a +/-10edeg offset from nominal, which could be from calibration accuracy or static biases like normal shaft twisting. It might be worth investigating more, but it's not enough offset to create any noticeable torque steer on the front wheel drive, so I'm satisfied for now. I will preemptively do the same rework on the remaining three motor shafts.

One other interesting cross-plot to look at is the Q- and D-axis voltage as a function of speed. I mentioned above that the current controller attempts to align the current vector with the back EMF vector by manipulating the voltage vector, the basis of field-oriented control. Due to the electrical time constant (L/R) of the motor, the voltage must lead the back EMF by a varying amount. This shows up as negative D-axis voltage increasing in magnitude with speed (and current).

 Jitter 3D  plot of the voltage vector operating curve.
At 80A and 2500erad/s (~3400rpm and ~27mph), the voltage vector is already leading by 45º, with 12V on both axes. This gives me a rough estimate for the motor's synchronous inductance.
Along with the measured resistance (32mΩ) and flux amplitude (5mWb), this is all that's required for a first-order motor model, and thus a torque-speed curve. Running this through the gear ratio, the force-speed curve at the ground should look something like:

The inductance has a large impact on the maximum speed at which 120A can be driven into the motor in-phase with the back EMF. This determines the maximum power, since above this speed the force drops off faster than the speed increases. The top speed is wherever on the curve the drag forces equal the motor force, probably in the 40-45mph range. This is all without using third harmonic injection, which gives an extra 15% voltage overhead (for the cost of higher peak battery power, of course). If I do turn that on, it will probably come with a gear ratio change to put that extra 15% toward more torque, not more speed.

That's all I wanted to check before building up the second motor controller for the rear wheel drive. I'm very eager to see how it handles with 4WD, and how close to this force-speed curve I can actually get.

## Wednesday, September 4, 2019

### CMV12000 Full-Speed (38.4Gb/s) Read-In on Zynq Ultrascale+

In my original Freight-Train-of-Pixels post, I explored three main challenges of building a 3.8Gpx/s imager: the source, the pipe, and the sink. Working backwards, the sink is an NVMe SSD that (hopefully) will be capable of 1GB/s writes. The pipe is a ~5:1 wavelet compression engine that has to make 3.8Gpx/s = 1GB/s in realtime, with minimal effect on image quality. And the source is the CMV12000 image sensor that relentlessly feeds pixel data into this machine. This post focuses on the source, and specifically the read-in mechanism implemented on a Zynq Ultrascale+ SoC for the 38.4Gb/s of LVDS data from the sensor.

#### Physical Interface

 The Source: Breaking out the CMV12000's 64 LVDS pairs was interesting.
The pixel data interface on the CMV12000 is 64 LVDS pairs, each operating at (up to) 300MHz DDR (600Mb/s). In the context of FPGA I/O interfaces, 300MHz DDR is really not that fast. It's just a lot of inputs. Most Zynq Ultrascale+ SoCs have enough LVDS-capable package pins to do this, but it took some searching to find a carrier board that breaks out enough of them to headers. I'm using the Trenz Electronic TE0803, specifically the ZU4CG vesion, which breaks out a total of 72 HP LVDS pairs from the ZU+.

The physical interface for LVDS is a 100Ω differential pair. At 300MHz DDR, the length-matching requirements are not difficult. A bit is about 250mm long, so millimeter-scale mismatches due to an uneven number of 45º left and right turns are not a big deal; no meandering is really needed for intrapair matching. Likewise, I felt it was okay to break some routing rules by splitting a pair for a few millimeters to reach inner CMV12000 pins, rather than pushing trace/space limits to squeeze two traces between pins.

 Routing of the LVDS pairs to TE0803 headers. Pairs are length-matched to within ~1mm, but no interpair matching was attempted. The FPGA must deal with interpair length differences as well as the CMV12000's large internal skew.
Interpair skew is still an issue. For ease of routing, no interpair length matching was attempted, resulting in length differences of as much as 30% of a bit interval. But this isn't even the bad news. The CMV12000 has a ~150ps skew between each channel from 1-32 and 33-64. That means that channels 32 and 64 are ~4.7ns behind channels 1 and 33, a whopping 280% of a bit interval. It would be silly to try to compensate for this with length matching, since that's equivalent to about 700mm at c/2!

For a brief moment after reading about the CMV12000's massive interchannel skew, I thought I might be screwed. FPGA inputs deal with skew by using adjustable delay elements to add delay to the edges that arrive early, slowing them all down to align with the most fashionably late edge. But the delay elements in the Zynq Ultrascale+ are only guaranteed to provide up to 1.1ns of delay. It's possible to cascade the unused output delay elements with their associated input delay elements, but that's still only 2.2ns.

But I don't need to account for the whole 4.7ns of interchannel skew; I only need to reach the same phase angle in the next bit. At 600Mb/s, that's only 1.67ns away. Delays larger than this can be done with bit slipping, as shown below. Since this still relies on the cascaded delay elements to span one full bit interval, an interesting consequence is that a minimum speed is imposed (about 450Mb/s for 2.2ns of available delay). So I guess it's go fast or go home...

 Channels are aligned using an adjustable delay of up to one bit period and integer bit slipping in the deserialized data.
The Ultrascale+ deserializer hardware supports up to 1:8 (byte) deserialization from DDR inputs. The bit slip logic selects a new byte from any position in the 16-bit concatenation of the current and previous raw deserialized bytes. The combination of the delay value and integer bit slip offset independently align each channel.

A complication is that the CMV12000 has 8-, 10-, and 12-bit pixel modes, with the highest readout efficiency in the default 10-bit mode. To go from 8-bit deserialized data to 10-bit pixel data requires building a "gearbox", a nomenclature I really like. An 8:10 gearbox can be built pretty easily with just a few registers:

 An 8:10 gearbox, with four states corresponding to alignment of the 10-bit output within two adjacent 8-bit inputs.
The gearbox cycles through four states, registering a 10-bit output from an offset of {0, 2, 4, or 6} within two adjacent 8-bit inputs to pick out whole pixels from the data. This looks simple enough, but there's a subtlety in the fact that five bytes must cycle through the registers for every four pixels. In other words, the input clock (byte_clk) is running 5/4 as fast as the output clock (px_clk). The two clocks must be divided down from the same source (the LVDS clock in this case) to ensure that timing constraints can be evaluated. Additionally, to work as pictured above, the phase of the two clocks must be such the "extra" byte shift occur between states 3 and 0.

The overall input module is pretty tiny, which is good because I have to instantiate 65 of them (64 pixel channels and one control channel). They're built into an AXI-Lite slave peripheral with all the per-channel tweakable parameters as well as the final 10-bit pixel outputs mapped for the ARM to play with. The CMV12000 outputs training data on the pixel channels any time they're not being used to send real data. So, my link training process is:
1. Find the correct phase for the px_clk so that, as described above, the gearbox works properly. Incorrect phase will result in flickering pixel data as the byte shifts occur in the wrong place relative to the gearbox px_clk state machine. I'm not sure why this phase changes from reset to reset. It's the same value for all 65 channels, so I feel like there should be a way to have it start up deterministically. But for now it's easy enough to try all four values and see which one produces constant data.

2. On each channel, set the sampling point by sweeping through the adjustable delay values looking for an eye center. (Or, since it's not guaranteed that a complete eye will be contained in the available delay range, a sampling point sufficiently far from eye edges.)

3. On the control channel, set the bit slip offset to the value between 3 and 12 that produces the expected training value. This covers all ten possibilities for phasing of the pixel data relative to the deserializer. Note that this requires registering and concatenating three deserialized bytes, rather than two as pictured in the bit slip example above.

4. On each pixel channel, set the bit slip offset to the value closest to the control channel bit slip offset that produces the expected training value. It should be within ±3 of the control channel bit slip offset, since that's the maximum interchannel skew.
This only takes a fraction of a second, so it can easily be done on start-up or even in between captures to protect against temperature-dependent skew. By looking at the total delay represented by the delay tap values and bit slip offsets, it's clear that the CMV12000's interchannel skew is the dominant factor and that the trained delays roughly match the datasheet skew specification of 150ps per channel:

 Total CMV12000 channel delays measured by training results.
That's the hard part of the source done, with less trouble than I expected. The output is a 60MHz px_clk and 65 10-bit values that update on that clock. This will be the interface to the middle of the pipeline, the wavelet engine. But I need to be able to test the sensor before that's complete, and more than 64 pixels at a time. Without the compression stage, though, that means writing data at full rate to external DDR4 attached to the ZU+. Although it's a throwaway test, I will need to write to that RAM (at a lower rate) after the compression stage anyway, so this would be good practice.

#### RAMMING SPEED

The ZU4CG version of the TE0803 has 2GB of 2400MT/s DDR4 configured as x64. That's over 150Gb/s of theoretical memory bandwidth, so the 38.4Gb/s CMV12000 data should be pretty easy. The DDR4 is attached to the PS side of the ZU+, though, and the dedicated DDR controller there is shared by many elements of the system, including the ARM cores.

The CMV12000 front-end described above exists on the PL side. The fastest interface between the PL and the PS is a set of 128-bit AXI memory buses, exposed as the slave ports S_AXI_HPx_FPD to the PL. There are four such slave ports, but only a maximum of three can be simultaneously routed to the DDR controller:

 Up to three 128-bit AXI memory buses can be dedicated to direct PL-PS DDR access.
The Ultrascale+ AXI might be able to go up to 333MHz, according to the datasheet, but 250MHz is the more common setting. That's okay - that's still 96Gb/s of theoretical bus bandwidth. But you can start to see why it's infeasible to store intermediate compression data in external RAM. Even 2.5 accesses per pixel would saturate the bus.

For this test, I set up some custom BRAM FIFOs to use for buffering between the hard-timed pixel input and the more uncertain AXI write transfer. To keep things simple, four adjacent channels share one 64b-wide FIFO, aligning their pixel data to 16 bits. All FIFO writes happen on the px_clk when the control channel indicates valid pixel data.

The other side of the FIFO is a little more confusing. I split channels 1-32 and 33-64 (8 FIFOs each) into two write groups, each with its own AXI master port with 32Gb/s of theoretical bandwidth. The bottom channels drive S_AXI_HP0_FPD and the top drive S_AXI_HP1_FPD, and I rely on the DDR controller to sort out simultaneous write requests.

 Bottom channel RAM writing test pipeline, through BRAM FIFO buffers. Top channels are similar.
When the FIFO levels reach a certain threshold, a write transaction is started. Each transaction is 16 bursts of 16 beats of 16 bytes, and the 16 bytes of a beat are views of the FIFO output data. For simplicity, I just alternate between views of the 8 MSBs of 16 pixels to fill each 128-bit beat. I may stick the 2 LSBs from all 64 channels in their own view at some point, but for now I can at least confirm sensor operation with the 8 MSBs.

Without further ado, the first full image off the sensor:

 What were you expecting?
It turned out better than I thought, even looking like a VHS tape on rewind as it does. There are both horizontal and vertical defects. The vertical defects were concentrated in one 128px-wide column, served by a single LVDS pair, so that was easily traceable to a marginal solder joint. The horizontal defects were more likely to be missing or corrupted RAM writes. They would change position every frame.

At first I suspected the DDR controller might be struggling to arbitrate between the two PL-PS ports and the ARM. The ARM might try to read program data while the image capture front-end is writing, incurring both a read/write turnaround penalty and a page change penalty. But in that case the AXI slave ports should exert back-pressure on their PL masters by deasserting the AWREADY signal, and I didn't see this happening. To further rule out ARM contention, I moved the ARM program into on-chip memory and disabled all but the two slave ports being used to write data to the DDR controller...still no good.

I also tried different combinations of pixel clock speed (down to 30MHz), AXI clock speed (down to 125MHz), burst size, and total transfer size with no real change. Even with only one port writing, the problem persisted. Then I tried replacing the image views with some FIFO debug info: input/output counters and the difference used to calculate the fill level. I had expected the difference to vary up and down by one or two address units since the counters run on different clocks, but I what I saw were cases where the difference was entirely wrong, possibly triggering bad transfers.

So what I had was a clock domain crossing problem. Rather than describe it in detail, I'll just link this article that I wish I had read beforehand. The crux of it is that the individual bits of the counter can't be trusted to change together and if you catch them in mid-transition during an asynchronous clock overlap, you can get results that are complete nonsense, not just off-by-one. The article details a comprehensive bidirectional synchronization method using Gray code counters, but for now I just tried a simple one-way method where the input counter is driven across the clock domain with an alternating "pump" signal:

 Synchronization pump for FIFO input counter.
The pump is driven by the LSB of the input-side counter and synchronized to the AXI clock domain through a series of flip-flops. This only works if the output-side clock is sufficiently faster than the input-side clock that it can always detect every edge of the pump signal. That's the case here, with a 250MHz axi_clk and a 60MHz px_clk. The value of in_cnt_axi, the input counter pumped to the AXI clock domain, is what's compared to the output counter (which is already in the AXI clock domain) to evaluate the FIFO level and trigger AXI transfers. It's the right amount of simple for me, adding only a few flip-flops to the design.

 And just like that, clean kerbal portraits.
In theory, I could read in about 170 frames this way (in 0.567s...). It currently takes me 30 seconds to get each frame off over JTAG, though, so I may want to get USB (SuperSpeed!) up and running first. More importantly, I can evaluate sensor stuff independent of the two other main challenges (wavelet pipeline and SSD sink). I'm actually surprised at the okay-ness of the raw image, but there is definitely some fixed pattern noise to contend with. I also want to try the multi-slope HDR mode, which should be great for fitting more dynamic range in the 10-bit data (with no processing on my end!).

I started with the source and sink because, even though they're the more known tasks, they represent external constraints that are actually show-stoppers if they don't work. Now I am confident in everything up to the pixel data hand-off on the source side. The sink side is still a mess, but the hardware has been checked at least. That leaves the more unknown challenge of the wavelet compression engine. But since it's entirely built from logic, with interfaces on both ends that I control, I'm actually less worried about it. In other words, it's nice to not have to think about whether or not to build something from scratch...