Saturday, October 5, 2019

KSP: Laythe Colony Part 4, Drop Ships and Lonely Rovers

After the second Jool launch window, I still had 196 days to get a few extra ships off Kerbin before its destruction on Year 3, Day 0. They couldn't transfer to Jool until the third launch window - around Year 3, Day 260 - but they could still get out of harm's way. I hadn't specified exactly how Kerbin is destroyed, but since this entire scenario is based on Seveneves, I think it was reasonable to say that these ships should not sit in cismunar orbit. So I decided to send them out to Minmus for parking.

Colony ship #11 or #12 - I lost count. Parked at Minmus for a front row seat to the end of the world.
By this point I was getting pretty tired of building colony ships. Each one takes about a dozen launches to assemble, crew, and fuel in low Kerbin orbit. But I managed to get two more built and parked at Minmus. I also realized that there would be a little bit of a housing shortage on Laythe with the extra 72 Kerbals these colony ships carry, so I sent up one more HAB1 transfer ship as well. But parking ships in Minmus orbit isn't exactly efficient, and I am running a pretty tight Δv budget. A perfect opportunity, then, to create one last piece of hardware for this mission.

The Drop Ships

The DS1 lander, a last-minute mining platform and fuel tanker for the fleet.
Until now, the only ships in my fleet with mining capabilities were the LR1 rovers, which can refuel space planes on the surface of Laythe. The planes can then climb into Laythe orbit and transfer any spare fuel to the colony ships. But it would take quite a few launches to fully refuel the colony ships this way. Better to mine on a moon with a shallow gravity well, like Pol, and net a bunch more fuel. So I designed a drop ship mining platform/tanker to do just that. Refueling the ships parked at Minmus before the third Jool transfer window would be a good test.

I've done space planes and straightforward powered descent, but never a true VTOL in the sense of a ship that is designed to hover and translate horizontally looking for flat ground or good mining prospects. Most of my knowledge about drop ships comes from watching Cupcake Landers videos. I just tried to make it symmetric, place the C.G. properly, and set up the fuel tanks so that the C.G. doesn't shift much as they drain.

Drop ship mining practice on Minmus.
Even though they're essentially flying fuel tanks, cruising through mountain ranges in the low gravity of Minmus in them is easy and actually kind-of fun. Normally I'm trying to time suicide burns just right or not stall out my space planes, both of which are more stressful technical tasks. Piloting a drop ship is closer to a sci-fi landing experience.  Which reminds me: if you're looking for a quick diversion from the brutally technical challenge of KSP, Outer Wilds is a beautiful (and creepy) exploration/mystery game with some incredible open-world storytelling. Absolutely worth going in blind and playing through.

Back to Minmus mining, though. I only had time to build two of these drop ships. They can operate autonomously, but they also have room for a pilot, for navigation in frontier areas with poor relay coverage, and an engineer, for more efficient mining. I realized while building these final few ships that I neglected to put relay antennas on the colony ships, something that is required for remote piloting rovers, space planes, or drop ships. Since I might need to do a lot of remote piloting in the Jool system, I decided to steal a couple relay satellites from Kerbin orbit.

Stealing a satellite with the grabby claw I knew would come in handy.
Once these ships leave Kerbin orbit, there won't be any need for a Kerbin comms network anymore, so I (literally) grabbed some of Kerbin's relay satellites with the last two colony ships. It is possible to create a remote piloting connection through a relay satellite in a grabby claw, something I find satisfyingly appropriate for Kerbal-style mission "planning". Anyway, I made a few round trips to Minmus surface to refuel the ships of the third wave and then that was it for Kerbin.

0 Days Remain

On Year 3, Day 0, time was up for the Kerbal home planet. The remaining population (of 432 Kerbals) was in flight, either on the way to Jool or at Minmus awaiting the third transfer window. No more hardware would be launched and the roughly four kilotons of ship and propellant in the fleet would have to become the Laythe colony. But it would still be almost another two years before the first colony ship arrives in the Jool system. Before that, the robotic fleet would have to lay the groundwork.

The Lonely Rovers

The 18 ships of the first Jool launch window arrived at their destination during the second half of Year 3. I set up the transfers such that the relay satellites would arrive first, since having a working comms net in the Jool system would be crucial to the rest of the mission. The RS3 ships and especially the ion engine satellites themselves have plenty of Δv to spare, so I just brute-forced them into useful coverage orbits around Jool and Laythe.

The first relay satellites arrive at Jool. I'm definitely guilty of setting up the WiFi before unpacking...
For the remainder of the ships, though, the Δv budget was tight enough that I definitely wanted to grab Tylo gravity assists on the way in. This created a bit of traffic as several ships would hit the Tylo gateway within days, or sometimes hours, of each other. To get captured using a gravity assist, I aimed to pass "in front of" Tylo, so that its gravity mostly pulls in a direction opposite my orbit and I feed it some of my kinetic energy. After some refinement, I also was able to target a captured orbit with a periapsis similar to the orbital radius of Laythe. From there, it's easy to get a low-energy intercept on the next orbit with just a couple small correction burns at periapsis and apoapsis.

Busy airspace (or, spacespace?) around the Tylo gateway.
Using gravity assist captures off Tylo, or in a few cases off Laythe itself, my average Δv from low Kerbin orbit to low Laythe orbit was about 3475m/s, with a tolerance of about ±350m/s. This is quite a bit below the 4360m/s you get from the subway map, which would have been cutting it very close for some of my ships. As it is, all of the robotic fleet made it to low Kerbin orbit with fuel to spare and without having to do any aerobraking. Assuming all the Δv saved went into accelerating Tylo (and it wasn't on rails), its apoapsis would be raised by about 1nm.

Getting to Laythe is not the same as landing on Laythe, though. It's a water world with only a few islands to target. I've landed there before, using a custom deorbit burn tool to target the island on the equator with the flattest terrain. To hit that island, it makes sense to burn over the small island that's about 90º west of there. I set up each ship in a near-circular 100km equatorial orbit and then start a burn just as the ship passes over the coast of that island:

Laythe deorbit burn over the small island on the equator, to hit the flat island about 90º to the east.
After the burn, the lander can ditch its propulsion module (which is mostly empty now and will burn up separately) and prep for entry. For the first phase of the landing, an inflatable heat shield protects the descent package from the initial atmospheric heating.

Landing Phase 1: Using an inflatable heat shield to protect the payload while bleeding off some speed.
As the air gets thicker, the drag on the heat shield overcomes the ability of the reaction wheels to keep it facing forward, so the lander flips around. The fairing still provides thermal and aerodynamic protection for the payload, and the heat shield now becomes more of an air brake, bleeding off even more speed in preparation for the final descent.

Landing Phase 2: The craft flips around, with the heat shield now acting as an air brake.
At about 3km AGL, the speed is low enough to jettison the fairing and deploy the main parachutes. The heat shield stays attached until the main chutes deploy, at which point it can be jettisoned in a controlled orientation so it doesn't crash back into the ship.

Landing Phase 3: Fairing jettisoned, main chutes deployed, heat shield dropped.
Finally, at about 300m AGL, the descent engines kick in and bleed off the final bit of vertical velocity. They don't have much fuel, so the burn has to be timed pretty well. I use the AeroGUI's AGL indicator and the lander's shadow to judge it.

Landing Phase 4: Powered descent. Kicks up a good amount of sand.
That's how things should go. But the first two landings were not quite perfect. I nearly overshot the landing zone on the first try, coming down less than 1km from the eastern shore. This is almost exactly where I landed my first Laythe mission, and I knew it was on a major slope. In the process of preparing for a potentially harrowing post-landing slide into the ocean, I forgot a few steps of the landing checklist and the descent engines didn't start up. The resulting ~15m/s impact was enough to break off the mining rig and fuel tank from the first LR1 rover down. But the drivetrain survived, so it could still act as a scout if it could get up the hill.

The first (hard) landing on Laythe in this mission, dangerously close to the shore.
Having nearly overshot the landing zone into the ocean, I tweaked the deorbit burn a little (from 104m/s to 110m/s). However, this was a little too much tweaking and lander #2 wound up heading straight for the lake in the middle of this island. Luckily, this was a HAB1 lander, which has a little more fuel on board for the powered final descent. I managed to just barely hover-translate to the cliff edge overlooking the lake's eastern shore with no fuel to spare.

Landing #2 involved some last-second piloting to steer away from the lake to the edge of a cliff.
Those two landings gave me the upper and lower limits for the deorbit burn. I used 108m/s as the burn for the remaining 12 landers, and they all touched down safely on the relatively flat land between the lake and the eastern shore.

Typical landing zone after dialing in the exact deorbit burn.
I say relatively flat because it's still filled with sand dunes. They're no problem for the 6- and 8-wheeled rovers, but I need a 1-2km stretch of actually flat terrain to use as a space plane runway. I scouted for a while before settling on the strip marked out by the pink markers in the landing photo above. It's about 1.5km long and 300m wide, near the equator, and aligned well for west-to-east landings. It's completely flat in the crosswind direction and slightly sloped upward in the "upwind" landing direction. I'd prefer something flat in all directions, but this is the next best thing.

By the end of the first wave, I could place the landers with about ±2km accuracy from orbit. But they are rovers, so it's easy to reposition them as needed. The LR1s all grouped together to form the corners of the runway, acting as visible markers for the space planes on approach. They're needed at the runway for refueling anyway, so this seems like the best place for them. In order to avoid excessive part counts in one location, I decided to move the HAB1s, the colony habitats, away from the runway and toward the lake. There, they could be assembled into housing groups.

Setting up some modular housing on the dunes.
It's not a metropolis, but having a mobile and reconfigurable colony seems ideal on the sand dunes of an otherwise pretty desolate water planet. In total, 13.5 of the 14 rovers in the robotic fleet made it to the surface, and all 14 were able to find their way to each other and remotely set up the infrastructure for a colony. It'll be another year before the colony ships arrive in the second wave, but when they do, they'll have a place to stay - with a nice view.


Saturday, September 28, 2019

Fast atan2() alternative for three-phase angle measurement.

Normally, to get the phase angle of a set of (assumed balanced) three-phase signals, I'd do a Clarke Transform followed by a atan2(β,α). This could be atan2f(), for single-precision floating-point in C, or some other approximation that trades off accuracy for speed. The crudest (and fastest) of these is a first-order approximation atan(x) ≈ (π/4)·x which has maximum error of ±4.073º over the range {-1 ≤ x ≤ 1}:
Interestingly, this isn't the best (minimax or least mean square) linear fit over that range. But it's pretty good and has zero error on both ends, so it can be stitched together into a continuous four-quadrant approximation that covers all finite inputs to the two-argument atan2(β,α):
One common implementation determines the quadrant based on α and β and then runs the linear approximation on either x = β/α or x = α/β, whichever is in the range {-1 ≤ x ≤ 1} in that quadrant. The combination of a quadrant offset and the local linear approximation determines the final result.

It's possible to extend this method to three inputs, a set of three-phase signals assumed to be balanced. Instead of quadrants, the input domain is split based on the six possible sorted orders of the three-phase signals. Within each sextant, the middle input (the one crossing zero) is divided by the difference of the other two to form a normalized input, analogous to selecting x = β/α or x = α/β in the atan2() implementation:
This normalized input, which happens to range from -1/3 to 1/3, is multiplied by a linear fit constant to create the local approximation. To follow the pattern of the four-quadrant approximation, a constant of π/2 gives a fit that's not (minimax or least mean square) optimal, but stitches together continuously at sextant boundaries. As with the atan2() implementation, the combination of a sextant offset and the local approximation determine the final result.
For this three-phase approximation the maximum error is ±1.117º, significantly lower than the four-quadrant approximation. If starting from three-phase signals anyway, this method may also be faster, or at least nearly the same speed. The conditional section for selecting a sextant is more complex, but there are fewer intermediate math operations. (Both still have the single pesky floating-point divide for normalization.)

To put this to the test, I tried directly computing the phase of the three flux observer signals on TinyCross's dual motor drive. This usually isn't the best way to derive sensorless rotor angle: An angle tracking observer or PLL-type method can do a better job at filtering out noise by enforcing physical bandwidth constraints. But for this test, I just compute the angle directly using either atan2f(β,α) or one of the two approximations above.

Computation times for different angle-deriving algorithms.
The three-phase approximation does turn out to be a little faster in this case. To keep the comparison fair, I tried to use the same structure for both approximations: the quadrant/sextant selection conditional runs first, setting bits in a 2- or 3-bit code. That code is then used to look up the offset and the numerator/denominator for the the local linear approximation. This is running on an STM32F303 at 72MHz. The PWM loop period is 42.67μs, so a 1.5-2.0μs calculation per motor isn't too bad, but every cycle counts. It's also "free" accuracy improvement:


The ±4º error ripple in the four-quadrant approximation shows up clearly in real data. The smaller error in the three-phase approximation is mostly lost in other noise. When the error is taken with respect to a post-computed atan2f(), the four-quadrant approximation looks less noisy. But I think this is just a mathematical symptom. When considering error with respect to an independent angle measurement (from Hall sensor interpolation), they show similar amounts of noise.

I don't have an immediate use for this, since TinyCross is primarily sensored and the flux signals are already synchronously logged (for diagnostics only). But clock cycle hunting is a fun hobby.

Monday, September 23, 2019

TinyCross: First Test Drive and Synchronous Data Logging

With the front wheel drive complete and the steering wheel control board working, it's finally time for a first test drive:


I've been waiting over a year to see if this mountain bike air shock suspension setup would work, and it looks like it does! I haven't done any tuning on it besides setting the preload, but it handles my pretty beat up parking lot nicely, absorbing bumps that would have broken tinyKart in minutes. The steering linkage also seems okay, with good travel and minimal bump steer. There are still some minor mechanical improvements I want to make, but it's nice to see the suspension concept in action after all this time.

I started with front wheel drive so I could see if the motor drive had any Flame Emitting Transistors, but happily it did not. It's the same gate drive design that I use on everything and it always just works, so I shouldn't be surprised anymore. But I am asking a lot of the FDMT80080DC FETs (just one per leg), so I'm working my way up to 120A (peak, line-to-neutral) phase current incrementally. The above test is at 80A and the FETs seem happy, although the motors do get pretty warm already. They might need some i²t thermal protection to handle 120A peaks.

Synchronous Data Logging

One of the early lessons I learned in building motor drives is to always log data. Nothing ever works perfectly on the first try, but having data logging built in from the start is the best way I know of to quickly diagnose problems. A lot of the stuff that happens in a motor drive is faster than typical data logging can capture, but a lot of it is also periodic. By synchronizing the data collected to the rotor electrical angle, its possible to reveal detailed periodic signals even with relatively low frequency (50Hz) logging to an SD card. As a quick example, here's a standard data logger plot of motor phase currents over time:
Phase current vs. time, pretty boring.
This type of plot shows the drive cycle, with periods of high current during acceleration (or braking) and periods of near zero current when coasting or stopped. And it shows that phase currents sometimes exceed 100A even with an 80A command. But a plot of Q-axis (torque-producing) current, which is already synchronous, could give a better summary of this information. The time resolution (40ms) isn't fine enough to show the AC signals.

However, each set of three phase currents is also stamped with a rotor electrical angle measured at the same time (within about 10 microseconds). Cross-plotting the phase currents against their angle stamp, instead of against time, reveals a much more interesting view of the data:
Same data, different meaning.
Now it's possible to see the three phase current waveforms separated by 120edeg. The peaks are at 0º (Phase A), 120º (Phase B), and -120º (Phase C). There are also negative peaks at the same angles, where braking is occurring. Most interestingly, the shape of the current waveforms at 80A peak is revealed to be asymmetric and far from sinusoidal.

The angular resolution of this type of waveform capture is only limited by the angle measurement, regardless of logging frequency. By contrast, the fastest it would be possible to log a continuous waveform would be at the PWM frequency (23.4kHz, in this case), which gives an speed-dependent angular resolution of 11.3edeg per 1000rpm. It would become difficult to resolve the shape of the current waveform at high speeds. There's always a trade-off, though: Synchronizing low-speed log data with angle stamps is only able to show the average shape of long-term periodic signals. It would not catch a glitch in a single cycle of the phase currents.

While the phase current shape is interesting, the position of the peaks is just a consequence of the current controller. Zero electrical degrees is defined (by me, arbitrarily) as the angle at which Phase A's back EMF is at a peak. The current controller aligns the Phase A current with the Phase A back EMF for maximum torque per amp. So the phase current plot shows that the current controller is doing its job. This information is also captured by the already-synchronous Q-axis and D-axis current signals:

Q-axis and D-axis current plotted against time.
The Q-axis current represents torque-producing current, aligned with the back EMF, and is the current being commanded by the throttle input. The D-axis current is field-augmenting (or weakening, if negative) current and doesn't contribute to torque production. In this case, the current controller seeks to hit the desired Q-axis current and keep the D-axis current at zero. It does this by varying the voltage vector applied to the motor. More on this later. The Q-axis and D-axis currents are rotor-synchronous values, so they already convey the magnitude and phase of the phase currents, just not the actual shape.

All of this is based on the assumption that the measured rotor angle is correct, i.e. properly defined with respect to the permanent magnets. On this kart, I'm using magnetic rotary sensors mounted to the motor shafts that communicate the rotor angle to the motor controller via SPI and optically-isolated emulated Hall sensor signals. But it's also possible to measure the rotor angle with a flux observer, as long as the motor is spinning sufficiently fast. I have this running in the background, logging flux estimates for each phase.

Again, plotting flux against time doesn't give a whole lot of information. It's interesting to see the observer converge as speed increases from zero at the start, and the average amplitude of about 5mWb is consistent with the motor's rpm/V constant and measured back-EMF. But the real value of this data comes from cross-plotting against the sensor-derived rotor angle:
Cross-plotting against sensor-derived electrical angle shows substantial offset between the two motors.
The flux from Phase A should cross zero when its back EMF is at its peak, i.e. at an electrical angle of 0º in my arbitrarily-defined system. So, the front-right motor is more correct. The front-left is offset by about 30-45edeg, which is enough to start causing significant differences in torque. Indeed I had noticed some torque steer during the first test drives, which is what prompted me to do the sensor/sensorless angle comparison in the first place.

Since I have all three phases of flux, I can estimate the flux vector angle with some math and compare it to the sensor-derived rotor angle:

Digging into flux angle offset of the front-left motor a little more.
Both motors have some variation in flux angle offset, but the front-left varies more and is further from the nominal 90º. Except...when it's not. There are two five-second intervals where the average offset of the front-left flux looks like it returns to nearly 90º, both occurring either during or just after applying negative current. However, there's one more negative current pulse, earlier in the test drive, that does not have a flux angle shift. My troubleshooting neural network has been trained over many project iterations to interpret this as the signature of a mechanical problem.

Sure enough, I was able to grab the rotor of the front-left motor and twist with hand strength only (< 5Nm) enough to make the shaft move relative to the rotor can. It only moved about 5º, but that's 35edeg, which is about the offset I had been seeing in the data. The press fit had failed and it was relying on back-up set screws on flats to keep from completely slipping. I suspect this won't be the last motor to fail in this way. I pressed out the shaft, roughed up the surface a little, and pressed it back in with some Loctite 609. I also drilled a hole in the back that can potentially be tapped as a back-up plan. And finally I recalibrated everything and marked the shaft so I'll know if it slips again.

Reworked shaft, with a 1/4-20 tap drill (not going to tap it unless I absolutely have to), roughed surface, and press-fit augmented with Loctite 609, which should be good up to 25Nm for this surface area (4-5x margin).
After a few more test drives, it looks like it's holding. The front-left flux vs. sensor-derived angle looks much closer to the correct phase as well:

Phase A flux vs. sensor-derived angle after shaft rework.
There's still a +/-10edeg offset from nominal, which could be from calibration accuracy or static biases like normal shaft twisting. It might be worth investigating more, but it's not enough offset to create any noticeable torque steer on the front wheel drive, so I'm satisfied for now. I will preemptively do the same rework on the remaining three motor shafts.

One other interesting cross-plot to look at is the Q- and D-axis voltage as a function of speed. I mentioned above that the current controller attempts to align the current vector with the back EMF vector by manipulating the voltage vector, the basis of field-oriented control. Due to the electrical time constant (L/R) of the motor, the voltage must lead the back EMF by a varying amount. This shows up as negative D-axis voltage increasing in magnitude with speed (and current).

Jitter 3D  plot of the voltage vector operating curve.
At 80A and 2500erad/s (~3400rpm and ~27mph), the voltage vector is already leading by 45º, with 12V on both axes. This gives me a rough estimate for the motor's synchronous inductance.
Along with the measured resistance (32mΩ) and flux amplitude (5mWb), this is all that's required for a first-order motor model, and thus a torque-speed curve. Running this through the gear ratio, the force-speed curve at the ground should look something like:


The inductance has a large impact on the maximum speed at which 120A can be driven into the motor in-phase with the back EMF. This determines the maximum power, since above this speed the force drops off faster than the speed increases. The top speed is wherever on the curve the drag forces equal the motor force, probably in the 40-45mph range. This is all without using third harmonic injection, which gives an extra 15% voltage overhead (for the cost of higher peak battery power, of course). If I do turn that on, it will probably come with a gear ratio change to put that extra 15% toward more torque, not more speed.

That's all I wanted to check before building up the second motor controller for the rear wheel drive. I'm very eager to see how it handles with 4WD, and how close to this force-speed curve I can actually get.

Wednesday, September 4, 2019

CMV12000 Full-Speed (38.4Gb/s) Read-In on Zynq Ultrascale+

In my original Freight-Train-of-Pixels post, I explored three main challenges of building a 3.8Gpx/s imager: the source, the pipe, and the sink. Working backwards, the sink is an NVMe SSD that (hopefully) will be capable of 1GB/s writes. The pipe is a ~5:1 wavelet compression engine that has to make 3.8Gpx/s = 1GB/s in realtime, with minimal effect on image quality. And the source is the CMV12000 image sensor that relentlessly feeds pixel data into this machine. This post focuses on the source, and specifically the read-in mechanism implemented on a Zynq Ultrascale+ SoC for the 38.4Gb/s of LVDS data from the sensor.

Physical Interface

The Source: Breaking out the CMV12000's 64 LVDS pairs was interesting.
The pixel data interface on the CMV12000 is 64 LVDS pairs, each operating at (up to) 300MHz DDR (600Mb/s). In the context of FPGA I/O interfaces, 300MHz DDR is really not that fast. It's just a lot of inputs. Most Zynq Ultrascale+ SoCs have enough LVDS-capable package pins to do this, but it took some searching to find a carrier board that breaks out enough of them to headers. I'm using the Trenz Electronic TE0803, specifically the ZU4CG vesion, which breaks out a total of 72 HP LVDS pairs from the ZU+.

The physical interface for LVDS is a 100Ω differential pair. At 300MHz DDR, the length-matching requirements are not difficult. A bit is about 250mm long, so millimeter-scale mismatches due to an uneven number of 45º left and right turns are not a big deal; no meandering is really needed for intrapair matching. Likewise, I felt it was okay to break some routing rules by splitting a pair for a few millimeters to reach inner CMV12000 pins, rather than pushing trace/space limits to squeeze two traces between pins.

Routing of the LVDS pairs to TE0803 headers. Pairs are length-matched to within ~1mm, but no interpair matching was attempted. The FPGA must deal with interpair length differences as well as the CMV12000's large internal skew.
Interpair skew is still an issue. For ease of routing, no interpair length matching was attempted, resulting in length differences of as much as 30% of a bit interval. But this isn't even the bad news. The CMV12000 has a ~150ps skew between each channel from 1-32 and 33-64. That means that channels 32 and 64 are ~4.7ns behind channels 1 and 33, a whopping 280% of a bit interval. It would be silly to try to compensate for this with length matching, since that's equivalent to about 700mm at c/2!

Deserialization and Link Training

For a brief moment after reading about the CMV12000's massive interchannel skew, I thought I might be screwed. FPGA inputs deal with skew by using adjustable delay elements to add delay to the edges that arrive early, slowing them all down to align with the most fashionably late edge. But the delay elements in the Zynq Ultrascale+ are only guaranteed to provide up to 1.1ns of delay. It's possible to cascade the unused output delay elements with their associated input delay elements, but that's still only 2.2ns.

But I don't need to account for the whole 4.7ns of interchannel skew; I only need to reach the same phase angle in the next bit. At 600Mb/s, that's only 1.67ns away. Delays larger than this can be done with bit slipping, as shown below. Since this still relies on the cascaded delay elements to span one full bit interval, an interesting consequence is that a minimum speed is imposed (about 450Mb/s for 2.2ns of available delay). So I guess it's go fast or go home...

Channels are aligned using an adjustable delay of up to one bit period and integer bit slipping in the deserialized data.
The Ultrascale+ deserializer hardware supports up to 1:8 (byte) deserialization from DDR inputs. The bit slip logic selects a new byte from any position in the 16-bit concatenation of the current and previous raw deserialized bytes. The combination of the delay value and integer bit slip offset independently align each channel.

A complication is that the CMV12000 has 8-, 10-, and 12-bit pixel modes, with the highest readout efficiency in the default 10-bit mode. To go from 8-bit deserialized data to 10-bit pixel data requires building a "gearbox", a nomenclature I really like. An 8:10 gearbox can be built pretty easily with just a few registers:

An 8:10 gearbox, with four states corresponding to alignment of the 10-bit output within two adjacent 8-bit inputs.
The gearbox cycles through four states, registering a 10-bit output from an offset of {0, 2, 4, or 6} within two adjacent 8-bit inputs to pick out whole pixels from the data. This looks simple enough, but there's a subtlety in the fact that five bytes must cycle through the registers for every four pixels. In other words, the input clock (byte_clk) is running 5/4 as fast as the output clock (px_clk). The two clocks must be divided down from the same source (the LVDS clock in this case) to ensure that timing constraints can be evaluated. Additionally, to work as pictured above, the phase of the two clocks must be such the "extra" byte shift occur between states 3 and 0.

The overall input module is pretty tiny, which is good because I have to instantiate 65 of them (64 pixel channels and one control channel). They're built into an AXI-Lite slave peripheral with all the per-channel tweakable parameters as well as the final 10-bit pixel outputs mapped for the ARM to play with. The CMV12000 outputs training data on the pixel channels any time they're not being used to send real data. So, my link training process is:
  1. Find the correct phase for the px_clk so that, as described above, the gearbox works properly. Incorrect phase will result in flickering pixel data as the byte shifts occur in the wrong place relative to the gearbox px_clk state machine. I'm not sure why this phase changes from reset to reset. It's the same value for all 65 channels, so I feel like there should be a way to have it start up deterministically. But for now it's easy enough to try all four values and see which one produces constant data.
     
  2. On each channel, set the sampling point by sweeping through the adjustable delay values looking for an eye center. (Or, since it's not guaranteed that a complete eye will be contained in the available delay range, a sampling point sufficiently far from eye edges.)
     
  3. On the control channel, set the bit slip offset to the value between 3 and 12 that produces the expected training value. This covers all ten possibilities for phasing of the pixel data relative to the deserializer. Note that this requires registering and concatenating three deserialized bytes, rather than two as pictured in the bit slip example above.
     
  4. On each pixel channel, set the bit slip offset to the value closest to the control channel bit slip offset that produces the expected training value. It should be within ±3 of the control channel bit slip offset, since that's the maximum interchannel skew.
This only takes a fraction of a second, so it can easily be done on start-up or even in between captures to protect against temperature-dependent skew. By looking at the total delay represented by the delay tap values and bit slip offsets, it's clear that the CMV12000's interchannel skew is the dominant factor and that the trained delays roughly match the datasheet skew specification of 150ps per channel:

Total CMV12000 channel delays measured by training results.
That's the hard part of the source done, with less trouble than I expected. The output is a 60MHz px_clk and 65 10-bit values that update on that clock. This will be the interface to the middle of the pipeline, the wavelet engine. But I need to be able to test the sensor before that's complete, and more than 64 pixels at a time. Without the compression stage, though, that means writing data at full rate to external DDR4 attached to the ZU+. Although it's a throwaway test, I will need to write to that RAM (at a lower rate) after the compression stage anyway, so this would be good practice.

RAMMING SPEED

The ZU4CG version of the TE0803 has 2GB of 2400MT/s DDR4 configured as x64. That's over 150Gb/s of theoretical memory bandwidth, so the 38.4Gb/s CMV12000 data should be pretty easy. The DDR4 is attached to the PS side of the ZU+, though, and the dedicated DDR controller there is shared by many elements of the system, including the ARM cores. 

The CMV12000 front-end described above exists on the PL side. The fastest interface between the PL and the PS is a set of 128-bit AXI memory buses, exposed as the slave ports S_AXI_HPx_FPD to the PL. There are four such slave ports, but only a maximum of three can be simultaneously routed to the DDR controller:

Up to three 128-bit AXI memory buses can be dedicated to direct PL-PS DDR access.
The Ultrascale+ AXI might be able to go up to 333MHz, according to the datasheet, but 250MHz is the more common setting. That's okay - that's still 96Gb/s of theoretical bus bandwidth. But you can start to see why it's infeasible to store intermediate compression data in external RAM. Even 2.5 accesses per pixel would saturate the bus.

For this test, I set up some custom BRAM FIFOs to use for buffering between the hard-timed pixel input and the more uncertain AXI write transfer. To keep things simple, four adjacent channels share one 64b-wide FIFO, aligning their pixel data to 16 bits. All FIFO writes happen on the px_clk when the control channel indicates valid pixel data.

The other side of the FIFO is a little more confusing. I split channels 1-32 and 33-64 (8 FIFOs each) into two write groups, each with its own AXI master port with 32Gb/s of theoretical bandwidth. The bottom channels drive S_AXI_HP0_FPD and the top drive S_AXI_HP1_FPD, and I rely on the DDR controller to sort out simultaneous write requests.

Bottom channel RAM writing test pipeline, through BRAM FIFO buffers. Top channels are similar.
When the FIFO levels reach a certain threshold, a write transaction is started. Each transaction is 16 bursts of 16 beats of 16 bytes, and the 16 bytes of a beat are views of the FIFO output data. For simplicity, I just alternate between views of the 8 MSBs of 16 pixels to fill each 128-bit beat. I may stick the 2 LSBs from all 64 channels in their own view at some point, but for now I can at least confirm sensor operation with the 8 MSBs.

Without further ado, the first full image off the sensor:

What were you expecting?
It turned out better than I thought, even looking like a VHS tape on rewind as it does. There are both horizontal and vertical defects. The vertical defects were concentrated in one 128px-wide column, served by a single LVDS pair, so that was easily traceable to a marginal solder joint. The horizontal defects were more likely to be missing or corrupted RAM writes. They would change position every frame.

At first I suspected the DDR controller might be struggling to arbitrate between the two PL-PS ports and the ARM. The ARM might try to read program data while the image capture front-end is writing, incurring both a read/write turnaround penalty and a page change penalty. But in that case the AXI slave ports should exert back-pressure on their PL masters by deasserting the AWREADY signal, and I didn't see this happening. To further rule out ARM contention, I moved the ARM program into on-chip memory and disabled all but the two slave ports being used to write data to the DDR controller...still no good.

I also tried different combinations of pixel clock speed (down to 30MHz), AXI clock speed (down to 125MHz), burst size, and total transfer size with no real change. Even with only one port writing, the problem persisted. Then I tried replacing the image views with some FIFO debug info: input/output counters and the difference used to calculate the fill level. I had expected the difference to vary up and down by one or two address units since the counters run on different clocks, but I what I saw were cases where the difference was entirely wrong, possibly triggering bad transfers.

So what I had was a clock domain crossing problem. Rather than describe it in detail, I'll just link this article that I wish I had read beforehand. The crux of it is that the individual bits of the counter can't be trusted to change together and if you catch them in mid-transition during an asynchronous clock overlap, you can get results that are complete nonsense, not just off-by-one. The article details a comprehensive bidirectional synchronization method using Gray code counters, but for now I just tried a simple one-way method where the input counter is driven across the clock domain with an alternating "pump" signal:

Synchronization pump for FIFO input counter.
The pump is driven by the LSB of the input-side counter and synchronized to the AXI clock domain through a series of flip-flops. This only works if the output-side clock is sufficiently faster than the input-side clock that it can always detect every edge of the pump signal. That's the case here, with a 250MHz axi_clk and a 60MHz px_clk. The value of in_cnt_axi, the input counter pumped to the AXI clock domain, is what's compared to the output counter (which is already in the AXI clock domain) to evaluate the FIFO level and trigger AXI transfers. It's the right amount of simple for me, adding only a few flip-flops to the design.

And just like that, clean kerbal portraits.
In theory, I could read in about 170 frames this way (in 0.567s...). It currently takes me 30 seconds to get each frame off over JTAG, though, so I may want to get USB (SuperSpeed!) up and running first. More importantly, I can evaluate sensor stuff independent of the two other main challenges (wavelet pipeline and SSD sink). I'm actually surprised at the okay-ness of the raw image, but there is definitely some fixed pattern noise to contend with. I also want to try the multi-slope HDR mode, which should be great for fitting more dynamic range in the 10-bit data (with no processing on my end!).

I started with the source and sink because, even though they're the more known tasks, they represent external constraints that are actually show-stoppers if they don't work. Now I am confident in everything up to the pixel data hand-off on the source side. The sink side is still a mess, but the hardware has been checked at least. That leaves the more unknown challenge of the wavelet compression engine. But since it's entirely built from logic, with interfaces on both ends that I control, I'm actually less worried about it. In other words, it's nice to not have to think about whether or not to build something from scratch...

Wednesday, August 28, 2019

TinyCross: Electronics Update

Where I left off, TinyCross was at the rolling chassis stage. Mechanically, it went together relatively smoothly, most of the issues having been worked out in CAD. There are a few minor tweaks I'd like to make to make it lighter and narrower, but they're low priority compared to getting a first test drive in. So, on to the electronics.

It always looks so clean until you start adding wires.
I've already done a post on the motor drive design. Since the kart is four wheel drive, one will control the two front motors and one will control the two rear motors. For now I've only built up one, just in case there are any observations from the first build that would require changing parts on the second. Here's what the power side looks like:

TxDrive, power side.
It's one of the weirdest power layouts I've done for a motor drive. The design supports two different FET configurations: one with a single MTI200WX75GD per motor and another with a six FDMT80080DCs. Since the MTI200's are perpetually out of stock, I committed to the FDMT solution for this build. It really doesn't look like enough FET, but on paper they're almost identical to the MTI200. I especially like the 1453A pulsed current rating. The board is four layers but only 1oz copper, so I also reinforced some of the high current density paths with 1mm bus wire and copper braid.

The FDMT 8x8 SO8 package creates a few other advantages in this configuration. The parasitic inductance is lower and there's room for local ceramic capacitor decoupling near each half bridge, which will help contain switching transients. The entire power side is also at or below 1mm in height, so the whole surface area of the board, include the somewhat overworked 12V to 5V LDO, can be heat sunk to the chassis through some thermal pad:

TxDrive, signal side.
On the other side of the board, each half bridge gets a 12mm-wide vertical slice with its phase wire exit, gate drive, current sense, and 2x47uF aluminum polymer bus capacitance. An additional 820uF of bulk capacitance per motor gets folded over into the unused volume. The signal board sits in the middle and carries the MCU, its power supply, a CAN transceiver, and the encoder interface.

Two pairs of 12AWG inputs, each with 4mm bullet connectors, support up to about 150A of peak battery current. I find it easier to deal with two 12AWG wires than one 8AWG wire. The six phase outputs are also 12AWG, so everything can pass through a common size grommet on the eventual enclosure. The only other connections are CAN (twisted pair) and the encoders (9-conductor shielded ribbon cable). 

The ribbon cables and phase wires run in parallel down each upper A-arm to the motors. This is the scariest run of wire for many reasons. Electrically, the phase wires are high-dV/dt EMI sources that will capacitively couple onto the encoder cable. This is the main motivation for using shielded cable and three-phase optoisolated Hall signal configuration. Mechanically, these wires pass through several moving parts. (The encoder cable even passes through the drive belt loop!) They need enough slack to accommodate the entire steering and suspension travel, but the slack needs to be in the right places, with good strain relief everywhere else. The routing is actually pretty clean, and will get cleaner once the drives and encoders have their covers installed.

Front wheel drive fully wired up.
That's all just for the front drive; everything will get repeated for the rear. That means that in total there are four pairs of 12AWG DC wire to route out from the central battery input, and up to 300A of total peak battery current to deal with. And this is where I get to spread the good word about MIDI fuses. They are by far the most power-dense fuse format. I've always used the car audio ones, with questionable voltage rating, but Littelfuse makes some serious ones as well, up to a 75V / 200A model rated to break 2500A! Their triple fuse holder is also perfect for my circuit.

Main power input.
The two battery inputs (each a series string of two Tattu 6S 10Ah Smart packs) get the same dual 12AWG treatment, with back-to-back thick #10 ring terminals (the wonderful McMaster 7113K17) bolted to individual fuses.These connect to a bus bar that feeds the full 4x12AWG positive group. This goes through a master power switch and then splits off two and two to the front and rear drives. Meanwhile, a separate 30A fuse feeds off the bus bar through a small switch to the charger and steering wheel board.

The...steering wheel board?

I am really trying to minimize the number of microcontrollers (and also the number of firmware images) on this kart. Each drive has an STM32F303 that's pretty busy running two motors and really shouldn't be doing anything else. But I can stuff every other process onto a single high-level controller. This controller needs to handle driver interface (including throttle read-in), CAN communication with the drives, and ideally battery management. This constrains it to be somewhere near the center of the kart, and the steering wheel seemed like a logical place.

Why have I not grip-taped my steering wheels before?
I've also always wanted to have an OLED steering wheel display. Having live data will definitely help with troubleshooting. Although it's not absolutely necessary, I decided to use the STM32F746 for this board since it has the DMA2D graphics driver. The OLED is 4-bit monochrome, which isn't a natively-supported output format for the DMA2D. But as long as you're blitting even numbers of pixels, you can still make it work. The interface between the OLED and the main board is a SPI variant, good enough for a 50-60Hz update rate. I was originally going to put it on headers, but for clam shell serviceability it was better to just use thin wires.

Display interface and "hot" side of the BMS.
Also on that side of the board is the battery management system (BMS) cell balance circuitry. This got out of hand quickly since I left almost no room for it: the entire area under the display is pretty much off-limits. But I managed to cram 12 cells worth of balance circuit on each side with the resistors themselves sinking heat into the steering wheel metal. To facilitate routing, the circuit alternates FET/resistor placement for the odd and even cells:

Cell balance group.
To discharge an individual cell, a square wave is driven onto its charge pump, which turns on the its FET. This happens to the cell(s) with the highest voltage until they are evened out. Usually this is done during or after charging. During discharge, it's sufficient to just monitor the cell voltages and stop when any one cell reaches a low voltage threshold.

Accurately measuring individual cell voltages is itself an interesting challenge. The main problem is that the cells are offset by up to 48V from the ADC ground. Of course, it's possible to use simple voltage dividers to bring the signals down to below 3.3V. But it would be better to have individual differential measurements of each cell. This means a lot of op-amps or...

One op-amp and a 72V analog multiplexer for cell voltage measurement.
I found some 72V analog multiplexers (MAX14753) that can feed the inputs of one nice op-amp. The muxes are dual 4-to-1 selectors cascaded and wired such that the two outputs are always adjacent cell nodes, which drive the inputs of a differential amplifier. This all fits in a pretty small footprint on the opposite side of the board from the cell balance circuitry. Also on this side of the board are all the connectors, the logic and analog power supplies, the charge cutoff FETs, buffers for driving the cell balance charge pumps, a very sad SD card holder with a reversed footprint, the STM32F7 itself, and a mystery component.

The crowded side of the steering wheel board.
Right now the main purpose of this board is to act as the high-level controller for commanding torque and reading back data from the motor drives. The BMS functionality is a secondary objective, since I can still monitor pack voltage through the drives and charge off-board. The torque command comes from a nice trigger stolen from Amazon's second cheapest RC car transmitter. Like tinyKart, this means all the controls are on the steering wheel - no pedals. The trigger is bidirectional, so it can command positive and negative torque. 

All four motors receive a torque command over CAN at 1kHz that they apply to their current controllers. The motors then take turns replying with their crucial data (electrical angle, speed, voltage, current, and fault status) at 250Hz, and their less important data at 50Hz. This should allow for some fairly tight feedback loops through the central controller for things like speed control, traction control, and torque vectoring. There's also that mystery component, which is controls-related.

For now, I'm just starting to test the power system with the front drive only, quite honestly so I can see the fire when it happens. The two motors will just get the same torque command, ramping up slowly to full voltage/current. I did as much testing as I could on power supplies, but it's finally time for batteries. Here's the first batteries-in test, at a very easy 6V/20A (peak line-to-neutral quantities) limit:



It's nothing exciting, but the first batteries-in test is always a bit scary since there's no longer a CC/CV supply keeping things from going out of hand. After I do some wiring and software clean up and make sure the data logging is working, I'll ramp up from there toward the full 24V/120A, and then full four wheel drive. I've learned to expect smoke at some point during this process though, so I'm holding off on building the second drive until I see what fails on the first...