Monday, July 15, 2019

Benchmarking NVMe through the Zynq Ultrascale+ PL PCIe Linux Root Port Driver

I want to be able to sink 1GB/s into an NVMe SSD from a Zynq Ultrascale+ device, something I know is technically possible but I haven't seen demonstrated without proprietary hardware accelerators. The software approach - through Linux and the Xilinx drivers - has enough documentation scattered around to make work, if you have a lot of patience. But the only speed reference I could find for it is this Z-7030 benchmark of 84.7MB/s. I found nothing for the newer ZU+, with the XDMA PCIe Bridge driver. I'm wasn't expecting it to be fast enough, but it seemed worth the effort to do a speed test.

For hardware, I have my TE0803 carrier with the ZU4CG version of the TE0803. All I need for this test is JTAG, UART, and the four PL-side GT transceivers, for PCIe Gen3 x4. I made a JTAG + UART cable out of the Digilent combo part, which is directly supported in Vivado and saves a separate USB port for the terminal. Using Trenz's bare board files, it was pretty quick to set up.

TE0803 Carrier + Dead-Bugged Digilent JTAG+USB Adapter.
Next, I wanted to validate the PCIe routing with a loopback test, following this video as a guide. I made my own loopback out of the cheapest M.2 to PCIe x4 adapter Amazon has to offer by desoldering the PCIe x4 connector and putting in twisted pairs. This worked out nicely since I could intentionally mismatch the length of one pair to get a negative result, confirming I wasn't in some internal loopback mode.

The three-eyed...something.
For most of the rest of this test, I'm roughly following the script from the FPGA Drive example design readme files, with deviations for my custom board and for Vivado 2019.1 support. The scripts there generates a Vivado project and block design with the Processing System and the XDMA PCIe Bridge. I had a few hardware differences that had to be taken care of manually (EMIO UART, inverted SSD reset signal), but having a reference design to start from was great.

The example design includes a standalone application for simply checking that a PCIe drive enumerates on the bus, but it isn't built for the ZU+. As the readme mentions, there had been no standalone driver for the XDMA PCIe Bridge. Well, as of Vivado 2019.1, there is! In SDK, the standalone project for xdmapcie_rc_enumerate_example.c can be imported directly from the peripheral driver list in system.mss from the exported hardware.

XDMA standalone driver example is available as of Vivado 2019.1!
I installed an SSD and ran this project and much to my amazement, the enumeration succeeded. By looking at the PHY Status/Control register at offset 0x144 from the Bridge Register Memory Map base address (0x400000000 here), I was also able to confirm that link training had finished and the link was Gen3 x4. (Documentation for this is in PG194.) Off to a good start, then.

Installed a 1TB Samsung 970 EVO Plus.
Unfortunately, that's where the road seems to end in terms of quick and easy setup. The next stage involves PetaLinux, which is a toolchain for building the Xilinx Linux kernel. I don't know about other people, but every time the words "Linux" and "toolchain" cross my path, I automatically lose a week of time to setup and debugging. This was no exception.

Unsurprisingly, PetaLinux tools run in Linux. I went off on a bit of a tangent trying to see if they would run in WSL2. They do, if you contain your project in the Linux file system. In other words, I couldn't get it to work on /mnt/c/... but it worked fine if the project was in ~/home/... But, WSL2 is a bit bleeding edge still and there's no USB support as of now. So you can build, but not JTAG boot. If you boot from an SD card, though, it might work for you.

So I started over with a VirtualBox VM running Ubuntu 18.04, which was mercifully easy to set up. For reasons I cannot at all come to terms with, you need at least 100GB of VM disk space for the PetaLinux build environment, all to generate a boot image that measures in the 10s of MB. I understand that tools like this tend to clone in entire repositories of dependencies, but seriously?! It's larger than all of my other development tools combined. I don't need the entire history of every tool involved in the build...

And here I thought Xilinx was a disk space hog...
The build process, even not including the initial creation of this giant environment, is also painfully slow. If you are running it on a VM, throw as many cores at it as you can and then still plan to go do something else for an hour. I started from the build script in the FPGA Drive example design, making sure it targetted cpu_type="zynqMP" and pcie_ip="xdma"

This should set up the kernel properly, but some of the config options in PetaLinux 2019.1 might not exactly match the packaged configs. There's a reference here explaining how to manually configure the kernel for PCIe and NVMe hosting on the Z-7030. I went through that, subbing in what I thought were correct ZU+ and XDMA equivalents where necessary. Specifically:
  • It seems like as of PetaLinux 2019.1 (v4.19.0), there's an entire new menu item under Bus support for PCI Express Port Bus support. Including this expands the menu with other PCI Express-specific items, which I left at whatever their default state was.
  • Under Bus support > PCI controller drivers, Xilinx XDMA PL PCIe host bridge support has to be included. I don't actually know if the NWL PCIe Core is also required, but left it in since it was enabled by default. It might be the driver for the PS-side PCIe?
  • Some things related to NVMe are in slightly different places. There's an item called Enable the block layer on the main config page that I assume should be included. Under Device Drivers, Block devices should be included. And under Device Drivers > NVME Support, NVM Express block device should also be included.

The rest of the kernel and rootfs config seems to match pretty closely the Z-7030 setup linked above. But I will admit it took me three attempts to create a build that worked, and I don't know exactly what trial-and-error steps I did between each one. Even once the correct controller driver (pcie-xdma-pl.c) was being included in the build, I couldn't get it to compile successfully without this patch. I don't know what the deal is with that, but after that I finally got a build that would enumerate the SSD on the PCIe bus:

Output from lspci -vv confirms link speed and width.
I had already partitioned the drive off-board, so I skipped over those steps and went straight to the speed tests as described here. I tested a few different block sizes and counts with pretty consistent results: about 460MB/s write and 630MB/s read.

Not sure about those correctable errors. I guess it's better than un-correctable errors.
That is actually pretty fast, compared to the Z-7030 benchmark. The ZU+ and the new driver seem like they're able to make much better use of the SSD. But, it's still about a factor of two below what I want. There could be some extra performance to squeeze out from driver optimization, but at this point I feel like the effort will be better-spent looking into hardware acceleration, which has been demonstrated to get to 1GB/s write speeds, even on older hardware.

Since there's no published datasheet or pricing information for that or any other NVMe hardware accelerator, I'm not inclined to even consider it as an option. At very least, I plan to read through the open specification and see what actually is required of an NVMe host. If it's feasible, I'd definitely prefer an ultralight custom core to a black-box IP...but that's just me. In the mean time, I have some parallel development paths to work on.

Wednesday, June 12, 2019

Freight Train of Pixels

I have a problem. After any amount of time at any level of development of anything, I feel the urge to move down one layer into a place where I really shouldn't be. Thus, after spending time implementing capture software for my Point Grey FLIR block cameras, I am now tired of dealing with USB cables and drivers and firmware and settings.

What I want is image data. Pixels straight from a sensor. As many as I can get, as fast as I can get them. To quote Jeremy Clarkson from The Great Train Race (Top Gear S13E1), "Make millions of coals go in there." Except instead of coals, pixels. And instead of millions, trillions. It doesn't matter how. I mean, it does, but really I want the only real constraints to be where the pixels are coming from and where they are going. So let's see what's in this rabbit hole.

The Source

The image sensor feeding this monster will be an ams (formerly CMOSIS) CMV12000. It's got lots of pros and a few cons for this type of project, which I'll get into in more detail. But the main reason for the choice is entirely non-technical: This is a sensor that I can get a full datasheet for and purchase without any fucking around. This was true even back in the CMOSIS days, but as an active ams part it's now documented and distributed the same way as their $1 ICs.
The CMV12000 is not $1, sadly, but you can Buy It Now if you really want. For prototyping, I have two monochrome ones that came from a heavily-discounted surplus listing. Hopefully they turn on.
This is a case, then, where the available component drives the design. The CMV12000 is not going to win an image quality shootout with a 4K Sony sensor, but it is remarkably fast for its resolution: up to 300fps at 4096x3072. That's 3.8Gpx/s, somewhere between the total camera interface on a Tesla Full Self Driving Chip (2.5Gpx/s) and the imaging rate of a Phantom Flex 4K (8.9Gpx/s). There was a version jump on this sensor that I think moved it into a different category of speed, and that's where I'm placing the lever for pushing this design.

The CMV12000 is also a global shutter CMOS sensor, something more common in industrial and machine vision applications than consumer cameras. The entire frame is sampled at once, instead of row-by-row as in rolling shutter CMOS. (The standupmaths video on the topic is my favorite.) The  advantage is that moving objects and camera panning don't create distortion, which is arguably just correct behavior for an image sensor... But although a few pro cameras with global shutter have existed, even those have mostly died out. This is due to an interlinked set of trade-offs that give rolling shutter designs the advantage in cost and/or dynamic range.

For engineering applications, though, a global shutter sensor with an external trigger is essentially a visual oscilloscope, and can be useful beyond just creating normal video. By synchronizing the exposure to a periodic event, you can measure frequencies or visualize oscillations well beyond the frame rate of the sensor. Here's an example of my global shutter Grasshopper 3 camera capturing the cycle of a pixel shifting DLP projector. Each state is 1s/720 in duration, but the trigger can be set to any multiple of that period, plus or minus a tiny bit, to capture the sequence with an effective frame rate much higher than 720fps.

Whether a consequence of the global shutter or not, the main on-paper shortcoming of the CMV12000 is the relatively high dark noise of 13e-. For comparison, the Sony IMX294CJK, the 4K sensor in some new cameras with very good low-light capability, is below 2e-. That's a rolling shutter sensor, though. Sony also makes low-noise global shutter CMOS sensors like the IMX253, at around 2.5e-. The extra noise on the CMV12000 will mean that it needs more light for the same image quality compared to these sensors.

Even given adequate light, the higher noise also eats into the dynamic range of the sensor. The signal-to-noise ratio for a given saturation depth will be lower. This means either noisy shadows or blown-out highlights. But the CMV12000 has a feature I haven't seen on any other commercially-available sensor: a per-pixel stepped partial reset. The theory is to temporarily stop accumulating charge on bright pixels when they hit intermediate voltages, while allowing dark pixels to keep integrating. Section 4.5.1 in this thesis has more on this method.

In the example below, the charge reading is simulated for 16 stops of contrast. With baseline lighting, the bottom four stops are lost in the noise and the top four are blown out. Increasing the illumination by 4x recovers two stops on the bottom, but loses two on top. The partial reset capability slows down the brightest pixels, recovering several more stops on top without affecting the dark pixels. The extra light is still needed to overcome the dark noise, but it's less of an issue in terms of dynamic range.
Dynamic range recovery using 3-stage partial reset.
The end result of partial reset is a non-linear pixel response to illumination. This is often done anyway, after the ADC conversion, to create log formats that compress more dynamic range into fewer bits per pixel. Having hardware that does something similar in-pixel, before the ADC, is a powerful feature that's not at all common.

Another aspect of the CMV12000 that helps with implementation is the pixel data interface: the data is spread out on 64 parallel LVDS output pairs that each serve a group of pixel columns. This extra-wide bus means more reasonable clock speeds: 300MHz DDR (600Mb/s) for full rate. A half-meter wavelength means wide intra-pair routing tolerances. There is still a massive 4.8ns inter-channel skew that has to be dealt with, but it would be futile to try to length match that. The sensor does put out training data meant for synchronizing the individual channels at the receiver, which is a headache I plan to have in the future.

The Sink

I'm starting from the assumption that it's impossible to really do anything permanent with 38Gb/s of data, if you're working with hardware at or below that of a laptop PC. In an early concept, I was planning to just route the data to a PCIe x4 output and send it in to something like an Intel NUC for further processing. But even that isn't fast enough for the CMV12000. (Also, you can buy something like that already. No fun.) And even if you could set up a 40Gb/s link to a host PC through something like Thunderbolt 3, it's really just kicking the problem down the road to more and more general hardware, which probably means more Watts per bit per second.

Ultimately, unless the data is consumed immediately (as with a machine vision algorithm that uses one frame and then discards it), or buffered into RAM as a short clip (as with circular buffers in high-speed cameras), the only way to sink this much data reasonably is to compress it. And this is where this project goes off the rails a little.

For starters, I choose 1GB/s as a reasonable sink rate for the data. This is within reach of NVMe SSD write speeds, and makes for completely reasonable recording times of 17min/TB (at maximum frame rate). This is very light compression, as far as video goes - less than 5:1. I think the best tool for the job is probably wavelet compression, rather than something like h.265. It's intra-frame and uses relatively simple logic, which means fast and cheap. But putting aside the question of how fast and how cheap for now, I first just want to make sure the quality would be acceptable.

There are several good examples of wavelet compression already in use: JPEG2000 uses different variants for lossless and lossy image compression. REDCODE is wavelet-based and 5:1 is a standard setting described as "visually lossless". CineForm is a wavelet codec recently open-sourced by GoPro. The SDK for CineForm includes a lightweight example project that just compresses a monochrome image with different settings. Running a test image through that with settings close to 5:1 produces good results:
The original monochrome image.
The wavelet transform outputs a 1/8-scale low-frequency thumbnail and three stages of quantized high-frequency blocks, which are sparse and easy to compress. I just zipped this image as a test and got a 5.7:1 ratio with these settings.
The recovered image.
Since these images are going to be destroyed by rescaling anyway, here's a 400% zoom of some high-contrast features.

The choice of wavelet type does matter, but I think the quantization strategy is even more important. The wavelet transform doesn't reduce the size of the data, it just splits it into low-frequency and high-frequency blocks. In fact, for all but the simplest wavelets, the blocks require more bits to store than the original pixels:
Output range maps for different wavelets. All but the simplest wavelets (Haar, Bilinear) have corner cases of low-frequency or high-frequency outputs that require one extra bit to store.
Take the Cineform 2/6 wavelet (a.k.a reverse biorthogonal 1.3?) as an example: the low-frequency block is just an average of two adjacent pixels, so it doesn't need any more bits than the source data. But the high-frequency blocks look at six adjacent pixels and could, for some corner cases, give a result that's larger than the maximum pixel amplitude. It needs one extra bit to store the result without clipping. Seems like we're going in the wrong direction!

Like most image compression techniques, the important fact is that the high frequency information is less valuable, and can be manipulated and even discarded without as much visual penalty. By applying a deadband and quantization step to the high-frequency blocks, the data becomes more sparse and easier to compress. Since this is the lossy part of the algorithm, the details are hugely important. I have a little sandbox program that I use to play with different wavelet and quantization settings on test images. In most cases, 5:1 compression is very reasonable.
Different wavelets and quantizer settings can be compared quickly in this software sandbox.
That's enough evidence for me that wavelet compression is a completely acceptable trade-off for opening up the possibility of sinking to a normal 1TB SSD instead of an absurd amount of RAM. A very fast RAM buffer is still needed to smooth things out, but it can be limited in size to just as many frames as are needed to ride out pipeline transients. Now, with the source and sink constraints defined, what the hell kind of hardware sits in the middle?

The Pipe

There was never any doubt that the entrance to this pipeline had to be an FPGA. Nothing else can deal with 64 LVDS channels. But instead of just repackaging the data for PCIe and passing it along to some poor single board computer to deal with, I'm now asking the FPGA to do everything: read in the data, perform the wavelet compression, and write it out to an SSD. This will ultimately be smaller and cheaper, since there's no need for a host computer, but it means a much fancier FPGA.

I'm starting from scratch here, so all of this is just an educated guess, but I think a viable solution lies somewhere in the spectrum of Xilinx Zynq Ultrascale+ devices. They are FPGA hardware bolted to ARM cores in a single chip. Based on the source and sink requirements I can narrow down further to something between the ZU4 and ZU7. (Below the ZU4 doesn't have the necessary transceivers for PCIe Gen3 x4 to the SSD, and above the ZU7 is prohibitively expensive.) Within each ZU number, there are also three categories: CG has no extra hardware, EG has a GPU, and EV has a GPU and h.264/h.265 codec.

In the interest of keeping development cost down, I'm starting with the bottom of this window, the ZU4CG. The GPU and video codec might be useful down the road for 30fps previews or making proxies, but they're too slow to be part of the main pipeline. Since they're fairly sideways-compatible, I think it's reasonable to start small and move up the line if necessary.

I really want to avoid laying out a board for the bare chip, its RAM, and its other local power supplies and accessories. The UltraZed-EV almost works, but it doesn't break out enough of the available LVDS pins. It's also only available with the ZU7EV, the very top of my window. The TE08xx Series of boards from Trenz Electronic is perfect, though, covering a wider range of the parts and breaking out enough IO. I picked up the ZU4CG version for less than the cost of just the ZU4CG on Digi-Key.
Credit card-sized TE0803 board with the ZU4CG and 2GB of RAM. Not counting the FPGA, the processing power is actually good deal less than what's on a modern smartphone.
One small detail I really like about the TE0803 is that the RAM is wired up as 64-bit wide. Assuming the memory controller can handle it, that would be over 150Gb/s for DDR4-2400, which dwarfs even the CMV12000's data rate. I think the RAM buffer will wind up on the compressed side of the pipeline, but it's good to know that it has the bandwidth to handle uncompressed sensor data too, if necessary.

Time for a motherboard:
The "tall" side has the TE0803 headers, an M.2 connector, USB-C, a microSD slot, power supplies, and an STM32F0 to act as a sort-of power/configuration supervisor. Sensor pins are soldered on this side.
The "short" side has just the sensor and some straggler passives that are under 1mm tall.
Asside from the power supplies, this board is really just a breakout for the TE0803, and the placement of everything is driven by where the LVDS- and PCIe-capable pins are. Everything is a differential pair, pretty much. There are a bunch of different target impedances: 100Ω for LVDS, 85Ω for PCIe Gen3, 90Ω for USB. I was happy to find that JLCPCB offers a standard 6-layer controlled-impedance stackup. They even have their own online calculator. I probably still fucked up somehow, but hopefully at least some of it is right so I can start prototyping the software.

Software? Hardware? What do you call FPGA logic? There are a bunch of somewhat independent tasks to deal with on the chip. At the input side, the pixel data needs to be synchronized using training data to deal with the massive 4.8ns inter-channel skew. The FPGA inputs have a built-in delay tap, but it maxes out at 1.25ns. You can, in theory, cascade these with the adjacent unused output delays, to reach 2.5ns. That's obviously not enough to directly cancel the skew, but it is enough to reach the next 300MHz clock edge. So, possibly some combination of cascaded hardware delays and intentional bit slipping can cover the full range. It's going to be a nightmare.

The output side might be even worse. Just look at the number of differential pairs going in to the TE0803 headers vs. the number coming out. That's the ratio of how much tighter the timing tolerance is on the PCIe outputs. The edge of one bit won't hit the M.2 connector until a couple more have already left the FPGA. In this case, I have taken the effort to length match the pairs themselves. I won't know how close I am until I can do a loopback test.
Length matching the PCIe differential pairs to make up for the left turns and TE0803 routing.
Even assuming the routing is okay, there's the problem of NVMe. NVMe is an open specification for what lives on top of the PCIe PHY to control communication with the SSD. It's built in to Linux, including versions that can run on the ZU4CG's ARM cores. But that puts the operating system in the pipeline, which sounds like a disaster. I haven't seen any examples of that running at anywhere near 1GB/s. I think hardware-accelerated NVMe might work, but as far as I can tell there are no license-free NVMe cores in existence. I don't have a solution to this problem yet, but I will happily sink hours into anything that prevents me from having to deal with IP vendors.

Sitting right in the middle, between these input and output constraints, is the complete mystery that is the wavelet core. This has to be done in hardware. The ARM cores and even the GPU are just not fast enough, and even if they were, accessing intermediate results would quickly eat the RAM bus. The math operations involved are so compact, though, that it seems natural to implement them in tiny logic/memory cores and then put as many of them in parallel as possible.

The wavelet cores are the most interesting part of this pipeline and require a separate post to cover in enough detail to be meaningful. I have a ton of references on the theory and a little bit of concept for how to turn it into lightweight hardware. As it stands, I know only enough to have some confidence that it will fit on the ZCU4CG, in terms of both logic elements and distributed memory for storing intermediate results. (The memory requirement is much less than a full frame, since the wavelets only look ahead/behind a few pixels at a time.) But there is an immense amount of implementation detail to fill in, and I hope to make a small dent in that while these boards are in flight.

To summarize, I still have no clue if, how, or when any of this will work. My philosophy on this project is to send the pixels as fast as they want to go and try to remove anything that gets in the way. It's not really a plan - more of a series of challenges.

Monday, May 27, 2019

KSP: Laythe Colony Part 3, The Colony Ships

Jool Launch Window #2 is all about getting as many Kerbals in transit to Laythe as possible, and that means building a fleet of colony ships. This was actually the first ship designed for this mission, but I only built one as a proof-of-concept before committing to the Robotic Fleet for Launch Window #1. Those habitats, rovers, and relays will arrive first to pave the way for the colony ships.

The colony ships are built in orbit, with each part launched separately on the same heavy-lift boosters that sent up the Robotic Fleet. The core of each ship, around which the rest of the ship is built, is a Passenger Module:

The Passenger Module
The Passenger Module has room for 18 Kerbals (half the crew of each ship), with two main living compartments on each end, a central stack of general purpose seating, and two observation domes. It's meant to be the "comfortable" portion of the ship, to make the multi-year journey more bearable than would be possible in a lander cockpit. Not that Kerbals really care.

One of the main quality of life considerations for the colony ships is the ability to spin to generate artificial gravity in some of the living quarters. For this reason, the rest of the ship is built along an axis passing through the center of the passenger module. Forward, the next part is the Docking Module:

The Docking Module
While it has space for another six Kerbals, the Docking Module is more of a working space than a living space. Since it's on the central axis, there's no artificial gravity. But it has a large science lab and common area for the crew. Most importantly, it serves as the docking interface for the Space Planes, which shuttle crew to and from the colony ships.

The Space Planes
The Space Planes are really the key to this entire mission, providing a way to get hundreds of Kerbals down to Laythe without having to exactly target flat landing sites from orbit. I tweaked and tested the design to the point where getting to orbit, docking with a colony ship, and returning to Kerbin for a runway landing was utterly routine. Each colony ship required four Space Plane round-trips and two one-way trips to fully crew. The two one-ways go with the ship to Laythe, where they will be used to ferry Kerbals down to the surface.

The back-to-back docking configuration for the Space Planes minimizes the moment of inertia along the spin axis. The two planes have to be exactly symmetric, so each interfaces with two medium-size docking ports for alignment. It is possible, with careful flying, to get both ports to engage at the same time. In addition to enforcing symmetry, this makes the final structure much more rigid. Finding parts that are exactly the right spacing on both sides to make this possible was the trickiest part of the design.

I could write an entire post about the Space Plane design, but I think I'll just post some pictures and videos of it kicking ass instead:

The last picture has a story that goes with it: For some reason, after dozens of clean flights, I botched a take-off and slammed back into the KSC runway with the gear still down, breaking off both wings, the outer engines and fuel tanks, the vertical stabilizers, and all but the two inner horizontal control surfaces. The fraction of a plane that was left was somehow still able to gain altitude, do a wide 180º turn, and make a water landing just off shore.

Anyway, back to the colony ships. Behind the Passenger Module is a truss structure I just call The Node:

The Node
This is the lightest and simplest part of the colony ship, primarily serving as a connector between the crew stack and the propulsion. It also carries the large solar panels, some battery storage, extra reaction control systems, and side ports for docking other modules, such as for refueling. Altogether a small but important building block.

To push all this, three Propulsion Modules are launched separately and docked to the back of the Node. These are the full four-engine versions of the propulsion modules used for the Robotic Fleet.

The Propulsion Modules
With 12 engines in total, the colony ships actually have a higher thrust-to-weight ratio than the robotic landers. The entire colony ship, including the two Space Planes, comes in at just under 300 tons and has a fully-fueled Delta-V of about 4400m/s, which should be enough to get to Laythe orbit with just a tiny bit of help from gravity assists off Tylo (or Laythe itself).

The process of building a single colony ship takes 13 separate launches: six for assembly, six crew shuttles (including two permanent ones), and one refueling run. (While the propulsion modules get to orbit fully-fueled, the total Delta-V counts on topping off the two Space Planes.) It's without a doubt the most ambitious in-orbit construction project I've attempted in KSP.

Oh, and I built 10 of these for Launch Window #2. That's about 3,000 tons of hardware, including 20 Space Planes and 360 Kerbals, on the way to Laythe.

196 Days Remain

I had intended for the second launch window to be the last ships out, but my arbitrary deadline of Year 3, Day 0 for the destruction of Kerbin leaves some time to send up a few more. They'll have to wait for the third launch window, possibly in the relative safety of a Minmus orbit, but I can think of a few extra pieces of hardware that would be useful to the colony.

Sunday, March 31, 2019

TinyCross: Chassis Build

Just hit print...
This winter build season is coming to a close: almost all of the mechanical work for TinyCross is done! Here's a recap how the rolling chassis came together:

The frame and suspension is mostly a kit of aluminum plates and 80/20 extrusion, with almost no post-machining required, so it went together very quickly. After the main box frame assembly and seat mounting, I started with a test build of a single corner of A-arm geometry to make sure I hadn't missed any clearance issues. I also wanted to get a first impression of the stiffness in real life, since that's probably the biggest risk of this new design.

I had no trouble at all with the front-right corner. Everything fit together as planned and the stiffness felt adequate, largely thanks to the zero-slop QA1 ball joints. I was actually a little surprised at how well-behaved it felt. Once all six ball joints and the air shock pins were tightened down, it really did have only the degrees of freedom it should have: one for steering and one for suspension travel. There's no rattling or play at all. With high confidence from the test corner, I went into production line mode for the other three.

PTFE-lined QA-1 1/4-28 rod ends (CMR4TS) are the real stars of this build. It would not be possible with McMaster's selection of ball joints, which are either cheap and overly loose or expensive and overly tight.
I say there was almost no post-machining, but just tapping all the 80/20 ends was a whole day of work.
About here is where the perfect build ended, though, because when I went to attach the front-left corner, I discovered that there was a slight interference between the A-arms and the air shock valve stems. The parts I designed, all 2D plates, are 100% symmetric, so I didn't bother to model the other corners. But the shocks themselves are not symmetric, so it wasn't exactly correct to assume that things would fit together the same in the mirrored configuration. Since the interference was minimal, I debated cutting notches in the A-arms for the valve stems. But I was able to find a more satisfying solution.

Can you spot it?
Not modeling the mirrored parts was a semi-legitimate time-saving strategy, but not modeling the rear corners was just laziness. And of course there was a major interference there: the brake calipers would not have cleared the corners of the frame. (In the front, it's no problem since there is extra space for steering travel.) It was nothing that couldn't be solved with a hacksaw and some improvisation, though. I actually like the final outcome better than the original design...

...he justifies, in post.
Minor issues aside, I am pleased with how the chassis turned out. It's much stiffer than tinyKart, thanks to a slight excursion into the third dimension, but still very light. And I went from 50% to 99% confidence on the suspension design after getting hands on the assembled corners.

So far, so light.
Most of the machining for this build was for turned parts within the four drive modules. The spindle shafts for the wheels were made from 7071 aluminum and support the wheel bearings (6902-2RS). Unlike on tinyKart, the shafts are doubly-supported within a box structure built around the wheel, which should be much more impact tolerant. The large drive pulleys got some weight reduction and a custom bolt pattern to interface with the wheel hubs.

My favorite bit of packaging is the brake caliper occupying the volume inside the belt loop, with the brake disk flush against one side of the wheel pulley. Torque is sourced and sunk from the same side of the wheel - in fact from the same metal plate.

The motor shaft and motor pulley also required some custom machining. This was a weak link on tinyKart: The original design used set screws on shaft flats, but it was prone to loosening over time (or, in one case, completely shearing off the 10mm shaft at the flat). After switching to keyed shafts (via Alien Power Systems motors), those problems mostly went away. But there was still axial play, and the torque was still being transmitted through a 3mm key into an aluminum keyway.

For TinyCross, I wanted to have a clamping and keyed shaft adapter so the torque would be primarily transmitted through friction, with the key as back-up. There's not a lot of room to work within the 15-tooth drive pulley, so that just gets bored out as much as possible and then pressed like hell, with retaining compound, onto a 7071 adapter. This adapter then gets the 10mm bore with a 3mm keyway. But it also gets slotted, turning it into a clamp. Finally, an off-the-shelf 0.75in aluminum clamping collar tightens the whole assembly down onto the motor shaft, with the key in place.

Additionally, the outboard side of the shaft interfaces with another bearing, for double support, and has a pocket for a shaft rotation sense magnet, to be picked up by a rotary encoder IC.

Not messing around.
For brakes, I opted for the same disks, calipers, cables, and levers as on tinyKart. I briefly debated going hydraulic, but the plumbing for four wheel disk brakes seemed like an unnecessary nightmare. tinyKart never had a problem with braking torque; it could easily lock up both front wheels. It just had so little weight on the front wheels that braking and steering were often mutually exclusive activities. With four wheel disk brakes, TinyCross should be much more controllable under braking. The TerraTrike dual pull levers are key to making this work: they have a fulcrum between the lever and two cable ends that ensures both cables get pulled with equal force. I have one such lever for the two front discs and one for the rears.

Independent rear brake lever...what could go wrong?
The last piece of the mechanical puzzle is the steering. At each wheel, there's a steering arm that terminates in a place to mount yet another ball joint, using a T-nut. This is driven by a link comprising a threaded rod with an aluminum stiffener, another trick carried over from tinyKart. The aluminum stiffener is compressed and the threaded rod is stretched by two nuts, creating a link that's stiffer than either part by itself.

For the rear wheels, all that's required are two fixed mounting points for the other end of these rods. Rear toe angle is set by adjustment with the threaded rod.

The toe-setting plate doubles as the TinyCross badge.
The front requires an actual steering mechanism. In lieu of a rack-and-pinion, I used a simple four-bar, driven by the steering column through a universal joint buried in the middle of the front suspension support tower. Each link in the four-bar has its own set of thrust bearings and radial bushings, to minimize the extra linkage slop.

This sort of setup works since the steering throw is very short: ±45º of travel is all it needs. Amazingly, it all clears over the full suspension travel and there doesn't seem to be much bump steer. (I shouldn't be amazed, since it works in CAD, but I am anyway.)

And just like that, it rolls.

Unfortunately, although the frame and suspension were on-target, the four power modules are way over weight budget. The wheels themselves are annoyingly heavy. I can't do much about that, but I can probably take some weight out of the surrounding assembly. A lot of the design is driven around the off-the-shelf cast aluminum rims. If I am willing to chop down the rim and re-machine its outer bearing bore, I can probably save a little weight and a lot of width. I can maybe even get it below 34in, which would help with getting through doorways.

But for now I will shift focus to the electronics. Preliminary bring-up of the motor drives has been uneventful (that's a good thing), but I'll need to actually hook them up to LiPos next, which could always become interesting in fiery ways.

To be continued...