Friday, November 29, 2019

Zynq Ultrascale+ FatFs and Direct Speed Tests with Bare Metal NVMe via AXI-PCIe Bridge

Blue wire PCIe REFCLK still hanging in there...
It's time to return to the problem of sinking 1GB/s of data onto an NVMe drive from a Zynq Ultrascale+ SoC. Last time, I benchmarked the Xilinx Linux drivers and found that they were fast, but not quite fast enough. In the comments of that post, there were many good suggestions for how to make up the difference without having to resort to a hardware accelerator. The consensus is that the hardware, namely the stock AXI-PCIe bridge, should be fast enough.

While a lot of the suggestions were ways to speed up the data transfer in Linux, and I have no doubt those would work, I also just don't want or need to run Linux in this application. The sensor input and wavelet compression modules are entirely built in Programmable Logic (PL), with only a minimal interface to the Processing System (PS) for configuration and control. So, I'm able to keep my entire application in the 256KB On-Chip Memory (OCM), leaving the external DDR4 RAM bandwidth free for data transfer.

After compression, the data is already in the DDR4 RAM where it should be visible to whatever DMA mechanism is responsible for transferring data to an NVMe drive. As Ambivalent Engineer points out in the comments:
It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue.
In other words, write a bare metal NVMe driver to interface with the AXI-PCIe bridge directly for initiating and controlling data transfers. This seems like a good fit, both to this specific application and to my general proclivity, for better or worse, to move to lower-level code when I get stuck. A good place to start is by exploring the functionality of the AXI-PCIe bridge itself.

AXI-PCIe Bridge

Part of the reason it took me a while to understand the AXI-PCIe bridge is that it has many names. The version for Zynq-7000 is called AXI Memory Mapped to PCI Express (PCIe) Gen2, and is covered in PG055. The version for Zynq Ultrascale is called AXI PCI Express (PCIe) Gen 3 Subsystem, and is covered in PG194. And the version for Zynq Ultrascale+ is called DMA for PCI Express (PCIe) Subsystem, and is nominally covered in PG195. But, when operated in bridge mode, as it will be here, it's actually still documented in PG194. I'll be focusing on this version. 

Whatever the name, the block diagram looks like this:

AXI-PCIe Bridge Root Port block diagram, adapted from PG194 Figure 1.
The AXI-Lite Slave Interface is straightforward, allowing access to the bridge control and configuration registers. For example, the PHY Status/Control Register (offset 0x144) has information on the PCIe link, such as speed and width, that can be useful for debugging. When the bridge is configured as a Root Port, as it must be to host an NVMe drive, this address space also provides access to the PCIe Configuration Space of both the Root Port itself, at offset 0x0, and the enumerated Endpoint devices, at other offsets.
PCIe Confinguration Space layout, adapted from PG213 Table 2-35.
If the NVMe drive has successfully enumerated, its Endpoint PCIe Configuration Space will be mapped to some offset in the AXI-Lite Slave register space. In my case, with no switch involved, it shows up as Bus 1, Device 0, Function 0 at offset 0x1000. Here, it's possible to check the Device ID, Vendor ID, and Class Codes. Most importantly, the BAR0 register holds the PCIe memory address assigned to the device. The AXI address assigned to BAR0 in the Address Editor in Vivado is mapped to this PCIe address by the bridge.

Reads from and writes to the AXI BAR0 address are done through the AXI Slave Interface. This is a full AXI interface supporting burst transactions and a wide data bus. In another class of PCIe device, it might be responsible for transferring large amounts of data to the device through the BAR0 address range. But for an NVMe drive, BAR0 just provides access to the NVMe Controller Registers, which are used to set up the drive and inform it of pending data transfers.

The AXI Master Interface is where all NVMe data transfer occurs, for both reads and writes. One way to look at it is that the drive itself contains the DMA engine, which issues memory reads and writes to the system (AXI) memory space through the bridge. The host requests that the drive perform these data transfers by submitting them to a queue, which is also contained in system memory and accessed through this interface.

Bare Metal NVMe

Fortunately, NVMe is an open standard. The specification is about 400 pages, but it's fairly easy to follow, especially with help from this tutorial. The NVMe Controller, which is implemented on the drive itself, does most of the heavy lifting. The host only has to do some initialization and then maintain the queues and lists that control data transfers. It's worth looking at a high-level diagram of what should be happening before diving in to the details of how to do it:

System-level look at NVMe data flow, with primary data streaming from a source to the drive.
After BAR0 is set, the host has access to the NVMe drive's Controller Registers through the bridge's AXI Slave Interface. They are just like any other device/peripheral control registers, used for low-level configuration, status, and control of the drive. The register map is defined in the NVMe Specification, Section 2.

One of the first things the host has to do is allocate some memory for the Admin Submission Queue and Admin Completion Queue. A Submission Queue (SQ) is a circular buffer of commands submitted to the drive by the host. It's written by the host and read by the drive (via the bridge AXI Master Interface). A Completion Queue (CQ) is a circular buffer of notifications of completed commands from the drive. It's written by the drive (via the bridge AXI Master Interface) and read by the host. 

The Admin SQ/CQ are used to submit and complete commands relating to drive identification, setup, and control. They can be located anywhere in system memory, as long as the bridge has access to them, but in the diagram above they're shown in the external DDR4. The host software notifies the drive of their address and size by setting the relevant Controller Registers (via the bridge AXI Slave Interface). After that, the host can start to submit and complete admin commands:
  1. The host software writes one or more commands to the Admin SQ.
  2. The host software notifies the drive of the new command(s) by updating the Admin SQ doorbell in the Controller Registers through the bridge AXI Slave Interface.
  3. The drive reads the command(s) from the Admin SQ through the bridge AXI Master Interface.
  4. The drive completes the command(s) and writes an entry to the Admin CQ for each, through the bridge AXI Master Interface. Optionally, an interrupt is triggered.
  5. The host reads the completion(s) and updates the Admin CQ doorbell in the Controller Registers, through the AXI Slave Interface, to tell the drive where to place the next completion.
In some cases, an admin command may request identification or capability data from the drive. If the data is too large to fit in the Admin CQ entry, the command will also specify an address to which to write the requested data. For example, during initialization, the host software requests the Controller Identification and Namespace Identification structures, described in the NVMe Specification, Section 5.15.2. These contain information about the capabilities, size, and low-level format (below the level of file systems or even partitions) of the drive. The space for these IDs must also be allocated in system memory before they're requested.

Within the IDs is information that indicates the Logical Block (LB) size, which is the minimum addressable memory unit in the non-volatile memory. 512B is typical, although some drives can also be formatted for 4KiB LBs. Many other variables are given in units of LBs, so it's important for the host to grab this value. There's also a maximum and minimum page size, defined in the Controller Registers themselves, which applies to system memory. It's up to the host software to configure the actual system memory page size in the Controller Registers, but it has to be between these two values. 4KiB is both the absolute minimum and the typical value. It's still possible to address system memory in smaller increments (down to 32-bit alignment); this value just affects how much can be read/written per page entry in an I/O command or PRP List (see below).

Once all identification and configuration tasks are complete, the host software can then set up one or more I/O queue pairs. In my case, I just want one I/O SQ and one I/O CQ. These are allocated in system memory, then the drive is notified of their address and size via admin commands. The I/O CQ must be created first, since the I/O SQ creation references it. Once created, the host can start to submit and complete I/O commands, using a similar process as for admin commands.

I/O commands perform general purpose writes (from system memory to non-volatile memory) or reads (from non-volatile memory to system memory) over the bridge's AXI Master Interface. If the data to be transferred spans more than two memory pages (typically 4KiB each), then a Physical Region Page (PRP) List is created along with the command. For example, a write of 24 512B LBs starting in the middle of a 4KiB page might reference the data like this:

A PRP List is required for data transfers spanning more than two memory pages.
The first PRP Address in the I/O command can have any 32-bit-aligned offset within a page, but subsequent addresses must be page-aligned. The drive knows whether to expect a PRP Address or PRP List Pointer in the second PRP field of the I/O command based on the amount of data being transferred. It will also only pull as much data as is needed from the last page on the list to reach the final LB count. There is no requirement that the pages in the PRP list be contiguous, so it can also be used as a scatter-gather with 4KiB granularity. The PRP List for a particular command must be kept in memory until it's completed, so some kind of PRP Heap is necessary if multiple commands can be in flight.

Some (most?) drives also have a Volatile Write Cache (VWC) that buffers write data. In this case, an I/O write completion may not indicate that the data has been written to non-volatile memory. An I/O flush command forces this data to be written to non-volatile memory before a completion entry is written to the I/O CQ for that flush command.

That's about it for things that are described explicitly in the specification. Everything past this point is implementation detail that is much more application-specific.

A key question the host software NVMe driver needs to answer is whether or not to wait for a particular completion before issuing another command. For admin commands that run once during initialization and are often dependent on data from previous commands, it's fine to always wait. For I/O commands, though, it really depends. I'll be using write commands as an example, since that's my primary data direction, but there's a symmetric case for reads.

If the host software issues a write command referencing a range of data in system memory and then immediately changes the data, without waiting for the write command to be completed, then the write may be corrupted. To prevent this, the software could:
  1. Wait for completion before allowing the original data to be modified. (Maybe there are other tasks that can be done in parallel.)
  2. Copy the data to an intermediate buffer and issue the write command referencing that buffer instead. The original data can then be modified without waiting for completion.
Both could have significant speed penalties. The copy option is pretty much out of the question for me. But usually I can satisfy the first constraint: If the data is from a stream that's being buffered in memory, the host software can issue NVMe write commands that consume one end of the stream while the data source is feeding in new data at the other end. With appropriate flow control, these write commands don't have to wait for completion.

My "solution" is just to push the decision up one layer: the driver never blocks on I/O commands, but it can inform the application of the I/O queue backlog as the slip between the queues, derived from sequentially-assigned command IDs. If a particular process thinks it can get away without waiting for completions, it can put more commands in flight (up to some slip threshold).
An example showing the driver ready to submit Command ID 72, with the latest completion being Command ID 67. The doorbells always point to the next free slot in the circular buffer, so the entry there has the oldest ID.
I'm also totally fine with polling for completions, rather than waiting for interrupts. Having a general-purpose completion polling function that takes as an argument a maximum number of completions to process in one call seems like the way to go. NVMeDirectSPDK, and depthcharge all take this approach. (All three are good open-source reference for light and fast NVMe drivers.)

With this set up, I am able to run a speed test by issuing read/write commands for blocks of data as fast as possible by trying to keep the I/O slip at a constant value:

Speed test for raw NVMe write/read on a 1TB Samsung 970 Evo Plus.
For smaller block transfers, the bottleneck is on my side, either in the driver itself or by hitting a limit on the throughput of bus transactions somewhere in the system. But for larger block transfers (32KiB and above) the read and write speeds split, suggesting that the drive becomes the bottleneck. And that's totally fine with me, since it's hitting 64% (write) and 80% (read) of the maximum theoretical PCIe Gen3 x4 bandwidth.

Sustained write speeds begin to drop off after about 32GiB. The Samsung Evo SSDs have a feature called TurboWrite that uses some fraction of the non-volatile memory array as fast Single-Level Cell (SLC) memory to buffer writes. Unlike the VWC, this is still non-volatile memory, but it gets transferred to more compact Multi-Level Cell (MLC) memory later since it's slower to write multi-level cells. The 1TB drive that I'm using has around 42GB of TurboWrite capacity according to this review, so a drop off in sustained write speeds after 32GiB makes sense. Even the sustained write speed is 1.7GB/s, though, which is more than fast enough for my application.

A bigger issue with sustained writes might be getting rid of heat. This drive draws about 7W during max speed writing, which nearly doubles the total dissipated power of the whole system, probably making a fan necessary. Then again, at these write speeds a 0.2kg chunk of aluminum would only heat up about 25ÂșC before the drive is full... In any case, the drive will also need a good conduction path to the rear enclosure, which will act as the heat sink.

FatFs

I am more than content with just dumping data to the SSD directly as described above and leaving the task of organizing it to some later, non-time-critical process. But, if I can have it arranged neatly into files on the way in, all the better. I don't have much overhead to spare for the file system operations, though. Luckily, ChaN gifted the world FatFs, an ultralight FAT file system module written in C. It's both tiny and fast, since it's designed to run on small microcontrollers. An ARM Cortex-A53 running at 1.2GHz is certainly not the target hardware for it. But, I think it's still a good fit for a very fast bare metal application.

FatFs supports exFAT, but using exFAT still requires a license from Microsoft. I think I can instead operate right on the limits of what FAT32 is capable of:
  • A maximum of 2^32 LBs. For 512B LBs, this supports up to a 2TiB drive. This is fine for now.
  • A maximum cluster size (unit of file memory allocation and read/write operations) of 128 LBs. For 512B LBs, this means 64KiB clusters. This is right at the point where I hit maximum (drive-limited) write speeds, so that's a good value to use.
  • A maximum file size of 4GiB. This is the limit of my RAM buffer size anyway. I can break up clips into as many files as I want. One file per frame would be convenient, but not efficient.
Linking FatFs to NVMe couldn't really get much simpler: FatFs's diskio.c device interface functions already request reads and writes in units of LBs, a.k.a. sectors. There's also a sync function that matches up nicely to the NVMe flush command. The only potential issue is that FatFs can ask for byte-aligned transfers, whereas NVMe only allows 32-bit alignment. My tentative understanding is that this can only happen via calls to f_read() or f_write(), so the application can guard against it.

For file system operations, FatFs reads and writes single sectors to and from a working buffer in system memory. It assumes that the read or write is complete when the disk_read() or disk_write() function returns, so the diskio.c interface layer has to wait for completion for NVMe commands issued as part of file system operations. To enforce this, but still allow high-speed sequential file writing from a data stream, I check the address of the disk_write() system memory buffer. If it's in OCM, I wait for completion. If it's in DDR4, I allow slip. For now, I wait for completion on all disk_read() calls, although a similar mechanism could work for high-speed stream reading. And of course, disk_ioctl() calls for CTRL_SYNC issue an NVMe flush command and wait for completion.

Interface between FatFs and NVMe through diskio.c, allowing stream writes from DDR4.
I also clear the queue prior to a read to avoid unnecessary read/write turnarounds in the middle of a streaming write. This logic obviously favors writes over reads. Eventually, I'd like to make a more symmetric and configurable diskio.c layer that allows fast stream reading and writing. It would be nice if the application could dynamically flag specific memory ranges as streamable for reads or writes. But for now this is good enough for some write speed testing:

Speed test for FatFs NVMe write on a 1TB Samsung 970 Evo Plus.
There's a very clear penalty for creating and closing files, since the process involves file system operations, including reads and flushes, that will have to wait for NVMe completions. But for writing sequentially to large (1GiB) files, it's still exceeding my 1GB/s requirement, even for total transfer sizes beyond the TurboWrite limit. So I think I'll give it a try, with the knowledge that I can fall back to raw writing if I really need to.

Utilization Summary

The good news is that the NVMe driver (not including the Queues, PRP Heap, and IDs) and FatFs together take up only about 27KB of system memory, so they should easily run in OCM with the rest of the application. At some point, I'll need to move the .text section to flash, but for now I can even fit that in OCM. The source is here, but be aware it is entirely a test implementation not at all intended to be a drop-in driver for any other application.

The bad news is that the XCZU4 is now pretty much completely full...

We're gonna need a bigger chip.
The AXI-PCIe bridge takes up 12960 LUTs, 17871 FFs, and 34 BRAMs. That's not even including the additional AXI Interconnects. The only real hope I can see for shrinking it would be to cut the AXI Slave Interface down to 32-bit, since it's only needed to access Controller Registers at BAR0. But I don't really want to go digging around in the bridge HDL if I can avoid it. I'd rather spend time optimizing my own cores, but I think no matter what I'll need more room for additional features, like decoding/HDMI preview and the subsampled 2048x1536 mode that might need double the number of Stage 1 Wavelet horizontal cores. 

So, I think now is the right time to switch over to the XCZU6, along with the v0.2 Carrier. It's pricey, but it's a big step up in capability, with twice the DDR4 and more than double the logic. And it's closer to impedance-matched to the cost of the sensor...if that's a thing. With the XCZU6, I think I'll have plenty of room to grow the design. It's also just generally easier to meet timing constraints with more room to move logic around, so the compile times will hopefully be lower.

Hopefully the next update will be with the whole 3.8Gpx/s continuous image capture pipeline working together for the first time!

Sunday, November 17, 2019

Zynq Ultrascale+ SuperSpeed RAM Dumping + v0.2 Carrier

I've gotten a lot of mileage out of my v0.1 (very first version) camera PCB. Partly that's because there's not much to it; it's mostly just power supplies, connectors, and differential pairs. But I'm still surprised I haven't broken it yet, and it's only had some minor design issues. I also made a front enclosure for it with an E-mount flange stolen from a macro extension tube (Amazon's cheapest, of course) and slots for some 1/4-20 T-nuts for tripod mounting.

Stealing an E-mount flange from a macro extension tube is maybe my favorite "Amazon's cheapest" hack so far. I'm not even sure how else to do it. Getting a custom CNC flange machined wouldn't be too bad, but what about the leaf springs?
There are some sensor alignment features, but mostly the board just bolts to the back of the front enclosure. The 1/4-20 T-nuts allow for quick and dirty tripod mounting without having to worry about aluminum threads or Heli-Coils.
No real thought was given to connector placement, user interface, battery wiring/charging, cooling, or anything else other than having something to constrain the sensor and lens the right distance from each other and deal with the massive pixel throughput. Still, it's been useful and reliable. At this point, I've tested most of the important hardware and am just about ready to make some functional improvements for v0.2.

One important subsystem I hadn't tested yet, though, is the USB interface. It's not part of the capture pipeline, but it's important that it operate at USB 3.x speeds for reading image data off the SSD later. The Zynq Ultrascale+ has a built-in USB 3.0 PHY using PS-GTR transceivers at 5Gb/s. This isn't quite fast enough for 5:1 compressed image data at full frame rate, but it's more than fast enough for 30fps playback, or direct access for conversion and editing.

At the moment, though, I'm mainly interested in USB 3.0 for reducing the amount of time it takes to get test image sequences out of the PS-side DDR4 RAM. I've so far been using XSCT commands to read blocks of RAM into a file (mrd -bin -file) over JTAG, but this is limited by the 30MHz JTAG interface. That's a theoretical maximum, too. In practice, it takes several minutes to read out even a short image sequence, and up to an hour to dump the entire contents of the RAM (2GiB). This is all for mere seconds of video...

SuperSpeed RAM Dumping

To remedy this, I repurpose the standalone ZU+ USB mass storage class example to map most of the RAM as a virtual disk, then use a raw disk image reader (Win32 Disk Imager) to read it. This is pretty much what the example does anyway, so my modifications were very minor. So far, I've been able to run my application in On-Chip Memory (OCM), leaving the external DDR4 free for image capture. So, I have to explicitly place the virtual disk in DDR4 in the linker script:

In the application, the virtual disk array also needs to be correctly sized and assigned to the dedicated memory section using an __attribute__:
With that small modification, the application (including the mass storage device driver) runs in OCM RAM, but references a virtual disk array based in external DDR4 at 0x20000000, which is where the image capture data starts. As with the original example, when plugged in to a host, the device shows up as a blank drive of the defined size. Windows asks to format it, but for now I just click Cancel and use Win32 Disk Imager to read the entire 1.25GiB. This copies the raw contents of the "disk" into a binary file, a process I'm all too familiar with from having to recover files from SD cards with corrupted file systems.

But at first I wasn't getting a SuperSpeed (5Gb/s) connection; it was falling back to High-Speed (480Mb/s) through the external USB3320 PHY. (An external USB 2.0 PHY is required on the ZU+, even for SuperSpeed operation.) To further troubleshoot, I took a look at the DCFG and DSTS registers in the USB module. DCFG indicated a Device Speed of 3'b100 (SuperSpeed), but DSTS indicated a Connection Speed of 3'b000 (High-Speed). I figured this meant the PS-GTR link to the host was failing, and after some more poking around I found that its reference clock source was set to incorrect pins and frequency. In my case, I'm feeding it with a 100MHz reference clock on input 2, so I changed it accordingly:


After that, I was able to get a SuperSpeed connection. As a formatted disk drive, I get sequential read speeds of around 300MB/s. Through Win32 Disk Imager, I can read the entire 1.25GiB virtual disk in about seven seconds. So much better! To celebrate, I set off some steel wool fireworks with Bill Kerman. (Steel wool, especially the ultrafine variety, burns quite spectacularly.)


Since I've been putting off the task of NVMe writing, these are still just image sequences that can fit in the RAM buffer. In this case they're actually compressed about 11:1, well beyond my SSD writing requirement, mostly due to the relatively dark and low-contrast scene. The same quantizer settings in a brighter scene with more detail would yield a lower compression ratio. I did separate out the quantizer values for each subband, so I can experiment more with the quality/data rate trade-off.

The most noticeable defects aren't from wavelet compression, they're just the regular sensor defects. There's definitely some "black sun" artifact in the brightest sparks. There's also a rolling row offset that makes the dark background appear to flicker. I did switch to a different power supply for this test, which could be contributing more electrical noise. In any case, I definitely need to implement row noise correction. The combination of all-intraframe compression and a global shutter does make it pretty good for observing the sometimes crazy behavior of individual sparks, though:

This one was gently falling and then just decided to explode into a dozen pieces, shooting off at 20-30mph.
My favorite, though, is this spark that gets flung off like a pair of binary stars. After a while, they decide to part ways and one goes flying up behind Bill's helmet. The comet-like tails are a motion artifact of the multi-slope exposure mode.
Another thing I learned from this is that I probably need an IR-cut filter. I neglected to record some normal footage of the steel wool burning, but it's nowhere near as bright as it looks here. Much of that is just how human visual perception works. I tried to mitigate it somewhat by using the CMV12000's multi-slope exposure mode to rein in the highlights. But I think there's also some near-infrared adding to the brightness here. I'll have to get an external IR-cut filter to test this theory.

Although the image sequence transfer is 100x faster now, it still takes time to adjust settings and trigger the capture over JTAG. I would very much like to do everything over USB in the near future, at least until I have some form of UI on the camera itself. But I also don't really want to write a custom driver. I might try to abuse the mass storage device driver, since it's already working, by adding in some custom SCSI codes for control. This is also the device class I intend to use eventually as the host interface to the SSD, so I should get to know it well.

v0.2 Carrier

Controlling the camera over USB is not the most user-friendly way of doing things, as I know from wrangling drivers and APIs for previous camera projects. I could maybe see an exception where a Pixel 2 (modern Pixels don't have USB 3.0 anymore, because smartphone progress makes no fucking sense) hosts the camera, presenting a nice preview image and dedicated touch interface. But that's a large chunk of Android development that I don't want or know how to do.

Instead, I think it makes sense to stick to something extremely simple for now: an HDMI output and some buttons. I would love to have a touchscreen LCD, but they're huge time, money, power, and reliability sinks. They're also never bright enough, or if they are they kill the power and thermal budget. Better to just move the problem off-board, where it can be solved more flexibly depending on the scenario. At least that's what I'll tell myself.

It seems like there are two main ways to do HDMI out from a Zynq SoC. The more modern Zynq Ultrascale+ development boards, like the ZCU106, use a PL-side GTH transceiver to directly drive a TMDS retimer. This supports HDMI 2.0 (4K), but would rule out the cheaper TE0803 XCZU4 board, since its four PL-side transceivers are already in use for the SSD. The second method uses a dedicated HDMI transmitter like the Analog Devices ADV7513 as an external PHY, which interfaces to the Zynq over a wide logic-level pixel bus. Even though it only goes up to 1080p, this sounds more like what I want for now. I just need a reasonable preview image.

HDMI output subsystem based on the ADV7513.
I had left a bunch of unused pins in the top right corner expecting to need a wide logic-level pixel bus, either for an LCD or an HDMI transmitter. The tricky part was finding room for the connector and IC. I decided to ditch the microSD card holder, which had a bad footprint anyway, to make the space. Without growing the board, I can fit a full-size (Type A) HDMI connector on the top side and the ADV7513 plus supporting components on the bottom. The TMDS lines do have to change layers once, but they're short and length-matched so I think they'll be okay.

At the same time, I also rerouted a good portion of the right edge of the board. The port I've been using for UART terminal access is gone, replaced by a more general-purpose optically-isolated I/O connector. This can still be used for terminal access, or as a trigger/sync signal. I also added a barrel jack connector for power/charge input. Finally, a 0.1" header off the back of the board has the battery power input and some unprotected I/O for two buttons, a rotary encoder, and a red "recording" LED on a separate board. This UI board would be mounted to the top face, right-hand side, where such things would typically be on a camera.

New right-edge connector layout and top face UI board header.
I consider this to be the bare minimum design for standalone functionality. It will need a simple menu and status overlay on the HDMI output. I'm also skipping any BMS or charge circuitry for now, so the battery must be self-contained (like this 3-cell pack) and charged by a CC/CV adapter. It's well-within the power range of USB C charging, so that could be an option in the future, but I don't think it's important enough for this revision.

One of the reasons I don't mind doing more small iterations rather than trying to cram features into one big revision is that I have been able to get these boards relatively fast and cheap from JLCPCB. Originally, I chose their service because they were the first and only place I found with a standard impedance-controlled stack-up, complete with an online calculator. But it's also just the most economical way to get a six-layer impedance controlled board in under two weeks. Each one is around $30. Even including all the power supplies and interfaces, the board is really a minor cost compared to the sensor and SoM it carries.

Other than that, there was only one minor fix that needed to be made regarding the SSD's PCIe reference clock. I had mistakenly assumed this could be generated or forwarded by the ZU+ out of one of its GT clock pairs. But this doesn't seem to be standard practice. Instead, the external clock generator distributes matching reference clocks to both the ZU+ GT clock input and the SSD. I hacked this on to v0.1 with some twisted pair blue wire surgery, but it was easy to reroute for v0.2. Aside from this, I didn't touch any of the differential pairs, or really any other part of the board. Well, I did add one more small component...but that'll be for much later.

These boards should arrive in time for a Thanksgiving weekend soldering session. I plan to build up two this time: one monochrome and, if all goes well, finally, one color. Before then, I'd like to have at least some plan for the NVMe write...

Monday, November 11, 2019

TinyCross: 4WD and Servoless RC Mode

I finished building up the second dual motor drive for TinyCross, which means that the electronics and wiring have finally caught up to the mechanical build and both are 100% complete!


That's not to say that the project is 100% complete; there's still some testing to be done to bring it all the way up to full power, as well as some weight reduction and weatherproofing tasks. But there are no more parts sitting in bins waiting for installation. It should be at peak mass, too, which is good because it's 86lb (39kg) without batteries. The original target was 75lb (34kg) without batteries, but I will settle for 80lbs (36kg) if I can get there.

The second TxDrive went together with no issues, and the software is identical to the front wheel drive. I have both set at 80A right now, which gives a total force at the ground of 112lbf (51kgf). That's about the same peak force as the rebuilt "black" version of tinyKart, which was maybe too much for that frame. But TinyCross is about 20% heavier (with the driver weight included) and 4WD, so it should be able to handle some more. I haven't seen any thermal issues at 4x80A - if anything, the motors run cooler now that all four are sharing the acceleration. Over the next few test drives, I'll work my way up toward the 120A target.

But before that, there's something I've been wanting to try. I have an abundance of actuators and not that many degrees of freedom. I decided to borrow an idea from Twitch X to cash in some of this actuator surplus for one more degree of control, specifically automatic servoless steering. So, a free 1/3-scale RC car mode without adding any parts.

Well okay, I do have to add the receiver.
The steering wheel board reads the throttle and steering PWM signals from a normal RC receiver. The throttle PWM gets directly mapped to a torque command for all four motors. The steering PWM sets a target angle for the steering wheel. The measured angle comes from an IMU, the secret part on the steering wheel board. (Yes, there are all sorts of issues with that...I honestly just don't want to run any more wires.) The angle error drives a feedback controller that outputs differential torque commands to the front motors. Not much to it, really.
I've also seen so many runaway robots and go-karts in my life that I consider it a must to have working failsafes for both radio loss of signal and receiver (PWM) disconnect. It's extra work but trust me, it's worth it! Anyway, time for a test drive:


I wasn't sure how tightly I could tune the steering control loop, since there's a long chain of mechanical mush between the torque output at the motors and the sensor input at the steering wheel. But it works just fine. After a minute I forgot it wasn't really an RC car and tried some curb jumping. Just like Twitch X, the wheels do need traction to be able to control the steering angle. But then again, that is a necessary condition for steering anyway.

I don't actually think there's much point in a go-kart-sized RC car. But it's a short jump from that to an autonomous platform. It might also be useful to adjust the "feel" of the steering during normal driving. Mostly, I just like to abide by the Twitch X philosophy of using your existing actuators to do as much as possible.