Friday, February 27, 2026

TinyCross: Rear Motor Update

In its original design, TinyCross used four identical Alien Power System 6374 170rpm/V brushless motors. But with the weight distribution being heavily biased to the rear, it really makes more sense to use larger motors there. I had seen some examples of electric power steering motors that might be suitable, so I did some eBay exploration to see what was available. I settled on a Hyundai part (56300F2200) that looked common enough that I could always find spares:

Hyundai EPS Motor (left), Trampa 6376 (middle), APS 6374 (right)

I was surprised to find that the torque constant (and thus rpm/V) of the Hyundai EPS motor was actually very similar to the APS 6374 170KV, meaning it could just drop into the design with the same gear ratio. The main difference is that the EPS motor is a 6-pole inrunner instead of a 14-pole outrunner. This makes it easier to get heat off of the stator winding. The resistance is also much lower, and the wire gauge going in suggests it can probably handle even more than the 100A peak I've been running in the rear. But the extra performance overhead comes at a price...it's big!

Although this motor is relatively small in the context of automotive electric power steering, it's still a lot larger than the APS 6374. It weighs almost 2kg with the wire harness still attached, compared to 0.68kg for the APS. It also has a diameter of 88mm and a bolt circle of 109mm, so it would be a tight fit into a drive quadrant designed for a 63mm outrunner with a face mount. The hardest part, though, would be adapting the coupler to TinyCross's drive pulley without increasing the width of the existing design. The original shaft adapter for the APS 6374 used a clamp and key to secure the drive pulley to the 10mm motor shaft:

But the Hyundai EPS motor has a large plastic coupling-half pressed onto a short spline shaft, intended to interface to a mating part with a flexible rubber coupler. So, unlike the old design, the drive shaft can't use the motor's bearing as one of its supports. I would need to add both another bearing and the mating coupling-half to the pulley shaft, without changing the belt location. 

I played with this in CAD for a while and determined that it would probably require moving the motor mounting plane inboard to make more room for the new drive shaft. That was more work than I wanted to do at the time, so I put the project aside for a few months. But luckily when I came back to it, I drew up this evil thing instead:

This is part of a long tradition of unholy drive shaft couplings going all the way back to Cap Kart for which I couldn't really tell you what material path is actually transmitting the torque. Is it the friction from the 6mm bolt clamping three pieces together? Or is it the shear pins? Or do the pins just keep the 6mm bolt from unthreading? Is some of the torque going through the bolt itself?? I could probably dig through an old textbook or simulate it to figure it out, but I like the mystery. Most importantly, it allows for an additional R8 bearing to fit between the pulley and the coupler, just barely, without shifting the motor mounting plane or the belt position:

There is 1mm of clearance between the pulley flange and the new bearing, and 0.33mm of clearance between the bearing and the coupler. It's an absurd packaging and I love it. It's also reasonably easy to machine. The coupler looks like it has some complex angles, but it's really just 4.5mm-width straight cuts along 45º radials:

That makes it possible to do on a manual mill, at least with some amount of janky 45º fixturing:

And with that, the normal rubber couplings fit right in:

I did have to drill them out to clear the head of the 6mm bolt, but since the torque is transmitted between the teeth (this much I know, at least!), that's no problem. Everything fit together as expected, which is to say with almost no clearance, to make a nice adapter module that occupies in the same space as the old drive shaft support:

That still leaves the motor mounting, though, and there was no way to easily adapt the existing plate to fit the new motor. It had to be re-cut entirely:

Some of the existing standoff holes also encroach on the motor's inner flange circle, so unfortunately this adaptation doesn't work with the bone-stock Hyundai EPS motor. I had to mill down the flange where it would interfere. Think of it as weight reduction. Speaking of which, the automotive wire harness can also go away:

Don't need any of that...

And after a changing a few tight-fit clearance holes to loose-fit to soak up some measurement errors on the motor's mounting pattern, everything fits together well enough:

Even though it's substantially larger than the APS 6374, it doesn't look overly bulky or out-of-place in the design. I actually really like the look of the motor's cast aluminum housing to go with the rest of the aluminum chassis. The total weight for the kart without batteries is just a bit over 40kg now. I don't think I can offset the extra weight of these motor completely, but I may at least try to get it back under 40kg at some point.

The Hyundai EPS motors do include some kind of position encoder, maybe a resolver? But I decided to just use the existing outboard sense magnet and rotary angle sensor. Despite the rubber coupling, I think that will be sufficiently accurate with only three pole pairs. (The electrical angle error is only 3x the mechanical angle error instead of 7x.) So I reprogrammed those sensors for six poles and...oh, that's right, one of the rear quadrants suffered a bit of a meltdown last year:

I'm not sure the exact sequence of events, but I think the sensor board got wet and stopped working, then the desynchronized motor windings got too hot and failed short, then one of the low-side FETs died, taking out its gate driver with it. Meanwhile, the other three motors were pulling me along well enough that I didn't notice the problem until it became an odor. Interestingly, it seems to be the FET gate that died; the D-S path was not shorted. I'm not going to think about it too hard right now. I just replaced the FET and gate driver and everything was happy again.

We've had some massive snowstorms this year that cumulatively add up to more than the height of the kart, but the latest one came with a bonus round of an extra inch after the plows had already done the parking lots. This is the perfect amount for some quick snow karting (another tradition):

The sensor board conformal coating is holding up, somehow.

At 80-100A, the Hyundai EPS motors feel pretty much identical to the APS 6374 170KV. They can easily generate and control oversteer on the snow. And after half a battery pack of messing around, they were staying nice and cool. So I think they do have some overhead now, if I want to push even more current in dry conditions. The main problem will be the suspension:

The A-arm flex is not good, even at 100A. At first glance, it seems like most of the displacement is originating from the two quarter-inch aluminum plates that hold the rear ball joints. They're only flexing a little, but that angular offset creates a lot of displacement at the plane of the wheel. Luckily, it's easy to reinforce these - just add even more 80/20:

I'm sure there's an even lighter way to do this, but it actually looks like it belongs. It also gives some new mounting points for accessories, cameras, etc. I will add one to the front as well. Only the top is needed there; the steering linkage bearing plate handles the bottom. Of course, these are even more weight that I now need to offset to get it back under 40kg.

But for now, I think this upgrade is complete. I still need to test the motors at speed when (if) the snow ever melts. They do have higher inductance than the APS 6374 170KV, which might limit their maximum power. They're also designed for 12V, not 48V, so they might lose efficiency at higher speeds due to speed-squared losses. I'll have to do some data logging to find out.

The next upgrade is also in the works...here's a teaser:

Thursday, February 20, 2025

GaNDr: Power Loop Measurement and Optimization

In a half-bridge configuration, the power loop is the path from the DC bus capacitor positive terminal to the output switch node and back to the capacitor negative terminal. This is the path of high dI/dt, passing through both FETs in the half-bridge. It's important to keep this loop as small as possible, to reduce the amount of parasitic inductance interacting with the high dI/dt.

This applies to any half-bridge, but it's especially important for GaNFETs, where the switching event can be on the order of nanoseconds. They're capable of switching this fast due to their very low gate charge and input capacitance, but they still have a significant output capacitance, COSS, that needs to be hard-switched. During a switching event, this forms an underdamped LC oscillator with the power loop inductance and leads to overshoot and ringing.

It's a little hard to predict the loop inductance, although admittedly I didn't even really try. There are ways to ballpark it based on the board geometry, or use a 2D field solver like FEMM to get an estimate. A 3D field solver could probably get pretty close. But nothing beats a direct measurement on the real PCB. So I just assembled one phase of the GaNDr Rev1 PCB for a test:

Nevermind the black wire hack...

Since I'm interested in stuff on the nanosecond time scale, I wanted to use my 5GHz-bandwidth sampling scope for this measurement. But the inputs only go up to ±1V, so directly probing a 50V node with unknown overshoot would be bad. Luckily, the cheapest way to make a high-bandwidth probe is just to make a voltage divider with a physically small resistor and a 50Ω coaxial cable.

Quick and dirty ≈67:1 high-bandwidth probe.

It also matters where the probe is attached. I soldered both the resistor and the probe ground (just the coax shield) to pads near the low-side FET. This excludes all the inductance except what's physically inside the FET. The measured voltage is, as close as possible, VDS of the low-side FET. And here is the measured switching transient:

It is...not great. The overshoot with a 32V input and 1Ω gate resistors is nearly 100%. That does not bode well for 50V operation with 100V parts, and can't be good for efficiency or EMI. It's possible to slow down the rise time by increasing the gate resistance, thus reducing the amount of energy in the transient. But that should be done last, after exhausting all other reasonable methods for reducing the power loop inductance.

Since I now have the ringing frequency, it's possible to calculate the power loop inductance using the COSS datasheet value of 1nF (per FET):

L = 1 C OSS ( 2 π f ) 2 = 1 2.0 nF ( 2 π 160 MHz ) 2 = 0.50 nH

This really isn't much inductance, but it's apparently enough to store sufficient energy during the transient to make a high-Q oscillator with COSS.

Some of the loop inductance is from the DC bus capacitors themselves. I probably should have checked this earlier, since this is usually well-characterized by the manufacturer. The Samsung CL32E475KCIVPNE capacitors used here have the following AC characteristic curves:

At frequencies above about 2MHz, they're really more inductors than they are capacitors: the impedance vs. frequency has a slope of +1 on a log-log plot. The effective inductance at 160MHz is around 0.30nH. But, there are six in parallel. Ideally, this would divide down to 0.050nH, but they may not share the load equally based on their placement. I think they contribute a significant amount of inductance, but not the majority.

Still, it would be nice to reduce the capacitor parasitic inductance if possible. Since the switching transient only involves COSS, which is only on the order of 1nF, maybe smaller and faster capacitors would be better? For example, the 10nF Samsung CL05B103KC5VPNC in 0402 stays mostly capacitive all the way up to 100MHz:

However, being physically smaller also increases the impedance overall, such that the effective inductance at 160MHz is still around 0.30nH. The advantage is that I could fit many more in parallel, to divide down the total inductance more. In reality, the big 1210 capacitors are still very much necessary for filtering the 125kHz PWM frequency, so the solution is probably to try to fit some of the 10nF 0402s on in parallel with the existing six 1210s.

An easy place to put them is where the VDC bus bar solder areas used to be, between the 1210 caps and the high-side FETs:

For each high-side FET, three can be arranged so that their positive terminals are directly adjacent to the drain pads. The negative terminal can be sent to the power ground plane on the first inner layer with a via-in-pad. This also has the advantage of shortening the power loop considerably, at least for the high-frequency switching transient. The only real disadvantage is losing the option for a VDC bus bar, which realistically shouldn't be necessary for the average power levels here.

A riskier move would be to drop in four more 10nF 0402s between the FETs:

The drain pads of the high-side FET (VDC) align with the source pads of the low-side FET (GND), so it's very tempting to bridge the gap with an 0402 capacitor. I even tried this on the physical board, by scratching off some solder mask:

Even with the ugly attachment, this did show a significant improvement, with the overshoot dropping from 94% to 65% and the frequency increasing from 160MHz to 210MHz:

This does seem like it would provide the shortest possible power loop, but an important difference between the soldered-on test and the actual layout is that there is still a mostly solid switch node (output) plane beneath the capacitors as-tested. It's like having a seventh PCB layer. In actuality, that plane gets cut if the capacitors are placed there, changing the power loop from a thin vertical sandwich to multiple horizontal loops, which might very well have higher inductance.

This configuration, with capacitors between the FETs, is actually discussed in the EPC layout guidelines. In order to get back to a thin vertical sandwich power loop, the switch node plane, rather than GND, should be on the first inner layer. But if I fully swap SW and GND layers, the loop sandwich to the outer capacitors becomes thicker, potentially cutting off those capacitors with much higher outer loop inductance.

I decided to try just inserting a small SW plane on the first inner layer, and only between the FETs. The rest of the first and second inner layers remains as GND to hopefully preserve some of the outer capacitor loop performance. And since I'm indecisive, I only did this on one half of the board for the next revision. That'll let me A/B test it against the more normal layout. I expect both to have much more interesting transients with composites of multiple frequencies, so it might be trickier to analyze. We'll see when the Rev 2 boards arrive.

Saturday, January 25, 2025

GaNDr: Motor Drive in the GaNFET Era

It's been a while since I've attempted a new motor drive design, the last one being TinyCross's dual-motor 50V/100A drive about six years ago. I still really like that design, and those drives have worked well for TinyCross so far. But one of my favorite pastimes is looking for new components that might change how I would build something. And in the last six years, there's been an interesting development that I'm curious to explore: the GaNFET is now mainstream. 

While they've been commercially available for a while, they're now both technically and economically viable as an alternative to silicon MOSFETs in certain power ranges, and have been adopted in many new consumer electronic devices. They are also becoming available in more conventional packages, though GaNFET purists would probably still use the bare-die versions for lowest parasitic inductance. Supporting components (mainly gate drivers), documentation, and device models are also now mature and widely available.

Of interest for this project are the EPC2302 and the brand new EPC2361 in 3x5 QFN. These are packaged GaNFETs, vs. the bare-die BGA and LGA options that have been around for a while. Ideally, encapsulating them in a thin plastic package improves robustness without sacrificing too much performance. The package also still exposes the die on the top side for direct cooling.

GaNFETs excel in the Figure-of-Merit (FoM) of on-resistance multiplied by gate charge, RDS(on)·QG, which captures both the conducting and switching losses. Here they are on a 2D plot with some silicon MOSFETs for comparison, with curves of constant FoM shown (lower is better):

The current TinyCross FETs, FDMT80080DC, can't really compete anymore. (I still love them for their pulsed current rating of 1453A, though.) The modern field of 60-80V silicon MOSFETs has significantly better FoM and also tends to publish specifications for lower gate drive voltages besides the standard 10V. The NVMFS5C604NL operated at VGS = 4.5V is especially remarkable. But the EPC GaNFETs still easily win on FoM, and they are 100V parts with half the surface area.

The extremely low QG means GaNFETs can be switched much faster than silicon MOSFETs. An EPC2302 half-bridge with a suitable layout and gate driver should have no problem operating at 100kHz PWM, or even higher. This reduces the need for large electrolytic capacitors on the DC bus, since the ripple current will be much lower. This is probably the largest contribution to space savings, even though the FETs themselves are also physically smaller.

Although I'm not space-constrained on TinyCross, shrinking a power stage is always a fun project. I arbitrarily set a limit of the size of a deck of cards, with the same specification as the previous design (dual motor, 50V and 100A peak). Here's what the resulting layout looks like:

It's considerably smaller than the current TinyCross drive, even with the logic and power consolidated onto one board. There are half as many electrolytic capacitors, with more ceramic capacitors filling in at high frequencies. Optional 2mm bus bars help take some of the load off the PCB copper. The board can be mounted to a heat sink base with six isolated M2 screws located in the middle of each FET group for even mounting pressure. And the wires all exit horizontally, probably through grommets in the seam between the aluminum base and a 3D-printed cover.

This is also a six-layer PCB, which further helps with layout. Six layer boards are now pretty fast and cheap thanks with JLCPCB, so there's not much reason to stick to four layers even for hobby motor drives anymore. The power stage layout follows EPC's recommended structure pretty closely, with the return path on the two inner layers closest to the FETs to minimize the dI/dt loop area:

I did not follow the gate drive recommendations as strictly, since I need a lot of space for the gate drivers themselves. I am using isolated half-bridge drivers, and remarkably there are at least two pin-compatible parts from different manufacturers that could work. The Skyworks Si82E39x looks like the most promising option, with a 4V UVLO option and sub-10ns rise/fall times. It's not even that expensive, at around $4ea. The ADuM4221 from Analog Devices could also work, though. Although I have the driver on the opposite side of the board, I did try to keep the gate drive traces as short as possible:

The gate drive return paths from the switch node (high-side) and negative plane (low-side) are also pretty direct. Will that be good enough? I don't really know - I don't have a 3D field solver. I'll have to wait for the boards. Ideally, with such low gate charge and fast drivers, it should be possible to run PWM frequencies above 100kHz and take the dead time down to around 50ns (less than 1%). Here's what the switching waveforms look like in an LTSpice simulation with the EPC device models:

However, this is just a guess for the loop inductance, so I won't really know how fast it can go until I get it on a scope. But what is the huge current spike on both FETs (shoot-through) during the high-side turn-on? I thought GaNFETs didn't have reverse recovery time? Well, I think it's just the Coss of the low-side FETs getting charged. Even though there's no body diode, there is still parasitic capacitance in the 1-2nF (per FET) range that has to get hard-switched at 50V. At these switching speeds, that requires a lot of current (for a very short period of time). If it turns out to be a problem, the gate drives will have to be artificially slowed down with extra gate resistance.

Another interesting quirk of GaNFETs is present in the LTSpice sim: during the deadtime, the low-side FETs do reverse-conduct, but at a voltage of around 2.25V, significantly higher than the body diode voltage across a silicon MOSFET. This partially eats into the power savings of a GaNFET design, but the more critical problem is what happens with a bootstrapped high-side driver. If the output is down at -2.25V, or even lower, even a crappy bootstrap diode won't drop enough from +5V to prevent the high-side VGS from exceeding 6V, the absolute maximum for these gates.

There are several ways to deal with this. Adding a resistor in series with the bootstrap diode creates low-pass filter with the bootstrap capacitor. A 10Ω resistor with a 0.1µF bootstrap capacitor has a time constant of 1µs, so a 50ns spike of +7V or so won't budge the voltage much. There is still a possibility that at very high duty cycle the average bootstrap charge voltage could exceed 6V, but I think you would run into other issues before that. For good measure, I also included a footprint for an optional 5.1V Zener diode across the bootstrap cap, which is another recommendation I saw.

Lastly, I included a couple footprints for small Schottky diodes in parallel with the low-side FETs. I can only fit something like an SMA or SOD128 package, which max out at 5A for 100V devices. But some have pulse ratings at or above 100A, and these are very short pulses. At those pulse ratings the voltage drop is quite high, but still significantly below 2.25V. The other interesting question is whether the parasitic inductance of the diode and its connection to the bridge will prevent any of the high current from making it to them in the first place. Only the scope will tell.

In order to take advantage of the 100kHz+ low-deadtime PWM, I also want to try out another relatively new part, an MCU from the STM32G4 series. These are the spiritual successor to the F3 series that I've used for so many motor drives. They have lots of analog features, including five (!) ADC converters, DACs, op-amps, and comparators. But the G4 also adds several new peripherals that are specifically well-suited to motor drives:

The high-resolution timer (HRTIM) can drive up to 12 PWM outputs with a resolution as small as 184ps. It's probably overkill, but it allows for close to 16-bit duty cycle resolution at 100kHz, and very accurate deadtime control. For example, part-to-part variation in the gate driver propagation delays could probably be compensated There's also a CORDIC engine for offloading trigonometric functions from the CPU. And there's a Filter Math Accelerator (FMAC) peripheral, which includes a dedicated hardware multiplier/accumulator and memory for accelerating FIR/IIR filter calculations without involving the CPU.

The need for extra computation might make more sense if you've noticed what's missing from this design: there are no inputs for motor position sensing. While I love the simplicity and robustness of TinyCross's optically-isolated virtual Hall sensors, it's nothing interesting in terms of brushless motor control. I did manage to run two instances of a lightweight flux observer in the background on the F3, but I never used it for driving the current control. I'm sure it would have worked at speed, but the main problem with sensorless control for TinyCross is getting starting torque on all four drive motors without them fighting each other.

One way to get the position of the motor before the flux observer converges is with High-Frequency Injection (HFI). The motors on TinyCross have an L/R time constant of around 1.5ms, so at frequencies above 1kHz they are mostly inductors. The inductance on each phase varies depending on the position of the rotor, since the permanent magnet flux changes where each phase's stator steel is on a non-linear B-H curve. That inductance change can be measured with a high-frequency signal on top of the normal drive voltage to derive the position. This is nothing new - it's been basically perfected in VESC - but it should be fun to try to implement with 100kHz+ PWM and the STM32G4's extra processing power.

That's it for now - more to come when there is physical hardware to look at.

Saturday, November 9, 2024

PCIe Deep Dive, Part 5: Flow Control

In Part 2, I described PCIe as a bi-directional memory bus extension and looked at some factors that contribute to the efficiency of the link. A PCIe 3.0 x4 link, with 8GT/s on each lane and an efficiency of around 90%, can support bidirectional data transfer of around 3.6GB/s. But this assumes both sides of the link can consume data that quickly. In reality, a PCIe function is subject to other constraints, like local memory access, that might limit the available data rate below the theoretical maximum. In this case, it's the job of the Data Link Layer's flow control mechanism to throttle the link partner so as not to overflow the receiver's buffers.

For a full-duplex serial interface, flow control can be as simple as two signals used to communicate readiness to receive data. In UART, for example, the RTS/CTS signals serve this purpose: A receiver de-asserts its RTS output when its buffer is full. A transmitter can only transmit when its CTS input is asserted. The RTS outputs of each receiver are crossed over to the CTS inputs of the other side's transmitter to enforce bidirectional flow control.

This works well for homogenous streams of data, but PCIe is packetized, and Transaction Layer Packets (TLPs) can have a variable amount of data in their payload. Additionally, packets representing write requests, read requests, and read completions need different handling. For example: If both sides are saturated with read requests, they need to be able to block further requests without also blocking completions of outstanding reads. So, a more sophisticated flow control system is necessary.

PCIe receivers implement six different buffers to accommodate different types of TLP. These are divided into three categories:

  1. Posted Requests (P). These are Request TLPs that don't require a Completion, such as memory writes. Once the TLP has been acknowledged, the requester assumes the remote memory will be written accordingly with the TLP data.
  2. Non-Posted Requests (NP). These are Request TLPs that do require a Completion, such as memory reads. Configuration reads and writes also fall into this category.
  3. Completions (CPL). These are Completion TLPs completing a Non-Posted Request, such as by returning data from a memory read.

For each of these categories, there is a Header buffer and a Data buffer. The Header buffer stores TLP header information, such as the address and size for memory reads and writes. The Data buffer stores TLP payload data. Separating the Header and Data buffers allows more flow control flexibility. A receiver can limit the number of TLPs received by using Header buffer flow control, or limit the amount of data received by using Data buffer flow control. In practice, both are used simultaneously depending on the external constraints of the PCIe function.

The PCIe specification only describes these buffers conceptually; it doesn't mandate any specific implementation. In an FPGA, the Data buffers could reasonably be built from block RAM, since they need to store several TLPs worth of payload data (order of KiB). The Header buffers only need to store (at most) four or five Double Words (DW = 32b) per TLP, so they could reasonably be built from distributed (LUT) RAM instead. A hypothetical Ultrascale+ implementation is shown below.

The Data buffers are built from two parallel BRAMs (4KiB each). This provides a R/W interface of 128b, matching the datapath width for a PCIe 3.0 x4 link. Conveniently, the unit of flow control for data is defined to be 4DW (128b), so each BRAM address is one unit of data. With 8KiB of total memory, these buffers could, for example, hold up to 16 TLPs worth of optimally-aligned data on a link with a Max. Payload Size of 512B. The capacity could be expanded by cascading more BRAMs or using URAMs, depending on the design requirements.

The Header buffers are built from LUTRAMs instead, with a depth of 64. Each entry represents the header of a single TLP, and may be associated with a block of data in the Data buffer. One header is also defined to be one unit of flow control. A single SLICEM (8 LUTs) can make a 64x7b simple dual-port LUTRAM. These can be parallelized up to any bit width, depending on how much of the header is actually used in the design. The buffer could also be made deeper if necessary, by using more LUTRAMs and MUXes.

For each of these buffers, the receiver issues credits to its link partner's transmitter in flow control units (4DW for Data buffers, one TLP header for Header buffers). The transmitter is only allowed to send a TLP if it has been issued enough credits (of the right type) for that TLP and its data. The receiver issues credits only up to the amount of space available in each of the buffers, updating the amount issued as the buffers are emptied by the Application Layer above.

Flow control credits are initialized and then updated using specific Data Link Layer Packets (DLLPs) for each category. These packets have a common 6B structure with an 8b counter for Header credits and a 12b counter for Data credits:

InitFC1- and InitFC2-type DLLPs are sent as part of a state machine during flow control initialization to issue the starting number of credits for each category (P, NP, Cpl). In the example above, the initial header and data credits could be 64 and 512 (or 63 and 511, to be safe). It's also possible for the receiver to grant "infinite" credits for a particular category of TLP by setting the initial value of HdrFC and/or DataFC to zero. In fact, for normal Root Complex or Endpoint operation, this is required for Completion credits. A Request is only transmitted if there is room in the local receiver buffers for its associated Completion.

UpdateFC-type DLLPs are sent periodically during normal operation to issue more credits to the link partner. Typically, credits are incremented as the Application Layer reads out of the associated buffer. In the case of an AXI Bridge, this could be when the AXI transaction is in progress. The values of HdrFC[7:0] and DataFC[11:0] are the total number of credits issued, mod 256 (header) or 4096 (data). For example, if the initial data credit was 511 and then 64DW were read out of the data buffer, the first UpdateFC value for DataFC could be 511 + 64/4 = 527.

This is mostly equivalent to transmitting the amount of buffer space available, but the head/tail index subtraction is left to the transmitter. It can compare the total credits issued to its local count of total credits consumed (mod 256 or 4096) to check whether there is enough space available for the next TLP. To prevent overflow, the receiver never issues more than 127 header credits or 2047 data credits beyond its current buffer head. (This also means that there's no reason to have header buffers larger than 128 entries, or data buffers larger than 2048x4DW = 32KiB.)

The receiver handles scheduling of UpdateFCs for all buffers that didn't advertise infinite credits during initialization. The PCIe 3.0 specification does have some rules for scheduling UpdateFCs immediately when the link partner's transmitter might be credit-starved, but otherwise there is only a maximum period requirement of 30-45μs. If the buffer has plenty of space, it can be better to space out updates for higher bus efficiency. (Starting with PCIe 6.0, the flow control interval and associated bus efficiency will become constant as part of FLIT mode, a nice simplification.)

A specific application may not actually need all six buffers to be the same size or type. For example, an NVMe host would barely need any Completion buffer space, since its role as a Requester is limited to modifying NVMe Controller Registers, usually one 32b access at a time. It's also unlikely to have any need for Non-Posted Requests with data. The vast majority of its receiver TLP traffic will be memory writes (PH/PD) and memory reads (NPH).

With a PCIe protocol analyzer, it's possible to see the flow control in action. The following is a trace recorded during a 512KiB NVMe read on a PCIe Gen3 x4 link with a Max. Payload Size of 512B. The host-side memory write throughput and DataFC available credits are plotted.

The whole transaction takes 1024 memory write request TLPs, each with 512B payload data. These occur over about 146μs, for an average throughput of about 3.6GB/s. The peak throughput is a little higher, though, at around 3.8GB/s. This means the PCIe link is slightly faster than the AXI bus, and as a result, DataFC credits are consumed faster than they are issued at first. Once the available credits drop below 32, there isn't enough room for a 512B TLP and the transmitter (on the NVMe SSD) is throttled. The link reaches steady-state operation at the AXI-limited throughput.

Depending on the typical access patterns of the application, a larger data buffer could help sustain peak throughput for the entire duration of a typical transfer. Sequential transfers might still hit the credit limit threshold, but it's also possible that there's enough NVMe overhead between them for the credits to recover. This sort of optimization, along with maximizing bus efficiency, would be required to squeeze out even more application-level throughput.

In summary, if PCIe is acting as a memory bus extension, PCIe flow control extends the local memory controller's backpressure mechanism across the link. For example, if an AXI memory bus can't keep up with writes, it will de-assert AWREADY and/or WREADY. The PCIe receiver can still accept memory write TLPs as long as it has room in its buffers, but it can't issue any new PH or PD credits. When the buffers are nearly full, this transfers the backpressure to the link partner.

Monday, January 22, 2024

PCIe Deep Dive, Part 4: LTSSM

The Link Training and Status State Machine (LTSSM) is a logic block that sits in the MAC layer of the PCIe stack. It configures the PHY and establishes the PCIe link by negotiating link width, speed, and equalization settings with the link partner. This is done primarily by exchanging Ordered Sets, easy-to-identify fixed-length packets of link configuration information transmitted on all lanes in parallel. The LTSSM must complete successfully before any real data can be exchanged over the PCIe link.

Although somewhat complex, the LTSSM is a normal logic state machine. The controller executes a specific set of actions based on the current state and its role as either a downstream-facing port (host/root complex) or upstream-facing port (device/endpoint). These actions might include:

  • Detecting the presence of receiver termination on its link partner.
  • Transmitting Ordered Sets with specific link configuration information.
  • Receiving Ordered Sets from its link partner.
  • Comparing the information in received Ordered Sets to transmitted Ordered Sets.
  • Counting Ordered Sets transmitted and/or received that meet specific requirements.
  • Tracking how much time has elapsed in the state (for timeouts).
  • Reading or writing bits in PCIe Configuration Space registers, for software interaction.
Each state also has conditions that trigger transitions to other states. All this is typically implemented in gate-level logic (HDL), not software, although there may be software hooks that can trigger state transitions manually. The top-level LTSSM diagram looks like this:


The entry point after a reset is the Detect state and the normal progression is through Detect, Polling, and Configuration, to the golden state of L0, the Link Up state where application data can be exchanged. This happens first at Gen1 speed (2.5GT/s). If both link partners support a higher speed, they can enter the Recovery state, change speeds, then return to L0 at the higher speed.

Each of these top-level states has a number of substates that define actions and conditions for transitioning between substates or moving to the next top-level state. The following sections detail the substates in the normal path from Detect to L0, including a speed change through Recovery. Not covered are side paths such as low-power states (L0s, L1, L2), since just the main path is complex enough for one post.

Detect


The Detect state is the only one that doesn't involve sending or receiving Ordered Sets. Its purpose is to periodically look for receiver termination, indicating the presence of a link partner. This is done with an implementation-specific analog mechanism built into the PHY.

Detect.Quiet

This is the entry point of the LTSSM after a reset and the reset point after many timeout or fault conditions. Software can also force the LTSSM back into this state to retrain the link. The transmitter is set to electrical idle. In PG239, this is done by setting the phy_txelecidle bit for each lane. The LTSSM stays in this state until 12ms have elapsed or the receiver detects that any lane has exited electrical idle (phy_rxelecidle goes low). Then, it will proceed to Detect.Active.

In the absence of a link partner, the LTSSM will cycle between Detect.Quiet and Detect.Active with a period of approximately 12ms. This period, as well as other timeouts in PCIe, are specified with a tolerance of (+50/-0)%, so it can really be anywhere from 12-18ms. This allows for efficient logic for counter comparisons. For example, with a PHY clock of 125MHz, a count of 2^17 is 1.049ms, so a single 6-input LUT attached to counter[22:17] can just wait for 6'd12 and that will be an accurate-enough 12ms timeout trigger.

Detect.Active

The transmitter for each lane attempts to detect receiver termination on that lane, indicating the presence of a link partner. This is done by measuring the time constant of the RC circuit created by the Tx AC-coupling capacitor and the Rx termination resistor. In PG239, the MAC sets the signal phy_txdetectrx and monitors the result in phy_rxstatus on each lane.

There are three possible outcomes:
  1. No receiver termination is detected on any lane. The LTSSM returns to Detect.Quiet.
  2. Receiver termination is detected on all lanes. The LTSSM proceeds to Polling on all lanes.
  3. Receiver termination is detected on some, but not all, lanes. In this case, the link partner may have fewer lanes. The transmitter waits 12ms, then repeats the receiver detection. If the result is the same, the LTSSM proceeds to Polling on only the detected lanes. Otherwise, it returns to Detect.Quiet.

Polling


In Polling and most other states, link partners exchange Ordered Sets, easy-to-identify fixed-length packets containing link configuration information. They are transmitted in parallel on all lanes that detected receiver termination, although the contents may very per-lane in some states. The most important Ordered Sets for training are Training Sequence 1 (TS1) and Training Sequence 2 (TS2), 16-symbol packets with the following layouts:

TS1 Ordered Set Structure

TS2 Ordered Set Structure

In the Link Number and Lane Number fields, a special symbol (PAD) is reserved for indicating that the field has not yet been configured. This symbol has a unique 8b/10b control code (K23.7) in Gen1/2, but is just defined as 8'hF7 in Gen3. Polling always happens at Gen1 speeds (2.5GT/s).

Polling.Active

The transmitter sends TS1s with PAD for the Link Number and Lane Number. The receiver listens for TS1s or TS2s from the link partner.

The LTSSM normally proceeds to Polling.Configuration when all of the following conditions are met:
  1. Software is not commanding a transition to Polling.Compliance via the Enter Compliance bit in the Link Control 2 register.
  2. At least 1024 TS1s have been transmitted.
  3. Eight consecutive TS1s or TS2s have been received with Link Number and Lane Number set to PAD on all lanes, and not requesting Polling.Compliance unless also requesting Loopback (an unusual corner case).
If the above three conditions are not met on all lanes after 24ms timeout, the LTSSM proceeds to Polling.Configuration anyway if at least one lane received the necessary TS1s and enough lanes to form a valid link have exited electrical idle. Otherwise, it will assume it's connected to a passive test load and go to Polling.Compliance, a substate used to test compliance with the PCIe PHY specification by transmitting known sequences.

Polling.Configuration

The transmitter sends TS2s with PAD for the Link Number and Lane Number. The receiver listens for TS2s (not TS1s) from the link partner.

The LTSSM normally proceeds to Configuration when all of the following conditions are met:
  1. At least 16 TS2s have been transmitted after receiving one TS2.
  2. Eight consecutive TS2s have been received with Link Number and Lane Number set to PAD on any lane.
Unlike in Polling.Active, transmitted TS are only counted after receiving at least one TS from the link partner. This mechanism acts as a synchronization gate to ensure that both link partners receive more than enough TS to clear the state, regardless of which entered the state first.

If the above two conditions are not met after a 48ms timeout, the LTSSM returns to Detect and starts over.

Configuration


The downstream-facing (host/root complex) port leads configuration, proposing link and lane numbers based on the available lanes. The upstream-facing (device/end-point) port echoes back configuration parameters, if they are accepted. The following diagram and description are from the point of view of the downstream-facing port.

Configuration.Linkwidth.Start

The (downstream-facing) transmitter sends TS1s with a Link Number (arbitrary, 0-31) and PAD for the Lane Number. The receiver listens for matching TS1s.

The LTSSM normally proceeds to Configuration.Linkwidth.Accept when the following condition is met:
  1. Two consecutive TS1s are received with Link Number matching that of the transmitted TS1s, and PAD for the Lane Number, on any lane.
It the above condition is not met after a 24ms timeout, the LTSSM returns to Detect and starts over.

Configuration.Linkwidth.Accept

The downstream-facing port must decide if it can form a link using the lanes that are receiving a matching Link Number and PAD for the Lane Numbers. If it can, it assigns sequential Lane Numbers to those lanes. For example, an x4 link can be formed by assigning Lane Numbers 0-3.

The LTSSM normally proceeds to Configuration.Lanenum.Wait when the following condition is met:
  1. A link can be formed with a subset of the lanes that are responding with a matching Link Number and PAD for the Lane Numbers.
An interesting question is how to handle a case where only some of the detected lanes have responded. Should the LTSSM wait at least long enough to handle a missed packet and/or lane-to-lane skew before exiting this state? (I don't actually know the answer, but to me it seems logical to wait for at least a few TS periods before proposing lane numbers.)

If the above condition isn't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

Configuration.Lanenum.Wait

The transmitter sends TS1s with the Link Number and with each lane's proposed Lane Number. The receiver listens for TS1s with a matching Link Number and updated Lane Numbers.

The LTSSM normally proceed to Configuration.Lanenum.Accept when the following condition is met:
  1. Two consecutive TS1s are received with Link Number matching that of the transmitted TS1s and with a Lane Number that has changed since entering the state, on any lane.
Here the spec is more explicit that upstream-facing lanes may take up to 1ms to start echoing the lane numbers, to account for receiver errors or lane-to-lane skew. So (I think) the above condition is meant to be evaluated only after 1ms has elapsed in this state.

If the above condition isn't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

Configuration.Lanenum.Accept

Here, there are three possibilities:
  1. The updated Lane Numbers being received match those transmitted on all lanes, or the reverse (if supported). The LTSSM proceeds to Configuration.Complete.
  2. The updated Lane Numbers don't match the those transmitted, or the reverse (if supported). But, a subset of the responding lanes can be used to form a link. The downstream-facing port reassigns lane numbers for this new link and returns to Configuration.Lanenum.Wait.
  3. No link can be formed. The LTSSM returns to Detect and starts over.
Normally, lane reversal (e.g. 0-3 remapped to 3-0) would be handled by the device if it supports the feature, and its upstream-facing port will respond with matching Lane Numbers. However, if the device doesn't support lane reversal, it can respond with the reversed lane numbers to request the host do the reversal, if possible.

Configuration.Complete

The transmitter sends TS2s with the agreed-upon Link and Lane Numbers. The receiver listens for TS2s with the same.

The LTSSM normally proceeds to Configuration.Idle when all of the following conditions are met:
  1. At least 16 TS2s have been transmitted after receiving one TS2, on all lanes.
  2. Eight consecutive TS2s have been received with the same Link and Lane Numbers as are being transmitted, on all lanes.
If the above condition isn't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

Configuration.Idle

The transmitter sends Idle data symbols (IDL) on all configured lanes. The receiver listens for the same. Unlike Training Sets, these symbols go through scrambling, so this state also confirms that scrambling is working properly in both directions.

The LTSSM normally proceeds to L0 when all of the following conditions are met:
  1. At least 16 consecutive IDL have been transmitted after receiving one IDL, on all lanes.
  2. Eight consecutive IDL have been received, on all lanes.
If the above conditions aren't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

L0

This is the golden normal operational state where the host and device can exchange actual data packets. The LTSSM indicates Link Up status to the upper layers of the stack, and they begin to do their work. One of the first things that happens after Link Up is flow control initialization by the Data Link Layer partners. Flow control is itself a state machine with some interesting rules, but that'll be for another post.

But wait...the link is still operating at 2.5GT/s at this point. If both link partners support higher data rates (as indicated in their Training Sets), they can try to switch to their highest mutually-supported data rate. This is done by transitioning to Recovery, running through the Recovery speed change substates, then returning to L0 at the new rate.

Recovery


Recovery is in many ways the most complex LTSSM state, with many internal state variables that alter state transitions rules and lead to circuitous paths through the substates, even for a nominal speed change. As with the other states, there are way too many edge cases to cover here, so I'll only focus on getting back to L0 at 8GT/s along the normal path. 

Also, since Configuration has been completed, it's assumed that Link and Lane Numbers will match in transmitted and received Training Sequences. If this condition is violated, the LTSSM may fail back to Configuration or Detect depending on the nature of the failure. For simplicity, I'm omitting these paths from the descriptions of each substate.

When changing speeds to 8GT/s, the link must establish equalization settings during this state. In the simplest case, the downstream-facing port chooses a transmitter equalization preset for itself and requests a preset for the upstream-facing transmitter to use. The transmitter presets specify two parameters, de-emphasis and preshoot, that modify the shape of the transmitted waveform to counteract the low-pass nature of the physical channel. This can open the receiver eye even with lower overall voltage swing:

Recovery.RcvrLock

This substate is encountered (at least) three times.

The first time this substate is entered is from L0 at 2.5GT/s. The transmitter sends TS1s (at 2.5GT/s) with the Speed Change bit set. It can also set the EQ bit and send a Transmitter Preset and Receiver Preset Hint in this state. These presets are requested values for the upstream transmitter to use after it switches to 8GT/s. The receiver listens for TS1s or TS2s that also have the Speed Change bit set.

The first exit is normally to Recovery.RcvrCfg when the following condition is met:
  1. Eight consecutive TS1s or TS2s are received with the Speed Change bit matching the transmitted value (1, in this case), on all lanes.
The second time this subtstate is entered is from Recovery.Speed, after the speed has changed from 2.5GT/s to 8GT/s. Now, the link needs to be re-established at the higher data rate. Transitioning to 8GT/s always requires a trip through the equalization substate, so after setting its transmitter equalization, the LTSSM proceeds to Recovery.Equalization immediately.

The third time this subtstate is entered is from Recovery.Equalization, after equalization has been completed. The transmitter sends TS1s (at 8GT/s) with the Speed Change bit cleared, the EC bits set to 2'b00, and the equalization fields reflecting the downstream transmitter's current equalization settings: Transmitter Preset and Cursor Coefficients. The receiver listens for TS1s or TS2s that also have the Speed Change and EC bits cleared.

The third exit is normally to Recovery.RcvrCfg when the following condition is met:
  1. Eight consecutive TS1s or TS2s are received with the Speed Change bit matching the transmitted value (0, in this case), on all lanes.

Recovery.RcvrCfg

This substate is encoutered (at least) twice.

The first time this substate is entered is from Recovery.RcvrLock at 2.5GT/s. The transmitter sends TS2s (at 2.5GT/s) with the Speed Change bit set. It can also set the EQ bit and send a transmitter preset and receiver preset hint in this state. These presets are requested values for the upstream transmitter to use after it switches to 8GT/s. The receiver listens for TS2s that also have the Speed Change bit set.

The first exit is normally to Recovery.Speed when the following condition is met:
  1. Eight consecutive TS2s are received with the Speed Change bit set, on all lanes.
The second time this substate is entered is from Recovery.RcvrLock at 8GT/s. The transmitter sends TS2s (at 8GT/s) with the Speed Change bit cleared. The receiver listens for TS2s that also have the Speed Change bit cleared.

The second exit is normally to Recovery.Idle when the following condition is met:
  1. Eight consecutive TS2s are received with the Speed Change bit cleared, on all lanes.

Recovery.Speed

In this substate, the transmitter enters electrical idle and the receiver waits for all lanes to be in electrical idle. At this point, the transmitter changes to the new higher speed and configures its equalization parameters. In PG239, this is done using the phy_rate and phy_txeq_X signals.

The LTSSM normally returns to Recovery.RcvrLock after waiting at least 800ns and not more than 1ms after all receiver lanes have entered electrical idle.

This state may be re-entered if the link cannot be reestablished at the new speed. In that case, the data rate can be changed back to the last known-good speed.

Recovery.Equalization

The Recovery.Equalization substate has phases, indicated by the Equalization Control (EC) bits of the TS1, that are themselves like sub-substates. From the point of view of the downstream-facing port, Phase 1 is always encountered, but Phase 2 and 3 may not be needed if the initially-chosen presets are acceptable.

In Phase 1, the transmitter sends TS1s with EC = 2'b01 and the equalization fields indicating the downstream transmitter's equalization settings and capabilities: Transmitter Preset, Full Scale (FS), Low Frequency (LF), and Post-Cursor Coefficient. The FS and LF values indicate the range of voltage adjustments possible for transmitter equalization.

The LTSSM normally returns to Recovery.RcvrLock when the following condition is met:
  1. Two consecutive TS1s are received with EC = 2'b01.
This essentially means that the presets chosen in the EQ TS1s and EQ TS2s sent at 2.5GT/s have been applied and are acceptable. If the above condition is not met after a 24ms timeout, the LTSSM returns to Recovery.Speed and changes back to the lower speed. From there, it could try again with different equalization presets, or accept that the link will run at a lower speed.

It's also possible for the downstream port to request further equalization tuning: In Phase 2 and Phase 3 of this substate, link partners can iteratively request different equalization settings and evaluate (via some implementation-specific method) the link quality. In a completely "known" link, these steps can be skipped if one of the transmitter presets has already been validated.

Recovery.Idle

This substate serves the same purpose as Configuration.Idle, but at the higher data rate (assuming the speed change was successful).

The transmitter sends Idle data symbols (IDL, 8'h00) on all configured lanes. The receiver listens for the same. These symbols now go through 8GT/s scrambling, so this state also confirms that 8GT/s scrambling is working properly in both directions.

The LTSSM normally returns to L0 when all of the following conditions are met:
  1. At least 16 consecutive IDL have been transmitted after receiving one IDL, on all lanes.
  2. Eight consecutive IDL have been received, on all lanes.
If the above conditions aren't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

LTSSM Protocol Analyzer Captures

There are lots of places for the LTSSM to go wrong, and since it's running near the very bottom of the stack, it's hard to troubleshoot without dedicated tools like a PCIe Protocol Analyzer. In my tool hunt, I managed to get a used U4301B, so let's put it to use and look at some LTSSM captures.

Side note: Somebody just scored an insane deal on a dual U4301A listing that included the unicorn U4322A probe. If you're that someone and you want to sell me just the probe, let me know! I will take it in any condition just for the spare pins. Also, there is a reasonably-priced U4301B up right now if anyone's looking for one.

But anyway, my Frankenstein U4301B + M.2 interposer is still operational and can be used with the Keysight software to capture Training Sets and summarize LTSSM progress:


You can see the progression through Polling and Configuration, L0, Recovery, and back to L0. In Recovery, you can see the speed change and equalization loops, crossing through the base state of Recovery.RcvrLock three times as described above.

Looking at the Training Sequence traffic itself, the entire LTSSM takes around 9ms to complete in this example, with the vast majority of the time spent in the Recovery state after the speed change. Zooming in shows the details of the earlier states, down to the level of individual Training Sequences.


If any of the states transitions don't go as expected it's possible to look inside the individual Training Sequences to troubleshoot what conditions aren't being met. The exact timing and behavior varies a lot from device to device, though.

So you made it to L0...what next?

L0 / Link Up means the physical link is established, so the upper layers of the PCIe stack can begin to communicate across the link. However, before any application data (memory transactions) can be transferred, the Data Link Layer must initialize flow control. PCIe flow control is itself an interesting topic that deserves a separate post, so I'll end here for now!