Saturday, November 9, 2024

PCIe Deep Dive, Part 5: Flow Control

In Part 2, I described PCIe as a bi-directional memory bus extension and looked at some factors that contribute to the efficiency of the link. A PCIe 3.0 x4 link, with 8GT/s on each lane and an efficiency of around 90%, can support bidirectional data transfer of around 3.6GB/s. But this assumes both sides of the link can consume data that quickly. In reality, a PCIe function is subject to other constraints, like local memory access, that might limit the available data rate below the theoretical maximum. In this case, it's the job of the Data Link Layer's flow control mechanism to throttle the link partner so as not to overflow the receiver's buffers.

For a full-duplex serial interface, flow control can be as simple as two signals used to communicate readiness to receive data. In UART, for example, the RTS/CTS signals serve this purpose: A receiver de-asserts its RTS output when its buffer is full. A transmitter can only transmit when its CTS input is asserted. The RTS outputs of each receiver are crossed over to the CTS inputs of the other side's transmitter to enforce bidirectional flow control.

This works well for homogenous streams of data, but PCIe is packetized, and Transaction Layer Packets (TLPs) can have a variable amount of data in their payload. Additionally, packets representing write requests, read requests, and read completions need different handling. For example: If both sides are saturated with read requests, they need to be able to block further requests without also blocking completions of outstanding reads. So, a more sophisticated flow control system is necessary.

PCIe receivers implement six different buffers to accommodate different types of TLP. These are divided into three categories:

  1. Posted Requests (P). These are Request TLPs that don't require a Completion, such as memory writes. Once the TLP has been acknowledged, the requester assumes the remote memory will be written accordingly with the TLP data.
  2. Non-Posted Requests (NP). These are Request TLPs that do require a Completion, such as memory reads. Configuration reads and writes also fall into this category.
  3. Completions (CPL). These are Completion TLPs completing a Non-Posted Request, such as by returning data from a memory read.

For each of these categories, there is a Header buffer and a Data buffer. The Header buffer stores TLP header information, such as the address and size for memory reads and writes. The Data buffer stores TLP payload data. Separating the Header and Data buffers allows more flow control flexibility. A receiver can limit the number of TLPs received by using Header buffer flow control, or limit the amount of data received by using Data buffer flow control. In practice, both are used simultaneously depending on the external constraints of the PCIe function.

The PCIe specification only describes these buffers conceptually; it doesn't mandate any specific implementation. In an FPGA, the Data buffers could reasonably be built from block RAM, since they need to store several TLPs worth of payload data (order of KiB). The Header buffers only need to store (at most) four or five Double Words (DW = 32b) per TLP, so they could reasonably be built from distributed (LUT) RAM instead. A hypothetical Ultrascale+ implementation is shown below.

The Data buffers are built from two parallel BRAMs (4KiB each). This provides a R/W interface of 128b, matching the datapath width for a PCIe 3.0 x4 link. Conveniently, the unit of flow control for data is defined to be 4DW (128b), so each BRAM address is one unit of data. With 8KiB of total memory, these buffers could, for example, hold up to 16 TLPs worth of optimally-aligned data on a link with a Max. Payload Size of 512B. The capacity could be expanded by cascading more BRAMs or using URAMs, depending on the design requirements.

The Header buffers are built from LUTRAMs instead, with a depth of 64. Each entry represents the header of a single TLP, and may be associated with a block of data in the Data buffer. One header is also defined to be one unit of flow control. A single SLICEM (8 LUTs) can make a 64x7b simple dual-port LUTRAM. These can be parallelized up to any bit width, depending on how much of the header is actually used in the design. The buffer could also be made deeper if necessary, by using more LUTRAMs and MUXes.

For each of these buffers, the receiver issues credits to its link partner's transmitter in flow control units (4DW for Data buffers, one TLP header for Header buffers). The transmitter is only allowed to send a TLP if it has been issued enough credits (of the right type) for that TLP and its data. The receiver issues credits only up to the amount of space available in each of the buffers, updating the amount issued as the buffers are emptied by the Application Layer above.

Flow control credits are initialized and then updated using specific Data Link Layer Packets (DLLPs) for each category. These packets have a common 6B structure with an 8b counter for Header credits and a 12b counter for Data credits:

InitFC1- and InitFC2-type DLLPs are sent as part of a state machine during flow control initialization to issue the starting number of credits for each category (P, NP, Cpl). In the example above, the initial header and data credits could be 64 and 512 (or 63 and 511, to be safe). It's also possible for the receiver to grant "infinite" credits for a particular category of TLP by setting the initial value of HdrFC and/or DataFC to zero. In fact, for normal Root Complex or Endpoint operation, this is required for Completion credits. A Request is only transmitted if there is room in the local receiver buffers for its associated Completion.

UpdateFC-type DLLPs are sent periodically during normal operation to issue more credits to the link partner. Typically, credits are incremented as the Application Layer reads out of the associated buffer. In the case of an AXI Bridge, this could be when the AXI transaction is in progress. The values of HdrFC[7:0] and DataFC[11:0] are the total number of credits issued, mod 256 (header) or 4096 (data). For example, if the initial data credit was 511 and then 64DW were read out of the data buffer, the first UpdateFC value for DataFC could be 511 + 64/4 = 527.

This is mostly equivalent to transmitting the amount of buffer space available, but the head/tail index subtraction is left to the transmitter. It can compare the total credits issued to its local count of total credits consumed (mod 256 or 4096) to check whether there is enough space available for the next TLP. To prevent overflow, the receiver never issues more than 127 header credits or 2047 data credits beyond its current buffer head. (This also means that there's no reason to have header buffers larger than 128 entries, or data buffers larger than 2048x4DW = 32KiB.)

The receiver handles scheduling of UpdateFCs for all buffers that didn't advertise infinite credits during initialization. The PCIe 3.0 specification does have some rules for scheduling UpdateFCs immediately when the link partner's transmitter might be credit-starved, but otherwise there is only a maximum period requirement of 30-45μs. If the buffer has plenty of space, it can be better to space out updates for higher bus efficiency. (Starting with PCIe 6.0, the flow control interval and associated bus efficiency will become constant as part of FLIT mode, a nice simplification.)

A specific application may not actually need all six buffers to be the same size or type. For example, an NVMe host would barely need any Completion buffer space, since its role as a Requester is limited to modifying NVMe Controller Registers, usually one 32b access at a time. It's also unlikely to have any need for Non-Posted Requests with data. The vast majority of its receiver TLP traffic will be memory writes (PH/PD) and memory reads (NPH).

With a PCIe protocol analyzer, it's possible to see the flow control in action. The following is a trace recorded during a 512KiB NVMe read on a PCIe Gen3 x4 link with a Max. Payload Size of 512B. The host-side memory write throughput and DataFC available credits are plotted.

The whole transaction takes 1024 memory write request TLPs, each with 512B payload data. These occur over about 146μs, for an average throughput of about 3.6GB/s. The peak throughput is a little higher, though, at around 3.8GB/s. This means the PCIe link is slightly faster than the AXI bus, and as a result, DataFC credits are consumed faster than they are issued at first. Once the available credits drop below 32, there isn't enough room for a 512B TLP and the transmitter (on the NVMe SSD) is throttled. The link reaches steady-state operation at the AXI-limited throughput.

Depending on the typical access patterns of the application, a larger data buffer could help sustain peak throughput for the entire duration of a typical transfer. Sequential transfers might still hit the credit limit threshold, but it's also possible that there's enough NVMe overhead between them for the credits to recover. This sort of optimization, along with maximizing bus efficiency, would be required to squeeze out even more application-level throughput.

In summary, if PCIe is acting as a memory bus extension, PCIe flow control extends the local memory controller's backpressure mechanism across the link. For example, if an AXI memory bus can't keep up with writes, it will de-assert AWREADY and/or WREADY. The PCIe receiver can still accept memory write TLPs as long as it has room in its buffers, but it can't issue any new PH or PD credits. When the buffers are nearly full, this transfers the backpressure to the link partner.

Monday, January 22, 2024

PCIe Deep Dive, Part 4: LTSSM

The Link Training and Status State Machine (LTSSM) is a logic block that sits in the MAC layer of the PCIe stack. It configures the PHY and establishes the PCIe link by negotiating link width, speed, and equalization settings with the link partner. This is done primarily by exchanging Ordered Sets, easy-to-identify fixed-length packets of link configuration information transmitted on all lanes in parallel. The LTSSM must complete successfully before any real data can be exchanged over the PCIe link.

Although somewhat complex, the LTSSM is a normal logic state machine. The controller executes a specific set of actions based on the current state and its role as either a downstream-facing port (host/root complex) or upstream-facing port (device/endpoint). These actions might include:

  • Detecting the presence of receiver termination on its link partner.
  • Transmitting Ordered Sets with specific link configuration information.
  • Receiving Ordered Sets from its link partner.
  • Comparing the information in received Ordered Sets to transmitted Ordered Sets.
  • Counting Ordered Sets transmitted and/or received that meet specific requirements.
  • Tracking how much time has elapsed in the state (for timeouts).
  • Reading or writing bits in PCIe Configuration Space registers, for software interaction.
Each state also has conditions that trigger transitions to other states. All this is typically implemented in gate-level logic (HDL), not software, although there may be software hooks that can trigger state transitions manually. The top-level LTSSM diagram looks like this:


The entry point after a reset is the Detect state and the normal progression is through Detect, Polling, and Configuration, to the golden state of L0, the Link Up state where application data can be exchanged. This happens first at Gen1 speed (2.5GT/s). If both link partners support a higher speed, they can enter the Recovery state, change speeds, then return to L0 at the higher speed.

Each of these top-level states has a number of substates that define actions and conditions for transitioning between substates or moving to the next top-level state. The following sections detail the substates in the normal path from Detect to L0, including a speed change through Recovery. Not covered are side paths such as low-power states (L0s, L1, L2), since just the main path is complex enough for one post.

Detect


The Detect state is the only one that doesn't involve sending or receiving Ordered Sets. Its purpose is to periodically look for receiver termination, indicating the presence of a link partner. This is done with an implementation-specific analog mechanism built into the PHY.

Detect.Quiet

This is the entry point of the LTSSM after a reset and the reset point after many timeout or fault conditions. Software can also force the LTSSM back into this state to retrain the link. The transmitter is set to electrical idle. In PG239, this is done by setting the phy_txelecidle bit for each lane. The LTSSM stays in this state until 12ms have elapsed or the receiver detects that any lane has exited electrical idle (phy_rxelecidle goes low). Then, it will proceed to Detect.Active.

In the absence of a link partner, the LTSSM will cycle between Detect.Quiet and Detect.Active with a period of approximately 12ms. This period, as well as other timeouts in PCIe, are specified with a tolerance of (+50/-0)%, so it can really be anywhere from 12-18ms. This allows for efficient logic for counter comparisons. For example, with a PHY clock of 125MHz, a count of 2^17 is 1.049ms, so a single 6-input LUT attached to counter[22:17] can just wait for 6'd12 and that will be an accurate-enough 12ms timeout trigger.

Detect.Active

The transmitter for each lane attempts to detect receiver termination on that lane, indicating the presence of a link partner. This is done by measuring the time constant of the RC circuit created by the Tx AC-coupling capacitor and the Rx termination resistor. In PG239, the MAC sets the signal phy_txdetectrx and monitors the result in phy_rxstatus on each lane.

There are three possible outcomes:
  1. No receiver termination is detected on any lane. The LTSSM returns to Detect.Quiet.
  2. Receiver termination is detected on all lanes. The LTSSM proceeds to Polling on all lanes.
  3. Receiver termination is detected on some, but not all, lanes. In this case, the link partner may have fewer lanes. The transmitter waits 12ms, then repeats the receiver detection. If the result is the same, the LTSSM proceeds to Polling on only the detected lanes. Otherwise, it returns to Detect.Quiet.

Polling


In Polling and most other states, link partners exchange Ordered Sets, easy-to-identify fixed-length packets containing link configuration information. They are transmitted in parallel on all lanes that detected receiver termination, although the contents may very per-lane in some states. The most important Ordered Sets for training are Training Sequence 1 (TS1) and Training Sequence 2 (TS2), 16-symbol packets with the following layouts:

TS1 Ordered Set Structure

TS2 Ordered Set Structure

In the Link Number and Lane Number fields, a special symbol (PAD) is reserved for indicating that the field has not yet been configured. This symbol has a unique 8b/10b control code (K23.7) in Gen1/2, but is just defined as 8'hF7 in Gen3. Polling always happens at Gen1 speeds (2.5GT/s).

Polling.Active

The transmitter sends TS1s with PAD for the Link Number and Lane Number. The receiver listens for TS1s or TS2s from the link partner.

The LTSSM normally proceeds to Polling.Configuration when all of the following conditions are met:
  1. Software is not commanding a transition to Polling.Compliance via the Enter Compliance bit in the Link Control 2 register.
  2. At least 1024 TS1s have been transmitted.
  3. Eight consecutive TS1s or TS2s have been received with Link Number and Lane Number set to PAD on all lanes, and not requesting Polling.Compliance unless also requesting Loopback (an unusual corner case).
If the above three conditions are not met on all lanes after 24ms timeout, the LTSSM proceeds to Polling.Configuration anyway if at least one lane received the necessary TS1s and enough lanes to form a valid link have exited electrical idle. Otherwise, it will assume it's connected to a passive test load and go to Polling.Compliance, a substate used to test compliance with the PCIe PHY specification by transmitting known sequences.

Polling.Configuration

The transmitter sends TS2s with PAD for the Link Number and Lane Number. The receiver listens for TS2s (not TS1s) from the link partner.

The LTSSM normally proceeds to Configuration when all of the following conditions are met:
  1. At least 16 TS2s have been transmitted after receiving one TS2.
  2. Eight consecutive TS2s have been received with Link Number and Lane Number set to PAD on any lane.
Unlike in Polling.Active, transmitted TS are only counted after receiving at least one TS from the link partner. This mechanism acts as a synchronization gate to ensure that both link partners receive more than enough TS to clear the state, regardless of which entered the state first.

If the above two conditions are not met after a 48ms timeout, the LTSSM returns to Detect and starts over.

Configuration


The downstream-facing (host/root complex) port leads configuration, proposing link and lane numbers based on the available lanes. The upstream-facing (device/end-point) port echoes back configuration parameters, if they are accepted. The following diagram and description are from the point of view of the downstream-facing port.

Configuration.Linkwidth.Start

The (downstream-facing) transmitter sends TS1s with a Link Number (arbitrary, 0-31) and PAD for the Lane Number. The receiver listens for matching TS1s.

The LTSSM normally proceeds to Configuration.Linkwidth.Accept when the following condition is met:
  1. Two consecutive TS1s are received with Link Number matching that of the transmitted TS1s, and PAD for the Lane Number, on any lane.
It the above condition is not met after a 24ms timeout, the LTSSM returns to Detect and starts over.

Configuration.Linkwidth.Accept

The downstream-facing port must decide if it can form a link using the lanes that are receiving a matching Link Number and PAD for the Lane Numbers. If it can, it assigns sequential Lane Numbers to those lanes. For example, an x4 link can be formed by assigning Lane Numbers 0-3.

The LTSSM normally proceeds to Configuration.Lanenum.Wait when the following condition is met:
  1. A link can be formed with a subset of the lanes that are responding with a matching Link Number and PAD for the Lane Numbers.
An interesting question is how to handle a case where only some of the detected lanes have responded. Should the LTSSM wait at least long enough to handle a missed packet and/or lane-to-lane skew before exiting this state? (I don't actually know the answer, but to me it seems logical to wait for at least a few TS periods before proposing lane numbers.)

If the above condition isn't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

Configuration.Lanenum.Wait

The transmitter sends TS1s with the Link Number and with each lane's proposed Lane Number. The receiver listens for TS1s with a matching Link Number and updated Lane Numbers.

The LTSSM normally proceed to Configuration.Lanenum.Accept when the following condition is met:
  1. Two consecutive TS1s are received with Link Number matching that of the transmitted TS1s and with a Lane Number that has changed since entering the state, on any lane.
Here the spec is more explicit that upstream-facing lanes may take up to 1ms to start echoing the lane numbers, to account for receiver errors or lane-to-lane skew. So (I think) the above condition is meant to be evaluated only after 1ms has elapsed in this state.

If the above condition isn't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

Configuration.Lanenum.Accept

Here, there are three possibilities:
  1. The updated Lane Numbers being received match those transmitted on all lanes, or the reverse (if supported). The LTSSM proceeds to Configuration.Complete.
  2. The updated Lane Numbers don't match the those transmitted, or the reverse (if supported). But, a subset of the responding lanes can be used to form a link. The downstream-facing port reassigns lane numbers for this new link and returns to Configuration.Lanenum.Wait.
  3. No link can be formed. The LTSSM returns to Detect and starts over.
Normally, lane reversal (e.g. 0-3 remapped to 3-0) would be handled by the device if it supports the feature, and its upstream-facing port will respond with matching Lane Numbers. However, if the device doesn't support lane reversal, it can respond with the reversed lane numbers to request the host do the reversal, if possible.

Configuration.Complete

The transmitter sends TS2s with the agreed-upon Link and Lane Numbers. The receiver listens for TS2s with the same.

The LTSSM normally proceeds to Configuration.Idle when all of the following conditions are met:
  1. At least 16 TS2s have been transmitted after receiving one TS2, on all lanes.
  2. Eight consecutive TS2s have been received with the same Link and Lane Numbers as are being transmitted, on all lanes.
If the above condition isn't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

Configuration.Idle

The transmitter sends Idle data symbols (IDL) on all configured lanes. The receiver listens for the same. Unlike Training Sets, these symbols go through scrambling, so this state also confirms that scrambling is working properly in both directions.

The LTSSM normally proceeds to L0 when all of the following conditions are met:
  1. At least 16 consecutive IDL have been transmitted after receiving one IDL, on all lanes.
  2. Eight consecutive IDL have been received, on all lanes.
If the above conditions aren't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

L0

This is the golden normal operational state where the host and device can exchange actual data packets. The LTSSM indicates Link Up status to the upper layers of the stack, and they begin to do their work. One of the first things that happens after Link Up is flow control initialization by the Data Link Layer partners. Flow control is itself a state machine with some interesting rules, but that'll be for another post.

But wait...the link is still operating at 2.5GT/s at this point. If both link partners support higher data rates (as indicated in their Training Sets), they can try to switch to their highest mutually-supported data rate. This is done by transitioning to Recovery, running through the Recovery speed change substates, then returning to L0 at the new rate.

Recovery


Recovery is in many ways the most complex LTSSM state, with many internal state variables that alter state transitions rules and lead to circuitous paths through the substates, even for a nominal speed change. As with the other states, there are way too many edge cases to cover here, so I'll only focus on getting back to L0 at 8GT/s along the normal path. 

Also, since Configuration has been completed, it's assumed that Link and Lane Numbers will match in transmitted and received Training Sequences. If this condition is violated, the LTSSM may fail back to Configuration or Detect depending on the nature of the failure. For simplicity, I'm omitting these paths from the descriptions of each substate.

When changing speeds to 8GT/s, the link must establish equalization settings during this state. In the simplest case, the downstream-facing port chooses a transmitter equalization preset for itself and requests a preset for the upstream-facing transmitter to use. The transmitter presets specify two parameters, de-emphasis and preshoot, that modify the shape of the transmitted waveform to counteract the low-pass nature of the physical channel. This can open the receiver eye even with lower overall voltage swing:

Recovery.RcvrLock

This substate is encountered (at least) three times.

The first time this substate is entered is from L0 at 2.5GT/s. The transmitter sends TS1s (at 2.5GT/s) with the Speed Change bit set. It can also set the EQ bit and send a Transmitter Preset and Receiver Preset Hint in this state. These presets are requested values for the upstream transmitter to use after it switches to 8GT/s. The receiver listens for TS1s or TS2s that also have the Speed Change bit set.

The first exit is normally to Recovery.RcvrCfg when the following condition is met:
  1. Eight consecutive TS1s or TS2s are received with the Speed Change bit matching the transmitted value (1, in this case), on all lanes.
The second time this subtstate is entered is from Recovery.Speed, after the speed has changed from 2.5GT/s to 8GT/s. Now, the link needs to be re-established at the higher data rate. Transitioning to 8GT/s always requires a trip through the equalization substate, so after setting its transmitter equalization, the LTSSM proceeds to Recovery.Equalization immediately.

The third time this subtstate is entered is from Recovery.Equalization, after equalization has been completed. The transmitter sends TS1s (at 8GT/s) with the Speed Change bit cleared, the EC bits set to 2'b00, and the equalization fields reflecting the downstream transmitter's current equalization settings: Transmitter Preset and Cursor Coefficients. The receiver listens for TS1s or TS2s that also have the Speed Change and EC bits cleared.

The third exit is normally to Recovery.RcvrCfg when the following condition is met:
  1. Eight consecutive TS1s or TS2s are received with the Speed Change bit matching the transmitted value (0, in this case), on all lanes.

Recovery.RcvrCfg

This substate is encoutered (at least) twice.

The first time this substate is entered is from Recovery.RcvrLock at 2.5GT/s. The transmitter sends TS2s (at 2.5GT/s) with the Speed Change bit set. It can also set the EQ bit and send a transmitter preset and receiver preset hint in this state. These presets are requested values for the upstream transmitter to use after it switches to 8GT/s. The receiver listens for TS2s that also have the Speed Change bit set.

The first exit is normally to Recovery.Speed when the following condition is met:
  1. Eight consecutive TS2s are received with the Speed Change bit set, on all lanes.
The second time this substate is entered is from Recovery.RcvrLock at 8GT/s. The transmitter sends TS2s (at 8GT/s) with the Speed Change bit cleared. The receiver listens for TS2s that also have the Speed Change bit cleared.

The second exit is normally to Recovery.Idle when the following condition is met:
  1. Eight consecutive TS2s are received with the Speed Change bit cleared, on all lanes.

Recovery.Speed

In this substate, the transmitter enters electrical idle and the receiver waits for all lanes to be in electrical idle. At this point, the transmitter changes to the new higher speed and configures its equalization parameters. In PG239, this is done using the phy_rate and phy_txeq_X signals.

The LTSSM normally returns to Recovery.RcvrLock after waiting at least 800ns and not more than 1ms after all receiver lanes have entered electrical idle.

This state may be re-entered if the link cannot be reestablished at the new speed. In that case, the data rate can be changed back to the last known-good speed.

Recovery.Equalization

The Recovery.Equalization substate has phases, indicated by the Equalization Control (EC) bits of the TS1, that are themselves like sub-substates. From the point of view of the downstream-facing port, Phase 1 is always encountered, but Phase 2 and 3 may not be needed if the initially-chosen presets are acceptable.

In Phase 1, the transmitter sends TS1s with EC = 2'b01 and the equalization fields indicating the downstream transmitter's equalization settings and capabilities: Transmitter Preset, Full Scale (FS), Low Frequency (LF), and Post-Cursor Coefficient. The FS and LF values indicate the range of voltage adjustments possible for transmitter equalization.

The LTSSM normally returns to Recovery.RcvrLock when the following condition is met:
  1. Two consecutive TS1s are received with EC = 2'b01.
This essentially means that the presets chosen in the EQ TS1s and EQ TS2s sent at 2.5GT/s have been applied and are acceptable. If the above condition is not met after a 24ms timeout, the LTSSM returns to Recovery.Speed and changes back to the lower speed. From there, it could try again with different equalization presets, or accept that the link will run at a lower speed.

It's also possible for the downstream port to request further equalization tuning: In Phase 2 and Phase 3 of this substate, link partners can iteratively request different equalization settings and evaluate (via some implementation-specific method) the link quality. In a completely "known" link, these steps can be skipped if one of the transmitter presets has already been validated.

Recovery.Idle

This substate serves the same purpose as Configuration.Idle, but at the higher data rate (assuming the speed change was successful).

The transmitter sends Idle data symbols (IDL, 8'h00) on all configured lanes. The receiver listens for the same. These symbols now go through 8GT/s scrambling, so this state also confirms that 8GT/s scrambling is working properly in both directions.

The LTSSM normally returns to L0 when all of the following conditions are met:
  1. At least 16 consecutive IDL have been transmitted after receiving one IDL, on all lanes.
  2. Eight consecutive IDL have been received, on all lanes.
If the above conditions aren't met after a 2ms timeout, the LTSSM returns to Detect and starts over.

LTSSM Protocol Analyzer Captures

There are lots of places for the LTSSM to go wrong, and since it's running near the very bottom of the stack, it's hard to troubleshoot without dedicated tools like a PCIe Protocol Analyzer. In my tool hunt, I managed to get a used U4301B, so let's put it to use and look at some LTSSM captures.

Side note: Somebody just scored an insane deal on a dual U4301A listing that included the unicorn U4322A probe. If you're that someone and you want to sell me just the probe, let me know! I will take it in any condition just for the spare pins. Also, there is a reasonably-priced U4301B up right now if anyone's looking for one.

But anyway, my Frankenstein U4301B + M.2 interposer is still operational and can be used with the Keysight software to capture Training Sets and summarize LTSSM progress:


You can see the progression through Polling and Configuration, L0, Recovery, and back to L0. In Recovery, you can see the speed change and equalization loops, crossing through the base state of Recovery.RcvrLock three times as described above.

Looking at the Training Sequence traffic itself, the entire LTSSM takes around 9ms to complete in this example, with the vast majority of the time spent in the Recovery state after the speed change. Zooming in shows the details of the earlier states, down to the level of individual Training Sequences.


If any of the states transitions don't go as expected it's possible to look inside the individual Training Sequences to troubleshoot what conditions aren't being met. The exact timing and behavior varies a lot from device to device, though.

So you made it to L0...what next?

L0 / Link Up means the physical link is established, so the upper layers of the PCIe stack can begin to communicate across the link. However, before any application data (memory transactions) can be transferred, the Data Link Layer must initialize flow control. PCIe flow control is itself an interesting topic that deserves a separate post, so I'll end here for now!

Tuesday, August 22, 2023

PCIe Deep Dive, Part 3: Scramblers, CRCs, and the Parallel LFSR

This post continues an exploration into the inner workings of PCIe. The previous post presented a top-level view of the PCIe Controller as a memory bus extension, with discussion of the various overheads associated with wrapping memory transfers into serial data packets. In this post, I want go to the other extreme and look at one of the low-level logic mechanisms that PCIe depends on for reliable data transfer: the parallel Linear-Feedback Shift Register (LFSR). This mechanism efficiently introduces randomness required to ensure DC-balanced serial data, and to validate Transaction Layer Packets (TLPs) with a Cyclic Redundancy Check (CRC).

PCIe 3.0 Scrambler

PCIe signals are driven across AC-coupled differential pairs to increase immunity to noise. The transmitter and receiver may be on different boards, far apart from each other, with significant high-frequency ground offset between them. Adding series capacitors to the differential signal provides low-voltage level shifting capability to deal with this. But, this only works if the data coming across the link is DC-balanced over a data interval much shorter than the time constant formed by the AC coupling capacitor and termination resistor, which is typically 10⁴ to 10⁵ UI.

PCIe 1.0 and 2.0 use 8b/10b encoding to enforce DC balance. This encoding tracks the running disparity of the serial data stream and modifies 10b symbols (representing 8b data) to keep it in balance. This is also the encoding used in USB all the way up to USB 3.x Gen 1 (5Gbps), which is the same speed as PCIe 2.0. It's simple and deterministic, but it has a poor serial encoding efficiency of only 80% (8/10).

By contrast, PCIe 3.0 through PCIe 5.0 use 128b/130b encoding, where two sync bits are prepended to 128b data payloads to form 130b blocks. As discussed in the previous post, this has a much better serial encoding efficiency of 98.5% (128/130). However, the two sync bits are not sufficient to control running disparity with a 128b data payload. Instead, the data is sent through a scrambler, a Pseudo-Random Number Generator (PRNG) that remaps bits in a way that both the transmitter and receiver understand. The output stream is statistically DC-balanced for all real data.

The PCIe implementation of the PRNG for scrambling is as a Linear-Feedback Shift Register (LFSR). In the case of PCIe 3.0, the canonical implementation is a 23-bit shift register with strategically-placed XORs between some bits to instigate pseudo-randomness. The output of the shift register is then XORed with each data bit to generate the scrambled output. Each lane gets its own LFSR scrambler seeded with a different value.

This is simple logic, but it would need to run at 8GHz to be implemented in single-bit fashion like this. That's not really practical even in dedicated silicon, and is completely impossible using FPGA sequential logic. However, it's possible to parallelize the LFSR to any data width pretty easily. The key is in the name: the operation is linear, so the contributions of each bit of input data and the initial LFSR can be superimposed to generate each bit of output data and the final LFSR. This method of parallelizing the LFSR is covered very well at OutputLogic.com, with utilities to generate Verilog implementations of any LFSR and data width. I will only briefly describe the procedure here.

Using for example the 23-bit LFSR and a 32-bit data path (common for each PCIe 3.0 lane with a 250MHz PHY clock), there are a total of 23 + 32 = 55 bits that can contribute to the final LFSR and output data. Set each of those bits to one, and all other bits to zero, then run the LFSR forward by 32 steps, and record the contribution of each input bit to the output data and final LFSR. This creates a big table of bit contributions:

The full parallel operation is just the sum (in mod 2, so XOR) of contributions from each bit of input data and the initial LFSR. Each bit of the output data and final LFSR is the XOR combination of a specific set of input bits, with at most 55 contributing bits. On a Xilinx Ultrascale+ FPGA, wide XORs like this are easy to build using nested six-input LUTs. With two levels, you get 6² = 36 inputs. With three levels, 6³ = 216 inputs. Each level has a propagation time on the order of 1ns, so even nested three deep it's capable of running at 250MHz.

Link CRC

Another use for the parallel LFSR is in the generation and checking of the Link CRC, a 32-bit value used for error detection in TLPs. The LCRC acts like a signature for the bits in the TLP and the value received must match the value calculated by the receiver, or the TLP is rejected. The LCRC mechanism uses a 32-bit LFSR (with XOR positions described by the standard CRC-32 polynomial 0x04C11DB7), seeded to 0xFFFFFFFF at the start of each TLP. At the end of the TLP, the 32-bit LFSR value is mapped to the packet's LCRC through some additional bit manipulation.

The LCRC operation can be parallelized in the same way as the Scrambler. The main differences are that the data is unmodified by the LCRC operation and that the data does contribute to the XOR sum of the next LFSR value. (This is what drives the LCRC to a unique value for each TLP.) In table form, this just changes which quadrants have XOR contributions:

Although there are fewer rows to handle, there are now 128 + 32 = 160 columns. The LCRC is calculated on packets before they are striped across all lanes. So for a PCIe 3.0 x4 link, instead of four 32-bit data paths as in the Scrambler, there is just one 128b data path operating at 250MHz. Any of these bits and any of the 32 bits of the previous clock cycle's LFSR might contribute to the XOR for each bit of the new LFSR. This isn't a problem, though, since three levels of LUT6 can handle up to 216 XOR inputs at 250MHz, as described above.

Where things do get a little complicated is in data alignment. TLP lengths are multiples of one Double Word (DW), or 32b. So, even without considering framing, 3/4 of the possible lengths would not fit evenly into 128b data beats. Each TLP is also prepended with a 32-bit framing token (STP), the latter half of which is fed into the LCRC computation as well. So in fact all cases will involve a partial data beat.

To handle this with a 128b parallel LFSR, the LCRC mechanism must get clever. Based on the length of the packet (which is known once the STP is parsed), the 128b data window can be shifted such that the last data beat will be aligned with the end of the packet. This ensures that the final LFSR value can be used directly to generate the LCRC. Then, the first 128b data beat is padded with zeros up to the middle of the STP token, where the LCRC computation begins. (In the case of a 3DW header with no data, the first and last data beat are the same.) This creates four possible alignment cases that repeat based on the length of the TLP:

Depending on the alignment case, the LFSR is seeded with a different value that accounts for the extra {16, 48, 80, 112} zeros padded onto the first data beat. These seed values are derived by seeding the reference single-bit implementation of the LFSR with 0xFFFFFFFF, then running it backwards for {16, 48, 80, 112} steps with zero data bits. With these seeds, the 128b parallel LFSR can be run on the zero-padded data and give the same final result as the single-bit implementation on the original data.

An interesting follow-up issue is how to handle back-to-back TLPs. Padding the first LCRC beat with zeros potentially means more than a 1:1 bit rate for the LCRC engine compared to the packet data, if there is no idle time between packets. An easy workaround could be to run two LCRC engines that take turns processing packets, although this means twice the logic area. The details are likely to vary in every implementation, so it's not something I will get into here.

Conclusion

The last couple of posts were setup and background for PCIe in general. This one was more of a microscopic view of a particular logic mechanism key to several aspects of PCIe, and how it can be implemented efficiently on modern FPGAs. There are many such interesting logic puzzles to solve in gateware implementations of PCIe, and I wanted to give just one example at the lowest level I understand. I may cover other logic-level tricks in future posts, but first I think it will be more interesting to introduce what might be the scariest part of PCIe: the Link Training and Status State Machine (LTSSM). To be continued...

Sunday, June 11, 2023

PCIe Deep Dive, Part 2: Stack and Efficiency

Before getting too caught up in the inner workings of PCIe, it's probably worth taking a look at the high-level architecture - how it's used in a system and what the PCIe controller stack looks like. PCIe is fundamentally a bi-directional memory bus extension: it allows the host to access memory on a device and a device to access memory on the host.

When a PCIe link is established between the host and a device, the host assigns address space(s) that it can use to access device memory. Optionally, it can also grant permission for the device to access portions of host system memory. In that way, the host and device memory buses are effectively connected. Each PCIe link is a point-to-point connection, but they can be combined with switches into a fabric with many devices (endpoints).

Different types of devices utilize the memory bus bridging capability of PCIe in different ways. For example, an NVMe storage device exposes only a small amount of device memory (the NVMe Controller Registers) that the host uses to configure the device and inform it when new commands have been submitted. All actual data transfer is done by the storage device reading from or writing to host memory. In this way, the NVMe storage device acts as a DMA controller between host memory and non-volatile storage.

NVMe storage device usage of PCIe link (completion steps omitted).

One might ask why the memory buses can't just be directly connected. For one, a native memory interface such as AXI is very wide: it might have 64-256b of data, 32-64b of address, and a bunch of control signals. This works fine inside a chip, but going from chip-to-chip, board-to-board, or across a cable, it's too many signals. The PCIe Controller encapsulates the data, address, and control signals from the memory bus into packets that can be sent across a fast serial link over a small number of differential pairs. This standard interface also allows bridging memory buses with different native interfaces, speeds, and latencies.

With that context in mind, we can look at the PCIe Controller stack, and what role each layer plays in bridging memory transactions between the host and device as efficiently and reliably as possible. The PCIe specification defines three layers: the Transaction Layer (TXL), the Data Link Layer (DLL), and the Physical Layer (PHY). These layers each have a transmit and a receive side. From the point of view of the host, the stack looks like this:

Memory transactions from the host to the device are packaged by the TXL into a Transaction Layer Packet (TLP) with a header containing the address and other control information. The DLL prepends a framing token (STP) and appends a CRC to the TLP to create a Link Packet. This is then split into lanes and serialized by the PHY. The process happens in reverse for memory transactions from device to host, to go from serialized Link Packets back to host memory transactions.

In practice, many architectures (including Ultrascale+) break the PHY into two parts: an upper Media Access Control (MAC) layer and a lower layer still called the PHY. These are connected by the standard PHY Interface for PCI Express (PIPE), published by Intel. It's also useful to add an explicit AXI-PCIe bridge layer above the TXL when the native memory bus is AXI, as it is in the Ultrascale+ architecture. This would be an example of what some references call the Application Layer. Expanded this way, the stack looks like this:

Different Xilinx IPs cover different layers of the stack, as shown above. PG239 (PCI Express PHY) is a low-level (PIPE down) PCIe PHY wrapper for the GTH/GTY serial transceivers. PG213 (UltraScale+ Devices Integrated Block for PCI Express) covers the PCIE4 hardware block that includes the TXL, DLL, and MAC layers, and interfaces to the PHY via PIPE. And PG194 (AXI Bridge for PCI Express Gen3 Subsystem) includes the AXI-PCIe bridge layer on top of the PCIE4 hardware block and PHY. (For Ultrascale+, this is technically implemented as a configuration of PG195, but the relevant documentation is still in PG194.)

All of these Xilinx IPs are included in Vivado at no additional cost, but not every device has the PCIE4 block(s) needed to instantiate PG213 or PG194/PG195. For the Zynq Ultrascale+ line, the product tables show how many PCIe lanes are supported by integrated PCIE4 blocks for each device. In general, the larger and more expensive chips have more available PCIe hardware. But there are exceptions like the ZU6xx, ZU9xx, and ZU15xx, which have none. These can still instantiate PG239, but require a PCIe soft IP to implement the rest of the stack.

Each layer communicates with the next through a data bus that's sized to match the speed of the link. The example above is for a Gen3 x4 link, which supports 32Gb/s of serial data in each direction. In the Ultrascale+ implementation, the 250MHz clock for the 128b internal datapath is derived from the PCIe reference clock, so all layer logic is synchronous with the PHY. This seems like a perfectly-balanced data pipeline, with 32Gb/s of data coming in and going out in each direction. But in practice, overheads limit the maximum link efficiency.

First, PCIe Gen3 uses 128b/130b encoding: for each 128b serial data payload on each lane, a 2b sync header is prepended to create a 130b block. The sync bits tell the receiver whether the block is data or an Ordered Set (control sequence). In order to make room for the sync bits, PIPE requires one invalid data clock cycle in every 65-clock period.

The period for skipping data on the 250MHz side of the PHY is 260ns, while the period for a 130b serial output block is only 16.25ns, so the PHY must implement buffering and a SERDES gearbox to make this work. The effect of the sync bits can be seen in the protocol analyzer raw data, where there are occasionally 1ns gaps in the timestamp. (The full serial data rate including sync bits would be exactly 4B/ns.) These leap-nanoseconds add up to an overall efficiency of 98.5% (64/65), as can be seen by plotting the starting timestamp of each block.

Next, transmitters are required to periodically stop transmitting data and send a SKP Ordered Set (SKP OS), which is used to compensate for clock drift. This should happen every 370-375 blocks, and the SKP OS takes one block to transmit. Stopping the data stream also requires sending an EDS token, which may require one additional block depending on packet alignment. But even in a worst-case scenario this still represents about 99.5% (368/370) efficiency.

We can see the EDS tokens and SKP OS at regular intervals in both directions on the protocol analyzer. Interestingly, the average interval in the Host-to-Device direction is on the short side (365 blocks). Maybe it's not accounting for the 64/65 PIPE TxDataValid efficiency described above. The interval is controlled by the MAC layer, which is in PG213 in this case, so I don't think it's something that can be adjusted. The Device-to-Host direction is spot-on in this case, with a 371-block interval.

DLLs also exchange Data Link Layer Packets (DLLPs) for Ack/Nak and flow control of TLPs. These packets are short (6B), but they must be transmitted with enough regularity to meet latency requirements and ensure receiver buffers don't overflow. There's no simple rule for when these are transmitted, only a set of constraints based on the link operating conditions. To get a feel for the typical link efficiency impact of DLLP traffic, we can look at a 100μs section of bulk data transfer and add up the combined contribution of all DLLPs:

In total, there were 237 DLLPs transmitted in the Host-to-Device direction. Since the packets must be lane-0-aligned on an x4 link, they actually occupy 8B each. This is 1896B of overhead for nearly 400000B of data, again around 99.5% efficiency. This example is mostly unidirectional data transfer from host to device, though. If the device was also sending data to the host, there would be far more Acks going in the Host-to-Device direction. If the Ack count were similar to that of the Device-to-Host direction in this example, the efficiency would drop to around 95%.

Lastly, the biggest overhead is usually for TLP packetization. The TLP header is either 12B or 16B. The DLL adds a 4B framing token (STP) and a 4B Link CRC (LCRC). The payload size can be as high as 4096B, although it's limited to 1024B in the Ultrascale+ implementation (PG213). It's also common for devices to limit the max payload size to 128B, 256B, or 512B, depending on the capability of their PCIe Controller. This gives a range of 84.2% (128/150) to 98.1% (1024/1044) for packetization efficiency with optimally-sized transfers on Ultrascale+ hardware.

In the example capture, data is transferred from host to device in 128B-payload TLPs:

The packet has 20B of overhead for 128B of data, which would be an 86.5% efficiency. However, the host controller also inserts 12B of logical idle (zeros) to align the next STP token to the start of a block. This isn't required by the PCIe protocol, but may be inherent in the implementation of the controller. For this payload size, it drops the efficiency to 80% (128/160). 

That packetization efficiency dominates the overall link efficiency, which hovers between 75% and 80% during periods of stable data transfer:

In this case, increasing the max payload size would have the most positive impact on throughput. PG213 can go up to 1024B, but the device controller may be the limiting factor.

In PCIe 6.0, a big change will be introduced that removes sync bits and consolidates DLLPs, framing tokens, and the LCRC into a fixed 20B overhead in each 256B unit (called a FLIT, for Flow Control unIT). This implies a fixed 92.2% efficiency for everything other than the SKP OS and TLP header overhead, and also a fixed latency for Ack/Nak and flow control, a nice simplification.

But for now we're still in the realm of PCIe Gen3, where we can expect an overall link efficiency in the 75-95% range, depending on the variety of factors described above as well as details of the controller implementations.

The packetization and flow control functions described above are the domain of the Transaction Layer and Data Link Layer, but there are also some really interesting functions of the MAC and PHY layers that facilitate reliable serial data transfer across the physical link. These will have to be topics for one or more future posts, though.

Sunday, May 7, 2023

PCIe Deep Dive, Part 1: Tool Hunt

Over the past few years, I've been developing and improving very fast standalone NVMe-based storage capability for the Zynq Ultrascale+ architecture, to keep up with the absurd speeds of modern SSDs. (Drives like the Seagate Firecuda 530 and Sabrent Rocket 4 Plus-G can now hit 3GB/s+ sustained TLC write speeds, with much higher pSLC cache peaks.) But my knowledge pretty much ended at the interface to the Xilinx DMA/Bridge Subsystem for PCI Express (PG194/PG195). In the usual fashion, I'm now going to dive deeper to explore in more detail how the AXI-PCIe bridge works, and what the PCIe stack actually looks like.

Something I found interesting about PCIe in general is that there seems to be a pretty large barrier built up around the black box. Even just finding learning resources is much harder than it should be. The best I found was PCI Express Technology 3.0 and some accompanying material by MindShare, but even that seems like a prose wrapper on top of the specification. There isn't anything that I would consider a beginner's guide, like you might find for USB or Ethernet.

[Edit by Future Shane] There is a very good series of four articles from Simon Southwell starting here that offers a thorough introduction to PCIe. Definitely check it out if you're going to be exploring PCIe.

For physical tools, the situation is even more bleak. The speeds in PCIe Gen3 (8GT/s) put it in the range where an oscilloscope that can actually measure the signal will cost more than a car. But for all but the lowest-level hardware debugging, a digital capture would suffice, and that's where a protocol analyzer would be nice. Unfortunately, there is no Wireshark equivalent for PCIe; protocol analyzers for it are dedicated hardware that only a few companies develop, and they are priced astronomically.

That is...unless you scout them on eBay for a year.

Biggest "that escalated quickly" of my test equipment stack (ref. PicoScopes below table).

This is a used U4301B that I got in what has to be my second-best eBay score of all time, for less than it would have cost me to rent one for a month. There are only ever a handful of them up for auction at any given time, and the market is so small that the price is basically random, so if you're actually looking for one I can only wish you luck. This one goes up to Gen3 x8, which is fine for my purposes. If you only need Gen1/2 capability, the situation is much better.

[Edit by Future Shane] There is one listed on eBay for a good price right now if anyone else is looking for one. (I'll remove this note after it's no longer available.)

The U4301B is actually just the instrument in the bottom slot of the M9505A AXIe Chassis. This is meant to connect to a PCIe slot on a host machine using an iPass cable and interface card. Newer versions of the chassis controller have a laptop-friendly Thunderbolt connection instead. I "upgraded" mine using an eGPU enclosure, the smaller black box sitting on top.

I said that the U4301B was my second-best eBay score of all time, and that's because the number one is the U4322A probe that I got to go with it, from a different auction. The protocol analyzer is useless without a probe or interposer, and those are even harder to find used. I have never seen a U4322A on eBay before or since the one I got, and all other online listings for them are dead-ends. So the fact that I got one for what might as well be free compared to the new cost is just plain luck.

It was, however, a lot broken...

The probe has two rows of spring-loaded contacts that are meant to touch down on test pads for the PCIe signals. Unfortunately, mine was missing several pins and many others were bent or broken. It had been treated like a scrap cable, rather than a delicate probe. No problem, though, I can just replace the spring pins with some equivalent Mill-Max parts...

...oh, well shit.

This was one of the most ridiculous things I have ever seen under the microscope. Each spring pin has a surface-mount resistor soldered into its tip, and encased in epoxy. What the multi-GHz fuck is going on with these? Well, I suspect they each make up part of a passive probe, also called a Low-Z or Z0 probe. This video explains the concept in detail; it's forming a resistive divider with the 50Ω termination. But it must have extremely low capacitance on the input side of the resistor, hence the resistors embedded in the tips. The good news is that there are no amplifiers in the probe head, so there's not much else that can be broken.

There's no replacement for these pins, so the ones that were missing or broken were a lost cause. But luckily there were enough intact ones to make a full bidirectional x4 link, which is all I really needed. They weren't all in the right locations, so I had to carefully rearrange them with a soldering iron, taking care to use as little solder as possible while still making a strong connection. After making the x4 link, there are only a couple of spare pins remaining, so I need to be very careful with this probe.

Actually the U4322A was not my first choice; what I really wanted was a U4328A M.2 interposer, which taps off the signals at an M.2 connector bridge. But I can convert my basically free U4322A into that using a basically free circuit board. This board just has the test pad footprint for the U4322A in between a short M.2 extension. I carefully mounted the U4322A to the board with standoffs and don't really intend to ever take it off again.

Somewhat to my surprise, this collection of parts actually does work. I was worried that there would be some license nonsense involved, but the instrument license seems to go with the instrument. The host software doesn't require a separate license and worked right away, even through my weird Thunderbolt eGPU enclosure hack. And that's really where the value is. It wouldn't be hard to make an in-system PIPE traffic logger on a Zynq Ultrascale+, and I might do that anyway, but parsing and visualizing the data in a convenient way takes a lot of effort. With the LPA Software, you just get nice graph and packet views straight away:

This all seems like a lot of effort for probing an interface that's now at least two generations old. All this equipment is outdated and could for sure be replaced with a single-board interposer based on a Zynq Ultrascale+. All it needs is two GTH quads, a bunch of RAM, and a high-speed interface to the outside world. But I don't think Keysight or Teledyne LeCroy are interested in that - Gen5 is where the money is. Interestingly, though, the new Keysight Gen5 analyzer is a single-board interposer.

But for now I have Gen3 protocol analysis capability, which is good enough for my purposes. I've used it a bunch in the past few months to explore the different layers of the PCIe stack and components within. There are some really interesting parts that I may cover in future posts. But I'll probably start with an overview of the whole stack, and where the available Xilinx IPs fit into it, since even that is a little confusing at first. There are hard and soft (i.e. HDL) components to it, and not every device has an out-of-the-box solution for making the whole stack. That's enough material for an entire post though, so I'll end this one here.

Saturday, October 9, 2021

Zynq Ultrascale+ Bare Metal NVMe: 2GB/s with FatFs + exFAT

This is a quick follow-up to my original post on speed testing bare metal NVMe with the Zynq Ultrascale+ AXI-PCIe bridge. There, I demonstrated a lightweight NVMe driver running natively on one Cortex-A53 core of the ZU+ PS that could comfortably achieve >1GB/s write speeds to a suitable M.2 NVMe SSD, such as the Samsung 970 Evo Plus. That's without any hardware acceleration: the NVMe queues are maintained in external DDR4 RAM attached to the PS, by software running on the A53.

I was actually able to get to much higher write speeds, over 2.5GB/s, writing directly to the SSD (no file system) with block sizes of 64KiB or larger. But this only lasts as long as the SLC cache: Modern consumer SSDs use either TLC or QLC NAND flash, which stores three or four bits per cell. But it's slower to write than single-bit SLC, so drives allocate some of their free space as an SLC buffer to achieve higher peak write speeds. Once the SLC cache runs out, the drive drops down to a lower sustained write speed.

It's not easy to find good benchmarks for sustained sequential writing. The best I've seen are from Tom's Hardware and AnandTech, but only as curated data sets in specific reviews, not as a global data set. For example, this Tom's Hardware review of the Sabrent Rocket 4 Plus 4TB has good sustained sequential write data for competing drives. And, this AnandTech review of the Samsung 980 Pro has some more good data for fast drives under the Cache Size Effects test. My own testing with some of these drives, using ZU+ bare metal NVMe, has largely aligned with these benchmarks.

The unfortunate trend is that, while peak write speeds have increased dramatically in the last few years, sustained sequential write speeds may have actually gotten worse. This trend can be seen globally as well as within specific lines. (It might even be true within different date codes of the same drive.) Take for example the Samsung 970 Pro, an MLC (two bit per cell) drive released in 2018 that had no SLC cache but could write its full capacity (1TB) MLC at over 2.5GB/s. Its successor, the 980 Pro, has much higher peak SLC cache write speeds, nearing 5GB/s with PCIe Gen4, but dips down to below 1.5GB/s at some points after the SLC cache runs out.

Things get more complicated when considering the allocation state of the SSD. The sustained write benchmarks are usually taken after the entire SSD has been deallocated, via a secure erase or whole-drive TRIM. This restores the SLC cache and resets garbage collection to some initial state. If instead the drive is left "full" and old blocks are overwritten, the SLC cache is not recovered. However, this may also result in faster and more steady sustained sequential writing, as it prevents the undershoot that happens when the SLC cache runs out and must be unloaded into TLC.

So in certain conditions and with the right SSD, it's just possible to get to sustained sequential write speeds of 2GB/s with raw disk access. But, what about with a file system? I originally tested FatFs with the drive formatted as FAT32, reasoning (incorrectly) that an older file system would be simpler and have less overhead. But as it turns out, exFAT is a much better choice for fast sustained sequential writing.

The most important difference is how FAT32 and exFAT check for and update cluster allocation. Clusters are the unit of memory allocated for file storage - all files take up an integer number of clusters on the disk. The clusters don't have to be sequential, though, so the File Allocation Table (FAT) contains chained lists of clusters representing a file. For sequentially-written files, this list is contiguous. But the FAT allows for clusters to be chained together in any order for non-contiguous files. Each 32b entry in the FAT is just a pointer to the next cluster in the file.

FAT32 cluster allocation entirely based on 32b FAT entries.

In FAT32, the cluster entries are mandatory and a sequential write must check and update them as it progresses. This means that for every cluster written (64KiB in maxed-out FAT32), 32b of read and write overhead is added. In FatFs, this gets buffered until a full LBA (512B) of FAT update is ready, but when this happens there's a big penalty for stopping the flow of sequential writing to check and update the FAT.

In exFAT, the cluster entries in the FAT are optional. Cluster allocation is handled by a bitmap, with one bit representing each cluster (0 = free, 1 = allocated). For a sequential file, this is all that's needed. Only non-contiguous files need to use the 32b cluster entries to create a chain in the FAT. As a result, sequential writing overhead is greatly reduced, since the allocation updates happen 32x less frequently.

exFAT cluster allocation using bitmap only for sequential files.

The cluster size in exFAT is also not limited to 64KiB. Using larger clusters further reduces the allocation update frequency, at the expense of more dead space between files. If the plan is to write multi-GB files anyway, having 1MiB clusters really isn't a problem. And speaking of multi-GB files, exFAT doesn't have the 4GiB file size limit that FAT32 has, so the file creation overhead can also be reduced. This does put more data "at risk" if a power failure occurs before the file is closed. (Most of the data would probably still be in flash, but it would need to be recovered manually.)

All together, these features reduce the overhead of exFAT to be almost negligible:

With 1MiB clusters and 16GiB files, it's possible to get ~2GB/s of sustained sequential file writing onto a 980 Pro for its entire 2TB capacity. I think this is probably the fastest implementation of FatFs in existence right now. The data block size still needs to be at least 64KiB, to keep the driver overhead low. But if a reasonable amount of streaming data can be buffered in RAM, this isn't too much of a constraint. And of course you do have to keep the SSD cool.

I've updated the bare metal NVMe test project to Vivado/Vitis 2021.1 here. It would still require some effort to port to a different board, and I still make no claims about the suitability of this driver for any real purposes. But if you need to write massive amounts of data and don't want to mess around in Linux (or want to try something similar in Linux user space...) it might be a good reference.

Sunday, September 12, 2021

TinyCross: New UI and Front Wheel Traction Control

 In the last post, I finally did some actual data logging with TinyCross set up in 4WD, 80A peak per motor, which is the rated current. Based on tinyKart, I know they can handle a a bit more for short durations, maybe even up to 120A. But the data logs (and many instances of having rocks flung into my face) demonstrate that the front wheels reach their traction limit somewhere around 60A on asphalt.

The behavior of front wheel slip on a go-kart is something new to me. In a straight line, the initiation of the slip and the acceleration of the wheel actually isn't the biggest problem. It's when the wheel regains traction and slows down that bad things happen. The restored grip combines with the energy being dumped from the wheel's moment of inertia to generate a quick pulse of torque on that side, which creates a lot of torque steer.

To deal with this, I wanted to implement some form of traction control, at least for the front wheels, so that I could get the most torque out of them as possible without the steering disturbances and rock shooting. But first, I needed a way to easily configure both the motor currents and the traction control settings without having to drag around my laptop everywhere. So, I finally built out the steering wheel UI to include a bunch of settings:

Sorry for the exposure; it's the only way to capture the full OLED refresh period.

Anyone familiar with the MōVI Controller might recognize the OLED display. I chose this for daylight visibility and responsiveness (~50Hz update rate). The menu interface is essentially the same as the one I built the day before NAB 2014... The left knob scrolls through the menu. The right knob adjust settings and, by clicking or holding, performs actions.

In the four corners are three motor parameters for the corresponding motors: S for Status, which shows error codes. F for Forward peak current, and R for Reverse (braking, or actually reversing) peak current. Setting both to zero masks out the CAN command from that motor, triggering a timeout that turns off the gate drivers entirely. A click and hold on S triggers an encoder recalibration for that motor.

In the second column from the left, the first three settings relate to data logging: LS for Logger Status, FN for File Number (click to start a new file), and LT for Logger Time, the time in [ms] for a single row of the data log to be written. Then, there are two parameters for tuning traction control: TT for Traction Threshold, and TG for Traction Gain, which I will explain shortly.

The reason I wanted to be able to adjust peak currents from the steering wheel is because I agree with this early Tesla blog post: "...it's much safer to avoid wheelspin altogether than react to it." If I know the surface supports front wheel current around 60A, there's not much point in setting it higher than that. But, I want to be able to set it higher for testing, or adjust it for different surfaces.

As for the traction control itself, there are a lot of corner cases to think about in 4WD, but the main problem I'm trying to solve is front wheel slip. If I assume the rear wheels are not slipping, then I can use their average speed as a reference. From there, it's easy to see if a front wheels is running faster than that reference, and reduce the current to that motor if so. This only needs two settings: a Traction Threshold (TT) that sets how much wheel slip is allowed, and a Traction Gain (TG) that sets how much to reduce the current per unit slip above the threshold. The Traction Threshold prevents overactuation in normal conditions and allows for speed differential due to turning radius.

But what happens if a rear wheel does slip? Well, then the front wheel might slip too. At that point, I'm probably in some kind of a four wheel sideways drift anyway, so alternate control laws are going to apply. Being able to trigger some rear wheel slip with the throttle is part of the fun, too, so having complete 4WD traction control isn't something I necessarily need to solve.

With the new UI setup and the simple front wheel traction control in place, it was time to do some tuning...

...or not.

At first, everything seemed to be going okay. I did a couple of runs at 60A front current and 80A rear current and the traction control seemed to be working as intended. But then during light regenerative braking at around 30mph, I heard the all-too-familiar sound of a FET popping, followed by some more bad noises and smells from the front drive. Upon inspection, only two FETs actually died, but they also took out many of the power traces, meaning this board was trash.

So what happened? Well, unfortunately, the data log was not very helpful in this case. It did show the speed (30mph) and current command (around -10A), but nothing out of the ordinary up until the point of failure. There is only one data point showing a Q-Axis current of 286A on the front left motor, followed by an undervoltage fault, which might have been the battery sagging or the power input traces getting blown up. So whatever happened, happened quick.

It's been a while since I've actually destroyed a motor controller, so I was a little disappointed. But after some thought, I didn't think this was due to the new traction control stuff. That's only applied during acceleration, and this failure definitely happened under braking. I think it's more likely that the front left motor just lost sync and the back EMF at 30mph was high enough to do damage. Up until now, I have only had a relatively slow overcurrent limit of 160A (or more) for 10ms. These FETs have a pretty insane Safe Operating Area (SOA), but that limit does leave room for exceeding it with currents above 400A:

This system could easily generate a 400A transient if a motor loses sync at 30mph. And the motor position and speed data does cut out at the same data point as the failure. But that's not enough to determine cause and effect. So for now I can only make changes that might help and hope for the best. I added in several more stages of faster overcurrent protection, up to 300A for a single ADC/PWM cycle (42.7μs). These overlap enough to cover the entire R_DS(on)-limited boundary of the SOA (up to the pulse rating of 1450A for 100μs!).

A faster overcurrent trip doesn't help with whatever caused the motor to lose sync in the first place (if that is what happened). I have seen at least a couple previous instances where the encoders, which supply emulated Hall effect sensor signals, have behaved as if they were completely reset. Even though I only use the buffered and optically isolated virtual Hall effect sensor signals for commutation, I was still reading the SPI data anyway. Maybe a SPI read got corrupted by noise and turned into a write that either reconfigured or entirely reset the encoder mid-run? To protect against this, I now disabled the SPI transactions entirely other than during initialization and calibration.

So with these changes and my last and only spare drive, I went back out for another try. This time, I ran into no motor drive issues and was actually able to test and tune the front wheel traction control as I originally intended. The difference is immediately obvious while driving and in the data. First, a test at 80A front, 90A rear, with no traction control:

Front wheel traction control off.

As before, the front right wheel starts slipping at about 60A and spins up to 2-3x the actual ground speed. The front right always seems to lose grip first, a mystery to solve another day. When I let off the throttle and it regains traction, the torque pulse creates substantial torque steer, jerking the steering wheel almost 20º to the left, which I then have to counteract immediately to stay on course. Overall, it's impossible to sustain peak acceleration for more than a second or so before having to deal with the wheel spin and torque steer.

And now with the same currents, but front wheel traction control on:

Front wheel traction control on.

The front right (FR) current now averages a bit below 60A and its speed is held to just a small margin above the actual ground speed. It's never able to build up momentum and then "catch", inducing torque steer. This allows continuous acceleration up to and past 30mph. The front left (FL) also starts to slip in the 20-30mph range, but the traction control catches it too. The overall result is a much more controllable launch and far fewer rocks being thrown up by the front wheels.

After finding traction control settings that I liked, I switched back to current settings that more closely match the actual traction limits: 60A front and 100A rear. This still gives a reasonable 0.45g launch, but with less likelihood of triggering the traction control on asphalt. I'd like to push to >0.5g, to match tinyKart's most extreme configuration, but that'll either require 120A on the rear or changing the gear ratio a bit. At 60A / 100A, the front motors still share enough of the load that the rear motors stay at healthy temperature after some acceleration runs:

Rear motors are doing most of the work, but...

...they are at a reasonable temperature.

And finally I did some less structured testing by just driving through the gravel corner in my parking lot and intentionally adding throttle to induce slip. It behaves pretty well, slipping and oversteering about the right amount to be controllable but still fun:


I think at this point most of the handling bottlenecks are back on the mechanical side. There's a small amount of backlash in the steering column that definitely exaggerates the residual torque steer, especially at high speeds. It's almost all coming from the U-joint, which I may try to shim or replace with one with tighter tolerances. Other than that, I need to do some suspension geometry tweaking to improve handling of lateral transients. Speaking of which, here's one last data capture. See if you can figure out what's going on here...

Mystery data log.