Saturday, November 9, 2024

PCIe Deep Dive, Part 5: Flow Control

In Part 2, I described PCIe as a bi-directional memory bus extension and looked at some factors that contribute to the efficiency of the link. A PCIe 3.0 x4 link, with 8GT/s on each lane and an efficiency of around 90%, can support bidirectional data transfer of around 3.6GB/s. But this assumes both sides of the link can consume data that quickly. In reality, a PCIe function is subject to other constraints, like local memory access, that might limit the available data rate below the theoretical maximum. In this case, it's the job of the Data Link Layer's flow control mechanism to throttle the link partner so as not to overflow the receiver's buffers.

For a full-duplex serial interface, flow control can be as simple as two signals used to communicate readiness to receive data. In UART, for example, the RTS/CTS signals serve this purpose: A receiver de-asserts its RTS output when its buffer is full. A transmitter can only transmit when its CTS input is asserted. The RTS outputs of each receiver are crossed over to the CTS inputs of the other side's transmitter to enforce bidirectional flow control.

This works well for homogenous streams of data, but PCIe is packetized, and Transaction Layer Packets (TLPs) can have a variable amount of data in their payload. Additionally, packets representing write requests, read requests, and read completions need different handling. For example: If both sides are saturated with read requests, they need to be able to block further requests without also blocking completions of outstanding reads. So, a more sophisticated flow control system is necessary.

PCIe receivers implement six different buffers to accommodate different types of TLP. These are divided into three categories:

  1. Posted Requests (P). These are Request TLPs that don't require a Completion, such as memory writes. Once the TLP has been acknowledged, the requester assumes the remote memory will be written accordingly with the TLP data.
  2. Non-Posted Requests (NP). These are Request TLPs that do require a Completion, such as memory reads. Configuration reads and writes also fall into this category.
  3. Completions (CPL). These are Completion TLPs completing a Non-Posted Request, such as by returning data from a nemory read.

For each of these categories, there is a Header buffer and a Data buffer. The Header buffer stores TLP header information, such as the address and size for memory reads and writes. The Data buffer stores TLP payload data. Separating the Header and Data buffers allows more flow control flexibility. A receiver can limit the number of TLPs received by using Header buffer flow control, or limit the amount of data received by using Data buffer flow control. In practice, both are used simultaneously depending on the external constraints of the PCIe function.

The PCIe specification only describes these buffers conceptually; it doesn't mandate any specific implementation. In an FPGA, the Data buffers could reasonably be built from block RAM, since they need to store several TLPs worth of payload data (order of KiB). The Header buffers only need to store (at most) four or five Double Words (DW = 32b) per TLP, so they could reasonably be built from distributed (LUT) RAM instead. A hypothetical Ultrascale+ implementation is shown below.

The Data buffers are built from two parallel BRAMs (4KiB each). This provides a R/W interface of 128b, matching the datapath width for a PCIe 3.0 x4 link. Conveniently, the unit of flow control for data is defined to be 4DW (128b), so each BRAM address is one unit of data. With 8KiB of total memory, these buffers could, for example, hold up to 16 TLPs worth of optimally-aligned data on a link with a Max. Payload Size of 512B. The capacity could be expanded by cascading more BRAMs or using URAMs, depending on the design requirements.

The Header buffers are built from LUTRAMs instead, with a depth of 64. Each entry represents the header of a single TLP, and may be associated with a block of data in the Data buffer. One header is also defined to be one unit of flow control. A single SLICEM (8 LUTs) can make a 64x7b simple dual-port LUTRAM. These can be parallelized up to any bit width, depending on how much of the header is actually used in the design. The buffer could also be made deeper if necessary, by using more LUTRAMs and MUXes.

For each of these buffers, the receiver issues credits to its link partner's transmitter in flow control units (4DW for Data buffers, one TLP header for Header buffers). The transmitter is only allowed to send a TLP if it has been issued enough credits (of the right type) for that TLP and its data. The receiver issues credits only up to the amount of space available in each of the buffers, updating the amount issued as the buffers are emptied by the Application Layer above.

Flow control credits are initialized and then updated using specific Data Link Layer Packets (DLLPs) for each category. These packets have a common 6B structure with an 8b counter for Header credits and a 12b counter for Data credits:

InitFC1- and InitFC2-type DLLPs are sent as part of a state machine during flow control initialization to issue the starting number of credits for each category (P, NP, Cpl). In the example above, the initial header and data credits could be 64 and 512 (or 63 and 511, to be safe). It's also possible for the receiver to grant "infinite" credits for a particular category of TLP by setting the initial value of HdrFC and/or DataFC to zero. In fact, for normal Root Complex or Endpoint operation, this is required for Completion credits. A Request is only transmitted if there is room in the local receiver buffers for its associated Completion.

UpdateFC-type DLLPs are sent periodically during normal operation to issue more credits to the link partner. Typically, credits are incremented as the Application Layer reads out of the associated buffer. In the case of an AXI Bridge, this could be when the AXI transaction is in progress. The values of HdrFC[7:0] and DataFC[11:0] are the total number of credits issued, mod 256 (header) or 4096 (data). For example, if the initial data credit was 511 and then 64DW were read out of the data buffer, the first UpdateFC value for DataFC could be 511 + 64/4 = 527.

This is mostly equivalent to transmitting the amount of buffer space available, but the head/tail index subtraction is left to the transmitter. It can compare the total credits issued to its local count of total credits consumed (mod 256 or 4096) to check whether there is enough space available for the next TLP. To prevent overflow, the receiver never issues more than 127 header credits or 2047 data credits beyond its current buffer head. (This also means that there's no reason to have header buffers larger than 128 entries, or data buffers larger than 2048x4DW = 32KiB.)

The receiver handles scheduling of UpdateFCs for all buffers that didn't advertise infinite credits during initialization. The PCIe 3.0 specification does have some rules for scheduling UpdateFCs immediately when the link partner's transmitter might be credit-starved, but otherwise there is only a maximum period requirement of 30-45μs. If the buffer has plenty of space, it can be better to space out updates for higher bus efficiency. (Starting with PCIe 6.0, the flow control interval and associated bus efficiency will become constant as part of FLIT mode, a nice simplification.)

A specific application may not actually need all six buffers to be the same size or type. For example, an NVMe host would barely need any Completion buffer space, since its role as a Requester is limited to modifying NVMe Controller Registers, usually one 32b access at a time. It's also unlikely to have any need for Non-Posted Requests with data. The vast majority of its receiver TLP traffic will be memory writes (PH/PD) and memory reads (NPH).

With a PCIe protocol analyzer, it's possible to see the flow control in action. The following is a trace recorded during a 512KiB NVMe read on a PCIe Gen3 x4 link with a Max. Payload Size of 512B. The host-side memory write throughput and DataFC available credits are plotted.

The whole transaction takes 1024 memory write request TLPs, each with 512B payload data. These occur over about 146μs, for an average throughput of about 3.6GB/s. The peak throughput is a little higher, though, at around 3.8GB/s. This means the PCIe link is slightly faster than the AXI bus, and as a result, DataFC credits are consumed faster than they are issued at first. Once the available credits drop below 32, there isn't enough room for a 512B TLP and the transmitter (on the NVMe SSD) is throttled. The link reaches steady-state operation at the AXI-limited throughput.

Depending on the typical access patterns of the application, a larger data buffer could help sustain peak throughput for the entire duration of a typical transfer. Sequential transfers might still hit the credit limit threshold, but it's also possible that there's enough NVMe overhead between them for the credits to recover. This sort of optimization, along with maximizing bus efficiency, would be required to squeeze out even more application-level throughput.

In summary, if PCIe is acting as a memory bus extension, PCIe flow control extends the local memory controller's backpressure mechanism across the link. For example, if an AXI memory bus can't keep up with writes, it will de-assert AWREADY and/or WREADY. The PCIe receiver can still accept memory write TLPs as long as it has room in its buffers, but it can't issue any new PH or PD credits. When the buffers are nearly full, this transfers the backpressure to the link partner.