<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:blogger='http://schemas.google.com/blogger/2008' xmlns:georss='http://www.georss.org/georss' xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8200098102909041178</id><updated>2025-04-27T19:03:41.230-04:00</updated><category term="scooter"/><category term="3ph"/><category term="motor control"/><category term="pneu"/><category term="tinykart"/><category term="sensorless"/><category term="kart"/><category term="quadrotor"/><category term="FF"/><category term="kk"/><category term="4pcb"/><category term="FOC"/><category term="motor"/><category term="BWD"/><category term="axial motor"/><category term="TinyCross"/><category term="WAVE"/><category term="cap"/><category term="ksp"/><category term="twitch"/><category term="directdrive"/><category term="hexrotor"/><category term="GS3"/><category term="DRSSTC"/><category term="charger"/><category term="pcie"/><category term="pneu scooter"/><category term="snow"/><category term="video"/><category term="RC"/><category term="arduino"/><category term="flinch"/><category term="flywheel"/><category term="laythe colony"/><category term="maker faire"/><category term="singapore"/><category term="EVER"/><category term="LEAF"/><category term="MIDI"/><category term="SMMA"/><category term="coil"/><category term="complementary filter"/><category term="hexbridge"/><category term="high speed video"/><category term="2.007"/><category term="NAB"/><category term="e-kart"/><category term="edgerton center"/><category term="electric go-kart"/><category term="field-oriented control"/><category term="mSCR"/><category term="mecanum"/><category term="monaco"/><category term="multiwii"/><category term="quals"/><category term="robot"/><category term="segstick"/><category term="tesla"/><category term="wootstick"/><category term="A123"/><category term="CMV12000"/><category term="DirectX"/><category term="F330"/><category term="FBL100"/><category term="FIRST"/><category term="Laythe"/><category term="balancer"/><category term="camera"/><category term="engineering design class"/><category term="failbot"/><category term="four-wheel independent suspension"/><category term="game"/><category term="instructable"/><category term="melon"/><category term="mems"/><category term="mini4WDbot"/><category term="miters"/><category term="motor timing"/><category term="pcb"/><category term="pcb quadrotor"/><category term="polyphony"/><category term="prop balancer"/><category term="prototype"/><category term="prototype this discovery"/><category term="scv"/><category term="servos"/><category term="strobe"/><category term="table"/><category term="talonhex"/><category term="tesla coil"/><category term="test"/><category term="victor"/><category term="vppc"/><category term="wheel"/><category term="xbee 10c0"/><title type='text'>Shane Colton</title><subtitle type='html'>A collection of my personal engineering projects including small electric vehicles, motor controllers, robots, flying things, and other fun electromechanical stuff!</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default?start-index=26&amp;max-results=25'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>241</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-1782974925189549955</id><published>2025-02-20T00:36:00.001-05:00</published><updated>2025-02-20T09:28:40.175-05:00</updated><title type='text'>GaNDr: Power Loop Measurement and Optimization</title><content type='html'>&lt;p&gt;In a half-bridge configuration, the power loop is the path from the DC bus capacitor positive terminal to the output switch node and back to the capacitor negative terminal. This is the path of high dI/dt, passing through both FETs in the half-bridge. It&#39;s important to keep this loop as small as possible, to reduce the amount of parasitic inductance interacting with the high dI/dt.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_000.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;This applies to any half-bridge, but it&#39;s especially important for GaNFETs, where the switching event can be on the order of nanoseconds. They&#39;re capable of switching this fast due to their very low gate charge and input capacitance, but they still have a significant output capacitance, C&lt;sub&gt;OSS&lt;/sub&gt;, that needs to be hard-switched. During a switching event, this forms an underdamped LC oscillator with the power loop inductance and leads to overshoot and ringing.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;It&#39;s a little hard to predict the loop inductance, although admittedly I didn&#39;t even really try. There are ways to ballpark it based on the board geometry, or use a 2D field solver like &lt;a href=&quot;https://www.femm.info/wiki/HomePage&quot;&gt;FEMM&lt;/a&gt; to get an estimate. A 3D field solver could probably get pretty close. But nothing beats a direct measurement on the real PCB. So I just assembled one phase of the GaNDr Rev1 PCB for a test:&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_001.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Nevermind the black wire hack...&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Since I&#39;m interested in stuff on the nanosecond time scale, I wanted to use my 5GHz-bandwidth sampling scope for this measurement. But the inputs only go up to ±1V, so directly probing a 50V node with unknown overshoot would be bad. Luckily, the cheapest way to make a high-bandwidth probe is just to make a voltage divider with a physically small resistor and a 50Ω coaxial cable.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_002.svg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Quick and dirty ≈67:1 high-bandwidth probe.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p style=&quot;text-align: left;&quot;&gt;It also matters &lt;i&gt;where&lt;/i&gt;&amp;nbsp;the probe is attached. I soldered both the resistor and the probe ground (just the coax shield) to pads near the low-side FET. This excludes all the inductance except what&#39;s physically inside the FET. The measured voltage is, as close as possible, V&lt;sub&gt;DS&lt;/sub&gt; of the low-side FET. And here is the measured switching transient:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_003.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;It is...not great. The overshoot with a 32V input and 1Ω gate resistors is nearly 100%. That does not bode well for 50V operation with 100V parts, and can&#39;t be good for efficiency or EMI. It&#39;s possible to slow down the rise time by increasing the gate resistance, thus reducing the amount of energy in the transient. But that should be done last, after exhausting all other reasonable methods for reducing the power loop inductance.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Since I now have the ringing frequency, it&#39;s possible to calculate the power loop inductance using the C&lt;sub&gt;OSS&lt;/sub&gt; datasheet value of 1nF (per FET):&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot;&gt;
&lt;math display=&quot;block&quot; style=&quot;font-size: 12pt;&quot;&gt;
  &lt;mi&gt;L&lt;/mi&gt;
  &lt;mo&gt;=&lt;/mo&gt;
  &lt;mfrac&gt;
    &lt;mn&gt;1&lt;/mn&gt;
    &lt;mrow&gt;
      &lt;msub&gt;
        &lt;mi&gt;C&lt;/mi&gt;
        &lt;mi&gt;OSS&lt;/mi&gt;
      &lt;/msub&gt;
      &lt;msup&gt;
        &lt;mrow&gt;
          &lt;mo&gt;(&lt;/mo&gt;
          &lt;mn&gt;2&lt;/mn&gt;
          &lt;mi&gt;π&lt;/mi&gt;
          &lt;mi&gt;f&lt;/mi&gt;
          &lt;mo&gt;)&lt;/mo&gt;
        &lt;/mrow&gt;
        &lt;mn&gt;2&lt;/mn&gt;
      &lt;/msup&gt;
    &lt;/mrow&gt;
  &lt;/mfrac&gt;
  &lt;mo&gt;=&lt;/mo&gt;
  &lt;mfrac&gt;
    &lt;mn&gt;1&lt;/mn&gt;
    &lt;mrow&gt;
      &lt;mn&gt;2.0&lt;/mn&gt;
      &lt;mi&gt;nF&lt;/mi&gt;
      &lt;mo&gt;⋅&lt;/mo&gt;
      &lt;msup&gt;
        &lt;mrow&gt;
          &lt;mo&gt;(&lt;/mo&gt;
          &lt;mn&gt;2&lt;/mn&gt;
          &lt;mi&gt;π&lt;/mi&gt;
          &lt;mo&gt;⋅&lt;/mo&gt;
          &lt;mn&gt;160&lt;/mn&gt;
          &lt;mi&gt;MHz&lt;/mi&gt;
          &lt;mo&gt;)&lt;/mo&gt;
        &lt;/mrow&gt;
        &lt;mn&gt;2&lt;/mn&gt;
      &lt;/msup&gt;
    &lt;/mrow&gt;
  &lt;/mfrac&gt;
  &lt;mo&gt;=&lt;/mo&gt;
  &lt;mn&gt;0.50&lt;/mn&gt;
  &lt;mi&gt;nH&lt;/mi&gt;
&lt;/math&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot;&gt;This really isn&#39;t much inductance, but it&#39;s apparently enough to store sufficient energy during the transient to make a high-Q oscillator with C&lt;sub&gt;OSS&lt;/sub&gt;.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Some of the loop inductance is from the DC bus capacitors themselves. I probably should have checked this earlier, since this is usually well-characterized by the manufacturer. The Samsung CL32E475KCIVPNE capacitors used here have the following AC characteristic curves:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_004.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;At frequencies above about 2MHz, they&#39;re really more inductors than they are capacitors: the impedance vs. frequency has a slope of +1 on a log-log plot. The effective inductance at 160MHz is around 0.30nH. But, there are six in parallel. Ideally, this would divide down to 0.050nH, but they may not share the load equally based on their placement. I think they contribute a significant amount of inductance, but not the majority.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Still, it would be nice to reduce the capacitor parasitic inductance if possible. Since the switching transient only involves C&lt;sub&gt;OSS&lt;/sub&gt;, which is only on the order of 1nF, maybe smaller and faster capacitors would be better? For example, the 10nF Samsung&amp;nbsp;CL05B103KC5VPNC in 0402 stays mostly capacitive all the way up to 100MHz:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_005.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;However, being physically smaller also increases the impedance overall, such that the effective inductance at 160MHz is still around 0.30nH. The advantage is that I could fit many more in parallel, to divide down the total inductance more. In reality, the big 1210 capacitors are still very much necessary for filtering the 125kHz PWM frequency, so the solution is probably to try to fit some of the 10nF 0402s on in parallel with the existing six 1210s.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;An easy place to put them is where the VDC bus bar solder areas used to be, between the 1210 caps and the high-side FETs:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_006.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;For each high-side FET, three can be arranged so that their positive terminals are directly adjacent to the drain pads. The negative terminal can be sent to the power ground plane on the first inner layer with a via-in-pad. This also has the advantage of shortening the power loop considerably, at least for the high-frequency switching transient. The only real disadvantage is losing the option for a VDC bus bar, which realistically shouldn&#39;t be necessary for the average power levels here.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;A riskier move would be to drop in four more 10nF 0402s&lt;i&gt;&amp;nbsp;between &lt;/i&gt;the FETs:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_007.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The drain pads of the high-side FET (VDC) align with the source pads of the low-side FET (GND), so it&#39;s very tempting to bridge the gap with an 0402 capacitor. I even tried this on the physical board, by scratching off some solder mask:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_008.jpg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Even with the ugly attachment, this did show a significant improvement, with the overshoot dropping from 94% to 65% and the frequency increasing from 160MHz to 210MHz:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_01_009.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;This does seem like it would provide the shortest possible power loop, but an important difference between the soldered-on test and the actual layout is that there is still a mostly solid switch node (output) plane beneath the capacitors as-tested. It&#39;s like having a seventh PCB layer. In actuality, that plane gets cut if the capacitors are placed there, changing the power loop from a thin vertical sandwich to multiple horizontal loops, which might very well have higher inductance.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;This configuration, with capacitors between the FETs, is actually discussed in the &lt;a href=&quot;https://epc-co.com/epc/Portals/0/epc/documents/presentations/High-Performance%20Layout%20Techniques%20to%20Maximize%20GaN%20Device%20Performance.pdf&quot;&gt;EPC layout guidelines&lt;/a&gt;. In order to get back to a thin vertical sandwich power loop, the switch node plane, rather than GND, should be on the first inner layer. But if I fully swap SW and GND layers, the loop sandwich to the outer capacitors becomes thicker, potentially cutting off those capacitors with much higher outer loop inductance.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;I decided to try just inserting a small SW plane on the first inner layer, and only between the FETs. The rest of the first and second inner layers remains as GND to hopefully preserve some of the outer capacitor loop performance. And since I&#39;m indecisive, I only did this on one half of the board for the next revision. That&#39;ll let me A/B test it against the more normal layout. I expect both to have much more interesting transients with composites of multiple frequencies, so it might be trickier to analyze. We&#39;ll see when the Rev 2 boards arrive.&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/1782974925189549955/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2025/02/gandr-power-loop-measurement-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/1782974925189549955'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/1782974925189549955'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2025/02/gandr-power-loop-measurement-and.html' title='GaNDr: Power Loop Measurement and Optimization'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-8041028822174776126</id><published>2025-01-25T14:21:00.002-05:00</published><updated>2025-01-25T16:02:19.556-05:00</updated><title type='text'>GaNDr: Motor Drive in the GaNFET Era</title><content type='html'>&lt;p&gt;It&#39;s been a while since I&#39;ve attempted a new motor drive design, the last one being &lt;a href=&quot;https://scolton.blogspot.com/2018/09/tinycross-electron-control-unit.html&quot;&gt;TinyCross&#39;s dual-motor 50V/100A drive&lt;/a&gt; about six years ago. I still really like that design, and those drives have worked well for TinyCross so far. But one of my favorite pastimes is looking for new components that might change how I would build something. And in the last six years, there&#39;s been an interesting development that I&#39;m curious to explore: the GaNFET is now mainstream.&amp;nbsp;&lt;/p&gt;&lt;p&gt;While they&#39;ve been commercially available for a while, they&#39;re now both technically and economically viable&amp;nbsp;as an alternative to silicon MOSFETs in certain power ranges, and have been adopted in many new consumer electronic devices. They are also becoming available in more conventional packages, though GaNFET purists would probably still use the bare-die versions for lowest parasitic inductance. Supporting components (mainly gate drivers), documentation, and device models are also now mature and widely available.&lt;/p&gt;&lt;p&gt;Of interest for this project are the&amp;nbsp;&lt;a href=&quot;https://epc-co.com/epc/products/gan-fets-and-ics/epc2302&quot;&gt;EPC2302&lt;/a&gt; and the brand new&amp;nbsp;&lt;a href=&quot;https://epc-co.com/epc/products/gan-fets-and-ics/epc2361&quot;&gt;EPC2361&lt;/a&gt;&amp;nbsp;in 3x5 QFN. These are &lt;i&gt;packaged &lt;/i&gt;GaNFETs, vs. the bare-die BGA and LGA options that have been around for a while. Ideally, encapsulating them in a thin plastic package improves robustness without sacrificing too much performance. The package also still exposes the die on the top side for direct cooling.&lt;/p&gt;&lt;p&gt;GaNFETs excel in the Figure-of-Merit (FoM) of on-resistance multiplied by gate charge, R&lt;sub&gt;DS(on)&lt;/sub&gt;·Q&lt;sub&gt;G&lt;/sub&gt;, which captures both the conducting and switching losses. Here they are on a 2D plot with some silicon MOSFETs for comparison, with curves of constant FoM shown (lower is better):&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_002.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;The current TinyCross FETs, &lt;a href=&quot;https://www.onsemi.com/products/discrete-power-modules/mosfets/low-medium-voltage-mosfets/FDMT80080DC&quot;&gt;FDMT80080DC&lt;/a&gt;, can&#39;t really compete anymore. (I still love them for their pulsed current rating of &lt;span style=&quot;color: red;&quot;&gt;1453A&lt;/span&gt;, though.) The modern field of 60-80V silicon MOSFETs has significantly better FoM and also tends to publish specifications for lower gate drive voltages besides the standard 10V. The &lt;a href=&quot;https://www.onsemi.com/products/discrete-power-modules/mosfets/low-medium-voltage-mosfets/NVMFS5C604NL&quot;&gt;NVMFS5C604NL&lt;/a&gt; operated at V&lt;sub&gt;GS&amp;nbsp;&lt;/sub&gt;= 4.5V is especially remarkable. But the EPC GaNFETs still easily win on FoM, and they are 100V parts with half the surface area.&lt;/p&gt;&lt;p&gt;The extremely low Q&lt;sub&gt;G&lt;/sub&gt; means GaNFETs can be switched &lt;i&gt;much &lt;/i&gt;faster than silicon MOSFETs. An EPC2302 half-bridge with a suitable layout and gate driver should have no problem operating at 100kHz PWM, or even higher. This reduces the need for large electrolytic capacitors on the DC bus, since the ripple current will be much lower. This is probably the largest contribution to space savings, even though the FETs themselves are also physically smaller.&lt;/p&gt;&lt;p&gt;Although I&#39;m not space-constrained on TinyCross, shrinking a power stage is always a fun project. I arbitrarily set a limit of the size of a deck of cards, with the same specification as the previous design (dual motor, 50V and 100A peak). Here&#39;s what the resulting layout looks like:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_009.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_010.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;It&#39;s considerably smaller than the current TinyCross drive, even with the logic and power consolidated onto one board. There are half as many electrolytic capacitors, with more ceramic capacitors filling in at high frequencies. Optional 2mm bus bars help take some of the load off the PCB copper. The board can be mounted to a heat sink base with six isolated M2 screws located in the middle of each FET group for even mounting pressure. And the wires all exit horizontally, probably through grommets in the seam between the aluminum base and a 3D-printed cover.&lt;/p&gt;&lt;p&gt;This is also a six-layer PCB, which further helps with layout. Six layer boards are now pretty fast and cheap thanks with &lt;a href=&quot;https://jlcpcb.com/?from=VGSU&amp;amp;utm_source=google&amp;amp;utm_medium=cpc&amp;amp;utm_campaign=14179457750&amp;amp;gad_source=1&amp;amp;gclid=Cj0KCQiAqL28BhCrARIsACYJvkePXG48juWbnZGnS3oK6yJ3vztq6KseQgN_ZkYqcopIX-vjvxJRwmQaAsHBEALw_wcB&quot;&gt;JLCPCB&lt;/a&gt;, so there&#39;s not much reason to stick to four layers even for hobby motor drives anymore.&amp;nbsp;The power stage layout follows &lt;a href=&quot;https://epc-co.com/epc/design-support/gan-first-time-right/schematic-and-layout&quot;&gt;EPC&#39;s recommended structure&lt;/a&gt; pretty closely, with the return path on the two inner layers closest to the FETs to minimize the dI/dt loop area:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_003.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;I did not follow the gate drive recommendations as strictly, since I need a lot of space for the gate drivers themselves. I am using isolated half-bridge drivers, and remarkably there are at least two pin-compatible parts from different manufacturers that could work. The Skyworks&amp;nbsp;&lt;a href=&quot;https://www.skyworksinc.com/Products/Isolation/Si82Ex-Isolated-Gate-Drivers&quot;&gt;Si82E39x&lt;/a&gt; looks like the most promising option, with a 4V UVLO option and sub-10ns rise/fall times. It&#39;s not even that expensive, at around $4ea. The &lt;a href=&quot;https://www.analog.com/en/products/adum4221.html&quot;&gt;ADuM4221&lt;/a&gt;&amp;nbsp;from Analog Devices could also work, though. Although I have the driver on the opposite side of the board, I did try to keep the gate drive traces as short as possible:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_004.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;The gate drive return paths from the switch node (high-side) and negative plane (low-side) are also pretty direct. Will that be good enough? I don&#39;t really know - I don&#39;t have a 3D field solver. I&#39;ll have to wait for the boards. Ideally, with such low gate charge and fast drivers, it should be possible to run PWM frequencies above 100kHz and take the dead time down to around 50ns (less than 1%). Here&#39;s what the switching waveforms look like in an LTSpice simulation with the EPC device models:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_005.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_006.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;However, this is just a guess for the loop inductance, so I won&#39;t really know how fast it can go until I get it on a scope. But what is the huge current spike on both FETs (shoot-through) during the high-side turn-on? I thought GaNFETs didn&#39;t have reverse recovery time? Well, I &lt;i&gt;think&lt;/i&gt;&amp;nbsp;it&#39;s just the C&lt;sub&gt;oss&lt;/sub&gt; of the low-side FETs getting charged. Even though there&#39;s no body diode, there is still parasitic capacitance in the 1-2nF (per FET) range that has to get hard-switched at 50V. At these switching speeds, that requires a lot of current (for a very short period of time). If it turns out to be a problem, the gate drives will have to be artificially slowed down with extra gate resistance.&lt;/p&gt;&lt;p&gt;Another interesting quirk of GaNFETs is present in the LTSpice sim: during the deadtime, the low-side FETs do reverse-conduct, but at a voltage of around 2.25V, significantly higher than the body diode voltage across a silicon MOSFET. This partially eats into the power savings of a GaNFET design, but the more critical problem is what happens with a bootstrapped high-side driver. If the output is down at -2.25V, or even lower, even a crappy bootstrap diode won&#39;t drop enough from +5V to prevent the high-side V&lt;sub&gt;GS&lt;/sub&gt; from exceeding 6V, the absolute maximum for these gates.&lt;/p&gt;&lt;p&gt;There are several ways to deal with this. Adding a resistor in series with the bootstrap diode creates low-pass filter with the bootstrap capacitor. A 10Ω resistor with a 0.1µF bootstrap capacitor has a time constant of 1µs, so a 50ns spike of +7V or so won&#39;t budge the voltage much. There is still a possibility that at very high duty cycle the &lt;i&gt;average &lt;/i&gt;bootstrap charge voltage could exceed 6V, but I think you would run into other issues before that. For good measure, I also included a footprint for an optional 5.1V Zener diode across the bootstrap cap, which is another recommendation I saw.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_007.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;Lastly, I included a couple footprints for small Schottky diodes in parallel with the low-side FETs. I can only fit something like an SMA or SOD128 package, which max out at 5A for 100V devices. But some have pulse ratings at or above 100A, and these are &lt;i&gt;very&lt;/i&gt;&amp;nbsp;short pulses. At those pulse ratings the voltage drop is quite high, but still significantly below 2.25V. The other interesting question is whether the parasitic inductance of the diode and its connection to the bridge will prevent any of the high current from making it to them in the first place. Only the scope will tell.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.us-east-1.amazonaws.com/img/gandr_00_008.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;In order to take advantage of the 100kHz+ low-deadtime PWM, I also want to try out another relatively new part, an MCU from the STM32G4 series. These are the spiritual successor to the F3 series that I&#39;ve used for so many motor drives. They have lots of analog features, including five (!) ADC converters, DACs, op-amps, and comparators. But the G4 also adds several new peripherals that are specifically well-suited to motor drives:&lt;/p&gt;&lt;p&gt;The high-resolution timer (HRTIM) can drive up to 12 PWM outputs with a resolution as small as 184ps. It&#39;s probably overkill, but it allows for close to 16-bit duty cycle resolution at 100kHz, and very accurate deadtime control. For example, part-to-part variation in the gate driver propagation delays could probably be compensated There&#39;s also a&amp;nbsp;&lt;a href=&quot;https://en.wikipedia.org/wiki/CORDIC&quot;&gt;CORDIC&lt;/a&gt; engine for offloading trigonometric functions from the CPU. And there&#39;s a Filter Math Accelerator (FMAC) peripheral, which includes a dedicated hardware multiplier/accumulator and memory for accelerating FIR/IIR filter calculations without involving the CPU.&lt;/p&gt;&lt;p&gt;The need for extra computation might make more sense if you&#39;ve noticed what&#39;s missing from this design: there are no inputs for motor position sensing. While I love the simplicity and robustness of TinyCross&#39;s optically-isolated virtual Hall sensors, it&#39;s nothing interesting in terms of brushless motor control. I did manage to run two instances of a lightweight &lt;a href=&quot;https://scolton.blogspot.com/2019/09/tinycross-first-test-drive-and.html&quot;&gt;flux observer&lt;/a&gt; in the background on the F3, but I never used it for driving the current control. I&#39;m sure it would have worked at speed, but the main problem with sensorless control for TinyCross is getting starting torque on all four drive motors without them fighting each other.&lt;/p&gt;&lt;p&gt;One way to get the position of the motor before the flux observer converges is with High-Frequency Injection (HFI). The motors on TinyCross have an L/R time constant of around 1.5ms, so at frequencies above 1kHz they are mostly inductors. The inductance on each phase varies depending on the position of the rotor, since the permanent magnet flux changes where each phase&#39;s stator steel is on a non-linear B-H curve. That inductance change can be measured with a high-frequency signal on top of the normal drive voltage to derive the position. This is nothing new - it&#39;s been basically perfected in &lt;a href=&quot;https://www.youtube.com/watch?v=H-6qzmeCNtw&quot;&gt;VESC&lt;/a&gt;&amp;nbsp;- but it should be fun to try to implement with 100kHz+ PWM and the STM32G4&#39;s extra processing power.&lt;/p&gt;&lt;p&gt;That&#39;s it for now - more to come when there is physical hardware to look at.&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/8041028822174776126/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2025/01/gandr-motor-drive-in-ganfet-era.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/8041028822174776126'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/8041028822174776126'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2025/01/gandr-motor-drive-in-ganfet-era.html' title='GaNDr: Motor Drive in the GaNFET Era'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-6480223422941795742</id><published>2024-11-09T16:38:00.006-05:00</published><updated>2024-12-07T13:20:05.362-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="pcie"/><title type='text'>PCIe Deep Dive, Part 5: Flow Control</title><content type='html'>&lt;p&gt;In &lt;a href=&quot;https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html&quot;&gt;Part 2&lt;/a&gt;, I described PCIe as a bi-directional memory bus extension and looked at some factors that contribute to the efficiency of the link. A PCIe 3.0 x4 link, with 8GT/s on each lane and an efficiency of around 90%, can support bidirectional data transfer of around 3.6GB/s. But this assumes both sides of the link can &lt;i&gt;consume &lt;/i&gt;data that quickly. In reality, a PCIe function is subject to other constraints, like local memory access, that might limit the available data rate below the theoretical maximum. In this case, it&#39;s the job of the Data Link Layer&#39;s flow control mechanism to throttle the link partner so as not to overflow the receiver&#39;s buffers.&lt;/p&gt;&lt;p&gt;For a full-duplex serial interface, flow control can be as simple as two signals used to communicate readiness to receive data. In UART, for example, the RTS/CTS signals serve this purpose: A receiver de-asserts its RTS output when its buffer is full. A transmitter can only transmit when its CTS input is asserted. The RTS outputs of each receiver are crossed over to the CTS inputs of the other side&#39;s transmitter to enforce bidirectional flow control.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_04_000.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;This works well for homogenous streams of data, but PCIe is packetized, and Transaction Layer Packets (TLPs) can have a variable amount of data in their payload. Additionally, packets representing write requests, read requests, and read completions need different handling. For example: If both sides are saturated with read requests, they need to be able to block further requests without also blocking completions of outstanding reads. So, a more sophisticated flow control system is necessary.&lt;/p&gt;&lt;p&gt;PCIe receivers implement six different buffers to accommodate different types of TLP. These are divided into three categories:&lt;/p&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;&lt;span style=&quot;color: #93c47d;&quot;&gt;Posted Requests (P)&lt;/span&gt;. These are Request TLPs that don&#39;t require a Completion, such as memory writes. Once the TLP has been acknowledged, the requester assumes the remote memory will be written accordingly with the TLP data.&lt;/li&gt;&lt;li&gt;&lt;span style=&quot;color: #c27ba0;&quot;&gt;Non-Posted Requests (NP)&lt;/span&gt;. These are Request TLPs that do require a Completion, such as memory reads. Configuration reads &lt;i&gt;and&lt;/i&gt;&amp;nbsp;writes also fall into this category.&lt;/li&gt;&lt;li&gt;&lt;span style=&quot;color: #76a5af;&quot;&gt;Completions (CPL)&lt;/span&gt;. These are Completion TLPs completing a Non-Posted Request, such as by returning data from a memory read.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;For each of these categories, there is a Header buffer and a Data buffer. The Header buffer stores TLP header information, such as the address and size for memory reads and writes. The Data buffer stores TLP payload data. Separating the Header and Data buffers allows more flow control flexibility. A receiver can limit the number of TLPs received by using Header buffer flow control, or limit the amount of data received by using Data buffer flow control. In practice, both are used simultaneously depending on the external constraints of the PCIe function.&lt;/p&gt;&lt;p&gt;The PCIe specification only describes these buffers conceptually; it doesn&#39;t mandate any specific implementation. In an FPGA, the Data buffers could reasonably be built from block RAM, since they need to store several TLPs worth of payload data (order of KiB). The Header buffers only need to store (at most) four or five Double Words (DW = 32b) per TLP, so they could reasonably be built from distributed (LUT) RAM instead. A hypothetical Ultrascale+ implementation is shown below.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_04_001.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;The Data buffers are built from two parallel BRAMs (4KiB each). This provides a R/W interface of 128b, matching the datapath width for a PCIe 3.0 x4 link. Conveniently, the unit of flow control for data is defined to be 4DW (128b), so each BRAM address is one unit of data. With 8KiB of total memory, these buffers could, for example, hold up to 16 TLPs worth of optimally-aligned data on a link with a Max. Payload Size of 512B. The capacity could be expanded by cascading more BRAMs or using URAMs, depending on the design requirements.&lt;/p&gt;&lt;p&gt;The Header buffers are built from LUTRAMs instead, with a depth of 64. Each entry represents the header of a single TLP, and may be associated with a block of data in the Data buffer. One header is also defined to be one unit of flow control. A single SLICEM (8 LUTs) can make a 64x7b simple dual-port LUTRAM. These can be parallelized up to any bit width, depending on how much of the header is actually used in the design. The buffer could also be made deeper if necessary, by using more LUTRAMs and MUXes.&lt;/p&gt;&lt;p&gt;For each of these buffers, the receiver issues credits to its link partner&#39;s transmitter in flow control units (4DW for Data buffers, one TLP header for Header buffers). The transmitter is only allowed to send a TLP if it has been issued enough credits (of the right type) for that TLP and its data. The receiver issues credits only up to the amount of space available in each of the buffers, updating the amount issued as the buffers are emptied by the Application Layer above.&lt;/p&gt;&lt;p&gt;Flow control credits are initialized and then updated using specific Data Link Layer Packets (DLLPs) for each category. These packets have a common 6B structure with an 8b counter for Header credits and a 12b counter for Data credits:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_04_002.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;InitFC1- and InitFC2-type DLLPs are sent as part of a state machine during flow control initialization to issue the starting number of credits for each category (P, NP, Cpl). In the example above, the initial header and data credits could be 64 and 512 (or 63 and 511, to be safe). It&#39;s also possible for the receiver to grant &quot;infinite&quot; credits for a particular category of TLP by setting the initial value of HdrFC and/or DataFC to zero. In fact, for normal Root Complex or Endpoint operation, this is &lt;i&gt;required&lt;/i&gt;&amp;nbsp;for Completion credits. A Request is only transmitted if there is room in the local receiver buffers for its associated Completion.&lt;/p&gt;&lt;p&gt;UpdateFC-type DLLPs are sent periodically during normal operation to issue more credits to the link partner. Typically, credits are incremented as the Application Layer reads out of the associated buffer. In the case of an AXI Bridge, this could be when the AXI transaction is in progress. The values of HdrFC[7:0] and DataFC[11:0] are the &lt;i&gt;total&lt;/i&gt;&amp;nbsp;number of credits issued, mod 256 (header) or 4096 (data). For example, if the initial data credit was 511 and then 64DW were read out of the data buffer, the first UpdateFC value for DataFC could be 511 + 64/4 = 527.&lt;/p&gt;&lt;p&gt;This is mostly equivalent to transmitting the amount of buffer space available, but the head/tail index subtraction is left to the transmitter. It can compare the total credits issued to its local count of total credits consumed (mod 256 or 4096) to check whether there is enough space available for the next TLP. To prevent overflow, the receiver never issues more than 127 header credits or 2047 data credits beyond its current buffer head. (This also means that there&#39;s no reason to have header buffers larger than 128 entries, or data buffers larger than 2048x4DW = 32KiB.)&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_04_003.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The receiver handles scheduling of UpdateFCs for all buffers that didn&#39;t advertise infinite credits during initialization. The PCIe 3.0 specification does have some rules for scheduling UpdateFCs immediately when the link partner&#39;s transmitter might be credit-starved, but otherwise there is only a maximum period requirement of 30-45μs. If the buffer has plenty of space, it can be better to space out updates for higher &lt;a href=&quot;https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html&quot;&gt;bus efficiency&lt;/a&gt;. (Starting with PCIe 6.0, the flow control interval and associated bus efficiency will become constant as part of&amp;nbsp;&lt;a href=&quot;https://pcisig.com/blog/pcie%C2%AE-60-specification-webinar-qa-deeper-dive-flit-mode-pam4-and-forward-error-correction-fec&quot;&gt;FLIT mode&lt;/a&gt;, a nice simplification.)&lt;/p&gt;&lt;p&gt;A specific application may not actually need all six buffers to be the same size or type. For example, an&amp;nbsp;&lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html&quot;&gt;NVMe host&lt;/a&gt;&amp;nbsp;would barely need any Completion buffer space, since its role as a Requester is limited to modifying NVMe Controller Registers, usually one 32b access at a time. It&#39;s also unlikely to have any need for Non-Posted Requests with data. The vast majority of its receiver TLP traffic will be memory writes (PH/PD) and memory reads (NPH).&lt;/p&gt;&lt;p&gt;With a &lt;a href=&quot;https://scolton.blogspot.com/2023/05/pcie-deep-dive-part-1-tool-hunt.html&quot;&gt;PCIe protocol analyzer&lt;/a&gt;, it&#39;s possible to see the flow control in action. The following is a trace recorded during a 512KiB NVMe read on a PCIe Gen3 x4 link with a Max. Payload Size of 512B. The host-side memory write throughput and DataFC available credits are plotted.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_04_004.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;The whole transaction takes 1024 memory write request TLPs, each with 512B payload data. These occur over about 146μs, for an average throughput of about 3.6GB/s. The peak throughput is a little higher, though, at around 3.8GB/s. This means the PCIe link is slightly faster than the AXI bus, and as a result, DataFC credits are consumed faster than they are issued at first. Once the available credits drop below 32, there isn&#39;t enough room for a 512B TLP and the transmitter (on the NVMe SSD) is throttled. The link reaches steady-state operation at the AXI-limited throughput.&lt;/p&gt;&lt;p&gt;Depending on the typical access patterns of the application, a larger data buffer could help sustain peak throughput for the entire duration of a typical transfer. Sequential transfers might still hit the credit limit threshold, but it&#39;s also possible that there&#39;s enough NVMe overhead between them for the credits to recover. This sort of optimization, along with maximizing &lt;a href=&quot;https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html&quot;&gt;bus efficiency&lt;/a&gt;, would be required to squeeze out even more application-level throughput.&lt;/p&gt;&lt;p&gt;In summary, if PCIe is acting as a memory bus extension, PCIe flow control extends the local memory controller&#39;s backpressure mechanism across the link. For example, if an AXI memory bus can&#39;t keep up with writes, it will de-assert AWREADY and/or WREADY. The PCIe receiver can still accept memory write TLPs as long as it has room in its buffers, but it can&#39;t issue any new PH or PD credits. When the buffers are nearly full, this transfers the backpressure to the link partner.&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/6480223422941795742/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2024/11/pcie-deep-dive-part-5-flow-control.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/6480223422941795742'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/6480223422941795742'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2024/11/pcie-deep-dive-part-5-flow-control.html' title='PCIe Deep Dive, Part 5: Flow Control'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-7995197493465862804</id><published>2024-01-22T00:00:00.001-05:00</published><updated>2024-01-22T09:14:43.980-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="pcie"/><title type='text'>PCIe Deep Dive, Part 4: LTSSM</title><content type='html'>&lt;p&gt;The Link Training and Status State Machine (LTSSM) is a logic block that sits in the MAC layer of the &lt;a href=&quot;https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html&quot;&gt;PCIe stack&lt;/a&gt;. It configures the PHY and establishes the PCIe link by negotiating link width, speed, and equalization settings with the link partner. This is done primarily by exchanging Ordered Sets, easy-to-identify fixed-length packets of link configuration information transmitted on all lanes in parallel. The LTSSM must complete successfully before any real data can be exchanged over the PCIe link.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Although somewhat complex, the LTSSM is a normal logic state machine. The controller executes a specific set of actions based on the current state and its role as either a downstream-facing port (host/root complex) or upstream-facing port (device/endpoint). These actions might include:&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;ul style=&quot;text-align: left;&quot;&gt;&lt;li&gt;Detecting the presence of receiver termination on its link partner.&lt;/li&gt;&lt;li&gt;Transmitting Ordered&amp;nbsp;Sets with specific link configuration information.&lt;/li&gt;&lt;li&gt;Receiving Ordered&amp;nbsp;Sets from its link partner.&lt;/li&gt;&lt;li&gt;Comparing the information in received Ordered Sets to transmitted Ordered Sets.&lt;/li&gt;&lt;li&gt;Counting Ordered Sets transmitted and/or received that meet specific requirements.&lt;/li&gt;&lt;li&gt;Tracking how much time has elapsed in the state (for timeouts).&lt;/li&gt;&lt;li&gt;Reading or writing bits in PCIe Configuration Space registers, for software interaction.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Each state also has conditions that trigger transitions to other states. All this is typically implemented in gate-level logic (HDL), not software, although there may be software hooks that can trigger state transitions manually. The top-level LTSSM diagram looks like this:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_000.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The entry point after a reset is the Detect state and the normal progression is through Detect, Polling, and Configuration, to the golden state of L0, the Link Up state where application data can be exchanged. This happens first at Gen1 speed (2.5GT/s). If both link partners support a higher speed, they can enter the Recovery state, change speeds, then return to L0 at the higher speed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Each of these top-level states has a number of substates that define actions and conditions for transitioning between substates or moving to the next top-level state. The following sections detail the substates in the normal path from Detect to L0, including a speed change through Recovery. Not covered are side paths such as low-power states (L0s, L1, L2), since just the main path is complex enough for one post.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3 style=&quot;text-align: left;&quot;&gt;Detect&lt;/h3&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_001.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The Detect state is the only one that doesn&#39;t involve sending or receiving Ordered Sets. Its purpose is to periodically look for receiver termination, indicating the presence of a link partner. This is done with an implementation-specific analog mechanism built into the PHY.&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Detect.Quiet&lt;/h4&gt;&lt;div&gt;This is the entry point of the LTSSM after a reset and the reset point after many timeout or fault conditions. Software can also force the LTSSM back into this state to retrain the link. The transmitter is set to electrical idle. In &lt;a href=&quot;https://docs.xilinx.com/r/en-US/pg239-pcie-phy&quot;&gt;PG239&lt;/a&gt;, this is done by setting the &lt;span style=&quot;color: #01ffff; font-family: courier;&quot;&gt;&lt;b&gt;phy_txelecidle&lt;/b&gt;&lt;/span&gt; bit for each lane. The LTSSM stays in this state until 12ms have elapsed or the receiver detects that any lane has exited electrical idle (&lt;b&gt;&lt;span style=&quot;color: #01ffff; font-family: courier;&quot;&gt;phy_rxelecidle&lt;/span&gt;&lt;/b&gt; goes low). Then, it will proceed to &lt;span style=&quot;color: white;&quot;&gt;Detect.Active&lt;/span&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the absence of a link partner, the LTSSM will cycle between Detect.Quiet and Detect.Active with a period of approximately 12ms. This period, as well as other timeouts in PCIe, are specified with a tolerance of (+50/-0)%, so it can really be anywhere from 12-18ms. This allows for efficient logic for counter comparisons. For example, with a PHY clock of 125MHz, a count of 2^17 is 1.049ms, so a single 6-input LUT attached to &lt;span style=&quot;color: #01ffff; font-family: courier;&quot;&gt;&lt;b&gt;counter[22:17]&lt;/b&gt;&lt;/span&gt; can just wait for 6&#39;d12 and that will be an accurate-enough 12ms timeout trigger.&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Detect.Active&lt;/h4&gt;&lt;div&gt;The transmitter for each lane attempts to detect receiver termination on that lane, indicating the presence of a link partner. This is done by measuring the time constant of the RC circuit created by the Tx AC-coupling capacitor and the Rx termination resistor. In &lt;a href=&quot;https://docs.xilinx.com/r/en-US/pg239-pcie-phy&quot;&gt;PG239&lt;/a&gt;, the MAC sets the signal &lt;b&gt;&lt;span style=&quot;color: #01ffff; font-family: courier;&quot;&gt;phy_txdetectrx&lt;/span&gt;&lt;/b&gt; and monitors the result in &lt;b&gt;&lt;span style=&quot;color: #01ffff; font-family: courier;&quot;&gt;phy_rxstatus&lt;/span&gt;&lt;/b&gt; on each lane.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There are three possible outcomes:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;No receiver termination is detected on any lane. The LTSSM returns to &lt;span style=&quot;color: white;&quot;&gt;Detect.Quiet&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;Receiver termination is detected on all lanes. The LTSSM proceeds to &lt;span style=&quot;color: white;&quot;&gt;Polling&lt;/span&gt; on all lanes.&lt;/li&gt;&lt;li&gt;Receiver termination is detected on some, but not all, lanes. In this case, the link partner may have fewer lanes. The transmitter waits 12ms, then repeats the receiver detection. If the result is the same, the LTSSM proceeds to &lt;span style=&quot;color: white;&quot;&gt;Polling&lt;/span&gt; on only the detected lanes. Otherwise, it returns to &lt;span style=&quot;color: white;&quot;&gt;Detect.Quiet&lt;/span&gt;.&lt;/li&gt;&lt;/ol&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3 style=&quot;text-align: left;&quot;&gt;Polling&lt;/h3&gt;&lt;/div&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_002.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In Polling and most other states, link partners exchange Ordered Sets, easy-to-identify fixed-length packets containing link configuration information. They are transmitted in parallel on all lanes that detected receiver termination, although the contents may very per-lane in some states. The most important Ordered Sets for training are Training Sequence 1 (TS1) and Training Sequence 2 (TS2), 16-symbol packets with the following layouts:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_003.svg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;TS1 Ordered Set Structure&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_004.svg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;TS2 Ordered Set Structure&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;In the Link Number and Lane Number fields, a special symbol (PAD) is reserved for indicating that the field has not yet been configured. This symbol has a unique 8b/10b control code (K23.7) in Gen1/2, but is just defined as 8&#39;hF7 in Gen3. Polling always happens at Gen1 speeds (2.5GT/s).&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Polling.Active&lt;/h4&gt;&lt;div&gt;The transmitter sends TS1s with PAD for the Link Number and Lane Number. The receiver listens for TS1s or TS2s from the link partner.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The LTSSM normally proceeds to &lt;span style=&quot;color: white;&quot;&gt;Polling.Configuration&lt;/span&gt; when all of the following conditions are met:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;Software is not commanding a transition to Polling.Compliance via the Enter Compliance bit in the Link Control 2 register.&lt;/li&gt;&lt;li&gt;At least 1024 TS1s have been transmitted.&lt;/li&gt;&lt;li&gt;Eight consecutive TS1s or TS2s have been received with Link Number and Lane Number set to PAD on all lanes, and not requesting Polling.Compliance unless also requesting Loopback (an unusual corner case).&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;If the above three conditions are not met on all lanes after 24ms timeout, the LTSSM proceeds to &lt;span style=&quot;color: white;&quot;&gt;Polling.Configuration&lt;/span&gt; anyway if at least one lane received the necessary TS1s and enough lanes to form a valid link have exited electrical idle. Otherwise, it will assume it&#39;s connected to a passive test load and go to &lt;span style=&quot;color: white;&quot;&gt;Polling.Compliance&lt;/span&gt;, a substate used to test compliance with the PCIe PHY specification by transmitting known sequences.&lt;/div&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Polling.Configuration&lt;/h4&gt;&lt;div&gt;The transmitter sends TS2s with PAD for the Link Number and Lane Number. The receiver listens for TS2s (&lt;i&gt;not &lt;/i&gt;TS1s)&amp;nbsp;from the link partner.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The LTSSM normally proceeds to &lt;span style=&quot;color: white;&quot;&gt;Configuration&lt;/span&gt; when all of the following conditions are met:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;At least 16 TS2s have been transmitted &lt;i&gt;after receiving one TS2.&lt;/i&gt;&lt;/li&gt;&lt;li&gt;Eight consecutive TS2s have been received with Link Number and Lane Number set to PAD on any lane.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;Unlike in Polling.Active, transmitted TS are only counted after receiving at least one TS from the link partner. This mechanism acts as a synchronization gate to ensure that both link partners receive more than enough TS to clear the state, regardless of which entered the state first.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If the above two conditions are not met after a 48ms timeout, the LTSSM returns to &lt;span style=&quot;color: white;&quot;&gt;Detect&lt;/span&gt; and starts over.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3 style=&quot;text-align: left;&quot;&gt;Configuration&lt;/h3&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_005.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;The downstream-facing (host/root complex) port leads configuration, proposing link and lane numbers based on the available lanes. The upstream-facing (device/end-point) port echoes back configuration parameters, if they are accepted. The following diagram and description are from the point of view of the downstream-facing port.&lt;/span&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Configuration.Linkwidth.Start&lt;/h4&gt;&lt;div&gt;The (downstream-facing) transmitter sends TS1s with a Link Number (arbitrary, 0-31) and PAD for the Lane Number. The receiver listens for matching TS1s.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The LTSSM normally proceeds to &lt;span style=&quot;color: white;&quot;&gt;Configuration.Linkwidth.Accept&lt;/span&gt; when the following condition is met:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;Two consecutive TS1s are received with Link Number matching that of the transmitted TS1s, and PAD for the Lane Number, on any lane.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;It the above condition is not met after a 24ms timeout, the LTSSM returns to &lt;span style=&quot;color: white;&quot;&gt;Detect&lt;/span&gt; and starts over.&lt;/div&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Configuration.Linkwidth.Accept&lt;/h4&gt;&lt;div&gt;The downstream-facing port must decide if it can form a link using the lanes that are receiving a matching Link Number and PAD for the Lane Numbers. If it can, it assigns sequential Lane Numbers to those lanes. For example, an x4 link can be formed by assigning Lane Numbers 0-3.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The LTSSM normally proceeds to &lt;span style=&quot;color: white;&quot;&gt;Configuration.Lanenum.Wait&lt;/span&gt; when the following condition is met:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;A link can be formed with a subset of the lanes that are responding with a matching Link Number and PAD for the Lane Numbers.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;An interesting question is how to handle a case where only some of the detected lanes have responded. Should the LTSSM wait at least long enough to handle a missed packet and/or lane-to-lane skew before exiting this state? (I don&#39;t actually know the answer, but to me it seems logical to wait for at least a few TS periods before proposing lane numbers.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If the above condition isn&#39;t met after a 2ms timeout, the LTSSM returns to &lt;span style=&quot;color: white;&quot;&gt;Detect&lt;/span&gt; and starts over.&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Configuration.Lanenum.Wait&lt;/h4&gt;&lt;div&gt;The transmitter sends TS1s with the Link Number and with each lane&#39;s proposed Lane Number. The receiver listens for TS1s with a matching Link Number and updated Lane Numbers.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The LTSSM normally proceed to &lt;span style=&quot;color: white;&quot;&gt;Configuration.Lanenum.Accept&lt;/span&gt; when the following condition is met:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;Two consecutive TS1s are received with Link Number matching that of the transmitted TS1s and with a Lane Number that has changed since entering the state, on any lane.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;Here the spec is more explicit that upstream-facing lanes may take up to 1ms to start echoing the lane numbers, to account for receiver errors or lane-to-lane skew. So (I think) the above condition is meant to be evaluated only after 1ms has elapsed in this state.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If the above condition isn&#39;t met after a 2ms timeout, the LTSSM returns to &lt;span style=&quot;color: white;&quot;&gt;Detect&lt;/span&gt; and starts over.&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Configuration.Lanenum.Accept&lt;/h4&gt;&lt;div&gt;Here, there are three possibilities:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;The updated Lane Numbers being received match those transmitted on all lanes, or the reverse (if supported). The LTSSM proceeds to &lt;span style=&quot;color: white;&quot;&gt;Configuration.Complete&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;The updated Lane Numbers don&#39;t match the those transmitted, or the reverse (if supported). But, a subset of the responding lanes can be used to form a link. The downstream-facing port reassigns lane numbers for this new link and returns to &lt;span style=&quot;color: white;&quot;&gt;Configuration.Lanenum.Wait&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;No link can be formed. The LTSSM returns to &lt;span style=&quot;color: white;&quot;&gt;Detect&lt;/span&gt; and starts over.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;Normally, lane reversal (e.g. 0-3 remapped to 3-0) would be handled by the device if it supports the feature, and its upstream-facing port will respond with matching Lane Numbers. However, if the device doesn&#39;t support lane reversal, it can respond with the reversed lane numbers to request the host do the reversal, if possible.&lt;/div&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Configuration.Complete&lt;/h4&gt;&lt;div&gt;The transmitter sends TS2s with the agreed-upon Link and Lane Numbers. The receiver listens for TS2s with the same.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;The LTSSM normally proceeds to&amp;nbsp;&lt;span style=&quot;color: white;&quot;&gt;Configuration.Idle&lt;/span&gt;&amp;nbsp;when all of the following conditions are met:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;At least 16 TS2s have been transmitted&amp;nbsp;&lt;i&gt;after receiving one TS2&lt;/i&gt;, on all lanes.&lt;/li&gt;&lt;li&gt;Eight consecutive TS2s have been received with the same Link and Lane Numbers as are being transmitted, on all lanes.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;If the above condition isn&#39;t met after a 2ms timeout, the LTSSM returns to&amp;nbsp;&lt;span style=&quot;color: white;&quot;&gt;Detect&lt;/span&gt;&amp;nbsp;and starts over.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Configuration.Idle&lt;/h4&gt;&lt;div&gt;The transmitter sends Idle data symbols (IDL) on all configured lanes. The receiver listens for the same. Unlike Training Sets, these symbols go through &lt;a href=&quot;https://scolton.blogspot.com/2023/08/pcie-deep-dive-part-3-scramblers-crcs.html&quot;&gt;scrambling&lt;/a&gt;, so this state also confirms that scrambling is working properly in both directions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;The LTSSM normally proceeds to&amp;nbsp;&lt;span style=&quot;color: white;&quot;&gt;L0&lt;/span&gt;&amp;nbsp;when all of the following conditions are met:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;At least 16 consecutive IDL have been transmitted&amp;nbsp;&lt;i&gt;after receiving one IDL&lt;/i&gt;, on all lanes.&lt;/li&gt;&lt;li&gt;Eight consecutive IDL have been received, on all lanes.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;If the above conditions aren&#39;t met after a 2ms timeout, the LTSSM returns to&amp;nbsp;&lt;span style=&quot;color: white;&quot;&gt;Detect&lt;/span&gt;&amp;nbsp;and starts over.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3 style=&quot;text-align: left;&quot;&gt;L0&lt;/h3&gt;&lt;div&gt;This is the golden normal operational state where the host and device can exchange actual data packets. The LTSSM indicates Link Up status to the upper layers of the &lt;a href=&quot;https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html&quot;&gt;stack&lt;/a&gt;, and they begin to do their work. One of the first things that happens after Link Up is flow control initialization by the Data Link Layer partners. Flow control is itself a state machine with some interesting rules, but that&#39;ll be for another post.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But wait...the link is still operating at 2.5GT/s at this point. If both link partners support higher data rates (as indicated in their Training Sets), they can try to switch to their highest mutually-supported data rate. This is done by transitioning to &lt;span style=&quot;color: white;&quot;&gt;Recovery&lt;/span&gt;, running through the Recovery speed change substates, then returning to L0 at the new rate.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3 style=&quot;text-align: left;&quot;&gt;Recovery&lt;/h3&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_006.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;Recovery is in many ways the most complex LTSSM state, with many internal state variables that alter state transitions rules and lead to circuitous paths through the substates, even for a nominal speed change. As with the other states, there are way too many edge cases to cover here, so I&#39;ll only focus on getting back to L0 at 8GT/s along the normal path.&amp;nbsp;&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;Also, since Configuration has been completed, it&#39;s assumed that Link and Lane Numbers will match in transmitted and received Training Sequences. If this condition is violated, the LTSSM may fail back to Configuration or Detect depending on the nature of the failure. For simplicity, I&#39;m omitting these paths from the descriptions of each substate.&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;When changing speeds to 8GT/s, the link must establish equalization settings during this state. In the simplest case, the downstream-facing port chooses a transmitter equalization preset for itself and requests a preset for the upstream-facing transmitter to use. The transmitter presets specify two parameters, de-emphasis and preshoot, that modify the shape of the transmitted waveform to counteract the low-pass nature of the physical channel. This can open the receiver eye even with lower overall voltage swing:&lt;/div&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_007.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Recovery.RcvrLock&lt;/h4&gt;&lt;div&gt;This substate is encountered (at least) three times.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first time this substate is entered is from L0 at 2.5GT/s. The transmitter sends TS1s (at 2.5GT/s) with the Speed Change bit set. It can also set the EQ bit and send a Transmitter Preset and Receiver Preset Hint in this state. These presets are requested&amp;nbsp;values for the upstream transmitter to use after it switches to 8GT/s. The receiver listens for TS1s or TS2s that also have the Speed Change bit set.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first exit is normally to &lt;span style=&quot;color: white;&quot;&gt;Recovery.RcvrCfg&lt;/span&gt; when the following condition is met:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;Eight consecutive TS1s or TS2s are received with the Speed Change bit matching the transmitted value (1, in this case), on all lanes.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;The second time this subtstate is entered is from Recovery.Speed, after the speed has changed from 2.5GT/s to 8GT/s. Now, the link needs to be re-established at the higher data rate. Transitioning to 8GT/s always requires a trip through the equalization substate, so after setting its transmitter equalization, the LTSSM proceeds to &lt;span style=&quot;color: white;&quot;&gt;Recovery.Equalization&lt;/span&gt; immediately.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The third time this subtstate is entered is from Recovery.Equalization, after equalization has been completed. The transmitter sends TS1s (at 8GT/s) with the Speed Change bit cleared, the EC bits set to 2&#39;b00, and the equalization fields reflecting the&amp;nbsp;&lt;i&gt;downstream&lt;/i&gt; transmitter&#39;s current equalization settings: Transmitter Preset and Cursor Coefficients. The receiver listens for TS1s or TS2s that also have the Speed Change and EC bits cleared.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The third exit is normally to &lt;span style=&quot;color: white;&quot;&gt;Recovery.RcvrCfg&lt;/span&gt; when the following condition is met:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Eight consecutive TS1s or TS2s are received with the Speed Change bit matching the transmitted value (0, in this case), on all lanes.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Recovery.RcvrCfg&lt;/h4&gt;&lt;div&gt;This substate is encoutered (at least) twice.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first time this substate is entered is from Recovery.RcvrLock at 2.5GT/s. The transmitter sends TS2s (at 2.5GT/s) with the Speed Change bit set. It can also set the EQ bit and send a transmitter preset and receiver preset hint in this state. These presets are requested&amp;nbsp;values for the upstream transmitter to use after it switches to 8GT/s. The receiver listens for TS2s that also have the Speed Change bit set.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;The first exit is normally to&amp;nbsp;&lt;span style=&quot;color: white;&quot;&gt;Recovery.Speed&lt;/span&gt;&amp;nbsp;when the following condition is met:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Eight consecutive TS2s are received with the Speed Change bit set, on all lanes.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;The second time this substate is entered is from Recovery.RcvrLock at 8GT/s. The transmitter sends TS2s (at 8GT/s) with the Speed Change bit cleared. The receiver listens for TS2s that also have the Speed Change bit cleared.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;The second exit is normally to&amp;nbsp;&lt;span style=&quot;color: white;&quot;&gt;Recovery.Idle&lt;/span&gt;&amp;nbsp;when the following condition is met:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Eight consecutive TS2s are received with the Speed Change bit cleared, on all lanes.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Recovery.Speed&lt;/h4&gt;&lt;div&gt;In this substate, the transmitter enters electrical idle and the receiver waits for all lanes to be in electrical idle. At this point, the transmitter changes to the new higher speed and configures its equalization parameters. In &lt;a href=&quot;https://docs.xilinx.com/r/en-US/pg239-pcie-phy&quot;&gt;PG239&lt;/a&gt;, this is done using the &lt;b&gt;&lt;span style=&quot;color: #01ffff; font-family: courier;&quot;&gt;phy_rate&lt;/span&gt;&lt;/b&gt; and &lt;span style=&quot;color: #01ffff; font-family: courier;&quot;&gt;&lt;b&gt;phy_txeq_X&lt;/b&gt;&lt;/span&gt; signals.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The LTSSM normally returns to &lt;span style=&quot;color: white;&quot;&gt;Recovery.RcvrLock&lt;/span&gt;&amp;nbsp;after waiting at least 800ns and not more than 1ms after all receiver lanes have entered electrical idle.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This state may be re-entered if the link cannot be reestablished at the new speed. In that case, the data rate can be changed back to the last known-good speed.&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Recovery.Equalization&lt;/h4&gt;&lt;div&gt;The Recovery.Equalization substate has phases, indicated by the Equalization Control (EC) bits of the TS1, that are themselves like sub-substates. From the point of view of the downstream-facing port, Phase 1 is always encountered, but Phase 2 and 3 may not be needed if the initially-chosen presets are acceptable.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In Phase 1, the transmitter sends TS1s with EC = 2&#39;b01 and the equalization fields indicating the &lt;i&gt;downstream&lt;/i&gt; transmitter&#39;s equalization settings and capabilities: Transmitter Preset, Full Scale (FS), Low Frequency (LF), and Post-Cursor Coefficient. The FS and LF values indicate the range of voltage adjustments possible for transmitter equalization.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The LTSSM normally returns to &lt;span style=&quot;color: white;&quot;&gt;Recovery.RcvrLock&lt;/span&gt; when the following condition is met:&lt;/div&gt;&lt;div&gt;&lt;ol style=&quot;text-align: left;&quot;&gt;&lt;li&gt;Two consecutive TS1s are received with EC = 2&#39;b01.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;This essentially means that the presets chosen in the EQ TS1s and EQ TS2s sent at 2.5GT/s have been applied and are acceptable. If the above condition is not met after a 24ms timeout, the LTSSM returns to &lt;span style=&quot;color: white;&quot;&gt;Recovery.Speed&lt;/span&gt; and changes back to the lower speed. From there, it could try again with different equalization presets, or accept that the link will run at a lower speed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It&#39;s also possible for the downstream port to request further equalization tuning: In Phase 2 and Phase 3 of this substate, link partners can iteratively request different equalization settings and evaluate (via some implementation-specific method) the link quality. In a completely &quot;known&quot; link, these steps can be skipped if one of the transmitter presets has already been validated.&lt;/div&gt;&lt;/div&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Recovery.Idle&lt;/h4&gt;&lt;div&gt;This substate serves the same purpose as Configuration.Idle, but at the higher data rate (assuming the speed change was successful).&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The transmitter sends Idle data symbols (IDL, 8&#39;h00) on all configured lanes. The receiver listens for the same. These symbols now go through 8GT/s&amp;nbsp;&lt;a href=&quot;https://scolton.blogspot.com/2023/08/pcie-deep-dive-part-3-scramblers-crcs.html&quot;&gt;scrambling&lt;/a&gt;, so this state also confirms that 8GT/s scrambling is working properly in both directions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;The LTSSM normally returns to&amp;nbsp;&lt;span style=&quot;color: white;&quot;&gt;L0&lt;/span&gt;&amp;nbsp;when all of the following conditions are met:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;At least 16 consecutive IDL have been transmitted&amp;nbsp;&lt;i&gt;after receiving one IDL&lt;/i&gt;, on all lanes.&lt;/li&gt;&lt;li&gt;Eight consecutive IDL have been received, on all lanes.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;If the above conditions aren&#39;t met after a 2ms timeout, the LTSSM returns to&amp;nbsp;&lt;span style=&quot;color: white;&quot;&gt;Detect&lt;/span&gt;&amp;nbsp;and starts over.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3 style=&quot;text-align: left;&quot;&gt;LTSSM Protocol Analyzer Captures&lt;/h3&gt;&lt;div&gt;There are lots of places for the LTSSM to go wrong, and since it&#39;s running near the very bottom of the stack, it&#39;s hard to troubleshoot without dedicated tools like a PCIe Protocol Analyzer. In my &lt;a href=&quot;https://scolton.blogspot.com/2023/05/pcie-deep-dive-part-1-tool-hunt.html&quot;&gt;tool hunt&lt;/a&gt;, I managed to get a used U4301B, so let&#39;s put it to use and look at some LTSSM captures.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Side note: Somebody just scored an insane deal on a dual U4301A&amp;nbsp;&lt;a href=&quot;https://www.ebay.com/itm/375182524628&quot;&gt;listing&lt;/a&gt; that included the unicorn U4322A probe. If you&#39;re that someone and you want to sell me just the probe, let me know! I will take it in any condition just for the spare pins. Also, there is a reasonably-priced &lt;a href=&quot;https://www.ebay.com/itm/186111472517?hash=item2b551bb385&quot;&gt;U4301B&lt;/a&gt; up right now if anyone&#39;s looking for one.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But anyway, my Frankenstein U4301B + M.2 interposer is still operational and can be used with the Keysight software to capture Training Sets and summarize LTSSM progress:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_008.png&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You can see the progression through Polling and Configuration, L0, Recovery, and back to L0. In Recovery, you can see the speed change and equalization loops, crossing through the base state of Recovery.RcvrLock three times as described above.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Looking at the Training Sequence traffic itself, the entire LTSSM takes around 9ms to complete in this example, with the vast majority of the time spent in the Recovery state after the speed change. Zooming in shows the details of the earlier states, down to the level of individual Training Sequences.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_03_009.png&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If any of the states transitions don&#39;t go as expected it&#39;s possible to look inside the individual Training Sequences to troubleshoot what conditions aren&#39;t being met. The exact timing and behavior varies a lot from device to device, though.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;h3 style=&quot;text-align: left;&quot;&gt;So you made it to L0...what next?&lt;/h3&gt;&lt;div&gt;L0 / Link Up means the physical link is established, so the upper layers of the PCIe stack can begin to communicate across the link. However, before any application data (memory transactions) can be transferred, the Data Link Layer must initialize flow control. PCIe flow control is itself an interesting topic that deserves a separate post, so I&#39;ll end here for now!&lt;/div&gt;&lt;p&gt;&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/7995197493465862804/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2024/01/pcie-deep-dive-part-4-ltssm.html#comment-form' title='16 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/7995197493465862804'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/7995197493465862804'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2024/01/pcie-deep-dive-part-4-ltssm.html' title='PCIe Deep Dive, Part 4: LTSSM'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>16</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-6049404523392621252</id><published>2023-08-22T00:22:00.003-04:00</published><updated>2023-08-25T09:57:01.856-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="pcie"/><title type='text'>PCIe Deep Dive, Part 3: Scramblers, CRCs, and the Parallel LFSR</title><content type='html'>&lt;p&gt;This post continues an exploration into the inner workings of PCIe. The &lt;a href=&quot;https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html&quot; target=&quot;_blank&quot;&gt;previous post&lt;/a&gt; presented a top-level view of the PCIe Controller as a memory bus extension, with discussion of the various overheads associated with wrapping memory transfers into serial data packets. In this post, I want go to the other extreme and look at one of the low-level logic mechanisms that PCIe depends on for reliable data transfer: the parallel &lt;a href=&quot;https://en.wikipedia.org/wiki/Linear-feedback_shift_register&quot; target=&quot;_blank&quot;&gt;Linear-Feedback Shift Register&lt;/a&gt; (LFSR). This mechanism efficiently introduces randomness required to ensure DC-balanced serial data, and to validate Transaction Layer Packets (TLPs) with a&amp;nbsp;&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;Cyclic Redundancy Check&lt;/a&gt; (CRC).&lt;/p&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;PCIe 3.0 Scrambler&lt;/h4&gt;&lt;p&gt;PCIe signals are driven across AC-coupled differential pairs to increase immunity to noise. The transmitter and receiver may be on different boards, far apart from each other, with significant high-frequency ground offset between them. Adding series capacitors to the differential signal provides low-voltage level shifting capability to deal with this. But, this only works if the data coming across the link is DC-balanced over a data interval much shorter than the time constant formed by the AC coupling capacitor and termination resistor, which is typically 10⁴ to 10⁵ UI.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_000.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;PCIe 1.0 and 2.0 use &lt;a href=&quot;https://en.wikipedia.org/wiki/8b/10b_encoding&quot; target=&quot;_blank&quot;&gt;8b/10b encoding&lt;/a&gt; to enforce DC balance. This encoding tracks the running disparity of the serial data stream and modifies 10b symbols (representing 8b data) to keep it in balance. This is also the encoding used in USB all the way up to USB 3.x Gen 1 (5Gbps), which is the same speed as PCIe 2.0. It&#39;s simple and deterministic, but it has a poor serial encoding efficiency of only 80% (8/10).&lt;/p&gt;&lt;p&gt;By contrast, PCIe 3.0 through PCIe 5.0 use 128b/130b encoding, where two sync bits are prepended to 128b data payloads to form 130b blocks. As discussed in the &lt;a href=&quot;https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html&quot; target=&quot;_blank&quot;&gt;previous post&lt;/a&gt;,&amp;nbsp;this has a much better serial encoding efficiency of 98.5% (128/130). However, the two sync bits are not sufficient to control running disparity with a 128b data payload. Instead, the data is sent through a scrambler, a Pseudo-Random Number Generator (PRNG) that remaps bits in a way that both the transmitter and receiver understand. The output stream is &lt;i&gt;statistically&lt;/i&gt; DC-balanced for all real data.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_001.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_002.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The PCIe implementation of the PRNG for scrambling is as a Linear-Feedback Shift Register (LFSR). In the case of PCIe 3.0, the canonical implementation is a 23-bit shift register with strategically-placed XORs between some bits to instigate pseudo-randomness. The output of the shift register is then XORed with each data bit to generate the scrambled output. Each lane gets its own LFSR scrambler seeded with a different value.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_003.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;This is simple logic, but it would need to run at 8GHz to be implemented in single-bit fashion like this. That&#39;s not really practical even in dedicated silicon, and is completely impossible using FPGA sequential logic. However, it&#39;s possible to parallelize the LFSR to any data width pretty easily. The key is in the name: the operation is linear, so the contributions of each bit of input data and the initial LFSR can be superimposed to generate each bit of output data and the final LFSR. This method of parallelizing the LFSR is covered very well at &lt;a href=&quot;http://OutputLogic.com&quot; target=&quot;_blank&quot;&gt;OutputLogic.com&lt;/a&gt;, with utilities to generate Verilog implementations of any LFSR and data width. I will only briefly describe the procedure here.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Using for example the 23-bit LFSR and a 32-bit data path (common for each PCIe 3.0 lane with a 250MHz PHY clock), there are a total of 23 + 32 = 55 bits that can contribute to the final LFSR and output data. Set each of those bits to one, and all other bits to zero, then run the LFSR forward by 32 steps, and record the contribution of each input bit to the output data and final LFSR. This creates a big table of bit contributions:&lt;br /&gt;&lt;/p&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_004.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The full parallel operation is just the sum (in mod 2, so XOR) of contributions from each bit of input data and the initial LFSR. Each bit of the output data and final LFSR is the XOR combination of a specific set of input bits, with at most 55 contributing bits. On a Xilinx Ultrascale+ FPGA, wide XORs like this are easy to build using nested six-input LUTs. With two levels, you get 6² = 36 inputs. With three levels, 6³ = 216 inputs. Each level has a propagation time on the order of 1ns, so even nested three deep it&#39;s capable of running at 250MHz.&lt;br /&gt;&lt;/p&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_005.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Link CRC&lt;/h4&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Another use for the parallel LFSR is in the generation and checking of the Link CRC, a 32-bit value used for error detection in TLPs. The LCRC acts like a signature for the bits in the TLP and the value received must match the value calculated by the receiver, or the TLP is rejected. The LCRC mechanism uses a 32-bit LFSR (with XOR positions described by the standard CRC-32 polynomial 0x04C11DB7), seeded to 0xFFFFFFFF at the start of each TLP. At the end of the TLP, the 32-bit LFSR value is mapped to the packet&#39;s LCRC through some additional bit manipulation.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The LCRC operation can be parallelized in the same way as the Scrambler. The main differences are that the data is unmodified by the LCRC operation and that the data &lt;i&gt;does&lt;/i&gt;&amp;nbsp;contribute to the XOR sum of the next LFSR value. (This is what drives the LCRC to a unique value for each TLP.) In table form, this just changes which quadrants have XOR contributions:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_006.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Although there are fewer rows to handle, there are now 128 + 32 = 160 columns. The LCRC is calculated on packets before they are striped across all lanes. So for a PCIe 3.0 x4 link, instead of four 32-bit data paths as in the Scrambler, there is just one 128b data path operating at 250MHz. Any of these bits and any of the 32 bits of the previous clock cycle&#39;s LFSR might contribute to the XOR for each bit of the new LFSR. This isn&#39;t a problem, though, since three levels of LUT6 can handle up to 216 XOR inputs at 250MHz, as described above.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Where things do get a little complicated is in data alignment. TLP lengths are multiples of one Double Word (DW), or 32b. So, even without considering framing, 3/4 of the possible lengths would not fit evenly into 128b data beats. Each TLP is also prepended with a 32-bit framing token (STP), the latter half of which is fed into the LCRC computation as well. So in fact all cases will involve a partial data beat.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_007.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;To handle this with a 128b parallel LFSR, the LCRC mechanism must get clever. Based on the length of the packet (which is known once the STP is parsed), the 128b data window can be shifted such that the &lt;i&gt;last&lt;/i&gt;&amp;nbsp;data beat will be aligned with the end of the packet. This ensures that the final LFSR value can be used directly to generate the LCRC. Then, the &lt;i&gt;first&lt;/i&gt;&amp;nbsp;128b data beat is padded with zeros up to the middle of the STP token, where the LCRC computation begins. (In the case of a 3DW header with no data, the first and last data beat are the same.) This creates four possible alignment cases that repeat based on the length of the TLP:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_02_008.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Depending on the alignment case, the LFSR is seeded with a different value that accounts for the extra {16, 48, 80, 112} zeros padded onto the first data beat. These seed values are derived by seeding the reference single-bit implementation of the LFSR with 0xFFFFFFFF, then running it &lt;i&gt;backwards&lt;/i&gt; for {16, 48, 80, 112} steps with zero data bits. With these seeds, the 128b parallel LFSR can be run on the zero-padded data and give the same final result as the single-bit implementation on the original data.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;An interesting follow-up issue is how to handle back-to-back TLPs. Padding the first LCRC beat with zeros potentially means more than a 1:1 bit rate for the LCRC engine compared to the packet data, if there is no idle time between packets. An easy workaround could be to run two LCRC engines that take turns processing packets, although this means twice the logic area. The details are likely to vary in every implementation, so it&#39;s not something I will get into here.&lt;/p&gt;&lt;h4 style=&quot;text-align: left;&quot;&gt;Conclusion&lt;/h4&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The last couple of posts were setup and background for PCIe in general. This one was more of a microscopic view of a particular logic mechanism key to several aspects of PCIe, and how it can be implemented efficiently on modern FPGAs. There are many such interesting logic puzzles to solve in gateware implementations of PCIe, and I wanted to give just one example at the lowest level I understand. I may cover other logic-level tricks in future posts, but first I think it will be more interesting to introduce what might be the scariest part of PCIe: the Link Training and Status State Machine (LTSSM). To be continued...&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/6049404523392621252/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2023/08/pcie-deep-dive-part-3-scramblers-crcs.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/6049404523392621252'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/6049404523392621252'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2023/08/pcie-deep-dive-part-3-scramblers-crcs.html' title='PCIe Deep Dive, Part 3: Scramblers, CRCs, and the Parallel LFSR'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-8855125854297457260</id><published>2023-06-11T17:07:00.011-04:00</published><updated>2024-11-04T18:26:56.698-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="pcie"/><title type='text'>PCIe Deep Dive, Part 2: Stack and Efficiency</title><content type='html'>&lt;p&gt;Before getting too caught up in the inner workings of PCIe, it&#39;s probably worth taking a look at the high-level architecture - how it&#39;s used in a system and what the PCIe controller stack looks like. PCIe is fundamentally a bi-directional memory bus extension: it allows the host to access memory on a device and a device to access memory on the host.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_000.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;When a PCIe link is established between the host and a device, the host assigns address space(s) that it can use to access device memory. Optionally, it can also grant permission for the device to access portions of host system memory. In that way, the host and device memory buses are effectively connected. Each PCIe link is a point-to-point connection, but they can be combined with switches into a fabric with many devices (endpoints).&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Different types of devices utilize the memory bus bridging capability of PCIe in different ways. For example, an NVMe storage device exposes only a small amount of device memory (the NVMe Controller Registers) that the host uses to configure the device and inform it when new commands have been submitted. All actual data transfer is done by the storage device reading from or writing to host memory. In this way, the NVMe storage device acts as a DMA controller between host memory and non-volatile storage.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_001.svg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;NVMe storage device usage of PCIe link (completion steps omitted).&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p style=&quot;text-align: left;&quot;&gt;One might ask why the memory buses can&#39;t just be directly connected. For one, a native memory interface such as AXI is very wide: it might have 64-256b of data, 32-64b of address, and a bunch of control signals. This works fine inside a chip, but going from chip-to-chip, board-to-board, or across a cable, it&#39;s too many signals. The PCIe Controller encapsulates the data, address, and control signals from the memory bus into packets that can be sent across a fast serial link over a small number of differential pairs. This standard interface also allows bridging memory buses with different native interfaces, speeds, and latencies.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;With that context in mind, we can look at the PCIe Controller stack, and what role each layer plays in bridging memory transactions between the host and device as efficiently and reliably as possible. The PCIe specification defines three layers: the Transaction Layer (TXL), the Data Link Layer (DLL), and the Physical Layer (PHY). These layers each have a transmit and a receive side. From the point of view of the host, the stack looks like this:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_002.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Memory transactions from the host to the device are packaged by the TXL into a Transaction Layer Packet (TLP) with a header containing the address and other control information. The DLL prepends a framing token (STP) and appends a CRC to the TLP to create a Link Packet. This is then split into lanes and serialized by the PHY. The process happens in reverse for memory transactions from device to host, to go from serialized Link Packets back to host memory transactions.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;In practice, many architectures (including Ultrascale+) break the PHY into two parts: an upper Media Access Control (MAC) layer and a lower layer still called the PHY. These are connected by the standard PHY Interface for PCI Express (PIPE), &lt;a href=&quot;https://www.intel.com/content/www/us/en/io/pci-express/pci-express-architecture-devnet-resources.html&quot;&gt;published by Intel&lt;/a&gt;. It&#39;s also useful to add an explicit AXI-PCIe bridge layer above the TXL when the native memory bus is AXI, as it is in the Ultrascale+ architecture. This would be an example of what some references call the Application Layer. Expanded this way, the stack looks like this:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_003.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Different Xilinx IPs cover different layers of the stack, as shown above. &lt;a href=&quot;https://docs.xilinx.com/r/en-US/pg239-pcie-phy&quot;&gt;PG239&lt;/a&gt; (PCI Express PHY) is a low-level (PIPE down) PCIe PHY wrapper for the GTH/GTY serial transceivers. &lt;a href=&quot;https://docs.xilinx.com/r/en-US/pg213-pcie4-ultrascale-plus&quot;&gt;PG213&lt;/a&gt; (UltraScale+ Devices Integrated Block for PCI Express) covers the PCIE4 hardware block that includes the TXL, DLL, and MAC layers, and interfaces to the PHY via PIPE. And &lt;a href=&quot;AXI Bridge for PCI Express Gen3 Subsystem&quot;&gt;PG194&lt;/a&gt; (AXI Bridge for PCI Express Gen3 Subsystem) includes the AXI-PCIe bridge layer on top of the PCIE4 hardware block and PHY. (For Ultrascale+, this is technically implemented as a configuration of &lt;a href=&quot;https://docs.xilinx.com/r/en-US/pg195-pcie-dma&quot;&gt;PG195&lt;/a&gt;, but the relevant documentation is still in PG194.)&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;All of these Xilinx IPs are included in Vivado at no additional cost, but not every device has the PCIE4 block(s) needed to instantiate PG213 or PG194/PG195. For the Zynq Ultrascale+ line, the &lt;a href=&quot;https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html#productTable&quot;&gt;product tables&lt;/a&gt; show how many PCIe lanes are supported by integrated PCIE4 blocks for each device. In general, the larger and more expensive chips have more available PCIe hardware. But there are exceptions like the ZU6xx, ZU9xx, and ZU15xx, which have none. These can still instantiate PG239, but require a PCIe soft IP to implement the rest of the stack.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Each layer communicates with the next through a data bus that&#39;s sized to match the speed of the link. The example above is for a Gen3 x4 link, which supports 32Gb/s of serial data in each direction. In the Ultrascale+ implementation, the 250MHz clock for the 128b internal datapath is derived from the PCIe reference clock, so all layer logic is synchronous with the PHY. This seems like a perfectly-balanced data pipeline, with 32Gb/s of data coming in and going out in each direction. But in practice, overheads limit the maximum link efficiency.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;First, PCIe Gen3 uses 128b/130b encoding: for each 128b serial data payload on each lane, a 2b sync header is prepended to create a 130b block. The sync bits tell the receiver whether the block is data or an Ordered Set (control sequence). In order to make room for the sync bits, PIPE requires one invalid data clock cycle in every 65-clock period.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_004.svg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The period for skipping data on the 250MHz side of the PHY is 260ns, while the period for a 130b serial output block is only 16.25ns, so the PHY must implement buffering and a SERDES gearbox to make this work. The effect of the sync bits can be seen in the protocol analyzer raw data, where there are occasionally 1ns gaps in the timestamp. (The full serial data rate including sync bits would be exactly 4B/ns.) These leap-nanoseconds add up to an overall efficiency of 98.5% (64/65), as can be seen by plotting the starting timestamp of each block.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_005s.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Next, transmitters are required to periodically stop transmitting data and send a SKP Ordered Set (SKP OS), which is used to compensate for clock drift. This should happen every 370-375 blocks, and the SKP OS takes one block to transmit. Stopping the data stream also requires sending an EDS token, which may require one additional block depending on packet alignment. But even in a worst-case scenario this still represents about 99.5% (368/370) efficiency.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;We can see the EDS tokens and SKP OS at regular intervals in both directions on the protocol analyzer. Interestingly, the average interval in the Host-to-Device direction is on the short side (365 blocks). Maybe it&#39;s not accounting for the 64/65 PIPE TxDataValid efficiency described above. The interval is controlled by the MAC layer, which is in PG213 in this case, so I don&#39;t think it&#39;s something that can be adjusted. The Device-to-Host direction is spot-on in this case, with a 371-block interval.&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_006s.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;DLLs also exchange Data Link Layer Packets (DLLPs) for Ack/Nak and flow control of TLPs. These packets are short (6B), but they must be transmitted with enough regularity to meet latency requirements and ensure receiver buffers don&#39;t overflow. There&#39;s no simple rule for when these are transmitted, only a set of constraints based on the link operating conditions. To get a feel for the typical link efficiency impact of DLLP traffic, we can look at a 100μs section of bulk data transfer and add up the combined contribution of all DLLPs:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_007s.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;In total, there were 237 DLLPs transmitted in the Host-to-Device direction. Since the packets must be lane-0-aligned on an x4 link, they actually occupy 8B each. This is 1896B of overhead for nearly 400000B of data, again around 99.5% efficiency. This example is mostly unidirectional data transfer from host to device, though. If the device was also sending data to the host, there would be far more Acks going in the Host-to-Device direction. If the Ack count were similar to that of the Device-to-Host direction in this example, the efficiency would drop to around 95%.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Lastly, the biggest overhead is usually for TLP packetization. The TLP header is either 12B or 16B. The DLL adds a 4B framing token (STP) and a 4B Link CRC (LCRC). The payload size can be as high as 4096B, although it&#39;s limited to 1024B in the Ultrascale+ implementation (PG213). It&#39;s also common for devices to limit the max payload size to 128B, 256B, or 512B, depending on the capability of their PCIe Controller. This gives a range of 84.2% (128/150) to 98.1% (1024/1044) for packetization efficiency with optimally-sized transfers on Ultrascale+ hardware.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;In the example capture, data is transferred from host to device in 128B-payload TLPs:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_008.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The packet has 20B of overhead for 128B of data, which would be an 86.5% efficiency. However, the host controller also inserts 12B of logical idle (zeros) to align the next STP token to the start of a block. This isn&#39;t required by the PCIe protocol, but may be inherent in the implementation of the controller. For this payload size, it drops the efficiency to 80% (128/160).&amp;nbsp;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;That packetization efficiency dominates the overall link efficiency, which hovers between 75% and 80% during periods of stable data transfer:&lt;br /&gt;&lt;/p&gt;&lt;div style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_01_009.svg&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;p&gt;&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;In this case, increasing the max payload size would have the most positive impact on throughput. PG213 can go up to 1024B, but the device controller may be the limiting factor.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;In PCIe 6.0, a big change will be introduced that removes sync bits and consolidates DLLPs, framing tokens, and the LCRC into a fixed 20B overhead in each 256B unit (called a FLIT, for Flow Control unIT). This implies a fixed 92.2% efficiency for everything other than the SKP OS and TLP header overhead, and also a fixed latency for Ack/Nak and flow control, a nice simplification.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;But for now we&#39;re still in the realm of PCIe Gen3, where we can expect an overall link efficiency in the 75-95% range, depending on the variety of factors described above as well as details of the controller implementations.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The packetization and flow control functions described above are the domain of the Transaction Layer and Data Link Layer, but there are also some really interesting functions of the MAC and PHY layers that facilitate reliable serial data transfer across the physical link. These will have to be topics for one or more future posts, though.&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/8855125854297457260/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/8855125854297457260'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/8855125854297457260'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2023/06/pcie-deep-dive-part-2-stack-and.html' title='PCIe Deep Dive, Part 2: Stack and Efficiency'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-3670780298537786236</id><published>2023-05-07T20:20:00.011-04:00</published><updated>2024-11-04T18:30:08.750-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="pcie"/><title type='text'>PCIe Deep Dive, Part 1: Tool Hunt</title><content type='html'>&lt;p&gt;Over the past few years, I&#39;ve been &lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html&quot;&gt;developing&lt;/a&gt; and &lt;a href=&quot;http://scolton.blogspot.com/2021/10/zynq-ultrascale-bare-metal-nvme-2gbs.html&quot;&gt;improving&lt;/a&gt; very fast standalone NVMe-based storage capability for the Zynq Ultrascale+ architecture, to keep up with the absurd speeds of modern SSDs. (Drives like the &lt;a href=&quot;https://www.tomshardware.com/reviews/seagate-firecuda-530-m2-nvme-ssd-review/2&quot;&gt;Seagate Firecuda 530&lt;/a&gt;&amp;nbsp;and &lt;a href=&quot;https://www.tomshardware.com/reviews/sabrent-rocket-4-plus-g-ssd-review/2&quot;&gt;Sabrent Rocket 4 Plus-G&lt;/a&gt; can now hit 3GB/s+&amp;nbsp;&lt;i&gt;sustained&lt;/i&gt;&amp;nbsp;&lt;a href=&quot;https://www.reddit.com/r/NewMaxx/wiki/basics/#wiki_number_of_levels_or_bits_per_cell&quot;&gt;TLC&lt;/a&gt; write speeds, with much higher &lt;a href=&quot;https://www.reddit.com/r/NewMaxx/wiki/basics/#wiki_slc_cache_type&quot;&gt;pSLC cache&lt;/a&gt; peaks.) But my knowledge pretty much ended at the interface to the Xilinx DMA/Bridge Subsystem for PCI Express (&lt;a href=&quot;https://docs.xilinx.com/r/en-US/pg194-axi-bridge-pcie-gen3/Introduction&quot;&gt;PG194&lt;/a&gt;/&lt;a href=&quot;https://docs.xilinx.com/r/en-US/pg195-pcie-dma&quot;&gt;PG195&lt;/a&gt;). In the usual fashion, I&#39;m now going to dive deeper to explore in more detail how the AXI-PCIe bridge works, and what the PCIe stack actually looks like.&lt;/p&gt;&lt;p&gt;Something I found interesting about PCIe in general is that there seems to be a pretty large barrier built up around the black box. Even just finding learning resources is much harder than it should be. The best I found was &lt;a href=&quot;https://www.mindshare.com/Books/Titles/PCI_Express_Technology_3.0&quot;&gt;PCI Express Technology 3.0&lt;/a&gt;&amp;nbsp;and some accompanying material by MindShare, but even that seems like a prose wrapper on top of the specification. There isn&#39;t anything that I would consider a beginner&#39;s guide, like you might find for USB or Ethernet.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;span style=&quot;color: #990000;&quot;&gt;[Edit by Future Shane] There is a very good series of four articles from Simon Southwell starting &lt;a href=&quot;https://www.linkedin.com/pulse/pci-express-primer-1-overview-physical-layer-simon-southwell/?trackingId=Ei0VpssCS3anOPhl0i1l6Q%3D%3D&quot;&gt;here&lt;/a&gt; that offers a thorough introduction to PCIe. Definitely check it out if you&#39;re going to be exploring PCIe.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;For physical tools, the situation is even more bleak. The speeds in PCIe Gen3 (8GT/s) put it in the range where an oscilloscope that can actually measure the signal will cost more than a car. But for all but the lowest-level hardware debugging, a digital capture would suffice, and that&#39;s where a protocol analyzer would be nice. Unfortunately, there is no Wireshark equivalent for PCIe; protocol analyzers for it are dedicated hardware that only a few companies develop, and they are priced astronomically.&lt;/p&gt;&lt;p&gt;That is...unless you scout them on eBay for a year.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_00_000s.jpg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Biggest &quot;that escalated quickly&quot; of my test equipment stack (ref. PicoScopes below table).&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;This is a used&amp;nbsp;&lt;a href=&quot;https://www.keysight.com/us/en/product/U4301B/pci-express-protocol-analyzer.html&quot;&gt;U4301B&lt;/a&gt; that I got in what has to be my second-best eBay score of all time, for less than it would have cost me to rent one for a month. There are only ever a handful of them up for auction at any given time, and the market is so small that the price is basically random, so if you&#39;re actually looking for one I can only wish you luck. This one goes up to Gen3 x8, which is fine for my purposes. If you only need Gen1/2 capability, the situation is much better.&lt;/p&gt;&lt;p&gt;&lt;span style=&quot;color: #990000;&quot;&gt;[Edit by Future Shane] There is one &lt;a href=&quot;https://www.ebay.com/itm/186111472517&quot;&gt;listed on eBay&lt;/a&gt; for a good price right now if anyone else is looking for one. (I&#39;ll remove this note after it&#39;s no longer available.)&lt;/span&gt;&lt;/p&gt;&lt;p&gt;The U4301B is actually just the instrument in the bottom slot of the&amp;nbsp;&lt;a href=&quot;https://www.keysight.com/us/en/product/M9505A/axie-5-slot-chassis.html&quot;&gt;M9505A&lt;/a&gt; AXIe Chassis. This is meant to connect to a PCIe slot on a host machine using an &lt;a href=&quot;https://www.molex.com/en-us/products/connectors/high-speed-pluggable-io/ipass-connector-system&quot;&gt;iPass&lt;/a&gt; cable and &lt;a href=&quot;https://onestopsystems.com/collections/pcie-cable-adapters/products/pcie-x4-gen2-host-cable-adapter&quot;&gt;interface card&lt;/a&gt;. Newer versions of the chassis controller have a laptop-friendly Thunderbolt connection instead. I &quot;upgraded&quot; mine using an eGPU enclosure, the smaller black box sitting on top.&lt;/p&gt;&lt;p&gt;I said that the U4301B was my second-best eBay score of all time, and that&#39;s because the number one is the &lt;a href=&quot;https://www.keysight.com/us/en/product/U4322A/pcie-mid-bus-probe.html&quot;&gt;U4322A&lt;/a&gt; probe that I got to go with it, from a different auction. The protocol analyzer is useless without a probe or interposer, and those are even harder to find used. I have &lt;i&gt;never&lt;/i&gt;&amp;nbsp;seen a U4322A on eBay before or since the one I got, and all other online listings for them are dead-ends. So the fact that I got one for what might as well be free compared to the new cost is just plain luck.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;It was, however, a lot broken...&lt;/p&gt;&lt;p&gt;The probe has two rows of spring-loaded contacts that are meant to touch down on test pads for the PCIe signals. Unfortunately, mine was missing several pins and many others were bent or broken. It had been treated like a scrap cable, rather than a delicate probe. No problem, though, I can just replace the spring pins with some equivalent Mill-Max parts...&lt;br /&gt;&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_00_001s.jpg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;...oh, well shit.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;This was one of the most ridiculous things I have ever seen under the microscope. Each spring pin has a &lt;i&gt;surface-mount resistor &lt;/i&gt;soldered into its tip, and encased in epoxy. What the multi-GHz fuck is going on with these? Well, I suspect they each make up part of a passive probe, also called a Low-Z or Z0 probe. This &lt;a href=&quot;https://www.youtube.com/watch?v=eZhMIR0l3xU&quot;&gt;video&lt;/a&gt; explains the concept in detail; it&#39;s forming a resistive divider with the 50Ω termination. But it must have extremely low capacitance on the input side of the resistor, hence the resistors embedded in the tips. The good news is that there are no amplifiers in the probe head, so there&#39;s not much else that can be broken.&lt;/p&gt;&lt;p&gt;There&#39;s no replacement for these pins, so the ones that were missing or broken were a lost cause. But luckily there were enough intact ones to make a full bidirectional x4 link, which is all I really needed. They weren&#39;t all in the right locations, so I had to carefully rearrange them with a soldering iron, taking care to use as little solder as possible while still making a strong connection. After making the x4 link, there are only a couple of spare pins remaining, so I need to be very careful with this probe.&lt;/p&gt;&lt;p&gt;Actually the U4322A was not my first choice; what I really wanted was a &lt;a href=&quot;https://www.keysight.com/us/en/product/U4328A/m-2-pci-express-interposer-socket-3-m-key.html&quot;&gt;U4328A&lt;/a&gt; M.2 interposer, which taps off the signals at an M.2 connector bridge. But I can convert my basically free U4322A into that using a basically free circuit board. This board just has the test pad footprint for the U4322A in between a short M.2 extension. I carefully mounted the U4322A to the board with standoffs and don&#39;t really intend to ever take it off again.&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_00_002s.png&quot; width=&quot;640&quot; /&gt;&lt;/div&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_00_003s.jpg&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;Somewhat to my surprise, this collection of parts actually does work. I was worried that there would be some license nonsense involved, but the instrument license seems to go with the instrument. The &lt;a href=&quot;https://www.keysight.com/us/en/lib/software-detail/instrument-firmware-software/logic-and-protocol-analyzer-software-64bit-2511828.html&quot;&gt;host software&lt;/a&gt; doesn&#39;t require a separate license and worked right away, even through my weird Thunderbolt eGPU enclosure hack. And that&#39;s really where the value is. It wouldn&#39;t be hard to make an in-system &lt;a href=&quot;https://cdrdv2.intel.com/v1/dl/getContent/643108&quot;&gt;PIPE&lt;/a&gt; traffic logger on a Zynq Ultrascale+, and I might do that anyway, but parsing and visualizing the data in a convenient way takes a lot of effort. With the LPA Software, you just get nice graph and packet views straight away:&lt;/p&gt;&lt;p style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://scolton-www.s3.amazonaws.com/img/pcie_00_004s.png&quot; width=&quot;640&quot; /&gt;&lt;/p&gt;&lt;p&gt;This all seems like a lot of effort for probing an interface that&#39;s now at least two generations old. All this equipment is outdated and could for sure be replaced with a single-board interposer based on a Zynq Ultrascale+. All it needs is two GTH quads, a bunch of RAM, and a high-speed interface to the outside world. But I don&#39;t think Keysight or Teledyne LeCroy are interested in that - Gen5 is where the money is. Interestingly, though, the new &lt;a href=&quot;https://www.keysight.com/us/en/product/P5552A/p5552a-pcie-5-0-protocol-analyzer.html&quot;&gt;Keysight Gen5 analyzer&lt;/a&gt; &lt;i&gt;is&lt;/i&gt;&amp;nbsp;a single-board interposer.&lt;/p&gt;&lt;p&gt;But for now I have Gen3 protocol analysis capability, which is good enough for my purposes. I&#39;ve used it a bunch in the past few months to explore the different layers of the PCIe stack and components within. There are some really interesting parts that I may cover in future posts. But I&#39;ll probably start with an overview of the whole stack, and where the available Xilinx IPs fit into it, since even that is a little confusing at first. There are hard and soft (i.e. HDL) components to it, and not every device has an out-of-the-box solution for making the whole stack. That&#39;s enough material for an entire post though, so I&#39;ll end this one here.&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/3670780298537786236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2023/05/pcie-deep-dive-part-1-tool-hunt.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/3670780298537786236'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/3670780298537786236'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2023/05/pcie-deep-dive-part-1-tool-hunt.html' title='PCIe Deep Dive, Part 1: Tool Hunt'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-6000588580685897548</id><published>2021-10-09T10:16:00.001-04:00</published><updated>2021-11-23T16:32:05.400-05:00</updated><title type='text'>Zynq Ultrascale+ Bare Metal NVMe: 2GB/s with FatFs + exFAT</title><content type='html'>&lt;p&gt;This is a quick follow-up to my &lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html&quot;&gt;original post&lt;/a&gt; on speed testing bare metal NVMe with the Zynq Ultrascale+ AXI-PCIe bridge. There, I demonstrated a lightweight NVMe driver running natively on one Cortex-A53 core of the ZU+ PS that could comfortably achieve &amp;gt;1GB/s write speeds to a suitable M.2 NVMe SSD, such as the Samsung 970 Evo Plus. That&#39;s without any hardware acceleration: the NVMe queues are maintained in external DDR4 RAM attached to the PS, by software running on the A53.&lt;/p&gt;&lt;p&gt;I was actually able to get to much higher write speeds, over 2.5GB/s, writing directly to the SSD (no file system) with block sizes of 64KiB or larger. But this only lasts as long as the SLC cache: Modern consumer SSDs use either TLC or QLC NAND flash, which stores three or four bits per cell. But it&#39;s slower to write than single-bit SLC, so drives allocate some of their free space as an SLC buffer to achieve higher peak write speeds. Once the SLC cache runs out, the drive drops down to a lower sustained write speed.&lt;/p&gt;&lt;p&gt;It&#39;s not easy to find good benchmarks for sustained sequential writing. The best I&#39;ve seen are from &lt;a href=&quot;https://www.tomshardware.com/&quot;&gt;Tom&#39;s Hardware&lt;/a&gt; and &lt;a href=&quot;https://www.anandtech.com/&quot;&gt;AnandTech&lt;/a&gt;, but only as curated data sets in specific reviews, not as a global data set. For example, this &lt;a href=&quot;https://www.tomshardware.com/reviews/sabrent-rocket-4-plus-m2-nvme-ssd-review/2&quot;&gt;Tom&#39;s Hardware review of the Sabrent Rocket 4 Plus 4TB&lt;/a&gt; has good sustained sequential write data for competing drives. And, this &lt;a href=&quot;https://www.anandtech.com/show/16087/the-samsung-980-pro-pcie-4-ssd-review/3&quot;&gt;AnandTech review of the Samsung 980 Pro&lt;/a&gt; has some more good data for fast drives under the Cache Size Effects test. My own testing with some of these drives, using ZU+ bare metal NVMe, has largely aligned with these benchmarks.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFbCSBuIDiDMRj1HJHdfma8XoPLWgT4MY-gMq_2J6-sStmWhcFFt53UIT2uQunCR4QwXTANIeWdEQf7Vru85ANMKm36J3iioGbBjNMiDtu-9KZ8jcH2OHoswficSDORFi2s3q48fkLSXs/s1585/CompareFOB.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;893&quot; data-original-width=&quot;1585&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFbCSBuIDiDMRj1HJHdfma8XoPLWgT4MY-gMq_2J6-sStmWhcFFt53UIT2uQunCR4QwXTANIeWdEQf7Vru85ANMKm36J3iioGbBjNMiDtu-9KZ8jcH2OHoswficSDORFi2s3q48fkLSXs/w640-h360/CompareFOB.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;The unfortunate trend is that, while peak write speeds have increased dramatically in the last few years, sustained sequential write speeds may have actually gotten &lt;i&gt;worse&lt;/i&gt;. This trend can be seen globally as well as within specific lines. (It might even be true within &lt;a href=&quot;https://www.tomshardware.com/news/samsung-is-swapping-ssd-parts-too&quot;&gt;different date codes of the same drive&lt;/a&gt;.) Take for example the &lt;a href=&quot;http://images.anandtech.com/doci/16087/seq-fill-970pro-1024.png&quot;&gt;Samsung 970 Pro&lt;/a&gt;, an MLC (two bit per cell) drive released in 2018 that had no SLC cache but could write its full capacity (1TB) MLC at over 2.5GB/s. Its successor, the &lt;a href=&quot;https://www.anandtech.com/show/16087/the-samsung-980-pro-pcie-4-ssd-review/3&quot;&gt;980 Pro&lt;/a&gt;, has much higher peak SLC cache write speeds, nearing 5GB/s with PCIe Gen4, but dips down to below 1.5GB/s at some points after the SLC cache runs out.&lt;/p&gt;&lt;p&gt;Things get more complicated when considering the allocation state of the SSD. The sustained write benchmarks are usually taken after the entire SSD has been deallocated, via a secure erase or whole-drive TRIM. This restores the SLC cache and resets garbage collection to some initial state. If instead the drive is left &quot;full&quot; and old blocks are overwritten, the SLC cache is not recovered. However, this may also result in faster and more steady sustained sequential writing, as it prevents the undershoot that happens when the SLC cache runs out and must be unloaded into TLC.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjEA2II-xL_12nxm8NlZvsMk-tTEaZYOCE9SfpaxZok8ZkZ5LXFI4_vmXI2q-RWxa39TiLXRAjls8gdN6iJAXfvTbUgVEmQXkSmOxYVfCQS6wr6Uw66XOfllb2hlLklfatBztdu-7Jiuw/s1601/CompareTRIM.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;897&quot; data-original-width=&quot;1601&quot; height=&quot;358&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjEA2II-xL_12nxm8NlZvsMk-tTEaZYOCE9SfpaxZok8ZkZ5LXFI4_vmXI2q-RWxa39TiLXRAjls8gdN6iJAXfvTbUgVEmQXkSmOxYVfCQS6wr6Uw66XOfllb2hlLklfatBztdu-7Jiuw/w640-h358/CompareTRIM.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;So in certain conditions and with the right SSD, it&#39;s just possible to get to sustained sequential write speeds of 2GB/s with raw disk access. But, what about with a file system? I originally tested &lt;a href=&quot;http://elm-chan.org/fsw/ff/00index_e.html&quot;&gt;FatFs&lt;/a&gt; with the drive formatted as FAT32, reasoning (incorrectly) that an older file system would be simpler and have less overhead. But as it turns out, exFAT is a much better choice for fast sustained sequential writing.&lt;/p&gt;&lt;p&gt;The most important difference is how FAT32 and exFAT check for and update cluster allocation. Clusters are the unit of memory allocated for file storage - all files take up an integer number of clusters on the disk. The clusters don&#39;t have to be sequential, though, so the File Allocation Table (FAT) contains chained lists of clusters representing a file. For sequentially-written files, this list is contiguous. But the FAT allows for clusters to be chained together in any order for non-contiguous files. Each 32b entry in the FAT is just a pointer to the next cluster in the file.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCJOhHf2yz_c9_cQ6KNbSZnEYNmnfhOYBsY52Apos-zAa6yZ5-UOwUUYwDAic2GRwd7D1xPudAoNip0w_jxgyCvpqWYp6iAGcqfbUZmr9n_zzE2cPshQqEc2ONv_k5FSIpCi6gvCP-RHc/s1307/e74.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1014&quot; data-original-width=&quot;1307&quot; height=&quot;497&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCJOhHf2yz_c9_cQ6KNbSZnEYNmnfhOYBsY52Apos-zAa6yZ5-UOwUUYwDAic2GRwd7D1xPudAoNip0w_jxgyCvpqWYp6iAGcqfbUZmr9n_zzE2cPshQqEc2ONv_k5FSIpCi6gvCP-RHc/w640-h497/e74.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;FAT32 cluster allocation entirely based on 32b FAT entries.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;In FAT32, the cluster entries are mandatory and a sequential write must check and update them as it progresses. This means that for every cluster written (64KiB in maxed-out FAT32), 32b of read and write overhead is added. In FatFs, this gets buffered until a full LBA (512B) of FAT update is ready, but when this happens there&#39;s a big penalty for stopping the flow of sequential writing to check and update the FAT.&lt;/p&gt;&lt;p&gt;In exFAT, the cluster entries in the FAT are optional. Cluster allocation is handled by a bitmap, with one bit representing each cluster (0 = free, 1 = allocated). For a sequential file, this is all that&#39;s needed. Only non-contiguous files need to use the 32b cluster entries to create a chain in the FAT. As a result, sequential writing overhead is greatly reduced, since the allocation updates happen 32x less frequently.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7RT57TasiKoEUEDvcBYRvyubADyNZeU_KAFYorYvhK-nXdyFB4ubBf4wR9VSzCSTJ1LRutcY6aVZnUJlk8Ux1GZmuT3u76ORq7ESoiswn6eTnfln7BdEGHW6T2fNfLPZEdk43F1c7uJM/s1313/e75.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1204&quot; data-original-width=&quot;1313&quot; height=&quot;586&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7RT57TasiKoEUEDvcBYRvyubADyNZeU_KAFYorYvhK-nXdyFB4ubBf4wR9VSzCSTJ1LRutcY6aVZnUJlk8Ux1GZmuT3u76ORq7ESoiswn6eTnfln7BdEGHW6T2fNfLPZEdk43F1c7uJM/w640-h586/e75.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;exFAT cluster allocation using bitmap only for sequential files.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;The cluster size in exFAT is also not limited to 64KiB. Using larger clusters further reduces the allocation update frequency, at the expense of more dead space between files. If the plan is to write multi-GB files anyway, having 1MiB clusters really isn&#39;t a problem. And speaking of multi-GB files, exFAT doesn&#39;t have the 4GiB file size limit that FAT32 has, so the file creation overhead can also be reduced. This does put more data &quot;at risk&quot; if a power failure occurs before the file is closed. (Most of the data would probably still be in flash, but it would need to be recovered manually.)&lt;/p&gt;&lt;p&gt;All together, these features reduce the overhead of exFAT to be almost negligible:&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghxUYlV4AANFtU9pMZIMLT6QLDdegR0qjV7AxsULmLDAsAbVNuoei5S7fHgohufFDmFAENEla-1PiVdD6mkTAYTUIAQ1ZUtwjKUaYkvtB9Y-ICFxz9pHNrfKLc25ZmeoADStG9Bn2WLDE/s1601/CompareFS.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;897&quot; data-original-width=&quot;1601&quot; height=&quot;358&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghxUYlV4AANFtU9pMZIMLT6QLDdegR0qjV7AxsULmLDAsAbVNuoei5S7fHgohufFDmFAENEla-1PiVdD6mkTAYTUIAQ1ZUtwjKUaYkvtB9Y-ICFxz9pHNrfKLc25ZmeoADStG9Bn2WLDE/w640-h358/CompareFS.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;With 1MiB clusters and 16GiB files, it&#39;s possible to get ~2GB/s of sustained sequential &lt;i&gt;file&lt;/i&gt; writing onto a 980 Pro for its entire 2TB capacity. I think this is probably the fastest implementation of FatFs in existence right now. The data block size still needs to be at least 64KiB, to keep the driver overhead low. But if a reasonable amount of streaming data can be buffered in RAM, this isn&#39;t too much of a constraint. And of course you do have to keep the SSD cool.&lt;/p&gt;&lt;p&gt;I&#39;ve updated the bare metal NVMe test project to Vivado/Vitis 2021.1 &lt;a href=&quot;https://github.com/coltonshane/SSD_Test&quot;&gt;here&lt;/a&gt;. It would still require some effort to port to a different board, and I still make no claims about the suitability of this driver for any real purposes. But if you need to write massive amounts of data and don&#39;t want to mess around in Linux (or want to try something similar in Linux user space...) it might be a good reference.&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/6000588580685897548/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2021/10/zynq-ultrascale-bare-metal-nvme-2gbs.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/6000588580685897548'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/6000588580685897548'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2021/10/zynq-ultrascale-bare-metal-nvme-2gbs.html' title='Zynq Ultrascale+ Bare Metal NVMe: 2GB/s with FatFs + exFAT'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFbCSBuIDiDMRj1HJHdfma8XoPLWgT4MY-gMq_2J6-sStmWhcFFt53UIT2uQunCR4QwXTANIeWdEQf7Vru85ANMKm36J3iioGbBjNMiDtu-9KZ8jcH2OHoswficSDORFi2s3q48fkLSXs/s72-w640-h360-c/CompareFOB.png" height="72" width="72"/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-4608820714266473694</id><published>2021-09-12T23:04:00.003-04:00</published><updated>2021-09-20T11:33:29.833-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="TinyCross"/><title type='text'>TinyCross: New UI and Front Wheel Traction Control</title><content type='html'>&lt;p&gt;&amp;nbsp;In the &lt;a href=&quot;http://scolton.blogspot.com/2021/08/tinycross-4wd-80a-data-logging.html&quot;&gt;last post&lt;/a&gt;, I finally did some actual data logging with TinyCross set up in 4WD, 80A peak per motor, which is the rated current. Based on &lt;a href=&quot;http://scolton.blogspot.com/p/cap-kart.html#tinykart&quot;&gt;tinyKart&lt;/a&gt;, I know they can handle a a bit more for short durations, maybe even up to 120A. But the data logs (and many instances of having rocks flung into my face) demonstrate that the front wheels reach their traction limit somewhere around 60A on asphalt.&lt;/p&gt;&lt;p&gt;The behavior of front wheel slip on a go-kart is something new to me. In a straight line, the initiation of the slip and the acceleration of the wheel actually isn&#39;t the biggest problem. It&#39;s when the wheel regains traction and slows down that bad things happen. The restored grip combines with the energy being dumped from the wheel&#39;s moment of inertia to generate a quick pulse of torque on that side, which creates a lot of torque steer.&lt;/p&gt;&lt;p&gt;To deal with this, I wanted to implement some form of traction control, at least for the front wheels, so that I could get the most torque out of them as possible without the steering disturbances and rock shooting. But first, I needed a way to easily configure both the motor currents and the traction control settings without having to drag around my laptop everywhere. So, I finally built out the steering wheel UI to include a bunch of settings:&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZaDxcsaIpz4ijBS9ZGksAvbhjJVqzm7PVBxjBntWNA_4lHPtN1z2pONUNFMByo3tnfbuAANcWixlqZ2iidWCE5RZURevvhMOvJmtNjFQSBeemJQaQA-vAQ1ndv4EJwC5_aN5ZUcrtlds/s2048/tc98.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZaDxcsaIpz4ijBS9ZGksAvbhjJVqzm7PVBxjBntWNA_4lHPtN1z2pONUNFMByo3tnfbuAANcWixlqZ2iidWCE5RZURevvhMOvJmtNjFQSBeemJQaQA-vAQ1ndv4EJwC5_aN5ZUcrtlds/w640-h480/tc98.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Sorry for the exposure; it&#39;s the only way to capture the full OLED refresh period.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p style=&quot;text-align: left;&quot;&gt;Anyone familiar with the &lt;a href=&quot;https://freeflysystems.com/movi-controller&quot;&gt;MōVI Controller&lt;/a&gt; might recognize the OLED display. I chose this for daylight visibility and responsiveness (~50Hz update rate). The menu interface is essentially the same as the one I built the day before NAB 2014... The left knob scrolls through the menu. The right knob adjust settings and, by clicking or holding, performs actions.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;In the four corners are three motor parameters for the corresponding motors: S for Status, which shows error codes. F for Forward peak current, and R for Reverse (braking, or actually reversing) peak current. Setting both to zero masks out the CAN command from that motor, triggering a timeout that turns off the gate drivers entirely. A click and hold on S triggers an encoder recalibration for that motor.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;In the second column from the left, the first three settings relate to data logging: LS for Logger Status, FN for File Number (click to start a new file), and LT for Logger Time, the time in [ms] for a single row of the data log to be written. Then, there are two parameters for tuning traction control: TT for Traction Threshold, and TG for Traction Gain, which I will explain shortly.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The reason I wanted to be able to adjust peak currents from the steering wheel is because I agree with this early Tesla &lt;a href=&quot;https://www.tesla.com/blog/spin-stops-here&quot;&gt;blog post&lt;/a&gt;: &quot;...it&#39;s much safer to avoid wheelspin altogether than react to it.&quot; If I know the surface supports front wheel current around 60A, there&#39;s not much point in setting it higher than that. But, I want to be able to set it higher for testing, or adjust it for different surfaces.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;As for the traction control itself, there are a lot of corner cases to think about in 4WD, but the main problem I&#39;m trying to solve is front wheel slip. If I assume the rear wheels are not slipping, then I can use their average speed as a reference. From there, it&#39;s &lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhn1AoiTrgVGp20NRQykuhhg999lD3uTm_jY7MSHTQ7IdnIjp6U1ggplaSq0j3KBjGTNY2LnWpcvAAcZDXdxGz35qYiHxNJQbFyb40acbHuepXU3z84tindKb53HViEqX4Baa5UCoIgkd4/w399-h640/launch_4WD_80A_12V.png&quot;&gt;easy to see&lt;/a&gt; if a front wheels is running faster than that reference, and reduce the current to that motor if so. This only needs two settings: a Traction Threshold (TT) that sets how much wheel slip is allowed, and a Traction Gain (TG) that sets how much to reduce the current per unit slip above the threshold. The Traction Threshold prevents overactuation in normal conditions and allows for speed differential due to turning radius.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;But what happens if a rear wheel does slip? Well, then the front wheel might slip too. At that point, I&#39;m probably in some kind of a four wheel sideways drift anyway, so alternate control laws are going to apply. Being able to trigger some rear wheel slip with the throttle is part of the fun, too, so having complete 4WD traction control isn&#39;t something I necessarily need to solve.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;With the new UI setup and the simple front wheel traction control in place, it was time to do some tuning...&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkTVDhMfdtneAJ-oxHkQih4OTOGI_exkprfMSNbxENdcJgBA-2ksvHU3Fv5z8JofVnOr2w7h9PTnksJul3JKESimV-quGWPMuILKMIyiR04bo9dFdOtfYAd74IJHNsom-9JDFoOMHZ8Es/s2048/tc95.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkTVDhMfdtneAJ-oxHkQih4OTOGI_exkprfMSNbxENdcJgBA-2ksvHU3Fv5z8JofVnOr2w7h9PTnksJul3JKESimV-quGWPMuILKMIyiR04bo9dFdOtfYAd74IJHNsom-9JDFoOMHZ8Es/w640-h480/tc95.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;...or not.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p style=&quot;text-align: left;&quot;&gt;At first, everything seemed to be going okay. I did a couple of runs at 60A front current and 80A rear current and the traction control seemed to be working as intended. But then during light regenerative braking at around 30mph, I heard the all-too-familiar sound of a FET popping, followed by some more bad noises and smells from the front drive. Upon inspection, only two FETs actually died, but they also took out many of the power traces, meaning this board was trash.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;So what happened? Well, unfortunately, the data log was not very helpful in this case. It did show the speed (30mph) and current command (around -10A), but nothing out of the ordinary up until the point of failure. There is only one data point showing a Q-Axis current of 286A on the front left motor, followed by an undervoltage fault, which might have been the battery sagging or the power input traces getting blown up. So whatever happened, happened quick.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;It&#39;s been a while since I&#39;ve actually destroyed a motor controller, so I was a little disappointed. But after some thought, I didn&#39;t think this was due to the new traction control stuff. That&#39;s only applied during acceleration, and this failure definitely happened under braking. I think it&#39;s more likely that the front left motor just lost sync and the back EMF at 30mph was high enough to do damage. Up until now, I have only had a relatively slow overcurrent limit of 160A (or more) for 10ms. These FETs have a pretty insane Safe Operating Area (SOA), but that limit does leave room for exceeding it with currents above 400A:&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiljq7fFCCpyVCo6V6K5Khy4JQ6RZO9YOUWOydWylawIPNHeAOVwTFAkk52FWXKu-N8UizhJ761FrCSQEi-NdqNY0DbC4qk6AspuddTl0XUUGZylWhvybM_Q7ZS1KF8ar2NYauDmCqsiEI/s1072/tc99.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;956&quot; data-original-width=&quot;1072&quot; height=&quot;356&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiljq7fFCCpyVCo6V6K5Khy4JQ6RZO9YOUWOydWylawIPNHeAOVwTFAkk52FWXKu-N8UizhJ761FrCSQEi-NdqNY0DbC4qk6AspuddTl0XUUGZylWhvybM_Q7ZS1KF8ar2NYauDmCqsiEI/w400-h356/tc99.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p style=&quot;text-align: left;&quot;&gt;This system could easily generate a 400A transient if a motor loses sync at 30mph. And the motor position and speed data&amp;nbsp;&lt;i&gt;does&lt;/i&gt;&amp;nbsp;cut out at the same data point as the failure. But that&#39;s not enough to determine cause and effect. So for now I can only make changes that might help and hope for the best. I added in several more stages of faster overcurrent protection, up to 300A for a single ADC/PWM cycle (42.7μs). These overlap enough to cover the entire R_DS(on)-limited boundary of the SOA (up to the pulse rating of 1450A for 100μs!).&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;A faster overcurrent trip doesn&#39;t help with whatever caused the motor to lose sync in the first place (if that is what happened). I have seen at least a couple previous instances where the encoders, which supply emulated Hall effect sensor signals, have behaved as if they were completely reset. Even though I only use the buffered and optically isolated virtual Hall effect sensor signals for commutation, I was still reading the SPI data anyway. Maybe a SPI read got corrupted by noise and turned into a write that either reconfigured or entirely reset the encoder mid-run? To protect against this, I now disabled the SPI transactions entirely other than during initialization and calibration.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;So with these changes and my last and only spare drive, I went back out for another try. This time, I ran into no motor drive issues and was actually able to test and tune the front wheel traction control as I originally intended. The difference is immediately obvious while driving and in the data. First, a test at 80A front, 90A rear, with no traction control:&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAxiD0tyB102XD4J4HqUJet6lb1_Ro033hW2ijFgSNvVz5VLpSGuz8MZGbcYpjjr7eg-DmXcdcNzcmmhE211F6piZUrc3n3DEQD9_XCfpKiXfiDPqJi9ZIq2MujYFyDVKfcXninDeoXrQ/s1741/tc00.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1741&quot; data-original-width=&quot;1369&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAxiD0tyB102XD4J4HqUJet6lb1_Ro033hW2ijFgSNvVz5VLpSGuz8MZGbcYpjjr7eg-DmXcdcNzcmmhE211F6piZUrc3n3DEQD9_XCfpKiXfiDPqJi9ZIq2MujYFyDVKfcXninDeoXrQ/w504-h640/tc00.png&quot; width=&quot;504&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Front wheel traction control off.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p style=&quot;text-align: left;&quot;&gt;As before, the front right wheel starts slipping at about 60A and spins up to 2-3x the actual ground speed. The front right always seems to lose grip first, a mystery to solve another day. When I let off the throttle and it regains traction, the torque pulse creates substantial torque steer, jerking the steering wheel almost 20º to the left, which I then have to counteract immediately to stay on course. Overall, it&#39;s impossible to sustain peak acceleration for more than a second or so before having to deal with the wheel spin and torque steer.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;And now with the same currents, but front wheel traction control on:&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEideQlOdqmAa1EWKA-uT3j_O9unV8W5QK1VwrEX0_IKyHNHj82C8K6u-ORRj7cycxhA2c1jj1HlQz4cRL4Mn1VIrCh-wgh6avaQLHxVhKh5iMkYAM5Dlf8e_2kSSobn83OcMVjRYILi-pk/s1731/tc01.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1731&quot; data-original-width=&quot;1369&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEideQlOdqmAa1EWKA-uT3j_O9unV8W5QK1VwrEX0_IKyHNHj82C8K6u-ORRj7cycxhA2c1jj1HlQz4cRL4Mn1VIrCh-wgh6avaQLHxVhKh5iMkYAM5Dlf8e_2kSSobn83OcMVjRYILi-pk/w506-h640/tc01.png&quot; width=&quot;506&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Front wheel traction control on.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p style=&quot;text-align: left;&quot;&gt;The front right (FR) current now averages a bit below 60A and its speed is held to just a small margin above the actual ground speed. It&#39;s never able to build up momentum and then &quot;catch&quot;, inducing torque steer. This allows continuous acceleration up to and past 30mph. The front left (FL) also starts to slip in the 20-30mph range, but the traction control catches it too. The overall result is a much more controllable launch and far fewer rocks being thrown up by the front wheels.&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;After finding traction control settings that I liked, I switched back to current settings that more closely match the actual traction limits: 60A front and 100A rear. This still gives a reasonable 0.45g launch, but with less likelihood of triggering the traction control on asphalt. I&#39;d like to push to &amp;gt;0.5g, to match tinyKart&#39;s most extreme configuration, but that&#39;ll either require 120A on the rear or changing the gear ratio a bit. At 60A / 100A, the front motors still share enough of the load that the rear motors stay at healthy temperature after some acceleration runs:&lt;/p&gt;&lt;p style=&quot;text-align: left;&quot;&gt;&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJYQ-19H8n1XDjK4_o979xE8Bd63OEeUhvRonZ-SwyZeOvLSLnzJo98s7Jbuw8tZNre98CA6aBBSJMl39VB4rJK5GeBOM49QZ-m5VeFarN9LSMv1_7vDZf7cgmfHSAr-GcyGzjCgcxQc4/s1440/tc97.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1080&quot; data-original-width=&quot;1440&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJYQ-19H8n1XDjK4_o979xE8Bd63OEeUhvRonZ-SwyZeOvLSLnzJo98s7Jbuw8tZNre98CA6aBBSJMl39VB4rJK5GeBOM49QZ-m5VeFarN9LSMv1_7vDZf7cgmfHSAr-GcyGzjCgcxQc4/w640-h480/tc97.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Rear motors are doing most of the work, but...&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEie37dGZ135or-3iC1gCtxzbXjd3zvQIamTUoZLWnv6RXzp0b4dKVp55StYYm_V4AZy4sgQ15OfzFTAyLC4G2vnuKo7X1GkmbNUooab6WigDK7V4J5Y5IeBbmAQvyhHAIYfGD-r3RxOutk/s1440/tc96.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1080&quot; data-original-width=&quot;1440&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEie37dGZ135or-3iC1gCtxzbXjd3zvQIamTUoZLWnv6RXzp0b4dKVp55StYYm_V4AZy4sgQ15OfzFTAyLC4G2vnuKo7X1GkmbNUooab6WigDK7V4J5Y5IeBbmAQvyhHAIYfGD-r3RxOutk/w640-h480/tc96.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;...they are at a reasonable temperature.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;And finally I did some less structured testing by just driving through the gravel corner in my parking lot and intentionally adding throttle to induce slip. It behaves pretty well, slipping and oversteering about the right amount to be controllable but still fun:&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;iframe allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/EdQ26vxMblk&quot; title=&quot;YouTube video player&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think at this point most of the handling bottlenecks are back on the mechanical side. There&#39;s a small amount of backlash in the steering column that definitely exaggerates the residual torque steer, especially at high speeds. It&#39;s almost all coming from the U-joint, which I may try to shim or replace with one with tighter tolerances. Other than that, I need to do some suspension geometry tweaking to improve handling of lateral transients. Speaking of which, here&#39;s one last data capture. See if you can figure out what&#39;s going on here...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYJF2s4iZZ23vq67hn4LED8aOPskF0KZovDts46zO3zVx295cuENferNij8TaXBNtQSesqqJPvndMxRkZJFxwKJXsmRsnv8cTG87SsOuJmGqoUu8M5jR-M5g92CaMolNT_0KhIhyphenhyphen5v_lk/s1799/tc02.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1799&quot; data-original-width=&quot;1384&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYJF2s4iZZ23vq67hn4LED8aOPskF0KZovDts46zO3zVx295cuENferNij8TaXBNtQSesqqJPvndMxRkZJFxwKJXsmRsnv8cTG87SsOuJmGqoUu8M5jR-M5g92CaMolNT_0KhIhyphenhyphen5v_lk/w492-h640/tc02.png&quot; width=&quot;492&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Mystery data log.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/4608820714266473694/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2021/09/tinycross-new-ui-and-front-wheel.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/4608820714266473694'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/4608820714266473694'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2021/09/tinycross-new-ui-and-front-wheel.html' title='TinyCross: New UI and Front Wheel Traction Control'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZaDxcsaIpz4ijBS9ZGksAvbhjJVqzm7PVBxjBntWNA_4lHPtN1z2pONUNFMByo3tnfbuAANcWixlqZ2iidWCE5RZURevvhMOvJmtNjFQSBeemJQaQA-vAQ1ndv4EJwC5_aN5ZUcrtlds/s72-w640-h480-c/tc98.jpg" height="72" width="72"/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-4805027145869433384</id><published>2021-08-15T20:05:00.000-04:00</published><updated>2021-09-20T11:33:23.190-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="TinyCross"/><title type='text'>TinyCross: 4WD 80A Data Logging</title><content type='html'>&lt;p&gt;It&#39;s been a long time since I did a proper test drive with TinyCross, although I&#39;ve taken it out just for fun a few times. Since I completed the &lt;a href=&quot;http://scolton.blogspot.com/2021/08/tinycross-weight-and-width-reduction.html&quot;&gt;weight/width reduction pass&lt;/a&gt; last week, I wanted to get it out again and do some proper data logging in 4WD, with the peak current set to 80A for all four motors. This is still below the ultimate target of 100-120A (for short bursts), but plenty for parking lot testing.&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh12KxTTTyG9aI5YuQI3DMFJxLUWkXY44dsriMHVLBHm3MpO8ht7ywQzf4PZmstQ2hCnNt-S6YKFTIF-rGs8OxMOljKY_K3on2Qgr3RoM3A6eQEnF_-s7eO6AVSOCPYCCYKVoLu9Dp8p_0/&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;2048&quot; data-original-width=&quot;1215&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh12KxTTTyG9aI5YuQI3DMFJxLUWkXY44dsriMHVLBHm3MpO8ht7ywQzf4PZmstQ2hCnNt-S6YKFTIF-rGs8OxMOljKY_K3on2Qgr3RoM3A6eQEnF_-s7eO6AVSOCPYCCYKVoLu9Dp8p_0/w379-h640/tc93.jpg&quot; width=&quot;379&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Really enjoying the extra 2&quot; of clearance - I can get through most of the &quot;doors&quot; in my building now.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;I had to inflate the tires, but amazingly the air shocks don&#39;t seem to have leaked at all after a year of neglect. And they still do a pretty impressive job of soaking up the awful topography of my parking lot.&lt;/p&gt;&lt;p&gt;&lt;iframe allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/XDjnYi42X1M&quot; title=&quot;YouTube video player&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;/p&gt;&lt;p&gt;I wanted to do some more thorough data logging in 4WD to characterize some of the issues I&#39;ve felt while just driving around for fun. The steering wheel PCB collects data from the front and rear motor drives over CAN, appends some of its own data, and writes the whole thing to a microSD card. When I first set this up, I just had it overwrite the existing data log every power cycle. But in the couple of years since I set that up, I&#39;ve had to &lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html&quot;&gt;master FatFs&lt;/a&gt;.&amp;nbsp;So setting it up to create new files on the fly without messing up any of the real-time stuff was an easy upgrade.&lt;/p&gt;&lt;p&gt;Here&#39;s what a 4x80A launch looks like:&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhn1AoiTrgVGp20NRQykuhhg999lD3uTm_jY7MSHTQ7IdnIjp6U1ggplaSq0j3KBjGTNY2LnWpcvAAcZDXdxGz35qYiHxNJQbFyb40acbHuepXU3z84tindKb53HViEqX4Baa5UCoIgkd4/&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;1939&quot; data-original-width=&quot;1209&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhn1AoiTrgVGp20NRQykuhhg999lD3uTm_jY7MSHTQ7IdnIjp6U1ggplaSq0j3KBjGTNY2LnWpcvAAcZDXdxGz35qYiHxNJQbFyb40acbHuepXU3z84tindKb53HViEqX4Baa5UCoIgkd4/w399-h640/launch_4WD_80A_12V.png&quot; width=&quot;399&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;4x80A launch (attempt).&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The main problem is pretty obvious from the data: the front wheels just don&#39;t have enough weight on them to support 80A. If there&#39;s even a little bit of a loose surface, one or both front wheels will lose grip. Excessive wheel slip is inefficient, so the peak acceleration isn&#39;t as high as it could be if all four wheels hugged their grip limit. But front wheel slip is especially bad because it results in massive torque steer. (I actually used this to make &lt;a href=&quot;http://scolton.blogspot.com/2019/11/tinycross-4wd-and-servoless-rc-mode.html&quot;&gt;remote-control TinyCross&lt;/a&gt;.) It also has a habit of throwing rocks up into the driver&#39;s face.&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;iframe allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/z502fdFy-yo&quot; title=&quot;YouTube video player&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;/p&gt;&lt;p&gt;I&#39;ve even debated whether the front wheel drive on TinyCross is worth the extra weight and complexity.&amp;nbsp;&lt;a href=&quot;http://scolton.blogspot.com/p/cap-kart.html#tinykart&quot;&gt;tinyKart&lt;/a&gt; handled pretty well with RWD only: I could put in a controlled amount of oversteer with the throttle. In fact, I got a chance to test out how TinyCross feels with RWD only when I had - let&#39;s call it an 80/20 failure - on the front right upright:&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYCRkEXaOWWCbNN61r3a1fcSchcpP1gHjec44NRyvGIW6-xFpKJ7jFwnWRcMg2PJslFACtz04h0N7OitiEYUgNkwKrvtI-yTn0S4uQB3V-H7jJuPbykC8cmmwqjfedKaduRzUZMcQaF6g/&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYCRkEXaOWWCbNN61r3a1fcSchcpP1gHjec44NRyvGIW6-xFpKJ7jFwnWRcMg2PJslFACtz04h0N7OitiEYUgNkwKrvtI-yTn0S4uQB3V-H7jJuPbykC8cmmwqjfedKaduRzUZMcQaF6g/w640-h480/tc94.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Always check your T-nuts! The only real casualty was the encoder wire.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Although I was able to fix the mechanicals with the single hex driver I always bring with me, a few crimps pulled out of the encoder wire and I didn&#39;t have the tools to fix it. I could probably add a failover to sensorless operation for individual motors, but I&#39;m not sure how well it&#39;d work on the front motors, again because of torque steer. (Both fronts would have to agree to not produce torque until the flux estimator converges on the sensorless motor.) For now, I just removed power from the front drive.&lt;p&gt;&lt;/p&gt;&lt;p&gt;In terms of handling, RWD works fine. But the launch is a mere 0.25g at 2x80A. There&#39;s no slip, and even if there was, it wouldn&#39;t matter as much on the rear since it doesn&#39;t induce torque steer.&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7JJ9sSoz9bSNl5rmhSiyUBPs08irLXoqaIDsJ4OwGvlGszha5OFHmyvF0PXw9ryL-kESON8gOjhoHwxB1kRlPp0rBygRP9MGS3Vt_7Zhzcpj9s5q6FG9Pl4hkXAP8sY0sSaibp-l8dsY/&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img alt=&quot;&quot; data-original-height=&quot;1929&quot; data-original-width=&quot;1480&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7JJ9sSoz9bSNl5rmhSiyUBPs08irLXoqaIDsJ4OwGvlGszha5OFHmyvF0PXw9ryL-kESON8gOjhoHwxB1kRlPp0rBygRP9MGS3Vt_7Zhzcpj9s5q6FG9Pl4hkXAP8sY0sSaibp-l8dsY/w491-h640/launch_RWD_80A_12V.png&quot; width=&quot;491&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;2x80A launch.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Even at 120A, this would only be about a 0.4g launch. tinyKart, in its last and somewhat scary configuration, was hitting about 0.5-0.6g. Part of this is down to gearing: TinyCross, with 12.5&quot; wheel, has to be geared for higher speeds. I could always ditch the front motors and switch to 80mm motors with more torque on the rear. But I think that goes against the spirit of TinyCross. Having full independent suspension and 4WD has always been the point.&lt;p&gt;&lt;/p&gt;&lt;p&gt;So I think I&#39;ll finally have to dive in to writing some simple traction/launch control software. Just looking at the 4x80A launch data, it&#39;s easy to pick out the wheel that&#39;s slipping and imagine that the software could just fold back the current command to that wheel as its speed starts to diverge from the other three. But there are so many logical knots on the path to generalizing that to 4WD, where any subset of the four wheels could be slipping, that it makes my brain hurt to even think about.&lt;/p&gt;&lt;p&gt;There are some amazing technical&amp;nbsp;&lt;a href=&quot;https://www.tesla.com/blog/spin-stops-here&quot;&gt;blog&lt;/a&gt;&amp;nbsp;&lt;a href=&quot;https://www.tesla.com/blog/slip-sliding-away&quot;&gt;posts&lt;/a&gt;&amp;nbsp;from the early days of Tesla (back when it was more of an engineering project than a consumer electronics device) where they talk about how it took months to go from a controller with excellent high-bandwidth torque control to functioning traction control, and even then a lot of it was subjective. One observation I really liked:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&lt;span face=&quot;&amp;quot;Gotham Book&amp;quot;, -apple-system, BlinkMacSystemFont, &amp;quot;Segoe UI&amp;quot;, Roboto, &amp;quot;Helvetica Neue&amp;quot;, Arial, sans-serif&quot; style=&quot;background-color: black; color: #757575; font-size: 14px;&quot;&gt;&lt;i&gt;This type of feedforward traction control can be hugely beneficial; for instance, it&#39;s much safer to avoid wheelspin altogether than react to it.&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This was regarding a lateral G observer that was fed into the friction model that the traction control software used to help limit motor torque to what it thought the tires could reasonably handle. This way, wheel slip might be limited to cases where there truly is a sudden drop in friction at one wheel. I think that should be the goal for this as well. I might even be able to just do slip detection on the front wheels. It&#39;ll be an interesting experiment, at least.&lt;/p&gt;&lt;p&gt;&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/4805027145869433384/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2021/08/tinycross-4wd-80a-data-logging.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/4805027145869433384'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/4805027145869433384'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2021/08/tinycross-4wd-80a-data-logging.html' title='TinyCross: 4WD 80A Data Logging'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh12KxTTTyG9aI5YuQI3DMFJxLUWkXY44dsriMHVLBHm3MpO8ht7ywQzf4PZmstQ2hCnNt-S6YKFTIF-rGs8OxMOljKY_K3on2Qgr3RoM3A6eQEnF_-s7eO6AVSOCPYCCYKVoLu9Dp8p_0/s72-w379-h640-c/tc93.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-5692030429285511468</id><published>2021-08-07T11:16:00.001-04:00</published><updated>2021-08-07T11:16:15.270-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="TinyCross"/><title type='text'>TinyCross Weight and Width Reduction Pass</title><content type='html'>&lt;p&gt;It&#39;s summer, which means it&#39;s time to work on go-karts. This round, it&#39;s a modification to &lt;a href=&quot;http://scolton.blogspot.com/p/cap-kart.html#tinycross&quot;&gt;TinyCross&lt;/a&gt; that I&#39;ve been wanting to make ever since I &lt;a href=&quot;http://scolton.blogspot.com/2019/09/tinycross-first-test-drive-and.html&quot;&gt;first got it together&lt;/a&gt; about two years ago. The main issue is that I designed it around stock&amp;nbsp;&lt;a href=&quot;https://electricscooterparts.com/electricscooterwheels.html#12-1/2%20Wheels&quot;&gt;rear 12.5&quot; scooter wheels&lt;/a&gt;. These are almost symmetric and have threading on both sides of the hub that are meant for mounting the drive sprocket and brake disk. But - and this is maybe my favorite bit of packaging on this project - I&#39;ve got the brake and drive sprocket both mounted to the inboard side, with the brake caliper sitting right in the middle of the belt:&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjZInTH7Snqi9kuVbO6_2aCw1FDj0hdQTw27QfmUDP2N0jCCsm92FNtfFDJExwTdHKuDH8gsFsE19Jc5jOjDzROevc_UQLN0XrVk2JkD9zqpcCClyPZyaA13hammEYLOayzNpoh05sJag/s2048/tc47.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1650&quot; data-original-width=&quot;2048&quot; height=&quot;516&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjZInTH7Snqi9kuVbO6_2aCw1FDj0hdQTw27QfmUDP2N0jCCsm92FNtfFDJExwTdHKuDH8gsFsE19Jc5jOjDzROevc_UQLN0XrVk2JkD9zqpcCClyPZyaA13hammEYLOayzNpoh05sJag/w640-h516/tc47.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The brake and drive sprocket are both mounted to the inboard side of the wheel, making the outboard side of the hub dead weight.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;This makes the extended length of the outboard side of the hub useless. But, I left it as stock for simplicity. I figured if I ever needed to replace the wheels, it would be easier to drop in a new stock 12.5&quot; wheel. But, this drives the overall width of the kart up to about 35&quot; for no good reason:&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkOQvFSUcoMrxGRvio06ZFFrPInsCfakz_ECXhtylDnF5QB4eMR4K-KHx0GzMbJXU9xSpqtpJVZssmz4OJOQYh5qY2mCBx97-uyoAS3LJYacUPRdBmw7hk2uV-kP8ZlOsbhu10dVUinW0/s1829/tc17.png&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1120&quot; data-original-width=&quot;1829&quot; height=&quot;392&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkOQvFSUcoMrxGRvio06ZFFrPInsCfakz_ECXhtylDnF5QB4eMR4K-KHx0GzMbJXU9xSpqtpJVZssmz4OJOQYh5qY2mCBx97-uyoAS3LJYacUPRdBmw7hk2uV-kP8ZlOsbhu10dVUinW0/w640-h392/tc17.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The total width, about 35&quot;, is driven in part by the symmetric 12.5&quot; wheel hubs.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;It&#39;s also unnecessary weight, especially factoring in the beefier 5&quot;x5/8&quot; hex standoffs I used to close the structural loop around each wheel. I figured I could eliminate 2&quot; off the total width and about 1lb off the total weight if I just bit the bullet and re-machined the 12.5&quot; wheel hubs. It still wouldn&#39;t fit through a 32&quot; door frame, but it would be easier to wiggle through indoor spaces and fit in my car. It also would just look a lot nicer.&lt;/p&gt;&lt;p&gt;One of the reasons I put off this modification for so long is because I thought it would involve disassembling the entire wheel module, but it turns out that it&#39;s just barely possible to remove the wheel without removing the motor. I can take off the brake caliper and slip the belt off the pulley to give it just enough slack to pull the wheel off the spindle shaft. I don&#39;t remember intentionally designing it this way, but let&#39;s pretend I did. It&#39;ll be good for fixing flats, too.&amp;nbsp;&lt;/p&gt;&lt;p&gt;The next obstacle to overcome was removing the outboard bearings. I didn&#39;t have a bearing puller on-hand, but I discovered that an 80/20 T-Nut (which I obviously have hundreds of...) is just about exactly the right size to push on the outer race of these bearings. So I came up with this improvised tool:&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBJMA1LHCMXpd3kuqDEqNUBvdijE3OCpYAU9rC59KhbOOT0G7CCD7uj_OLEiz6jLysv0EIHNT7-BGpp__CUPchHWh2I5b2UX3jO_zTKWRQiPPa61SqjDELx1UQBhSB4xnrMCX3kn0vkM8/s2048/tc83.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBJMA1LHCMXpd3kuqDEqNUBvdijE3OCpYAU9rC59KhbOOT0G7CCD7uj_OLEiz6jLysv0EIHNT7-BGpp__CUPchHWh2I5b2UX3jO_zTKWRQiPPa61SqjDELx1UQBhSB4xnrMCX3kn0vkM8/w640-h480/tc83.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Improvised bearing pusher.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;The tool is built &lt;i&gt;inside&lt;/i&gt;&amp;nbsp;the hub by slipping the 80/20 T-Nut through the bearing, flipping it horizontal, then dropping in the hex standoff from the other side. After fastening it together with a 1/4-20, it&#39;s ready for the press. Luckily, I didn&#39;t Loctite these bearings in, so they pressed out pretty easily.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjo_Sl81susGkM5UhT7P3L-BKn9hqEg1ycV15AT4O3kim2PfklSEOO4-9baHmBznM3h0aci1e5pLjQ6d56nxt-qmwPtcB20BINyAp9_QVLuhQztJfi32U42FJWbYuQ1oNMVUMm9UbLNc4/s2048/tc84.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjo_Sl81susGkM5UhT7P3L-BKn9hqEg1ycV15AT4O3kim2PfklSEOO4-9baHmBznM3h0aci1e5pLjQ6d56nxt-qmwPtcB20BINyAp9_QVLuhQztJfi32U42FJWbYuQ1oNMVUMm9UbLNc4/w640-h480/tc84.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Pressing out the bearings using the makeshift pusher.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;The 12.5&quot; wheels don&#39;t fit on my &lt;a href=&quot;https://littlemachineshop.com/products/product_view.php?ProductID=5100&amp;amp;category=1271799306&quot;&gt;mini lathe&lt;/a&gt;, but they do just barely fit on my &lt;a href=&quot;https://littlemachineshop.com/products/product_view.php?ProductID=4190&amp;amp;category=1387807683&quot;&gt;mini-mill&lt;/a&gt;. I knew this ahead of time, so I bought a 22mm end mill specifically for cutting the new bearing pocket. (One of the nice features of this mini-mill is its use of a regular R8 spindle, so it&#39;s possible to get large tools for it.) I did have to get a little creative with fixturing. The brake disk is bolted down to a piece of 80/20, which is clamped in the mill. But, to make things stiff enough, I also had to ground the rim itself directly to the bed with some long clamping screws.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKicqFmiQEjtEHfYw8BnnQWUnH2TsUFmxCJwXJWIOvTIaq2TxQq973xHLyhxCoB_FiiOuFtlgmNshXtZ18qrs2ekQMzT7ibLEYZ4C4uSHf9Nuf_ZXlJ-AJu0rLD8JulMxNgSoscFVGG-U/s2048/tc88.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKicqFmiQEjtEHfYw8BnnQWUnH2TsUFmxCJwXJWIOvTIaq2TxQq973xHLyhxCoB_FiiOuFtlgmNshXtZ18qrs2ekQMzT7ibLEYZ4C4uSHf9Nuf_ZXlJ-AJu0rLD8JulMxNgSoscFVGG-U/w640-h480/tc88.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Clamping situation: not great, not terrible.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqea5kNOWyWUt8_ONoTngZ__FuDk6xFoLHkNSDlfCaJxnjhKCQ7LJeSMNFDBP3zmhdAJm5YaSzpPlG3acHmEMElhPiRtluuZYSjYOEGN-gTAFInTtYQ7Zf1rhyphenhyphen05ATWjY9bJ3zvbiqctU/s2048/tc89.jpg&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqea5kNOWyWUt8_ONoTngZ__FuDk6xFoLHkNSDlfCaJxnjhKCQ7LJeSMNFDBP3zmhdAJm5YaSzpPlG3acHmEMElhPiRtluuZYSjYOEGN-gTAFInTtYQ7Zf1rhyphenhyphen05ATWjY9bJ3zvbiqctU/w640-h480/tc89.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Pretty sure this mill was never meant to hold a tool this big.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;I decided to extend the bearing pocket by 1.000&quot; first, before machining down the hub by 1.000&quot;. I&#39;m not sure if this was the best order of operations, but it all went pretty smoothly. Here&#39;s 7:45 of relaxing slow-motion bearing pocket cutting, captured at 4K 420fps with my &lt;a href=&quot;https://freeflysystems.com/wave&quot;&gt;Wave&lt;/a&gt;:&lt;/p&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
  &lt;iframe allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/F2xeFx_fLQE&quot; title=&quot;YouTube video player&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;

&lt;/div&gt;&lt;p&gt;These hubs are cast aluminum, so it wasn&#39;t surprising to find that there were some voids in the newly-machined faces. They&#39;re nothing that I think would affect the structural integrity, but it&#39;s an interesting consequence of the manufacturing process.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-NsEF-4nz8HD18iz9HfXtuh1tfSQcAbmzGbqOBUlyiH7HZOJX0mdMCBanmBpzeBWncU3X3BlZSrMfQCO2qEs6iBZi8bdVvRYynKQaPCn1ypWZT3wzFTTylpPr2nRRFGEvkhgYBT6p3FQ/s2048/tc90.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-NsEF-4nz8HD18iz9HfXtuh1tfSQcAbmzGbqOBUlyiH7HZOJX0mdMCBanmBpzeBWncU3X3BlZSrMfQCO2qEs6iBZi8bdVvRYynKQaPCn1ypWZT3wzFTTylpPr2nRRFGEvkhgYBT6p3FQ/w640-h480/tc90.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Casting voids exposed by re-machining the hubs.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;One of the downsides of doing this operation on the mill is that I didn&#39;t have a choice of machining the new bearing pocket to an interference fit. But I was pleased to see that, with all the extra effort put into stiffening the fixture, it was still a nice slip fit. I can always add Loctite later if needed.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0TSOSKcAGQj76WECEDPrstczHTz18_5pp48ZawBtc9SMWWZ1uD19e-en1sETtNvWG7RoL28jEGhlJAswPhokhnU92aVP8kf3ZZvptmvjfgRt5fMn2e6nCyT61RJ1b-fdrbDu97j_BSbY/s2048/tc91.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0TSOSKcAGQj76WECEDPrstczHTz18_5pp48ZawBtc9SMWWZ1uD19e-en1sETtNvWG7RoL28jEGhlJAswPhokhnU92aVP8kf3ZZvptmvjfgRt5fMn2e6nCyT61RJ1b-fdrbDu97j_BSbY/w640-h480/tc91.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;After re-machining, the bearings are now a nice slip fit.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;That just leaves the 7075 spindle shafts, which also needed to be shortened by 1.000&quot;. Cutting off the extra length and extending the outboard mounting hole was a quick task for the mini-lathe. Then, it just needed to be re-tapped.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-FBSVwMPVUvyY6Uk8Z6qGfRrPZ1FZX36MGJR1TyxVrJQKrkHoo4muhReCxPtK1VFU3RjTJxlvcPpHoq2sC3pxqvUOP4yV4Gyq16w6X1ULAJwrtDpZubONFcLxG3XevpUZOB5Ux-SQj7o/s2048/tc85.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-FBSVwMPVUvyY6Uk8Z6qGfRrPZ1FZX36MGJR1TyxVrJQKrkHoo4muhReCxPtK1VFU3RjTJxlvcPpHoq2sC3pxqvUOP4yV4Gyq16w6X1ULAJwrtDpZubONFcLxG3XevpUZOB5Ux-SQj7o/w640-h480/tc85.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Shortening the 7075 spindle shafts...&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgA7SbJRIdyF_eVsb_no0dJQQaHrSmWWlW-zQpG7Hn_9zx5yXUkSvm29mVBg00fWr8pEkzCf_ziJGXA5ifP-3yvP3pJ32l35b1cgr-7EqbGv3BlSTsgISi0GM5FrLL-bupCBNJt5hyCrhU/s2048/tc86.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1536&quot; data-original-width=&quot;2048&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgA7SbJRIdyF_eVsb_no0dJQQaHrSmWWlW-zQpG7Hn_9zx5yXUkSvm29mVBg00fWr8pEkzCf_ziJGXA5ifP-3yvP3pJ32l35b1cgr-7EqbGv3BlSTsgISi0GM5FrLL-bupCBNJt5hyCrhU/w640-h480/tc86.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;...and re-tapping.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;Finally, I put everything back together, substituting much lighter 4&quot;x1/2&quot; hex standoffs to span the gap at the top of each wheel module. The total process took only about two hours per wheel, including disassembly and reassembly. So something I have put off for two years was really only one day of work...typical. Anyway, the final result is a kart that&#39;s now 2&quot; narrower and about 1lb lighter.&lt;/p&gt;&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4nm8JCOnDD-OLfM6tq5NrsUIkOIt3LMw3B6_3nE5WC6oIVkepVLy93RlRtuZvgl-S7xpAQCxUliOfArEhMGSC0fya5Gf0MNyBDAClBWpTGyi5ScesU7TwzJQPlBhbB_G0cLdJLjRJ09M/s2048/tc92.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;2048&quot; data-original-width=&quot;1955&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4nm8JCOnDD-OLfM6tq5NrsUIkOIt3LMw3B6_3nE5WC6oIVkepVLy93RlRtuZvgl-S7xpAQCxUliOfArEhMGSC0fya5Gf0MNyBDAClBWpTGyi5ScesU7TwzJQPlBhbB_G0cLdJLjRJ09M/w610-h640/tc92.jpg&quot; width=&quot;610&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The pile at the front is roughly the weight saved. (5&quot;x5/8&quot; standoffs were replaced by 4&quot;x1/2&quot;, but an equivalent amount of weight was taken out of each hub.)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;I have a few more tasks I want to do on this kart. It still needs to be fully weather-proofed. I have a plan for enclosing the motor drives, but need to figure out something for the steering wheel PCB. I may redesign that board from scratch since I don&#39;t think I&#39;ll ever get to using the battery balancing circuit on it. It can be much smaller and simpler without that. Lastly, there&#39;s always motor drive stuff to fiddle with to squeeze out more torque and/or speed.&lt;/p&gt;&lt;p&gt;For now, though, I&#39;m glad it&#39;s a little lighter and a lot narrower. It&#39;ll make deploy that much easier, which ultimately means more actual testing and use.&lt;/p&gt;</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/5692030429285511468/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2021/08/tinycross-weight-and-width-reduction.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/5692030429285511468'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/5692030429285511468'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2021/08/tinycross-weight-and-width-reduction.html' title='TinyCross Weight and Width Reduction Pass'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjZInTH7Snqi9kuVbO6_2aCw1FDj0hdQTw27QfmUDP2N0jCCsm92FNtfFDJExwTdHKuDH8gsFsE19Jc5jOjDzROevc_UQLN0XrVk2JkD9zqpcCClyPZyaA13hammEYLOayzNpoh05sJag/s72-w640-h516-c/tc47.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-1141218219849083062</id><published>2020-04-18T16:44:00.000-04:00</published><updated>2020-04-19T00:01:36.094-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>Full-Speed CMV12000 Subsampled Readout: 1440fps 1080p</title><content type='html'>Now that I&#39;ve got a &lt;a href=&quot;https://scolton.blogspot.com/2019/12/continuous-38gpxs-4k-400fps-image.html&quot;&gt;continuous multi-Gpx/s image capture pipeline&lt;/a&gt; running, it&#39;s time to rearrange some things to break the 1000fps barrier:&lt;br /&gt;
&lt;br /&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/DVDJMog8FxU&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;br /&gt;
&lt;br /&gt;
For this clip I&#39;m using the CMV12000&#39;s X/Y subsampling mode to trade resolution for frame rate, hitting 1440fps at 2048x1088. The overall pixel rate is a little lower than in 4K (3.2Gpx/s vs. 3.8Gpx/s), so it&#39;s feasible to send this through the same Zynq Ultrascale+ capture pipeline, with some modifications, to record &lt;span style=&quot;color: yellow;&quot;&gt;continuously&lt;/span&gt; to an NVMe SSD. With ~4:1 &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot;&gt;wavelet compression&lt;/a&gt;, this writes about 1GB/s to the drive, up to 1000s (16.7min) for a 1TB drive. That would be 16.7 &lt;i&gt;hours&lt;/i&gt;&amp;nbsp;of playback at 24fps, though. I figured 30 seconds real-time and 30 minutes of playback was enough water droplet footage for now.&lt;br /&gt;
&lt;h4&gt;
CMV12000 Subsampling&lt;/h4&gt;
In a &lt;a href=&quot;https://scolton.blogspot.com/2019/12/continuous-38gpxs-4k-400fps-image.html&quot;&gt;previous post&lt;/a&gt;, I covered the pipeline architecture for continuously recording 400fps 4K video from a CMV12000 image sensor to an NVMe SSD. That was a 4096x2304 (16:9) frame, slightly larger than 4K UHD. The sensor&#39;s native resolution is 4096x3072 (4:3), which it can read in at 300fps. By reading in fewer rows, the maximum frame rate is increased. Going wider than 16:9 would allow frame rates higher than 400fps, but since the sensor always reads in full 4096px-wide rows, the speed gain is only linear.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
To go much faster, it&#39;s necessary to read in fewer columns as well. Not all sensors can do this; reading whole rows may be baked into the hardware architecture. The CMV12000 doesn&#39;t support arbitrary readout width, but it &lt;i&gt;does&lt;/i&gt;&amp;nbsp;support 2x subsampling. In this mode, every other four-pixel square (&lt;a href=&quot;https://en.wikipedia.org/wiki/Bayer_filter&quot;&gt;Bayer&lt;/a&gt; group) is skipped in both the X and Y directions. The remaining squares are transmitted on the LVDS channels using an alternate packing:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUHR_JNG1oLfivmo_voboZejp7O5tye4u3WQU_3AjDujeNOz3C8spvC_V9zxxnKs9AygVPM7EHr7NUVM2OMyVT400Q4KaDbwr10Rs1qd5_ggSziNoXw2MKiMEVX0PSUPmHwtXoE3MyAxU/s1600/d31.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;571&quot; data-original-width=&quot;1600&quot; height=&quot;228&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUHR_JNG1oLfivmo_voboZejp7O5tye4u3WQU_3AjDujeNOz3C8spvC_V9zxxnKs9AygVPM7EHr7NUVM2OMyVT400Q4KaDbwr10Rs1qd5_ggSziNoXw2MKiMEVX0PSUPmHwtXoE3MyAxU/s640/d31.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;CMV12000 subsampled readout (color, X-flipped).&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Each of the 64 LVDS channels alternates between two rows, with the lower 32 channels handling two even (G1/R1) rows and the upper 32 channels handling two odd (B1/G2) rows. This alternate data packing allows the subsampled image, with 1/4 as many total pixels, to be read out nearly 4x faster. There is a small amount of extra overhead time that makes the actual gain not quite 4x.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Subsampling drops the resolution from 4K to 2K but preserves the crop factor of the sensor, since the full width and height are still used. This is preferable to cropping a 2048px-wide image out of the middle. It doesn&#39;t give any increase in sensitivity though; to do that would require binning (averaging the larger 4x4 squares to generate the final 2x2). The CMV12000 does support binning, but the overhead is so bad that you might as well read out the 4K image and do it in post (assuming you have the data storage bandwidth, which I certainly do). So to go ~4x faster, I will need ~4x more light.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjapueHOJxckKJQkGpFka9rbmMEHa48pHPHydEPLfhhJwawwDDtR9RTlupYb6yJftKfFaToO6yQH3ml2CcKJiGaLwG_yD-2iZBXwdDrobl1koL7pKcwcYJum0xAVNDoNHxL6lBC8aQ2kk0/s1600/d33.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;461&quot; data-original-width=&quot;1600&quot; height=&quot;184&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjapueHOJxckKJQkGpFka9rbmMEHa48pHPHydEPLfhhJwawwDDtR9RTlupYb6yJftKfFaToO6yQH3ml2CcKJiGaLwG_yD-2iZBXwdDrobl1koL7pKcwcYJum0xAVNDoNHxL6lBC8aQ2kk0/s640/d33.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Light sensitivity of subsampling vs. binning.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Before worrying about a shortage of photons, though, I first need to deal with a shortage of programmable logic. To fit everything on the XCZU4, my main bottlenecks are BRAMs and LUTs. I managed to add the &lt;a href=&quot;https://scolton.blogspot.com/2020/03/hdmi-hard-way.html&quot;&gt;decoder for HDMI output&lt;/a&gt;&amp;nbsp;with no increase in either by sacrificing the third wavelet stage. But I&#39;ve known for a long time that the day would come when I would need to add 128 more &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#h26&quot;&gt;Stage 1 horizontal cores&lt;/a&gt; to handle the subsampled inputs.&lt;br /&gt;
&lt;br /&gt;
It might seem odd that &lt;i&gt;more&lt;/i&gt;&amp;nbsp;cores are needed to process a smaller image. Even at the higher frame rate, the pixel input rate is lower than in 4K. Surely the existing horizontal cores could time-multiplex to handle the data? But, the wavelet cores must operate on groups of&amp;nbsp;&lt;span style=&quot;color: yellow;&quot;&gt;adjacent&lt;/span&gt; pixels. In this case, adjacency describes the nearest horizontal pixels of the same color, since applying a difference operation to pixels of different colors would not have the desired result. And whatever the color, pixels from another row are not horizontally adjacent. Since each LVDS channel now services two color fields &lt;i&gt;and&lt;/i&gt;&amp;nbsp;two rows, it must feed four independent wavelet cores.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVJVb8RhIu2_QCWmk24j_MC0vwdka419Ck2JwlA-pp0ZfR7JWjHK3O_8Mw5JhZara4gEZxVN3c4nZxL4Ujdo6yw8dY_dhcm6J2wUqv42c1yTreMCPLN7I96FZ7_P3K32pEdL9sTIFH1Pg/s1600/d34.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;604&quot; data-original-width=&quot;1600&quot; height=&quot;240&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVJVb8RhIu2_QCWmk24j_MC0vwdka419Ck2JwlA-pp0ZfR7JWjHK3O_8Mw5JhZara4gEZxVN3c4nZxL4Ujdo6yw8dY_dhcm6J2wUqv42c1yTreMCPLN7I96FZ7_P3K32pEdL9sTIFH1Pg/s640/d34.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;In 2K mode, each LVDS channel feeds four independent Stage 1 horizontal cores.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
So, the total number of Stage 1 horizontal cores doubles from 128 to 256. This jump has been on my mind since the early stages of the design, and I tried to optimize the horizontal cores as much as possible. A big part of this was reducing the operating pixel width from 16-bit to 12-bit, which brought the per-core LUT count down from 107 to 83. As this is the first stage of the pipeline, it&#39;s easy to verify that it won&#39;t saturate on 10-bit inputs. The horizontal cores operate in-line with the input using only distributed memory, so no additonal BRAMs are required. But there&#39;s now way around the additional 10,000 or so LUTs, and that will bring me right up to the limits of this chip.&lt;br /&gt;
&lt;br /&gt;
Since I knew there would be very few LUTs remaining for switching modes, I originally thought the 4K and 2K modes might have to exist as entirely separate PL configurations, their bitstreams loaded on as-needed by software. I&#39;ve seen other cameras do this; it looks like a software reset when changing capture formats. And while it only takes a few seconds, I really dislike the workflow and the idea of maintaining two configurations.&lt;br /&gt;
&lt;br /&gt;
So, I spent some time looking at the actual differences between modes at all stages of the pipeline and decided that I could and should build the switch. I had this mode change in mind early in the design, so I tried to minimize the number of touch points required in each of the modules to switch between 4K and 2K. Even so, there are a number of small changes needed in the Wavelet, Encoder, and HDMI modules. They are collectively driven by a master switch in each module&#39;s AXI slave registers. I&#39;ll go through them in pipeline order below.&lt;br /&gt;
&lt;h4&gt;
Wavelet Stage 4K/2K Switch&lt;/h4&gt;
&lt;div&gt;
First, no actual switching is required to distribute the inputs to the Stage 1 horizontal cores; each channel always connects to the same four cores. Instead, the cores are gated by a master pixel counter based on their color and, when in 2K mode, also their row. The 2K mode switch turns on this extra enable gate and offsets the counter that handles first/last row states by one bit, to account for the half-width rows. Miraculously, this did not add any LUTs to the horizontal cores. I assume the extra logic just got merged into existing smaller LUTs...I&#39;ll take it.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The most complicated part of the switch happens next, at the interface between the Stage 1 horizontal and vertical cores. Instead of distributing outputs from four adjacent horizontal cores into a single row of a vertical core BRAM, the 2K interface distributes outputs from eight horizontal cores into two rows of a vertical core BRAM. Since the rows are half as wide, this takes the same number of pixel clock cycles (128). So, as will be the case at many points in the pipeline, this just boils down to rearranging the bits of the BRAM write address:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZmxTjD9PM8y5UBTWaXQvH4oKDgvd5GGoB-59vnduTnfUBx0WLJ4_mHK5TxP4cEmi5TA0rrp91obWI9iPQMMCoe_hUFo6La787x5RDkE0xs9fyKPVial6UoSXnIgc96CbrWJHPF_XkEBM/s1600/d35.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1600&quot; data-original-width=&quot;1431&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZmxTjD9PM8y5UBTWaXQvH4oKDgvd5GGoB-59vnduTnfUBx0WLJ4_mHK5TxP4cEmi5TA0rrp91obWI9iPQMMCoe_hUFo6La787x5RDkE0xs9fyKPVial6UoSXnIgc96CbrWJHPF_XkEBM/s640/d35.png&quot; width=&quot;572&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Aspect ratio change and read/write addressing of the Stage 1 vertical core BRAMs in 4K vs. 2K mode.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Conceptually, the aspect ratio of the vertical core BRAM changes from 8 rows of 256px to 16 rows of 128px. The figure above shows where writes and reads occur in the BRAM at a given relative pixel count. Reads occur on half-counts since the Stage 1 vertical DWT operates at double px_clk frequency. The read address generator is also modified by the switch to account for the new aspect ratio. Only the eight most recent rows are actively written or read, so in 2K mode the BRAM is twice as big as it needs to be. The latency of the vertical core is also halved, since it&#39;s determined by the number of rows required to complete the vertical DWT operation. This will come into play later.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The Stage 1 vertical core buffers the alternating-row 2K mode inputs into a single-row format that&#39;s compatible with the rest of the pipeline, so changes after this point are relatively minor. Each Stage 1 vertical core feeds its output row to a Stage 2 horizontal core. The only modification required there is to offset the counter that handles first/last row states by one bit, to account for the half-width rows. Then, the Stage 2 vertical core just needs some more BRAM address rearrangement:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhUTb6GlFcFE0thLkQEPaYU3mwiUBVyIFG9BapvdltSUk4wJrdxviqu7uqHuasQyqxFHVCHLtfTDrhBRCiUdFcZMT5zMs3XqOFDXm7olEYbCDspAxsYaHf27SkfDiWCWGVPZeU8ZuS5gI/s1600/d36.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1600&quot; data-original-width=&quot;1430&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhUTb6GlFcFE0thLkQEPaYU3mwiUBVyIFG9BapvdltSUk4wJrdxviqu7uqHuasQyqxFHVCHLtfTDrhBRCiUdFcZMT5zMs3XqOFDXm7olEYbCDspAxsYaHf27SkfDiWCWGVPZeU8ZuS5gI/s640/d36.png&quot; width=&quot;572&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Aspect ratio change and read/write addressing of the Stage 2 vertical core BRAMs in 4K vs. 2K mode.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
Like the Stage 1 vertical core BRAM, the aspect ratio is changed from 8 rows of 256px to 16 rows of 128px. But since the first stage already rearranged things into single rows, the write addressing here is more straighforward: In both 4K and 2K mode, only a single row is filled at a time (by two adjacent Stage 2 horizontal cores). The row width is halved, but there&#39;s no write interleaving between the two rows. Ultimately, this is just a different arrangement of the write address bits. The read address generator is similarly modified to grab the right data for the Stage 2 vertical DWT. As with the Stage 1 vertical core, the BRAM is twice as big as it needs to be, and the latency is halved.&lt;br /&gt;
&lt;h4&gt;
Encoder 4K/2K Switch&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
The compression stage doesn&#39;t care about the aspect ratio change, since the only context it uses for &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#encoder&quot;&gt;variable-length encoding&lt;/a&gt; is an immediate group of four pixels. However, it does need to know the adjusted latency of both wavelet stages, since the first pixel to be encoded will arrive sooner in 2K mode. For that, I just made all the latency offsets software-defined, through the encoder&#39;s AXI slave registers. And that should be the only change required here...&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Except things are never that easy. I noticed after plugging in the expected latency values in 2K mode, two of the four color fields (R1 and G2) were actually dropping one pixel per row. It took a while to isolate this to the encoder, and then even more staring at this module to figure out what the problem was. Since the only change I made was to the latency offsets, I figured there had to be some fundamental difference between how the local pixel counter (px_count_e) drives the encoder states during row transitions with different offsets, and there was:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihI8urr1MktbJ59hVExn8RiBXKUJ2HyjISb_NZkSioQp4RO2OSqlMb-D9um3emvhW6A6Jb6fv4Z7Bnxbx6zWQxUO2aP6Z_7bF_Qyx3HaZddpzfDjx9ECM_FAIxO7Uteu7c4HmR2b1xGm0/s1600/d37.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;661&quot; data-original-width=&quot;1600&quot; height=&quot;264&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihI8urr1MktbJ59hVExn8RiBXKUJ2HyjISb_NZkSioQp4RO2OSqlMb-D9um3emvhW6A6Jb6fv4Z7Bnxbx6zWQxUO2aP6Z_7bF_Qyx3HaZddpzfDjx9ECM_FAIxO7Uteu7c4HmR2b1xGm0/s640/d37.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Encoder gating in 4K mode, showing the difference between sequential and combinational px_count_e_updated.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
The above shows px_count_e at the first row overhead time (ROT) in 4K mode. It&#39;s negative since pixels haven&#39;t made it to the encoder yet, but the same behavior happens at all subsequent row transitions. During ROT, the sensor is not sending pixel data and all the pixel counters (including px_count_e) hold their previous values. A signal called px_count_e_updated is cleared, which gates the encoder from sending pixels to RAM (via an intermediate shift register called e_buffer). This signal was previously &lt;span style=&quot;color: #bf9000;&quot;&gt;sequential&lt;/span&gt;, which would add one clock cycle delay between the ROT and when the encoder is gated. It should have been &lt;span style=&quot;color: #76a5af;&quot;&gt;combinational&lt;/span&gt;, to line up correctly with the ROT.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
But the write to e_buffer also only takes place every other group of four pixel clocks, for reasons discussed &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#encoder&quot;&gt;here&lt;/a&gt;. In 4K mode, the ROT happens to fall in a period where writes don&#39;t occur anyway. The &lt;span style=&quot;color: #bf9000;&quot;&gt;sequential&lt;/span&gt; vs. &lt;span style=&quot;color: #76a5af;&quot;&gt;combinational&lt;/span&gt; difference didn&#39;t matter to the final e_buffer_wr_en signal. But in 2K mode, the new latency offsets just happen to put the ROT one cycle before the start of a four-cycle write sequence, where the difference does matter:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq9d_tPpdApWawb4WQ_8rDniXIARHmWHPw9VoGfrBB_S7DOvJKlAFoe1WT1O-Q8JeM0G56R4Y24fCz2mMpHi7zRswXohsOw4-l7Jx3j7xPn2xerud5hfjRer1t_53hPnOgV2rm8kAXqME/s1600/d38.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;654&quot; data-original-width=&quot;1600&quot; height=&quot;260&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq9d_tPpdApWawb4WQ_8rDniXIARHmWHPw9VoGfrBB_S7DOvJKlAFoe1WT1O-Q8JeM0G56R4Y24fCz2mMpHi7zRswXohsOw4-l7Jx3j7xPn2xerud5hfjRer1t_53hPnOgV2rm8kAXqME/s640/d38.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Encoder gating in @K mode, showing the difference between sequential and combinational px_count_e_updated.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
After switching over to &lt;span style=&quot;color: #76a5af;&quot;&gt;combinational&lt;/span&gt; logic for px_count_e_updated, the missing pixel returned, and things were almost happy again. It turns out there was a similar issue at the quantizer and encoder modules themselves, before the write to e_buffer. This was simply due to them not being enable-gated at all, though. (Again, it must have been working thanks to lucky latency offsets in 4K mode.) Gating each with the same &lt;span style=&quot;color: #76a5af;&quot;&gt;combinational &lt;/span&gt;px_count_e_updated signal worked fine.&lt;/div&gt;
&lt;h4&gt;
HDMI 4K/2K Switch&lt;/h4&gt;
&lt;div&gt;
But wait, isn&#39;t the &lt;a href=&quot;https://scolton.blogspot.com/2020/03/hdmi-hard-way.html&quot;&gt;HDMI output&lt;/a&gt; always 1080p? While that is true, it doesn&#39;t mean there&#39;s nothing to be done here. In 4K mode, only the Stage 2 wavelet compression is decoded, leaving a 2K preview image (really, four color fields that are each 1024px wide) to be output via HDMI. This greatly reduces the size of the HDMI module, since it only has to decode four of the sixteen codestreams and do one stage of inverse DWT. However, getting to the same preview size in 2K mode would mean complete decoding, require all sixteen codestreams and two wavelet stages. I simply don&#39;t have room to do that, so I&#39;m going to cheat.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The first step is to change how the viewport is mapped to a pixel count. To achieve &lt;a href=&quot;https://scolton.blogspot.com/2020/03/hdmi-hard-way.html&quot;&gt;arbitrary scaling of the preview image&lt;/a&gt;, I first normalize the viewport to 16-bit, i.e. top-left (0, 0) to bottom-right (65535, 65535). The x and y components, vxNorm and vyNorm, are shifted around to create the pixel counters that drives the output pipeline. When switching from 4K to 2K, each component gets right-shifted by one and the split between x and y moves over by one bit in the final counter:&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEkbR23hb4L8i0qTkHt8RbdVh6Aw-A2cUmI1yXffwigQYUnkUsOn9pWBFvdzN8Ug5XJ7R_fApDyQcRcIYBFBsKcg5ya5SvWk5QPmL880Opf14UWqC9Sb_rwEpdBG1E9dpnqU4pXZmsiPE/s1600/d39.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1494&quot; data-original-width=&quot;1600&quot; height=&quot;596&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEkbR23hb4L8i0qTkHt8RbdVh6Aw-A2cUmI1yXffwigQYUnkUsOn9pWBFvdzN8Ug5XJ7R_fApDyQcRcIYBFBsKcg5ya5SvWk5QPmL880Opf14UWqC9Sb_rwEpdBG1E9dpnqU4pXZmsiPE/s640/d39.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Mapping between 16-bit (vxNorm, vyNorm) coordinates and opx_count in 4K vs. 2K mode.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
This remapping means that the entire output pipeline operates at half resolution in 2K mode. The preview will actually just be scaled up from the four LL1 color fields, which are each 512px wide. There will still be bilinear interpolation to help smooth out the result, but it will be blurrier than the 1080p preview in 4K mode. But again there isn&#39;t really an alternative, at least not with the resources I have left on this chip.&lt;br /&gt;
&lt;br /&gt;
The output pixel counter (opx_count) drives all parts of the decoding process, starting with a RAM reading FIFO through the HDMI module&#39;s AXI master. No changes are required there or in the decoder itself, other than modifying the latency offsets accordingly. These have always been software-defined, so I just added the expected values for 2K mode and they worked without any hassle. (There was no equivalent sequential vs. computational bug, thankfully.)&lt;br /&gt;
&lt;br /&gt;
After this, the modifications to the Stage 2 inverse vertical wavelet cores are pretty simple and almost the same as in the forward direction. Each color field&#39;s IV2 core uses a single URAM for row storage. In 2K mode, the aspect ratio is changed from 16 rows of 1024px to 32 rows of 512px, by rearranging read and write address bits:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeQJMUR1VMC6IfHkE1mMJuX-3tVDYmPeQMMjB76hDmDY6XFdtD_Npm-diYnt9rRHHDW8s9IefcoZ5Ubcv60CLywzpcE7ngRH_R4-lVE-EYydISXl2ZUMN2EcLSDLW4y1ZV5OwU7AaTfms/s1600/d40.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1251&quot; data-original-width=&quot;1600&quot; height=&quot;500&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeQJMUR1VMC6IfHkE1mMJuX-3tVDYmPeQMMjB76hDmDY6XFdtD_Npm-diYnt9rRHHDW8s9IefcoZ5Ubcv60CLywzpcE7ngRH_R4-lVE-EYydISXl2ZUMN2EcLSDLW4y1ZV5OwU7AaTfms/s640/d40.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Aspect ratio change and read/write addressing of the Stage 2 inverse vertical core URAMs in 4K vs. 2K mode.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Unlike the forward direction, the Stage 2 inverse horizontal wavelet cores also use URAMs for row storage and these likewise need address bit rearrangement to change aspect ratios for 2K mode:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyu1eVYGwoVGI5k9opXNooIYAbLqd5pTCR-zbOacUsrdXqRDo0vgdBAG0MwEU__1BV9nKpFhzUSETi25rT0w2Ev4FV8e3f-dOp0dixxp4f1eQ_1NL401LkjZ2t3tjubsWHGyZUeKBwC4A/s1600/d41.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1143&quot; data-original-width=&quot;1600&quot; height=&quot;456&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyu1eVYGwoVGI5k9opXNooIYAbLqd5pTCR-zbOacUsrdXqRDo0vgdBAG0MwEU__1BV9nKpFhzUSETi25rT0w2Ev4FV8e3f-dOp0dixxp4f1eQ_1NL401LkjZ2t3tjubsWHGyZUeKBwC4A/s640/d41.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Aspect ratio change and read/write addressing of the Stage 2 inverse horizontal core URAMs in 4K vs. 2K mode.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
And finally, the bilinear interpolation module needs to be adjusted to automatically scale up the preview image by 2x, so it can fill the viewport using the 512px-wide color field LL1 outputs. This can be done quickly by passing the shifted vxNorm and vyNorm values to the module, although this isn&#39;t &lt;i&gt;quite&lt;/i&gt;&amp;nbsp;correct, as will be discussed below. It&#39;s good enough for now, though.&lt;br /&gt;
&lt;h4&gt;
Debayering&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
Applying an ordinary debayering algorithm, whatever it is, to the 2K subsampled raw data doesn&#39;t really work. This is because the physical spacing between pixels is no longer symmetric. For example, a red pixel is closer to its green and blue neighbors to the left and below than to the right and above. A proper bilinear interpolation needs to take this asymmetry into account, by modifying the location of pixel centers for each color field accordingly. More advanced algorithms are still built on the assumption of symmetric neighbors, so they&#39;d all need modification to some degree.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFZSy6Xo_cGU0gVbQ_g4qYTTp36zpK9emQsNgJzjgqRHX28PuI9dNrzUGAi8V2-FtqOrZMhFbCvfKc2YcOGCT84KFq2sDbT_mLiyoMxY5AW7CVIYllu7eTSnRKNt_mZvWXQINBnsuzfdU/s1600/d42.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;735&quot; data-original-width=&quot;1600&quot; height=&quot;292&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFZSy6Xo_cGU0gVbQ_g4qYTTp36zpK9emQsNgJzjgqRHX28PuI9dNrzUGAi8V2-FtqOrZMhFbCvfKc2YcOGCT84KFq2sDbT_mLiyoMxY5AW7CVIYllu7eTSnRKNt_mZvWXQINBnsuzfdU/s640/d42.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Asymmetric neighboring pixels in subsampled mode can be handled by modifying interpolation pixel centers (left) or with an intermediate supersampling step (right).&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
Alternatively, the subsampled data can be supersampled by 2x to estimate the missing pixels (G2&#39; and G1&#39; in the image above) and then run through the ordinary debayer algorithm in 4K. The final output can then be scaled back to 2K to reflect the true information content of the data. This path takes longer for what may be an equivalent result for simpler debayer algorithms, but it might have advantages for more complex algorithms. All this will probably be obsoleted by neural networks that upscale 240p images to 16K in a few year anyway, so I&#39;m not going to worry about it.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
It is important to adapt the debayer algorithm for the subsampled pixel locations somehow, though, or there will be significant artifacts. The following comparison shows three different algorithms, nearest-neighbor, bilinear, and a &lt;a href=&quot;https://www.microsoft.com/en-us/research/publication/high-quality-linear-interpolation-for-demosaicing-of-bayer-patterned-color-images/&quot;&gt;Microsoft 5x5 interpolator&lt;/a&gt; that I like. For each, a reference 4K capture and 4K debayer is compared to a 2K subsampled capture with an&amp;nbsp;&lt;i&gt;unmodified&lt;/i&gt; 2K debayer and a 2K subsampled capture with a supersampled 4K debayer.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBE7JHWM1fqJppgCMOwMIEAkvdhHZCQuVAQg9WCKkUPZAKNlqMas87BRK99I6pGKJEGU2vje312HSN-kVRQTpG3aCTVLVx5qmLxLydzkDEHlcieLgeA5L6J_1bk8dgKzjcSgKBM2PEfrw/s1600/d32.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1170&quot; data-original-width=&quot;1600&quot; height=&quot;468&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBE7JHWM1fqJppgCMOwMIEAkvdhHZCQuVAQg9WCKkUPZAKNlqMas87BRK99I6pGKJEGU2vje312HSN-kVRQTpG3aCTVLVx5qmLxLydzkDEHlcieLgeA5L6J_1bk8dgKzjcSgKBM2PEfrw/s640/d32.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Comparison of three different interpolation algorithms with 4K capture/debayer, 2K subsampled capture with unmodified 2K debayer, and 2K subsampled capture with supersampled 4K debayer.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
None of these simple algorithms can do much to recover resolution - for that I defer to the AI supersampling state of the art - but using an unmodified 2K debayer on subsampled raw data creates significant color checkerboarding artifacts on edges. Supersampling the data by 2x and running a simple 4K debayer at least bypasses the problem of neighboring pixel asymmetry.&lt;/div&gt;
&lt;h4&gt;
Resource Utilization&lt;/h4&gt;
&lt;div&gt;
Squeezing in the 4K/2K switch was beyond what I&#39;d hoped to fit on the XCZU4, but it just barely works. The switch itself really only adds LUTs where BRAM/URAM address bits are remapped or where pixel counts are shifted to account for the aspect ratio change. The main addition is the 128 new Stage 1 horizontal wavelet cores, which really push the resource utilization to the limits.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkNtnW8eXRBSrpisokxZZ8hljLMPVYElK24ls4CD6YbB2t1bKBvqvb1VJacjb_R0JFfB6fiSx1We3tYAfJFCGOIbzhOj9p3kHoyFA7ouGGQSuaqRIO1OInR8L-tWFrLoJjGRVs3lGvXgE/s1600/d43.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1189&quot; data-original-width=&quot;1600&quot; height=&quot;474&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkNtnW8eXRBSrpisokxZZ8hljLMPVYElK24ls4CD6YbB2t1bKBvqvb1VJacjb_R0JFfB6fiSx1We3tYAfJFCGOIbzhOj9p3kHoyFA7ouGGQSuaqRIO1OInR8L-tWFrLoJjGRVs3lGvXgE/s640/d43.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The XCZU4 with everything crammed in.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
At this point I&#39;m at 77143 LUTs (&lt;span style=&quot;color: red;&quot;&gt;87.82%&lt;/span&gt;), 93883 FFs (53.44%), 118 BRAMs (&lt;span style=&quot;color: red;&quot;&gt;92.20%&lt;/span&gt;), 14 URAMs (29.17%) and 146 DSPs (20.05%). But, since most of my cores are running at px_clk (60MHz) or HDMI clock (74.25MHz) frequency, the timing constraints are not too difficult to meet. The exception seems to be things that interact with the 250MHz AXI clock, including the encoder and decoder BRAM FIFOs. These have to have some amount of manual placement help to meet timing.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The good news is I don&#39;t really have much else to add to the programmable logic. I&#39;ve already built in placeholder URAMs for UI overlays in the HDMI module, so those just need to be filled in by software. I might add some more color processing to the HDMI output, but that will mostly use DSPs, and possibly URAMs for color look-up tables, which should be no problem to add. I&#39;m really happy that everything fits on the XCZU4, not just because the bigger chips are way more expensive, but because it&#39;s been a much better lesson in optimizing cores to fit resource constraints than if I had just switched to the XCZU7 early on.&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/1141218219849083062/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2020/04/full-speed-cmv12000-subsampled-readout.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/1141218219849083062'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/1141218219849083062'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2020/04/full-speed-cmv12000-subsampled-readout.html' title='Full-Speed CMV12000 Subsampled Readout: 1440fps 1080p'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/DVDJMog8FxU/default.jpg" height="72" width="72"/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-6603215929841972940</id><published>2020-03-14T15:38:00.001-04:00</published><updated>2020-03-15T16:39:23.606-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>HDMI, the Hard Way</title><content type='html'>&lt;span style=&quot;font-weight: normal;&quot;&gt;If I were to rank the components of this project in terms of the ratio of their actual vs. expected difficulty, the &lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html&quot;&gt;NVMe interface&lt;/a&gt; would probably be lowest, since it was nowhere near as hard as I thought it would be. The &lt;a href=&quot;https://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;CMV12000 input&lt;/a&gt; (easy, expected to be easy) and &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot;&gt;wavelet engine&lt;/a&gt; (hard, expected to be hard) would be somewhere in the middle. And the new top of the list, the hardest module that should have been easy, would be the HDMI output.&lt;/span&gt;&lt;br /&gt;
&lt;h4&gt;
HDMI&lt;/h4&gt;
There seem to be two main reference designs for outputting an HDMI signal from a Zynq SoC. Zynq-7000 series boards such as the &lt;a href=&quot;https://www.xilinx.com/products/boards-and-kits/device-family/nav-zynq-7000.html&quot;&gt;ZC70x&lt;/a&gt;&amp;nbsp;and &lt;a href=&quot;http://zedboard.org/product/zedboard&quot;&gt;Zedboard&lt;/a&gt; use an external HDMI transmitter, the &lt;a href=&quot;https://www.analog.com/en/products/adv7511.html&quot;&gt;ADV7511&lt;/a&gt;, to convert a parallel RGB interface into serial HDMI TMDS outputs. Zynq Ultrascale+ boards such as the &lt;a href=&quot;https://www.xilinx.com/products/boards-and-kits/device-family/nav-zynq-ultrascale-mpsoc.html&quot;&gt;ZCU10x&lt;/a&gt; and &lt;a href=&quot;http://zedboard.org/product/ultrazed-ev-carrier-card&quot;&gt;UltraZed-EV Carrier Card&lt;/a&gt; use the built-in serial transceivers of the ZU+ to drive the TMDS outputs through a &lt;a href=&quot;http://www.ti.com/product/SN65DP159&quot;&gt;SN65DP159&lt;/a&gt; HDMI retimer. The latter is a more modern approach, supporting up to 4K60 through the &lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/hdmi.html&quot;&gt;HDMI TX Subsystem IP&lt;/a&gt;. But, that IP is not included with Vivado. It also requires three free GTH transceiver channels, which I don&#39;t have on the XCZU4. (Its four available channels are in use for &lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html&quot;&gt;PCIe Gen3 to the SSD&lt;/a&gt;.)&lt;br /&gt;
&lt;br /&gt;
There&#39;s nothing wrong with using an external HDMI transmitter with the ZU+, though. I left a PL GPIO bank open specifically for a parallel RGB pixel bus, either for an LCD controller or an HDMI interface. I opted for the slightly newer &lt;a href=&quot;https://www.analog.com/en/products/adv7513.html&quot;&gt;ADV7513&lt;/a&gt;, which supports up to 1080p60 at 8-bit. This is perfectly acceptable as a preview and local playback resolution. Outputting a full 4K frame over HDMI might be useful for interfacing with a RAW recorder, but that is out of the question anyway at 400fps. In fact, I only really need a 24-30fps HDMI output, which means a very manageable &lt;span style=&quot;color: yellow;&quot;&gt;74.25MHz&lt;/span&gt; pixel clock, based on the &lt;a href=&quot;https://en.wikipedia.org/wiki/Extended_Display_Identification_Data#EIA/CEA-861_extension_block&quot;&gt;CEA-861&lt;/a&gt; standard.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3vpSdKZEnpJGF_LcxBw6Yqs-70ryYoD7HVV-XLGsFzkvh-IHk_bm2zQtxxfpL5MUFMhcuTkgZdzIKfW7uHqSyyvHPB1I4YMKyzDPYIVGZKD5seScXo816maUFc5QS-Gm8FXL3vGDpYlY/s1600/d15.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;876&quot; data-original-width=&quot;1600&quot; height=&quot;350&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3vpSdKZEnpJGF_LcxBw6Yqs-70ryYoD7HVV-XLGsFzkvh-IHk_bm2zQtxxfpL5MUFMhcuTkgZdzIKfW7uHqSyyvHPB1I4YMKyzDPYIVGZKD5seScXo816maUFc5QS-Gm8FXL3vGDpYlY/s640/d15.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;HDMI timing parameters for 1920x1080p 24/25/30Hz with a 74.25MHz pixel clock.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Generating the required pixel clock, sync, and dummy RGB signals in the ZU+ Programmable Logic (PL) is pretty simple; I set that up as a module on day one of playing with HDMI. Typically, you&#39;d just point this module at a frame buffered in RAM and let it pull the real data. (There are video DMAs and drivers that will do this more-or-less automatically.) But here&#39;s where I run into a &lt;strike&gt;slight&lt;/strike&gt; problem: &lt;span style=&quot;color: yellow;&quot;&gt;I don&#39;t actually have a frame buffered in RAM.&lt;/span&gt;&lt;br /&gt;
&lt;h4&gt;
The Hard Way&lt;/h4&gt;
&lt;div&gt;
While it &lt;i&gt;is&lt;/i&gt;&amp;nbsp;possible to &lt;a href=&quot;https://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;write the full 3.8Gpx/s raw frame data to RAM&lt;/a&gt; on the ZU+, it would be futile to try doing any significant processing on it there. Even if I used all three 128b AXI bus connections between the PL and the memory controller at 250MHz, that would allow for less than three accesses per pixel...including the initial write. The Processing System (PS) has a similar memory access constraint, although processing pixels serially on the ARM cores is much too slow anyway. So I made the decision early on to implement the &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot;&gt;wavelet compression engine&lt;/a&gt;&amp;nbsp;in PL hardware and write the ~5:1 compressed codestreams to RAM instead, on their way to the SSD.&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIq07duqPWJEDJEq6pYkHwZPG3eiPp6vT5ZGSH2Ur6qeCS4FFiS9K6mVe9zRPO52oDFPcpFqC08rdqbPbpUu-7aPBT99h2uJUwMoxE_Lp3ykM7YnED5Skr-7fY0N2rUD0SiMrM73MGTmM/s1600/d16.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;506&quot; data-original-width=&quot;1600&quot; height=&quot;202&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIq07duqPWJEDJEq6pYkHwZPG3eiPp6vT5ZGSH2Ur6qeCS4FFiS9K6mVe9zRPO52oDFPcpFqC08rdqbPbpUu-7aPBT99h2uJUwMoxE_Lp3ykM7YnED5Skr-7fY0N2rUD0SiMrM73MGTmM/s640/d16.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The capture pipeline, with the main data path highlighted and shown decreasing in width where compression occurs at the PL Encoder, before data is written to DDR4 RAM.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
&quot;No problem,&quot; you might say, &quot;just split off raw data from the sensor and feed it to the HDMI module.&quot; Unfortunately, this doesn&#39;t quite work: In the time it takes the HDMI scan to complete one row, the capture pipeline has processed 50+ rows from the CMV12000. The input and output are just not in sync, and any attempt to buffer partial frames between them would require much more block RAM than I have available. It would also cause frame tearing that would ruin any attempt to preview periodic phenomenon with the global shutter.&lt;br /&gt;
&lt;br /&gt;
The only real choice is to put the HDMI output module after the RAM buffer, which means decoding compressed frame data on the way out:&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRl8y6_7KMAkeDPjgVN6eajdY2Yth3EiNmhOSR76rrYxrZ9TiSdsjFc3Z0Q-yrPuWDnNVybvcHdeoGUn2AyZHcrLyCR7rMScYD8-6WnhwxCrrxY9c2D4U5SiSTgMjMFFQEYi1lN1_nCHw/s1600/d17.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;414&quot; data-original-width=&quot;1600&quot; height=&quot;165&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRl8y6_7KMAkeDPjgVN6eajdY2Yth3EiNmhOSR76rrYxrZ9TiSdsjFc3Z0Q-yrPuWDnNVybvcHdeoGUn2AyZHcrLyCR7rMScYD8-6WnhwxCrrxY9c2D4U5SiSTgMjMFFQEYi1lN1_nCHw/s640/d17.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The only logical place to put the HDMI output, and not just because I left space for it there in the block diagram.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
The HDMI module reads codestream data from RAM as an AXI Master, decodes the pixel values, and runs an Inverse &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#dwt&quot;&gt;Discrete Wavelet Transform&lt;/a&gt; (IDWT) to recover the raw image. While this is a lot more work, it pays off twofold because the same module can be used for playback by reading frames back out of the SSD into RAM and pointing the decoder at them.&lt;br /&gt;
&lt;br /&gt;
Notwithstanding the design effort, the actual resource utilization of this module &lt;i&gt;should&lt;/i&gt;&amp;nbsp;be pretty low. For one, only four of the&amp;nbsp;&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#encoder&quot;&gt;sixteen codestreams&lt;/a&gt; need to be decoded to reconstruct a 2048px-wide image to use for the preview; there&#39;s no need to decode any LH1, HL1, or HH1 data. Also, the preview frame rate is at least 10x slower than the capture frame rate, so the amount of parallelism needed in the decoding and IDWT pipeline is much lower. Still, it&#39;s more logic on an already-crowded chip.&lt;br /&gt;
&lt;h4&gt;
Kill Your Darlings&lt;/h4&gt;
At this point I&#39;m stubbornly committed to fitting this design on the XCZU4. With the capture pipeline complete, I was getting pretty close to &lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikdQBoG5MfI1DHaQvQCQUamIizk6ojBcYeV_HQg25UOMJVD8jKQ1SHJyQLvcTmDhXsKYFGcKS-4ceZXHx6PKdp5HjMsP0KFZ5hA8meoNk3PZtrpX0nSE_QMIjEtsnmkj2JLmQnDQ7zuyk/s640/c90.png&quot;&gt;maxing out this chip&lt;/a&gt;, especially the LUTs (65593 / 87840) and BRAMs (122 / 128). And this was after a significant optimization pass on all the cores, including trimming pixel math operations from 16-bit to 12-bit where applicable and removing debug interfaces. These bottlenecks were already causing routing difficulty that was pushing up compile times, so I needed to make more room somehow. And then one day I woke up and decided to delete Wavelet Stage 3.&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7XG1-7CGjFHmcQKD0ChNJMaQmH1JNOHYVo9n8fWkrehM8vRPspX7kwLppSwW4ijjQpHDWwc161mUH76E8gGzdNHNqE8hdXVlcIhjpo1KilPwNsrAiEpr577xNIsZm95W7fodREew8Td0/s1600/d18.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;741&quot; data-original-width=&quot;1600&quot; height=&quot;296&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7XG1-7CGjFHmcQKD0ChNJMaQmH1JNOHYVo9n8fWkrehM8vRPspX7kwLppSwW4ijjQpHDWwc161mUH76E8gGzdNHNqE8hdXVlcIhjpo1KilPwNsrAiEpr577xNIsZm95W7fodREew8Td0/s640/d18.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;An example showing the effect of deleting the third DWT stage without changing the target compression ratios of any other stages. The red bars are each sized proportionally to the&amp;nbsp; compressed sub-band they represent.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Stage 3 only handles 1/16 of the total data throughput, but it is visually the most significant and thus uses the least amount of compression. In the example above, replacing Stage 3&#39;s output with a raw 1/4-scale average image (LL2) has a relatively small effect on the overall compression ratio. It&#39;s also not a complete loss, since the 1:1 LL2 will yield slightly better visual quality if the other subbands remain unchanged. The distribution of bandwidth that achieves the best image quality with an overall compression ratio of 5:1 is still an unknown, but ditching Stage 3 probably isn&#39;t restricting the search space too far.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Although Stage 3 is by far the smallest wavelet core, removing it also simplifies a lot of downstream logic. The &quot;XX3&quot; &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#encoder&quot;&gt;encoder&lt;/a&gt;, which previously handled all four Stage 3 subbands by cycling through different inputs and quantizer settings, now becomes a pass-through for raw LL2 data. It also now has the same latency as the HL2, LH2, and HH2 encoders. This latency is the new maximum and is significantly lower than the former XX3 latency. (It&#39;s no longer necessary to wait for six whole LL2 rows for the Stage 3 DWT.) There&#39;s a symmetric payoff on the decoder side as well.&lt;br /&gt;
&lt;br /&gt;
So while I&#39;m sad to see it go, I think it&#39;s the right call for now. Having three stages probably does improve the compression performance (objectively, the PSNR at a given compression ratio), but I think I can still achieve good image quality at an overall ratio of 5:1 with only two. Not even including prospective decoder savings, the reduction in LUTs (-4575), FFs (-5320), and most crucially BRAMs (-8) is well worth-it.&lt;br /&gt;
&lt;h4&gt;
Working Backwards&lt;/h4&gt;
&lt;div&gt;
In may ways, the HDMI output module is just a mirror image of the pixel input pipeline, from the deserialized CMV12000 input pixels to the AXI Master that writes encoded data to RAM. The 74.25MHz HDMI clock runs a master pixel counter that scans across and down the output frame. Whereas the CMV12000 clocks in 64 pixels in parallel, though, the HDMI only has to clock out one.&lt;br /&gt;
&lt;br /&gt;
Or does it? Each HDMI pixel (in RGB 4:4:4 format) consists of an 8-bit red, green, and blue value, whereas the&amp;nbsp;&lt;a href=&quot;https://en.wikipedia.org/wiki/Bayer_filter&quot;&gt;Bayer-masked&lt;/a&gt;&amp;nbsp;sensor input&amp;nbsp;is split into four interleaved color fields. Each color field&#39;s decoded LL1 image will only be 1024px wide. One option would be to center this in the HDMI frame and pull the 8-bit R, G, and B values directly from each color field&#39;s LL1:&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguPUXagsqYHBbevgavv8DZtocyl9iQSZofJ22SA7F_dCFkYrCn8GeqVlK7MKz78mh4u37AxLM_t2pk4RdL5p0EAtjeFsuaFGCL9idLxa42imQ7YI7b9LuKRP9N8HWFIigbxKoehJiCSOw/s1600/d20.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;737&quot; data-original-width=&quot;1600&quot; height=&quot;294&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguPUXagsqYHBbevgavv8DZtocyl9iQSZofJ22SA7F_dCFkYrCn8GeqVlK7MKz78mh4u37AxLM_t2pk4RdL5p0EAtjeFsuaFGCL9idLxa42imQ7YI7b9LuKRP9N8HWFIigbxKoehJiCSOw/s640/d20.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;1:1 scaling from LL1 color field pixels to HDMI pixels.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
In this case, each HDMI clock requires one pixel from each of the four color fields (the two greens are averaged). The logic couldn&#39;t really get any simpler. But, it makes poor use of the 1920x1080 HDMI frame, especially for widescreen aspect ratios. An alternative would be to scale everything up by a factor of two:&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijWbg9bHJ-8W7y4Bo9mBb-jwnKJGlXp7omfD3aoacQSZfVY3adZur_1cH96LQevvaxLD3w6UppadFAGP309OgKOeXLfTfaBI7hRjf4nYD3T5wE6iB2fAUhwiyABNqoXWrgANVaH8TlUjI/s1600/d21.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;766&quot; data-original-width=&quot;1600&quot; height=&quot;306&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijWbg9bHJ-8W7y4Bo9mBb-jwnKJGlXp7omfD3aoacQSZfVY3adZur_1cH96LQevvaxLD3w6UppadFAGP309OgKOeXLfTfaBI7hRjf4nYD3T5wE6iB2fAUhwiyABNqoXWrgANVaH8TlUjI/s640/d21.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;2:1 scaling from LL1 color field pixels to HDMI pixels.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Now, a debayering method has to be used to reconstruct the missing color values at each pixel. For this application, a simple average of the neighboring pixels would be fine. (The off-line decoder uses a more complex, higher-quality method.) Each HDMI pixel now references as many as four pixels from each color field. But, these pixels don&#39;t all update at each HDMI clock. The average pixel consumption from each color field is actually only one per four HDMI clocks, as expected from the 2:1 scaling factor.&lt;br /&gt;
&lt;br /&gt;
But a 2:1 scaled preview doesn&#39;t fit in 1920x1080. The cropping isn&#39;t too bad for widescreen aspect ratios, but it&#39;s unusable for 4:3. Switching between 1:1 and 2:1 scaling depending on the aspect ratio would work, but adds a lot of conditional logic for a still-compromised result. An arbitrary software-controlled scaling between 1:1 and 2:1 would be so much better. So, time to break out the DSPs:&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmiGAsJxaY99ZwhL30mOi41m2xT0uULL7vEuRGiM276EljBIZ_LjgmfnHRWr4SuJPgL0lTcQvmLfzd8HNsljAoc5Uw87l7vOmTMZ0RIh4MF1_YkTCWyIYQ1Dh61v37gu9ZSku0SMlwe1o/s1600/d22.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;738&quot; data-original-width=&quot;1600&quot; height=&quot;294&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgmiGAsJxaY99ZwhL30mOi41m2xT0uULL7vEuRGiM276EljBIZ_LjgmfnHRWr4SuJPgL0lTcQvmLfzd8HNsljAoc5Uw87l7vOmTMZ0RIh4MF1_YkTCWyIYQ1Dh61v37gu9ZSku0SMlwe1o/s640/d22.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Arbitrary scaling from LL1 color field pixels to HDMI pixels, using bilinear interpolation.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
To achieve arbitrary scaling, the four 1024px-wide LL1 color fields are resampled onto a 65536px-wide grid, accounting for the offsets between the centers of pixels of each color. Then, a viewport is defined within the HDMI frame and normalized onto this 16-bit grid (&lt;a href=&quot;https://github.com/coltonshane/WAVE-Vivado/blob/master/base_cmd.srcs/sources_1/ip/HDMI_1.0/src/viewport_normalize.v&quot;&gt;using DSPs&lt;/a&gt;). The four pixel centers of each color field that box in the normalized viewport coordinate are used for bilinear interpolation (&lt;a href=&quot;https://github.com/coltonshane/WAVE-Vivado/blob/master/base_cmd.srcs/sources_1/ip/HDMI_1.0/src/bilinear_16b.v&quot;&gt;using more DSPs&lt;/a&gt;) to produce the R, G, and B values. This is also the debayer step, thanks to the pixel center offsets.&lt;br /&gt;
&lt;br /&gt;
One thing I actually do have plenty of is DSPs, and this seems like a great use for 14 of them. Being able to reposition and rescale the preview image from software makes life a lot easier. The down-side is that sixteen LL1 pixels are required to generate a single HDMI pixel. But as with the 2:1 case, the input pixels don&#39;t all change with every HDMI clock. The average LL1 pixel consumption rate will depend on the scale, but if the viewport width is always at least 1024px, &lt;span style=&quot;color: yellow;&quot;&gt;it will never exceed one LL1 pixel per color field per HDMI clock&lt;/span&gt;. All upstream logic in the decoder is designed with this constraint in mind.&lt;/div&gt;
&lt;div&gt;
&lt;h4&gt;
Ultra-IDWT&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
Next upstream is the Inverse Discrete Wavelet Transform (IDWT). One of the most significant simplifications achieved by deleting Wavelet Stage 3 is that the HDMI output module only has to do one stage of IDWT: Stage 2. This stage recovers LL1 from the LL2, LH2, HL2, and HH2 subbands. The order of operations is reversed in the IDWT: vertical first, then horizontal. Since we&#39;re working backwards from the HDMI output, let&#39;s look at the horizontal core first.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#h26&quot;&gt;forward horizontal DWT core&lt;/a&gt; is heavily optimized for speed and size using only FF-based distributed memory. In the inverse direction, there&#39;s a lot more breathing room. Only four cores are needed (one per color field) and they only need to process at most one pixel per HDMI clock. So, I am able to combine the horizontal IDWT with a block RAM buffer and output shift register pretty easily. I&#39;m almost completely out of BRAMs, but I have plenty of UltraRAM (URAM) for this.&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlVmwfD7l2pOyC-mZLKYt6IApt-eeXdPIpRNJVKgeQUVHa4FnzIrLeKKQEqB7ePHM2rRixn_3tx21P6k2yF6Ph1Q94Vkj8FM0WUSZyRhp0ThAjPZhGQtsuFgRD2NIneYtTc7JvssNQLzQ/s1600/d23.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;847&quot; data-original-width=&quot;1600&quot; height=&quot;338&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhlVmwfD7l2pOyC-mZLKYt6IApt-eeXdPIpRNJVKgeQUVHa4FnzIrLeKKQEqB7ePHM2rRixn_3tx21P6k2yF6Ph1Q94Vkj8FM0WUSZyRhp0ThAjPZhGQtsuFgRD2NIneYtTc7JvssNQLzQ/s640/d23.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Horizontal IDWT and output buffer for one color field built around a single URAM.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Each URAM is 32KiB, enough to store 16 rows of LL1 data. The oldest two rows (N+0 and N+1) feed output shift registers that end in the four pixels the bilinear interpolator needs. The horizontal IDWT is performed on data from Row N+3, its result written back to Row N+2. As in the forward direction, pixels are processed in 64-bit groups of four: two interleaved pairs of low-pass and high-pass values become four LL1 outputs. Two half-speed shift registers unpack 64-bit URAM reads for the IDWT and pack the results into 64-bit writes. Running the IDWT as a single combinational step is not as efficient as using sequential&amp;nbsp;&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#lifting&quot;&gt;lifting steps&lt;/a&gt;, as in the &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#h26&quot;&gt;forward horizontal DWT&lt;/a&gt;, but it&#39;s a bit simpler to do with shift registers. Meanwhile, new data from the vertical stage is fed in at Row N+6.&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZjif9ZBaLCqDZVjwxmTKBAWxvF4rA4d1jiF-lol9jalzBQtICYbuyG5JaG2YfJBRLR_lrYYLuQF9WiQBqZBeBvqbzgNmAdPLEcIc6Jltqk0IrWIHuW5ZuaUvL9O1wyJ-PScTmiVM8HYI/s1600/d24.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;706&quot; data-original-width=&quot;1600&quot; height=&quot;282&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZjif9ZBaLCqDZVjwxmTKBAWxvF4rA4d1jiF-lol9jalzBQtICYbuyG5JaG2YfJBRLR_lrYYLuQF9WiQBqZBeBvqbzgNmAdPLEcIc6Jltqk0IrWIHuW5ZuaUvL9O1wyJ-PScTmiVM8HYI/s640/d24.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Vertical IDWT for one color field built around a single URAM.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
The vertical IDWT cores are also each built around a single URAM. In this case, the URAM is split in half for low-pass (HL2/LL2) and high-pass (HH2/LH2) vertical data. Four pixels each from three rows of low-pass data (N+0 to N+2) and one row of high-pass data (N+9) are processed every four clocks to create two four-pixel outputs to write to horizontal core URAM. In a shameful waste of clock cycles, input rows are scanned twice and the output write alternates between the even and odd IDWT results. (There are other ways to deal with the 2:1 row scanning ratio, but I&#39;m willing to trade power for simpler logic right now.) Meanwhile, raw interleaved LL2, LH2, HL2, and HH2 data are written in to rows somewhere just ahead of the IDWT read pointers.&lt;br /&gt;
&lt;h4&gt;
Decompressor and Distributor&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
Each horizontal and vertical core operates on a single color field, but the four input codestreams are instead separated by subband (LL2, LH2, HL2, HH2), with all four color fields being present in each codestream. The codestreams also cycle through four different column positions in a given row, since the Stage 2&amp;nbsp;&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#v26&quot;&gt;forward vertical DWT&lt;/a&gt; uses four cores in parallel. A distributor remaps decoded subband data to the appropriate write address in one of the vertical IDWT cores. This is also a good place to interleave the high-pass and low-pass data, which facilitates the horizontal IDWT.&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSbVuaTRHKqYE4zE0ksESMelRSsWlTLKqSIBNYVURwYyLnBi_XyqY3yz99229RPmQ12Mw8S-KI24GfyxQIxfBv1vZNK51pXsoTlLAAVFAoTEI3s9atUmBGYNvNsxGkZKYg0i4pjo5-5Kk/s1600/d25.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;429&quot; data-original-width=&quot;1600&quot; height=&quot;170&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSbVuaTRHKqYE4zE0ksESMelRSsWlTLKqSIBNYVURwYyLnBi_XyqY3yz99229RPmQ12Mw8S-KI24GfyxQIxfBv1vZNK51pXsoTlLAAVFAoTEI3s9atUmBGYNvNsxGkZKYg0i4pjo5-5Kk/s640/d25.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;After decoding, subband pixels are redistributed to the appropriate location in each color field&#39;s vertical IDWT buffer.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
The distributor writes four pixels into one of the four vertical core URAMs at most once per HDMI clock, to satisfy the one pixel per color field per clock constraint discussed above. For viewport widths greater than 1024px, the distribution is gated by the master pixel counter, which only updates when the interpolators actually need new pixels.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Continuing upstream, the distributor receives 16-bit signed pixel values from the four codestream decompressors. Each one takes in codestream data from RAM as-needed, decoding four pixels at a time by reversing the &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#encoder&quot;&gt;variable length code&lt;/a&gt; used by the encoder. The pixels are then multiplied by the inverse of the &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#quantizer&quot;&gt;quantizer&lt;/a&gt; multiplication factor, using more DSPs, to recover their full range.&lt;br /&gt;
&lt;br /&gt;
Raw codestream data is read in from RAM by an AXI Master into BRAM FIFOs at the entrance to each decompressor. I&#39;m using precious BRAMs here, for the built-in FIFO functionality and to make the decoder RAM reader symmetric to the encoder RAM writer. A round-robin arbiter checks the FIFO levels to see when more data needs to be read. I&#39;m only using a 64-bit AXI Master on the decoder, since the bandwidth already far exceeds the worst-case HDMI output requirement.&lt;br /&gt;
&lt;h4&gt;
Start-Of-Frame Context&lt;/h4&gt;
&lt;span style=&quot;font-weight: 400;&quot;&gt;So far, the HDMI output pipeline looks a lot like the sensor input pipeline in reverse. But one subtle way in which they differ is in Start-Of-Frame (SOF) context: the state of the pipeline at the beginning of each frame. In the interest of speed, the input pipeline is &lt;/span&gt;&lt;i style=&quot;font-weight: 400;&quot;&gt;not&lt;/i&gt;&lt;span style=&quot;font-weight: 400;&quot;&gt;&amp;nbsp;flushed between frames. Furthermore, codestream addresses for a given frame are updated during the Frame Overhead Time (FOT) interrupt, while some data is still in the pipeline, so the very bottom of Frame N-1 becomes the top of Frame N in memory.&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-weight: 400;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiszMho5Loj_K8ys4zmJTzo_FYvxHHJfofD_RV0-c_513wrL0mw7jKfRupBjW4fansiD9QOwZoRpUD79E-slqM5JcmwEqCU4FwmkjNmDZZwp-KOq5Zmpkhq1YfTKyGcJzoY2CKA7DdLr1U/s1600/d26.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;822&quot; data-original-width=&quot;1600&quot; height=&quot;328&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiszMho5Loj_K8ys4zmJTzo_FYvxHHJfofD_RV0-c_513wrL0mw7jKfRupBjW4fansiD9QOwZoRpUD79E-slqM5JcmwEqCU4FwmkjNmDZZwp-KOq5Zmpkhq1YfTKyGcJzoY2CKA7DdLr1U/s640/d26.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Overlap between Frame N-1 and Frame N in memory. SOF N marks the sector-aligned start of &quot;Frame N&quot; in RAM, set during the FOT interrupt from the CMV12000. The decoder seeks the actual start of Frame N data.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
If the decoder processes every frame, this isn&#39;t a problem: it can wrap cleanly through the overlapping region to get the data it needs for both frames. But the HDMI output only processes a subset of the frames captured. It needs to be able to find the start of any individual frame and process it independently. This is needed for seeking in a playback context too. But I can&#39;t afford the time it would take to flush the input pipeline between each frame. So instead I need to completely capture the state of the pipeline at the SOF boundary.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
As it turns out, this isn&#39;t too bad, since there are only a few places where data can remain in the input pipeline at the SOF:&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;In the pre-encoder pixel memory: registers or BRAM buffers that are part of sensor input, DWT or quantizer operations. These have a fixed latency of&amp;nbsp;&lt;span style=&quot;color: yellow;&quot;&gt;6336px for Stage 1+2&lt;/span&gt;. The decoder can offset its pixel counter by this amount, essentially discarding the overlapping pixels into the space between VSYNC and the start of the viewport.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;In the 128-bit &lt;a href=&quot;https://github.com/coltonshane/WAVE-Vivado/blob/master/base_cmd.srcs/sources_1/ip/Encoder_1.0/src/compressor_16in.v#L94&quot;&gt;e_buffer register&lt;/a&gt; of each codestream that accumulates encoded data before writing it to that codesteram&#39;s BRAM FIFO. The number of bits remaining in this register is neatly captured by its write index,&amp;nbsp;&lt;span style=&quot;color: #9fc5e8;&quot;&gt;e_buffer_idx&lt;/span&gt;.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;In the codestream BRAM FIFO itself. This is captured by the &lt;span style=&quot;color: #b6d7a8;&quot;&gt;FIFO read level&lt;/span&gt;, already used as the AXI write trigger. Since these FIFOs are 64-bit write and 128-bit read, care must be taken to keep track the &lt;span style=&quot;color: #d5a6bd;&quot;&gt;write level LSB&lt;/span&gt; as well, to know if there&#39;s an extra half-word in memory that can&#39;t be read yet.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
The last two combine to give a number of bits to discard for each codestream:&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #9fc5e8;&quot;&gt;e_buffer_idx&lt;/span&gt; + 128 * &lt;span style=&quot;color: #b6d7a8;&quot;&gt;fifo_rd_count&lt;/span&gt; + 64 * &lt;span style=&quot;color: #d5a6bd;&quot;&gt;fifo_wr_count[0]&lt;/span&gt;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
To fully capture the SOF context, these three values are written to the frame headers during the FOT interrupt. A VSYNC interrupt from the HDMI module prompts software to read the header of the next frame to be displayed, calculate the number of bits to discard for each codestream, and pass it to the decoder along with the codestream start addresses. That number of bits are then discarded by the decoders prior to attempting to decode any pixels.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_rQI5u2AfhMstpVqom6F5hR8N6R7H_4PDYG-BES-lCd_dzWwmvPMDQkLKb6HjL5413bKQ7LaXLVFsvL7r8SnXuuggcu6b-fJQwqqCSjk7K9ij7IpmiRRjjrajXD_0hcFig3JSW3w4RGA/s1600/d27.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;602&quot; data-original-width=&quot;1600&quot; height=&quot;240&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_rQI5u2AfhMstpVqom6F5hR8N6R7H_4PDYG-BES-lCd_dzWwmvPMDQkLKb6HjL5413bKQ7LaXLVFsvL7r8SnXuuggcu6b-fJQwqqCSjk7K9ij7IpmiRRjjrajXD_0hcFig3JSW3w4RGA/s640/d27.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;High-level architecture of the encoder and decoder interactions with the CPU and RAM.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
In total, the HDMI output module (decoder and all) uses 4363 LUTs, 4227 FFs, and 4 BRAMs, less than what was saved by deleting Wavelet Stage 3. It adds 8 URAMs and 26 DSPs, but I&#39;m not running short of those (yet). Except for the AXI Master, it runs entirely on the 74.25MHz HDMI clock, so it shouldn&#39;t be too much of a timing burden. There might be room for a bit more optimization, but I&#39;m happy with the functionality it gives for its size.&lt;/div&gt;
&lt;h4&gt;
Focus Assist&lt;/h4&gt;
&lt;div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The main reason I wanted to get the HDMI module done now, ahead of some of the other remaining tasks, is so I can use the real-time preview for testing. It sucks to have to pull frames off one-by-one through USB to iterate on framing, exposure, and especially focus. Having a 1080p 30fps preview on something like an &lt;a href=&quot;https://www.atomos.com/shinobi&quot;&gt;Atomos Shinobi&lt;/a&gt; on-camera monitor makes life a lot easier, and moves in the direction of standalone operation.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/uceEuX6Fj1Q&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
One neat trick you an do with wavelets is overmultiply the high-pass subbands (LH1, HL1, HH1) to highlight edges in the preview image. This effect is useful for focus assist. Most on-camera monitors can do this anyway (by running a high-pass filter on the HDMI data), but it&#39;s essentially free to do in the decoder since the subbands pass through a multiplier anyway to undo the quantization division. I&#39;ll take free features any day.&lt;/div&gt;
&lt;h4 style=&quot;clear: both; text-align: left;&quot;&gt;
Macro Machining&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
With the newfound ability to actually focus the image in a reasonable amount of time, I&#39;m finally able to play with a new lens: The &lt;a href=&quot;https://www.irixusa.com/irix-cine-150mm-t3-0-macro-1-1&quot;&gt;Irix Cine 150mm T3.0 Macro&lt;/a&gt;. I started drooling over this lens for close-up high-speed stuff after watching &lt;a href=&quot;https://www.youtube.com/watch?v=GsxybNZn22k&quot;&gt;this review&lt;/a&gt;. I&#39;m no lens expert, but I feel like this lens series competes with ones 3x its price in terms of image quality. My first test was to attempt to get some macro shots of my mini mill:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_AkJ_UoOP1WlnCvyze6txyfz3e7AiGOB0IO6q1lsn_T00jRvUCeVZ70lVbfuNJUwyuoR3pqGpfyTS3C9QOlP8aJn-WSm_H8yWM2EJGLtphh3QKEDLzjk73r2foDrzzFCkZQfNlqP0Eow/s1600/d29.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_AkJ_UoOP1WlnCvyze6txyfz3e7AiGOB0IO6q1lsn_T00jRvUCeVZ70lVbfuNJUwyuoR3pqGpfyTS3C9QOlP8aJn-WSm_H8yWM2EJGLtphh3QKEDLzjk73r2foDrzzFCkZQfNlqP0Eow/s640/d29.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Shooting my mini-mill at 400fps with the Irix 150mm T3.0 Macro lens.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
The HDMI output was crucial for this, since the lens has an insanely shallow depth-of-field at T3.0, less than the width of the cutting tool. The CMV12000 is not a particularly good low-light sensor, so with an exposure time of around 1.87ms, I needed to add a good deal of light. To make things more interesting, I threw in some cheap &lt;a href=&quot;https://www.ikea.com/us/en/p/dioder-led-4-piece-light-strip-set-multicolor-50192365/&quot;&gt;IKEA RGBs&lt;/a&gt;&amp;nbsp;as well. It took a while to get set up, but the result was promising:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/qQ8LboNODIs&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
I&#39;ll probably repeat this with a more interesting subject (this was just a piece of scrap aluminum) and a more stable mount. If I can get more light, it might be good to close down to T5.6 or so as well, to get a bit more depth of field, and drop the exposure to 180º shutter for less motion blur on the cutter. But the lens is terrific and I&#39;m happy with the quality of the two-stage wavelet compression so far. The above clip has an average compression ratio of right around 6:1, helped along by the ultra-shallow depth of field.&lt;br /&gt;
&lt;h4&gt;
Next Up&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
The last major HDL task on this project is modifying the pipeline to accept 2K subsampled frames from the sensor at higher frame rates (up to around 1400fps at 1080p!). This will probably be a separate Vivado project and bitstream, since it requires substantial modifications to the input pipeline. It also needs twice as many Stage 1 horizontal cores, since four rows are being read in simultaneously instead of two.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
But I may tackle some simpler but no less important usability tasks first. For one, I still don&#39;t have pass-through Mass Storage Device access to the SSD over USB C. This is necessary for getting footage off without opening the camera (or using RAM as intermediate storage). With that and a bit of on-camera UI work (record button, simple menus), I can finally run everything completely standalone soon.&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/6603215929841972940/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2020/03/hdmi-hard-way.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/6603215929841972940'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/6603215929841972940'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2020/03/hdmi-hard-way.html' title='HDMI, the Hard Way'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3vpSdKZEnpJGF_LcxBw6Yqs-70ryYoD7HVV-XLGsFzkvh-IHk_bm2zQtxxfpL5MUFMhcuTkgZdzIKfW7uHqSyyvHPB1I4YMKyzDPYIVGZKD5seScXo816maUFc5QS-Gm8FXL3vGDpYlY/s72-c/d15.png" height="72" width="72"/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-4390889306834972789</id><published>2019-12-19T00:46:00.002-05:00</published><updated>2019-12-26T13:35:33.598-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>Continuous 3.8Gpx/s (4K 400fps+) Image Capture Pipeline</title><content type='html'>&lt;div&gt;
In the original &lt;a href=&quot;https://scolton.blogspot.com/2019/06/freight-train-of-pixels.html&quot;&gt;Freight Train of Pixels&lt;/a&gt;&amp;nbsp;post, I laid out three main technical challenges to building a &lt;i&gt;continuous recording &lt;/i&gt;3.8Gpx/s imager. All three have now been dealt with, using a Zynq Ultrascale+ SoC as a hardware base. The detailed implementations for each one has its own post:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;The Source&lt;/a&gt;] - Full-speed read-in of the CMV12000&#39;s 64 LVDS channels.&lt;br /&gt;
&lt;div&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot;&gt;The Pipe&lt;/a&gt;] - Hardware wavelet compression engine.&lt;/div&gt;
&lt;div&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html&quot;&gt;The Sink&lt;/a&gt;] - Sustained 1GB/s writing to an NVMe SSD.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Now it&#39;s time to put all three pieces together and run it as a full pipeline:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/j_u-m2xwpek&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;span style=&quot;font-size: x-small;&quot;&gt;Since YouTube uploads are at the mercy of H.264, here&#39;s a &lt;a href=&quot;https://scolton-www.s3.amazonaws.com/video/wv37.png&quot;&gt;PNG frame&lt;/a&gt;&amp;nbsp;as well.&lt;/span&gt;&lt;/div&gt;
&lt;br /&gt;
There are lots of technical details to dive into, but the first thing to point out is that this is 12000 frames of continuously-recorded 4K 400fps video. That&#39;s 30s in real-time and 500s of playback at 24fps, something very few existing high-speed imaging systems can do. And I can keep going. This clip is &quot;only&quot; 24GB of a 1TB SSD. To fill the entire 1TB would take about 20 minutes at this bit rate. That&#39;s 20 minutes of real-time, 5.5 hours of playback at 24fps.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
This is made possible mostly due to the insane speed of modern SSDs. High-speed cameras typically use RAM to buffer raw frame data in real-time, transferring short clips to non-volatile storage after the capture period. But with NVMe flash write speeds now well into the GB/s range, maybe continuous direct recording to a single drive (or a RAID 0 array) will catch on. Besides the ability to capture long clips, it also allows an alternative trigger-free user interface that would be familiar to anyone: push to record, push to stop.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEie6TC8wYZVLs7hcT_VLO219K5i9tB5_wa6dmB7ADx7ObOD-0Jgb9FEkcOBLR89XvEAbxnKs65GerU48aoV99JfYAqTFcot18W1uSsBv-6pqeMG2AQj6Iv328hIMcpOrX5Oih4nPY3NK8c/s1600/d02.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;826&quot; data-original-width=&quot;1600&quot; height=&quot;330&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEie6TC8wYZVLs7hcT_VLO219K5i9tB5_wa6dmB7ADx7ObOD-0Jgb9FEkcOBLR89XvEAbxnKs65GerU48aoV99JfYAqTFcot18W1uSsBv-6pqeMG2AQj6Iv328hIMcpOrX5Oih4nPY3NK8c/s640/d02.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The real MVP here is the 1TB Samsung 970 Evo Plus NVMe SSD, a good example of modern consumer electronics running laps around everything above it in the pro/industrial world.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Of course, the other enabling factor is the use of &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot;&gt;wavelet compression&lt;/a&gt; to reduce the data rate by a ratio of at least 5:1. This might seem like cheating, but since a similar compression ratio is utilized by &lt;a href=&quot;https://www.red.com/red-101/redcode-file-format&quot;&gt;REDCODE&lt;/a&gt;, &lt;a href=&quot;https://www.blackmagicdesign.com/products/blackmagicraw&quot;&gt;Blackmagic RAW&lt;/a&gt;, &lt;a href=&quot;https://support.apple.com/en-us/HT208671&quot;&gt;Apple ProRes RAW&lt;/a&gt;, and probably many other &quot;raw&quot; formats, I don&#39;t feel the least bit guilty. On a sensor like the CMV12000, a lightweight compression pass might actually &lt;i&gt;help&lt;/i&gt;&amp;nbsp;the image quality, since it&#39;ll inherently denoise the image somewhat.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
I also picked a good hardware platform for this project: the lower-end Zynq Ultrascale+ SoCs, like that of the wonderful&amp;nbsp;&lt;a href=&quot;https://shop.trenz-electronic.de/en/TE0803-03-4AE11-A-MPSoC-Module-with-Xilinx-Zynq-UltraScale-ZU4CG-1E-2-GByte-DDR4-5.2-x-7.6-cm?c=452&quot;&gt;XCZU4CG module&lt;/a&gt; I&#39;ve been using, have just barely enough resources to pull it off. I&#39;m using 134 of the 144 LVDS-capable I/O, all four PCIe Gen3-capable transceivers, and most of the programmable logic. I was actually going to move up to the XCZU6, which has more than double the logic, but it seems to be on a different branch of the product tree that a) doesn&#39;t have PCIe hard blocks and b) isn&#39;t supported by &lt;a href=&quot;https://www.xilinx.com/products/design-tools/vivado/vivado-webpack.html&quot;&gt;Vivado HL WebPACK&lt;/a&gt;. For now, I&#39;ll just try my best to optimize for the XCZU4. I still have the XCZU5 and XCZU7 available to me if needed, though.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-euC3DPK7TDJVDMQeAXwRbd6BwWAVMxbmd5W2Ca5D7gXEdE7cIMUyFAkl_Bw-_HlGI8t8CHET1nauDpPjeGS4zsLsyQIncKm_TUyYwWkkO3bXzsJKZmn8QwW1YiMj88ZsLMDMHCtqA-8/s1600/d03.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-euC3DPK7TDJVDMQeAXwRbd6BwWAVMxbmd5W2Ca5D7gXEdE7cIMUyFAkl_Bw-_HlGI8t8CHET1nauDpPjeGS4zsLsyQIncKm_TUyYwWkkO3bXzsJKZmn8QwW1YiMj88ZsLMDMHCtqA-8/s640/d03.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;XCZU4 module transferred over to the v0.2 carrier, which has some new connectors including a barrel jack for power in, a full-size HDMI out, and some isolated GPIO.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Although I&#39;m sticking with the same ZU+ module, I did finally get around to building a &lt;a href=&quot;http://scolton.blogspot.com/2019/11/zynq-ultrascale-superspeed-ram-dumping.html&quot;&gt;new carrier&lt;/a&gt;. The main addition is an HDMI transmitter, which I&#39;ll probably play with in the coming weeks. There&#39;s also a new barrel jack power input and an isolated GPIO connector. Other than a PCIe reference clock routing fix, I didn&#39;t touch any of the existing high-speed signals. I also (re)discovered good drag soldering technique, so there were zero issues with the 160-pin headers this time around.&lt;br /&gt;
&lt;br /&gt;
Most importantly, I felt confident enough in the design now to risk a color sensor on this board. Unlike my stockpile of monochrome CMV12000s from eBay, I bought this one new, at full price, and I do not want to break it. I haven&#39;t had any power supply issues in months of testing on the v0.1 board, so after a thorough multimeter check on the v0.2 board, I permanently soldered on the color sensor and crossed my fingers.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiC88Wjh6bDo85EkZqPvgmzbxLt0t0CiI8Dwuv7OMpfye8v7xtJP3UxLR3yIUfav5uiFn0NGwbgnGlwMmvq2HtynmOsiu-YDo91KnxjQZAOHSOudhrxbQtBZHoL8oo-LtmFESg0r369Gow/s1600/d01.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1015&quot; data-original-width=&quot;1600&quot; height=&quot;406&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiC88Wjh6bDo85EkZqPvgmzbxLt0t0CiI8Dwuv7OMpfye8v7xtJP3UxLR3yIUfav5uiFn0NGwbgnGlwMmvq2HtynmOsiu-YDo91KnxjQZAOHSOudhrxbQtBZHoL8oo-LtmFESg0r369Gow/s640/d01.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;My one and only&amp;nbsp;CMV12000-2E5C1PA, now committed to this board.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
The wavelet compression engine was designed for the &lt;a href=&quot;https://en.wikipedia.org/wiki/Bayer_filter&quot;&gt;Bayer-masked&lt;/a&gt; color sensor, with independent encoding for each color field, so I didn&#39;t have to change any hardware or software to accommodate it. All the color processing happens off-board, a point of reckoning discussed below. But from an image capture point of view, it&#39;s 100% drop-in.&lt;br /&gt;
&lt;br /&gt;
Returning to the integration of the three main pieces of the image capture pipeline, there were a handful of small but important details to sort out regarding how frames are buffered in RAM and how they are written out to files on the SSD:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgR35A5fLNpd6vji6Qe1LZ9pbJJcUoHgCCbGWokiGqTLrwsw8bbzEjw6dlV8bGpGzN0_nUXCSBSXv34QU_TL4f2gxVSFymLP-ZgjKhDvVA36d8YnikLo0igtFC2EK8ZXclg_eHH_jF-x4s/s1600/d04.png&quot; imageanchor=&quot;1&quot; style=&quot;clear: left; float: left; margin-bottom: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;858&quot; data-original-width=&quot;1600&quot; height=&quot;342&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgR35A5fLNpd6vji6Qe1LZ9pbJJcUoHgCCbGWokiGqTLrwsw8bbzEjw6dlV8bGpGzN0_nUXCSBSXv34QU_TL4f2gxVSFymLP-ZgjKhDvVA36d8YnikLo0igtFC2EK8ZXclg_eHH_jF-x4s/s640/d04.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;ol&gt;
&lt;li&gt;The AXI Master port on the &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#encoder&quot;&gt;Encoder&lt;/a&gt; module now transfers data from the 16 compressor codestream FIFOs to RAM in increments of 512B, one logical block / sector on disk. This greatly simplifies downstream file writing operations.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;The (now sector-aligned) addresses of the next RAM write for each compressor codestream are presented to software via the Encoder module&#39;s AXI Slave. These addresses increment automatically as data is written, but can be overwritten by software.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;The &lt;a href=&quot;https://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;CMV Input&lt;/a&gt; module generates an interrupt at the start of the Frame Overhead Time (FOT), a short (~20-30μs) period of deadtime between frame readouts where there shouldn&#39;t be any RAM writing happening.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;During the FOT interrupt, software reads the Encoder RAM write addresses, resets them if necessary, and records the start address and size of each compressor codestream for a given frame as part of a 512B frame header. Frame headers are stored in a circular buffer in RAM.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;In the main loop, frames are dequeued from RAM as needed&amp;nbsp;by writing the frame header, followed by each of the 16 codestreams it references, to a file with FatFs. A new file is created every so often to prevent the file size from exceeding the 4GiB FAT32 limit.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
This is all pretty easy to implement, at least compared to the three main hardware modules themselves. Most of it is ARM software, which can be iterated and debugged much faster than programmable logic. Also, having the codestream RAM address and size baked into the frame header helps with validating and analyzing the output:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0TuntkfrLI6elCBDrA24wU-85RdsP0VkREmY0lsg-0Bs_BTV9VYQN9L48NLVsGAp7s65MWueG6wVfWsay3smu6bRN1UUYJwI4ve6BzSPzq-g1I4z8gMARXQ-Vv9qanjVnihj0QXmSH6w/s1600/d05.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;913&quot; data-original-width=&quot;1600&quot; height=&quot;364&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0TuntkfrLI6elCBDrA24wU-85RdsP0VkREmY0lsg-0Bs_BTV9VYQN9L48NLVsGAp7s65MWueG6wVfWsay3smu6bRN1UUYJwI4ve6BzSPzq-g1I4z8gMARXQ-Vv9qanjVnihj0QXmSH6w/s640/d05.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
The codestream size per frame shows how bandwidth is being distributed to the 16 codestreams, with more bits per frame going to the low-frequency subbands. Compression ratios relative to raw 10-bit data are shown on the codestream size axis. Spikes can be seen during the portions of the clip where the steel wool burns brightest. There&#39;s also a gradual increase in codestream size over the 30 seconds, especially in the high-frequency subbands, which I believe is due to image sensor noise increasing with temperature. Feeding the codestream size back into the &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#quantizer&quot;&gt;quantizer&lt;/a&gt; settings to maintain a roughly constant bit rate will be a problem for another day.&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi845mjezk9vrixjzFb_BslHpS5GJfpiDXW47F-fz23sFKHHZT1qZWEpb_7YtloNvvdJ7wdWopP9FeZCZ5GY5KXflspCFMbbd7OUw_AjvtxxIvETTdMtLTOm-YfgdGDTCRUCDW0g3U5KPQ/s1600/d06.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1061&quot; data-original-width=&quot;1600&quot; height=&quot;424&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi845mjezk9vrixjzFb_BslHpS5GJfpiDXW47F-fz23sFKHHZT1qZWEpb_7YtloNvvdJ7wdWopP9FeZCZ5GY5KXflspCFMbbd7OUw_AjvtxxIvETTdMtLTOm-YfgdGDTCRUCDW0g3U5KPQ/s640/d06.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
Each codestream is given a range of RAM addresses to use as a circular buffer for frame data. At the start of a clip, the encoder RAM write addresses are set to the bottom of their range. As data is written, the addresses increment automatically. When they reach the top of their range, software resets them during the FOT interrupt. Each codestream does this independently, but the overall frame buffering capability is limited by the most frequently reset stream. In this case, XX3 resets approximately every 400 frames, so a maximum of 1s can be buffered. Ideally, the RAM ranges would be sized proportionally to the codestream bit rates to maximize the number of frames that can be buffered.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The RAM frame buffer allows the pipeline to ride out NVMe writing delays that last for several frame intervals. Each frame is time stamped once when it is read into RAM and again when it is submitted for writing to the SSD. The two time stamps and the frame write backlog count are added to the frame header, and can be used to review the delay.&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9N_CN69qJsa4yPlpgOkR6ZMlZ8nVFhWBoKiW_DBxY0YWJqA4d6cfr7n4STxQI3a_nU-JI7cZhz2qjtgJZVlM23y2TAZfYJTTOp13xFwA0eElbgFMYCfMGw990ollo2Z1FIY6RwFEJCjU/s1600/d07.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1224&quot; data-original-width=&quot;1600&quot; height=&quot;488&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9N_CN69qJsa4yPlpgOkR6ZMlZ8nVFhWBoKiW_DBxY0YWJqA4d6cfr7n4STxQI3a_nU-JI7cZhz2qjtgJZVlM23y2TAZfYJTTOp13xFwA0eElbgFMYCfMGw990ollo2Z1FIY6RwFEJCjU/s640/d07.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Here, the read-to-read interval between frames is a steady 2.5ms, as it must be for 400fps. (A spike would indicate a dropped frame.) The read-to-write delay is typically 10ms, corresponding to a four frame backlog. This offset is set by software to allow for a small number of frames between the read and write pointer, which could be used to generate a low-latency local preview. There is a spike in read-to-write delay every 240 frames when a new file is created, since the file creation NVMe operations are&amp;nbsp;&lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html&quot;&gt;slower than streaming writes&lt;/a&gt;. Most of these spikes are only one frame, but there were instances of higher delays, up to 20 total frames (50ms). This is still easily absorbed by the RAM buffer, although it would be good to understand where the extra delay comes from.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
So all 12000 frames (113.2Gpx) made it onto the SSD in 30 seconds. That&#39;s the last stop on the freight train of pixels - once they hit the flash, they&#39;re no longer volatile or time-critical. So in some sense this project is done. But from a practical standpoint, there&#39;s still an equal amount of computation that has to happen to decode the frames and run the inverse DWTs, not to mention the additional load of debayering, color correction, and eventual transcoding. What the ZU+ does at 400fps, my laptop CPU struggles to undo at 0.25fps. So implementing decode and inverse DWT in a GPU-accelerated way just became high-priority. Luckily, I remember that I do at least have an existing&amp;nbsp;&lt;a href=&quot;https://scolton.blogspot.com/2014/12/fun-with-pixel-shaders.html&quot;&gt;GPU debayer and color correction solution&lt;/a&gt;.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/2IHg6gl4_4c&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
I had to do a little work to modernize it, but since laptop hardware has also gotten way better since I wrote it, it can scrub through 4K raw just fine. Hopefully, I can implement the decode and IDWT there as well, to save the 4s per frame it currently takes to do those on the CPU. I&#39;ll also need the ZU+ to be able to do it, for preview and playback, but since it only has to run at 30fps that should be much easier than the forward direction.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Besides that, there are a few side-challenges I still have to take on:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;The HDMI output, so I can see what the hell I&#39;m doing. This will involve some amount of decoding and inverse DWT, at least of the 3rd and maybe 2nd wavelet stage, to generate a usable preview image with a menu and status overlay. I don&#39;t have the ability to output 4K, so a 1080p preview will have to suffice.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;USB mass-storage device access to the SSD. As much as I love my &lt;a href=&quot;https://www.amazon.com/Sabrent-EC-NVME-Aluminum-Enclosure-Nvme/dp/B07K4TZQ7D&quot;&gt;Sabrent NVMe SSD enclosure&lt;/a&gt;, I fear I am approaching the insertion/removal cycle limit on this drive. I have already demonstrated &lt;a href=&quot;https://scolton.blogspot.com/2019/11/zynq-ultrascale-superspeed-ram-dumping.html&quot;&gt;USB 3.0-speed mass storage device access to ZU+ RAM&lt;/a&gt;, so this should be a simple SCSI bridge project. Bonus points if I can implement a few custom SCSI commands to start/stop recording or do other useful control tasks.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;An alternative programmable logic configuration for the CMV12000&#39;s subsampled read-out mode. In this mode, the X and Y read-out skips every other 2x2 Bayer block, for a maximum resolution of 2048x1536 at a frame rate of 1050fps. Or an even more interesting 1472fps at 1080p. Because this mode reads in four rows in parallel, it will require more Stage 1 &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#h26&quot;&gt;Horizontal DWT Cores&lt;/a&gt;, which might be tough on the XCZU4.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
But having the full capture pipeline in place is a good milestone, and I&#39;m happy that it&#39;s pretty close to what I had in mind back in &lt;a href=&quot;https://scolton.blogspot.com/2019/06/freight-train-of-pixels.html&quot;&gt;June&lt;/a&gt;. Some of the details changed as I learned about the capabilities and limitations of the ZU+, but the high-level architecture is as-planned. Being able to capture 3.8Gpx/s continuously (for less than 5nJ/px, too) is something that I think is new and only recently possible with the latest generation SSDs and FPGA hardware working together. Keeping an eye on new components and trying to figure out interesting corner cases where they might be useful is something I enjoy, so this was a fun challenge.&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/4390889306834972789/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/12/continuous-38gpxs-4k-400fps-image.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/4390889306834972789'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/4390889306834972789'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/12/continuous-38gpxs-4k-400fps-image.html' title='Continuous 3.8Gpx/s (4K 400fps+) Image Capture Pipeline'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/j_u-m2xwpek/default.jpg" height="72" width="72"/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-3444691169526029869</id><published>2019-11-29T19:39:00.004-05:00</published><updated>2024-11-05T15:58:04.766-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>Zynq Ultrascale+ FatFs and Direct Speed Tests with Bare Metal NVMe via AXI-PCIe Bridge</title><content type='html'>&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAsdsZd7roWIjycpWXHPF8M2p2Vnb86JDFHxi2-vYBcrG8Td4zn15hnbRMOJVYypUVOMX_dGF3TRiELwtkkjt94jDn_-yOVz9tlntT7Yepu0EGiX6fJEy9oa1xpmBjOIqyiLtqiqHZqMI/s640/c99.jpg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Blue wire PCIe REFCLK still hanging in there...&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
It&#39;s time to return to the problem of &lt;a href=&quot;https://scolton.blogspot.com/2019/06/freight-train-of-pixels.html&quot;&gt;sinking 1GB/s&lt;/a&gt; of data onto an NVMe drive from a Zynq Ultrascale+ SoC. Last time, I &lt;a href=&quot;https://scolton.blogspot.com/2019/07/benchmarking-nvme-through-zynq.html&quot;&gt;benchmarked&lt;/a&gt; the Xilinx Linux drivers and found that they were fast, but not quite fast enough. In the comments of that post, there were many good suggestions for how to make up the difference without having to resort to a hardware accelerator. The consensus is that the hardware, namely the stock AXI-PCIe bridge, should be fast enough.&lt;br /&gt;
&lt;br /&gt;
While a lot of the suggestions were ways to speed up the data transfer in Linux, and I have no doubt those would work, I also just don&#39;t want or need to run Linux in this application. The &lt;a href=&quot;https://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;sensor input&lt;/a&gt; and &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot;&gt;wavelet compression&lt;/a&gt; modules are entirely built in Programmable Logic (PL), with only a minimal interface to the Processing System (PS) for configuration and control. So, I&#39;m able to keep my entire application in the 256KB On-Chip Memory (OCM), leaving the external DDR4 RAM bandwidth free for data transfer.&lt;br /&gt;
&lt;br /&gt;
After compression, the data is already in the DDR4 RAM where it should be visible to whatever DMA mechanism is responsible for transferring data to an NVMe drive. As &lt;a href=&quot;https://www.blogger.com/profile/16491915174390340818&quot;&gt;Ambivalent Engineer&lt;/a&gt; points out in the comments:&lt;br /&gt;
&lt;blockquote class=&quot;tr_bq&quot;&gt;
&lt;span style=&quot;color: #f9cb9c;&quot;&gt;It should be possible to issue commands directly to the NVMe from software by creating a command and completion queue pair and writing directly to the command queue.&lt;/span&gt;&lt;/blockquote&gt;
In other words, write a bare metal NVMe driver to interface with the AXI-PCIe bridge directly for initiating and controlling data transfers. This seems like a good fit, both to this specific application and to my general proclivity, for better or worse, to move to lower-level code when I get stuck. A good place to start is by exploring the functionality of the AXI-PCIe bridge itself.&lt;br /&gt;
&lt;h4&gt;
AXI-PCIe Bridge&lt;/h4&gt;
&lt;div&gt;
Part of the reason it took me a while to understand the AXI-PCIe bridge is that it has many names. The version for Zynq-7000 is called&amp;nbsp;&lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/axi_pcie.htm&quot;&gt;AXI Memory Mapped to PCI Express (PCIe) Gen2&lt;/a&gt;, and is covered in &lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/axi_pcie.html#documentation&quot;&gt;PG055&lt;/a&gt;. The version for Zynq Ultrascale is called&amp;nbsp;&lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/axi_pcie_gen3.html&quot;&gt;AXI PCI Express (PCIe) Gen 3 Subsystem&lt;/a&gt;, and is covered in &lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/axi_pcie_gen3.html#documentation&quot;&gt;PG194&lt;/a&gt;. And the version for Zynq Ultrascale+ is called&amp;nbsp;&lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/pcie-dma.html&quot;&gt;DMA for PCI Express (PCIe) Subsystem&lt;/a&gt;, and is nominally covered in &lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/pcie-dma.html#documentation&quot;&gt;PG195&lt;/a&gt;. But, when operated in bridge mode, as it will be here, it&#39;s actually still documented in &lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/axi_pcie_gen3.html#documentation&quot;&gt;PG194&lt;/a&gt;. I&#39;ll be focusing on this version.&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Whatever the name, the block diagram looks like this:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtwnk5a4cLmOrKn8X9wgSp40DuhRA2CjpjpmL4k-d49tgS9-iD-Cgyg-u8hKgUZzuXazH6MvxIaUQyp2kFB9sGCH-8tPUGf8qta2m7GiN80jI3uoyT8h0bJysh6ZAbIojeoPvupf1ww_E/s640/c91.png&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;AXI-PCIe Bridge Root Port block diagram, adapted from &lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/axi_pcie_gen3.html#documentation&quot;&gt;PG194&lt;/a&gt; Figure 1.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The &lt;span style=&quot;color: #9fc5e8;&quot;&gt;AXI-Lite Slave Interface&lt;/span&gt; is straightforward, allowing access to the bridge control and configuration registers. For example, the&amp;nbsp;PHY Status/Control Register (offset &lt;span style=&quot;color: #9fc5e8;&quot;&gt;0x144&lt;/span&gt;) has information on the PCIe link, such as speed and width, that can be useful for debugging. When the bridge is configured as a Root Port, as it must be to host an NVMe drive, this address space also provides access to the PCIe Configuration Space of both the Root Port itself, at offset &lt;span style=&quot;color: #9fc5e8;&quot;&gt;0x0&lt;/span&gt;, and the enumerated Endpoint devices, at other offsets.&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGf_rJTQbXh1O2mEVOqXc6wOj_LblHgR_a3K72jGi06atH91qKJ-ayQbbezbtzq_SE_Fodugxz9vyb9L3JEUHsdqH3OH_tP-PoJU-2RMSP0jg-KxAS7VijcUNSXxAAOAzJ3Jyj3pUHn_c/s640/c92.png&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;PCIe Confinguration Space layout, adapted from &lt;a href=&quot;https://www.xilinx.com/products/intellectual-property/pcie4-ultrascale-plus.html#documentation&quot;&gt;PG213&lt;/a&gt; Table 2-35.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
If the NVMe drive has successfully enumerated, its Endpoint PCIe Configuration Space will be mapped to some offset in the &lt;span style=&quot;color: #9fc5e8;&quot;&gt;AXI-Lite Slave&lt;/span&gt; register space. In my case, with no switch involved, it shows up as Bus 1, Device 0, Function 0 at offset &lt;span style=&quot;color: #9fc5e8;&quot;&gt;0x1000&lt;/span&gt;. Here, it&#39;s possible to check the Device ID, Vendor ID, and Class Codes. Most importantly, the BAR0 register holds the PCIe memory address assigned to the device. &lt;span style=&quot;color: yellow;&quot;&gt;The AXI address assigned to BAR0 in the Address Editor in Vivado is mapped to this PCIe address by the bridge.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Reads from and writes to the AXI BAR0 address are done through the &lt;span style=&quot;color: #ffe599;&quot;&gt;AXI Slave Interface&lt;/span&gt;. This is a full AXI interface supporting burst transactions and a wide data bus. In another class of PCIe device, it might be responsible for transferring large amounts of data to the device through the BAR0 address range. But for an NVMe drive, BAR0 just provides access to the NVMe Controller Registers, which are used to set up the drive and inform it of pending data transfers.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The &lt;span style=&quot;color: #c27ba0;&quot;&gt;AXI Master Interface&lt;/span&gt; is where &lt;i&gt;all&lt;/i&gt;&amp;nbsp;NVMe data transfer occurs, for both reads and writes. One way to look at it is that the drive itself contains the DMA engine, which issues memory reads and writes to the system (AXI) memory space through the bridge. The host requests that the drive perform these data transfers by submitting them to a queue, which is also contained in system memory and accessed through this interface.&lt;/div&gt;
&lt;h4&gt;
Bare Metal NVMe&lt;/h4&gt;
&lt;div&gt;
Fortunately, NVMe is an open standard. The &lt;a href=&quot;https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf&quot;&gt;specification&lt;/a&gt; is about 400 pages, but it&#39;s fairly easy to follow, especially with help from &lt;a href=&quot;https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2013/20130812_PreConfD_Marks.pdf&quot;&gt;this tutorial&lt;/a&gt;. The NVMe Controller, which is implemented on the drive itself, does most of the heavy lifting. The host only has to do some initialization and then maintain the queues and lists that control data transfers. It&#39;s worth looking at a high-level diagram of what should be happening before diving in to the details of how to do it:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMp4VLATTAXxAGJLZYnl9KcKh7y3AxcEx9IZsOXAsPt0dZVKiHQrjGjwxwfiPCeSQcURJ0JfHzfjTJIhUzBpw9Tqw9y2QhdNQ_mp5Dm95beMYVu3Q2A8J0iDFz-K2Heue4JWmr0nCnkhY/s400/c93.png&quot; width=&quot;400&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;System-level look at NVMe data flow, with primary data streaming &lt;i&gt;from&lt;/i&gt; a source &lt;i&gt;to&lt;/i&gt; the drive.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
After BAR0 is set, the host has access to the NVMe drive&#39;s Controller Registers through the bridge&#39;s &lt;span style=&quot;color: #ffe599;&quot;&gt;AXI Slave Interface&lt;/span&gt;. They are just like any other device/peripheral control registers, used for low-level configuration, status, and control of the drive. The register map is defined in the &lt;a href=&quot;https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf&quot;&gt;NVMe Specification&lt;/a&gt;, Section 2.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
One of the first things the host has to do is allocate some memory for the &lt;span style=&quot;color: orange;&quot;&gt;Admin Submission Queue&lt;/span&gt; and &lt;span style=&quot;color: orange;&quot;&gt;Admin Completion Queue&lt;/span&gt;. A Submission Queue (SQ) is a circular buffer of commands submitted to the drive by the host. It&#39;s written by the host and read by the drive (via the bridge &lt;span style=&quot;color: #c27ba0;&quot;&gt;AXI Master Interface&lt;/span&gt;). A Completion Queue (CQ) is a circular buffer of notifications of completed commands from the drive. It&#39;s written by the drive (via the bridge &lt;span style=&quot;color: #c27ba0;&quot;&gt;AXI Master Interface&lt;/span&gt;) and read by the host.&amp;nbsp;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The &lt;span style=&quot;color: orange;&quot;&gt;Admin SQ/CQ&lt;/span&gt; are used to submit and complete commands relating to drive identification, setup, and control. They can be located anywhere in system memory, as long as the bridge has access to them, but in the diagram above they&#39;re shown in the external DDR4. The host software notifies the drive of their address and size by setting the relevant Controller Registers (via the bridge &lt;span style=&quot;color: #ffe599;&quot;&gt;AXI Slave Interface&lt;/span&gt;). After that, the host can start to submit and complete admin commands:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;The host software&amp;nbsp;writes one or more commands to the&amp;nbsp;&lt;span style=&quot;color: orange;&quot;&gt;Admin SQ&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;The host software&amp;nbsp;notifies the drive of the new command(s) by updating the &lt;span style=&quot;color: orange;&quot;&gt;Admin SQ&lt;/span&gt; doorbell in the Controller Registers through the bridge &lt;span style=&quot;color: #ffe599;&quot;&gt;AXI Slave Interface&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;The drive reads the command(s) from the Admin SQ through the bridge &lt;span style=&quot;color: #c27ba0;&quot;&gt;AXI Master Interface&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;The drive completes the command(s) and writes an entry to the &lt;span style=&quot;color: orange;&quot;&gt;Admin CQ&lt;/span&gt; for each, through the bridge &lt;span style=&quot;color: #c27ba0;&quot;&gt;AXI Master Interface&lt;/span&gt;. Optionally, an interrupt is triggered.&lt;/li&gt;
&lt;li&gt;The host reads the completion(s) and updates the &lt;span style=&quot;color: orange;&quot;&gt;Admin CQ&lt;/span&gt; doorbell in the Controller Registers, through the &lt;span style=&quot;color: #ffe599;&quot;&gt;AXI Slave Interface&lt;/span&gt;, to tell the drive where to place the next completion.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
In some cases, an admin command may request identification or capability data from the drive. If the data is too large to fit in the &lt;span style=&quot;color: orange;&quot;&gt;Admin CQ&lt;/span&gt; entry, the command will also specify an address to which to write the requested data. For example, during initialization, the host software&amp;nbsp;requests the Controller Identification and Namespace Identification structures, described in the &lt;a href=&quot;https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf&quot;&gt;NVMe Specification&lt;/a&gt;, Section 5.15.2. These contain information about the capabilities, size, and low-level format (below the level of file systems or even partitions) of the drive. The space for these &lt;span style=&quot;color: #3d85c6;&quot;&gt;IDs&lt;/span&gt; must also be allocated in system memory before they&#39;re requested.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Within the &lt;span style=&quot;color: #3d85c6;&quot;&gt;IDs&lt;/span&gt; is information that indicates the Logical Block (LB) size, which is the minimum addressable memory unit in the non-volatile memory. 512B is typical, although some drives can also be formatted for 4KiB LBs. Many other variables are given in units of LBs, so it&#39;s important for the host to grab this value. There&#39;s also a maximum and minimum page size, defined in the Controller Registers themselves, which applies to system memory. It&#39;s up to the host software to configure the actual system memory page size in the Controller Registers, but it has to be between these two values. 4KiB is both the absolute minimum and the typical value. It&#39;s still possible to address system memory in smaller increments (down to 32-bit alignment); this value just affects how much can be read/written per page entry in an I/O command or PRP List (see below).&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Once all identification and configuration tasks are complete, the host software can then set up one or more I/O queue pairs. In my case, I just want one &lt;span style=&quot;color: #76a5af;&quot;&gt;I/O SQ&lt;/span&gt; and one &lt;span style=&quot;color: #76a5af;&quot;&gt;I/O CQ&lt;/span&gt;. These are allocated in system memory, then the drive is notified of their address and size via admin commands. The &lt;span style=&quot;color: #76a5af;&quot;&gt;I/O CQ&lt;/span&gt; must be created first, since the &lt;span style=&quot;color: #76a5af;&quot;&gt;I/O SQ&lt;/span&gt; creation references it. Once created, the host can start to submit and complete I/O commands, using a similar process as for admin commands.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I/O commands perform general purpose writes (from system memory to non-volatile memory) or reads (from non-volatile memory to system memory) over the bridge&#39;s &lt;span style=&quot;color: #c27ba0;&quot;&gt;AXI Master Interface&lt;/span&gt;. If the data to be transferred spans more than two memory pages (typically 4KiB each), then a Physical Region Page (PRP) List is created along with the command. For example, a write of 24 512B LBs starting in the middle of a 4KiB page might reference the data like this:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBKgxAtAOvgkyVwh-qT3i3fhtYWuXwTvo2B70-YxMKr7Qr0fqGmSVE1n8vOPK0zeTLnG-eDSgOzXfqNig_eEMRTBL2_qh80rY70PTzDdxowlTr2blgi9kiPklyGg0VOrW3Ny9BE4KDGoQ/s640/c94.png&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;A PRP List is required for data transfers spanning more than two memory pages.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
The first PRP Address in the I/O command can have any 32-bit-aligned offset within a page, but subsequent addresses must be page-aligned. The drive knows whether to expect a PRP Address or PRP List Pointer in the second PRP field of the I/O command based on the amount of data being transferred. It will also only pull as much data as is needed from the last page on the list to reach the final LB count. There is no requirement that the pages in the PRP list be contiguous, so it can also be used as a scatter-gather with 4KiB granularity. The PRP List for a particular command must be kept in memory until it&#39;s completed, so some kind of PRP Heap is necessary if multiple commands can be in flight.&lt;br /&gt;
&lt;br /&gt;
Some (most?) drives also have a Volatile Write Cache (VWC) that buffers write data. In this case, an I/O write completion may not indicate that the data has been written to non-volatile memory. An I/O flush command forces this data to be written to non-volatile memory before a completion entry is written to the &lt;span style=&quot;color: #76a5af;&quot;&gt;I/O CQ&lt;/span&gt; for that flush command.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
That&#39;s about it for things that are described explicitly in the specification. &lt;span style=&quot;color: yellow;&quot;&gt;Everything past this point is implementation detail that is much more application-specific.&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
A key question the host software NVMe driver needs to answer is whether or not to wait for a particular completion before issuing another command. For admin commands that run once during initialization and are often dependent on data from previous commands, it&#39;s fine to always wait. For I/O commands, though, it really depends. I&#39;ll be using write commands as an example, since that&#39;s my primary data direction, but there&#39;s a symmetric case for reads.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
If the host software issues a write command referencing a range of data in system memory and then immediately changes the data, without waiting for the write command to be completed, then the write may be corrupted. To prevent this, the software could:&lt;/div&gt;
&lt;div&gt;
&lt;ol&gt;
&lt;li&gt;Wait for completion before allowing the original data to be modified. (Maybe there are other tasks that can be done in parallel.)&lt;/li&gt;
&lt;li&gt;Copy the data to an intermediate buffer and issue the write command referencing that buffer instead. The original data can then be modified without waiting for completion.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
Both could have significant speed penalties. The copy option is pretty much out of the question for me. But usually I can satisfy the first constraint: If the data is from a stream that&#39;s being buffered in memory, the host software can issue NVMe write commands that consume one end of the stream while the data source is feeding in new data at the other end. With appropriate flow control, these write commands don&#39;t have to wait for completion.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
My &quot;solution&quot; is just to push the decision up one layer: the driver &lt;i&gt;never&lt;/i&gt;&amp;nbsp;blocks on I/O commands, but it can inform the application of the I/O queue backlog as the slip between the queues, derived from sequentially-assigned command IDs. If a particular process thinks it can get away without waiting for completions, it can put more commands in flight (up to some slip threshold).&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHecbogWeVRhWoC_Adaxb2RlcqauE0kKnFEbhqxJnarSYeJXAh32g-kSAaaEWzqeMaH-av8FzBSTWU3CMNu5-kTaGqQJCUfbOOLqMK6bsHzibjVtvXNvaVPwMGjzG2mXyOcbfHTOHqMDU/s640/c95.png&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;An example showing the driver ready to submit Command ID 72, with the latest completion being Command ID 67. The doorbells always point to the next free slot in the circular buffer, so the entry there has the oldest ID.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
I&#39;m also totally fine with polling for completions, rather than waiting for interrupts. Having a general-purpose completion polling function that takes as an argument a maximum number of completions to process in one call seems like the way to go. &lt;a href=&quot;https://github.com/nvmedirect/nvmedirect/blob/master/library/lib_nvmed.c#L246&quot;&gt;NVMeDirect&lt;/a&gt;,&amp;nbsp;&lt;a href=&quot;https://github.com/spdk/spdk/blob/9a25fc12bb7585c9cbcc8f642e38ecc879313a32/lib/nvme/nvme_pcie.c#L2066&quot;&gt;SPDK&lt;/a&gt;, and&amp;nbsp;&lt;a href=&quot;https://github.com/coreboot/depthcharge/blob/master/src/drivers/storage/nvme.c#L222&quot;&gt;depthcharge&lt;/a&gt;&amp;nbsp;all take this approach. (All three are good open-source reference for light and fast NVMe drivers.)&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
With this set up, I am able to run a speed test by issuing read/write commands for blocks of data as fast as possible by trying to keep the I/O slip at a constant value:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgC6CzDrS_LwqEz1KsgWYCvWRyaYmtgPvPH41MoQJN9cOH7eftSyVXCvG0oLp0XgTEVxINcSMeAVdf6GDKAzTxRJjQyf-jMyzeTix0c9x9WWL6Kl35pKxB_N8zY44FMCS7ENO4NvsyXxBM/s640/c96.png&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Speed test for raw NVMe write/read on a 1TB Samsung 970 Evo Plus.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
For smaller block transfers, the bottleneck is on my side, either in the driver itself or by hitting a limit on the throughput of bus transactions somewhere in the system. But for larger block transfers (32KiB and above) the read and write speeds split, suggesting that the drive becomes the bottleneck. And that&#39;s totally fine with me, since it&#39;s hitting &lt;span style=&quot;color: yellow;&quot;&gt;64% (write) and 80% (read) of the maximum theoretical PCIe Gen3 x4 bandwidth&lt;/span&gt;.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Sustained write speeds begin to drop off after about 32GiB. The Samsung Evo SSDs have a feature called TurboWrite that uses some fraction of the non-volatile memory array as fast &lt;a href=&quot;https://en.wikipedia.org/wiki/Multi-level_cell#Single-level_cell&quot;&gt;Single-Level Cell&lt;/a&gt; (SLC) memory to buffer writes. Unlike the VWC, this is still non-volatile memory, but it gets transferred to more compact &lt;a href=&quot;https://en.wikipedia.org/wiki/Multi-level_cell&quot;&gt;Multi-Level Cell&lt;/a&gt; (MLC) memory later since it&#39;s slower to write multi-level cells. The 1TB drive that I&#39;m using has around 42GB of TurboWrite capacity according to &lt;a href=&quot;https://www.relaxedtech.com/reviews/samsung/970-evo-plus/&quot;&gt;this review&lt;/a&gt;, so a drop off in sustained write speeds after 32GiB makes sense. Even the sustained write speed is 1.7GB/s, though, which is more than fast enough for my application.&lt;br /&gt;
&lt;br /&gt;
A bigger issue with sustained writes might be getting rid of heat. This drive draws about 7W during max speed writing, which nearly doubles the total dissipated power of the whole system, probably making a fan necessary. Then again, at these write speeds a 0.2kg chunk of aluminum would only heat up about 25ºC before the drive is full... In any case, the drive will also need a good conduction path to the rear enclosure, which will act as the heat sink.&lt;/div&gt;
&lt;h4&gt;
FatFs&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
I am more than content with just dumping data to the SSD directly as described above and leaving the task of organizing it to some later, non-time-critical process. But, if I can have it arranged neatly into files on the way in, all the better. I don&#39;t have much overhead to spare for the file system operations, though. Luckily, ChaN gifted the world &lt;a href=&quot;http://elm-chan.org/fsw/ff/00index_e.html&quot;&gt;FatFs&lt;/a&gt;, an ultralight FAT file system module written in C. It&#39;s both tiny and fast, since it&#39;s designed to run on small microcontrollers. An ARM Cortex-A53 running at 1.2GHz is &lt;i&gt;certainly&amp;nbsp;not&lt;/i&gt;&amp;nbsp;the target hardware for it. But, I think it&#39;s still a good fit for a very fast bare metal application.&lt;br /&gt;
&lt;br /&gt;
FatFs supports exFAT, but using exFAT still requires a license from Microsoft. I think I can instead operate right on the limits of what FAT32 is capable of:&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;A maximum of 2^32 LBs. For 512B LBs, this supports up to a 2TiB drive. This is fine for now.&lt;/li&gt;
&lt;li&gt;A maximum cluster size (unit of file memory allocation and read/write operations) of 128 LBs. For 512B LBs, this means 64KiB clusters. This is right at the point where I hit maximum (drive-limited) write speeds, so that&#39;s a good value to use.&lt;/li&gt;
&lt;li&gt;A maximum file size of 4GiB. This is the limit of my RAM buffer size anyway. I can break up clips into as many files as I want. One file per frame would be convenient, but not efficient.&lt;/li&gt;
&lt;/ul&gt;
Linking FatFs to NVMe couldn&#39;t really get much simpler: FatFs&#39;s diskio.c device interface functions already request reads and writes in units of LBs, a.k.a. sectors. There&#39;s also a sync function that matches up nicely to the NVMe flush command. The only potential issue is that FatFs can ask for byte-aligned transfers, whereas NVMe only allows 32-bit alignment. My tentative understanding is that this can only happen via calls to &lt;a href=&quot;http://elm-chan.org/fsw/ff/doc/read.html&quot;&gt;f_read()&lt;/a&gt; or &lt;a href=&quot;http://elm-chan.org/fsw/ff/doc/write.html&quot;&gt;f_write()&lt;/a&gt;, so the application can guard against it.&lt;br /&gt;
&lt;br /&gt;
For file system operations, FatFs reads and writes single sectors to and from a working buffer in system memory. It assumes that the read or write is complete when the &lt;a href=&quot;http://elm-chan.org/fsw/ff/doc/dread.html&quot;&gt;disk_read()&lt;/a&gt; or &lt;a href=&quot;http://elm-chan.org/fsw/ff/doc/dwrite.html&quot;&gt;disk_write()&lt;/a&gt; function returns, so the diskio.c interface layer has to wait for completion for NVMe commands issued as part of file system operations. To enforce this, but still allow high-speed sequential file writing from a data stream, I check the address of the &lt;a href=&quot;http://elm-chan.org/fsw/ff/doc/dwrite.html&quot;&gt;disk_write()&lt;/a&gt; system memory buffer. If it&#39;s in OCM, I wait for completion. If it&#39;s in DDR4, I allow slip. For now, I wait for completion on all &lt;a href=&quot;http://elm-chan.org/fsw/ff/doc/dread.html&quot;&gt;disk_read()&lt;/a&gt; calls, although a similar mechanism could work for high-speed stream reading. And of course, &lt;a href=&quot;http://elm-chan.org/fsw/ff/doc/dioctl.html&quot;&gt;disk_ioctl()&lt;/a&gt; calls for CTRL_SYNC issue an NVMe flush command and wait for completion.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimnXOwQDVmRL9FaM3X8rcyliJqcN9GleRACl91fhNLZujrMpzvXtcTI9IBlfu95ESbZvqShU71EJNUFgkguFyyFWJZrjcilnifBBvYzWb43uD05QELAdgppnRvgN-f-zgvqDwkeLYZDN0/s400/c97.png&quot; width=&quot;400&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Interface between FatFs and NVMe through diskio.c, allowing stream writes from DDR4.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I also clear the queue prior to a read to avoid unnecessary read/write turnarounds in the middle of a streaming write. This logic obviously favors writes over reads. Eventually, I&#39;d like to make a more symmetric and configurable diskio.c layer that allows fast stream reading and writing. It would be nice if the application could dynamically flag specific memory ranges as streamable for reads or writes. But for now this is good enough for some write speed testing:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrSEBVOk7RBvydhIJ_5ueAQmLEk2Hasnzlav8dUjd7KusLg_FhJuiEh7Y4JxqCDRE0rl8ozCXoruFgpRJ3_RLL8o2FoX6cvWQ0UYZ-UUcFrG1Phn75BaZgjhAU-1p204CjDUH7d8EnppM/s640/c98.png&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Speed test for FatFs NVMe write on a 1TB Samsung 970 Evo Plus.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
There&#39;s a very clear penalty for creating and closing files, since the process involves file system operations, including reads and flushes, that will have to wait for NVMe completions. But for writing sequentially to large (1GiB) files, it&#39;s still exceeding my 1GB/s requirement, even for total transfer sizes beyond the TurboWrite limit. So I think I&#39;ll give it a try, with the knowledge that I can fall back to raw writing if I really need to.&lt;/div&gt;
&lt;/div&gt;
&lt;h4&gt;
Utilization Summary&lt;/h4&gt;
&lt;div&gt;
The good news is that the NVMe driver (not including the Queues, PRP Heap, and IDs) and FatFs together take up only about 27KB of system memory, so they should easily run in OCM with the rest of the application. At some point, I&#39;ll need to move the .text section to flash, but for now I can even fit that in OCM. The &lt;a href=&quot;https://github.com/coltonshane/SSD_Test&quot;&gt;source&lt;/a&gt; is here, but be aware it is entirely a test implementation not at all intended to be a drop-in driver for any other application.&lt;br /&gt;
&lt;br /&gt;
The bad news is that the XCZU4 is now pretty much completely full...&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;img src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikdQBoG5MfI1DHaQvQCQUamIizk6ojBcYeV_HQg25UOMJVD8jKQ1SHJyQLvcTmDhXsKYFGcKS-4ceZXHx6PKdp5HjMsP0KFZ5hA8meoNk3PZtrpX0nSE_QMIjEtsnmkj2JLmQnDQ7zuyk/s640/c90.png&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;We&#39;re gonna need a bigger chip.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The AXI-PCIe bridge takes up 12960 LUTs, 17871 FFs, and 34 BRAMs. That&#39;s not even including the additional AXI Interconnects. The only real hope I can see for shrinking it would be to cut the &lt;span style=&quot;color: #ffe599;&quot;&gt;AXI Slave Interface&lt;/span&gt; down to 32-bit, since it&#39;s only needed to access Controller Registers at BAR0. But I don&#39;t really want to go digging around in the bridge HDL if I can avoid it. I&#39;d rather spend time optimizing my own cores, but I think no matter what I&#39;ll need more room for additional features, like decoding/HDMI preview and the subsampled 2048x1536 mode that might need double the number of Stage 1 Wavelet horizontal cores.&amp;nbsp;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
So, I think now is the right time to switch over to the &lt;a href=&quot;https://shop.trenz-electronic.de/en/TE0808-04-6BE21-A-UltraSOM-MPSoC-Module-with-Zynq-UltraScale-XCZU6EG-1FFVC900E-4-GB-DDR4&quot;&gt;XCZU6&lt;/a&gt;, along with the &lt;a href=&quot;http://scolton.blogspot.com/2019/11/zynq-ultrascale-superspeed-ram-dumping.html&quot;&gt;v0.2 Carrier&lt;/a&gt;.&amp;nbsp;It&#39;s pricey, but it&#39;s a big step up in capability, with twice the DDR4 and more than double the logic. And it&#39;s closer to impedance-matched to the cost of the sensor...if that&#39;s a thing. With the XCZU6, I think I&#39;ll have plenty of room to grow the design. It&#39;s also just generally easier to meet timing constraints with more room to move logic around, so the compile times will hopefully be lower.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Hopefully the next update will be with the whole 3.8Gpx/s continuous image capture pipeline working together for the first time!&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/3444691169526029869/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html#comment-form' title='137 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/3444691169526029869'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/3444691169526029869'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/11/zynq-ultrascale-fatfs-with-bare-metal.html' title='Zynq Ultrascale+ FatFs and Direct Speed Tests with Bare Metal NVMe via AXI-PCIe Bridge'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAsdsZd7roWIjycpWXHPF8M2p2Vnb86JDFHxi2-vYBcrG8Td4zn15hnbRMOJVYypUVOMX_dGF3TRiELwtkkjt94jDn_-yOVz9tlntT7Yepu0EGiX6fJEy9oa1xpmBjOIqyiLtqiqHZqMI/s72-c/c99.jpg" height="72" width="72"/><thr:total>137</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-8976982923812940835</id><published>2019-11-17T16:08:00.001-05:00</published><updated>2019-11-25T12:16:32.573-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>Zynq Ultrascale+ SuperSpeed RAM Dumping + v0.2 Carrier</title><content type='html'>I&#39;ve gotten a lot of mileage out of my v0.1 (very first version) camera PCB. Partly that&#39;s because there&#39;s not much to it; it&#39;s mostly just power supplies, connectors, and differential pairs. But I&#39;m still surprised I haven&#39;t broken it yet, and it&#39;s only had some minor design issues. I also made a front enclosure for it with an E-mount flange stolen from a macro extension tube (&lt;a href=&quot;https://www.amazon.com/gp/product/B01N7UB0KU/ref=ppx_od_dt_b_asin_title_s00?ie=UTF8&amp;amp;psc=1&quot;&gt;Amazon&#39;s cheapest&lt;/a&gt;, of course) and slots for some 1/4-20 T-nuts for tripod mounting.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLjgvsb6E-7Udc4-xDdPIKwXFy1MJl0NrkCrzUM2ix-aztNmq6QEuUPDC5nOhBVaUnUdILHQVCsUGYXRDJvhwpyqHy_HaI_eNUaCMskv-vZrT_scpNfDgISrqJoMPXgin-31RRPjwqW9E/s1600/c51.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLjgvsb6E-7Udc4-xDdPIKwXFy1MJl0NrkCrzUM2ix-aztNmq6QEuUPDC5nOhBVaUnUdILHQVCsUGYXRDJvhwpyqHy_HaI_eNUaCMskv-vZrT_scpNfDgISrqJoMPXgin-31RRPjwqW9E/s640/c51.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Stealing an E-mount flange from a macro extension tube is maybe my favorite &quot;Amazon&#39;s cheapest&quot; hack so far. I&#39;m not even sure how else to do it. Getting a custom CNC flange machined wouldn&#39;t be too bad, but what about the leaf springs?&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1SC4AKV3QSahAmJKE5ouPSH_V09YZxbkgT8rAWFAgnOl2mE8UoftJoMQUMwTH_nOVeyvNt_2dq8plbJV5Ld3V5nGQ1V5EwbkdwdflgxLbE7A2EsR12LzxILNm-fCWIdo4ZKV151wfyE0/s1600/c89.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1SC4AKV3QSahAmJKE5ouPSH_V09YZxbkgT8rAWFAgnOl2mE8UoftJoMQUMwTH_nOVeyvNt_2dq8plbJV5Ld3V5nGQ1V5EwbkdwdflgxLbE7A2EsR12LzxILNm-fCWIdo4ZKV151wfyE0/s640/c89.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;There are some sensor alignment features, but mostly the board just bolts to the back of the front enclosure. The 1/4-20 T-nuts allow for quick and dirty tripod mounting without having to worry about aluminum threads or Heli-Coils.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
No real thought was given to connector placement, user interface, battery wiring/charging, cooling, or anything else other than having something to constrain the sensor and lens the right distance from each other and deal with the &lt;a href=&quot;http://scolton.blogspot.com/2019/06/freight-train-of-pixels.html&quot;&gt;massive pixel throughput&lt;/a&gt;. Still, it&#39;s been useful and reliable. At this point, I&#39;ve tested most of the important hardware and am just about ready to make some functional improvements for v0.2.&lt;/div&gt;
&lt;br /&gt;
One important subsystem I hadn&#39;t tested yet, though, is the USB interface. It&#39;s not part of the capture pipeline, but it&#39;s important that it operate at USB 3.x speeds for reading image data off the SSD later. The Zynq Ultrascale+ has a built-in USB 3.0 PHY using PS-GTR transceivers at 5Gb/s. This isn&#39;t quite fast enough for 5:1 compressed image data at full frame rate, but it&#39;s more than fast enough for 30fps playback, or direct access for conversion and editing.&lt;br /&gt;
&lt;br /&gt;
At the moment, though, I&#39;m mainly interested in USB 3.0 for reducing the amount of time it takes to get test image sequences out of the PS-side DDR4 RAM. I&#39;ve so far been using XSCT commands to read blocks of RAM into a file (&lt;span style=&quot;font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;mrd -bin -file&lt;/span&gt;) over JTAG, but this is limited by the 30MHz JTAG interface. That&#39;s a theoretical maximum, too. In practice, it takes several minutes to read out even a short image sequence, and up to &lt;i&gt;an hour&lt;/i&gt; to dump the entire contents of the RAM (2GiB). This is all for mere &lt;a href=&quot;https://www.youtube.com/watch?v=ep0a7_K0EIs&quot;&gt;seconds of video&lt;/a&gt;...&lt;br /&gt;
&lt;h4&gt;
SuperSpeed RAM Dumping&lt;/h4&gt;
To remedy this, I repurpose the &lt;a href=&quot;https://github.com/Xilinx/embeddedsw/blob/master/XilinxProcessorIPLib/drivers/usbpsu/examples/xusb_poll_example.c&quot;&gt;standalone ZU+ USB mass storage class example&lt;/a&gt; to map most of the RAM as a virtual disk, then use a raw disk image reader (&lt;a href=&quot;https://sourceforge.net/projects/win32diskimager/&quot;&gt;Win32 Disk Imager&lt;/a&gt;) to read it. This is pretty much what the example does anyway, so my modifications were very minor. So far, I&#39;ve been able to run my application in On-Chip Memory (OCM), leaving the external DDR4 free for image capture. So, I have to explicitly place the virtual disk in DDR4 in the linker script:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeLpvWOc8A5YXCBiSHY4AUfKCC4XE7-mvi_I_bQgmHKjJSmQB9tq1T5evhFkeJeap9Z5G2r94FzLnPdPDH_WJNowgPEhjjQw5NjFulXwIMvDxPHRXmZ9YI5uoNdxlq4OR_TNPyTtxqkNk/s1600/c82.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1248&quot; data-original-width=&quot;1062&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeLpvWOc8A5YXCBiSHY4AUfKCC4XE7-mvi_I_bQgmHKjJSmQB9tq1T5evhFkeJeap9Z5G2r94FzLnPdPDH_WJNowgPEhjjQw5NjFulXwIMvDxPHRXmZ9YI5uoNdxlq4OR_TNPyTtxqkNk/s640/c82.png&quot; width=&quot;544&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
In the application, the virtual disk array also needs to be correctly sized and assigned to the dedicated memory section using an __attribute__:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihF80LW-aUuVGjBekF-WuKlc2g1tTShwrxV9jmhq_uPV4BObG2Cg36qm7ZDSGqpkJpttxb9O1PZw8Qi9ITazyyBiH98nDu-D8SvYgAdpY-QQnCTk95veOEXas868HsTjJIfjwsMlQA2PI/s1600/c83.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;403&quot; data-original-width=&quot;1303&quot; height=&quot;196&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihF80LW-aUuVGjBekF-WuKlc2g1tTShwrxV9jmhq_uPV4BObG2Cg36qm7ZDSGqpkJpttxb9O1PZw8Qi9ITazyyBiH98nDu-D8SvYgAdpY-QQnCTk95veOEXas868HsTjJIfjwsMlQA2PI/s640/c83.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
With that small modification, the application (including the mass storage device driver) runs in OCM RAM, but references a virtual disk array based in external DDR4 at 0x20000000, which is where the image capture data starts. As with the original example, when plugged in to a host, the device shows up as a blank drive of the defined size. Windows asks to format it, but for now I just click Cancel and use Win32 Disk Imager to read the entire 1.25GiB. This copies the raw contents of the &quot;disk&quot; into a binary file, a process I&#39;m all too familiar with from having to recover files from SD cards with corrupted file systems.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
But at first I wasn&#39;t getting a SuperSpeed (5Gb/s) connection; it was falling back to High-Speed (480Mb/s) through the external &lt;a href=&quot;https://www.microchip.com/wwwproducts/en/USB3320&quot;&gt;USB3320 PHY&lt;/a&gt;. (An external USB 2.0 PHY is required on the ZU+, even for SuperSpeed operation.) To further troubleshoot, I took a look at the &lt;a href=&quot;https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html#usb3_xhci___dcfg.html&quot;&gt;DCFG&lt;/a&gt; and&amp;nbsp;&lt;a href=&quot;https://www.xilinx.com/html_docs/registers/ug1087/ug1087-zynq-ultrascale-registers.html#usb3_xhci___dsts.html&quot;&gt;DSTS&lt;/a&gt; registers in the USB module. DCFG indicated a Device Speed of 3&#39;b100 (SuperSpeed), but DSTS indicated a Connection Speed of 3&#39;b000 (High-Speed). I figured this meant the PS-GTR link to the host was failing, and after some more poking around I found that its reference clock source was set to incorrect pins&amp;nbsp;&lt;i&gt;and&lt;/i&gt; frequency. In my case, I&#39;m feeding it with a 100MHz reference clock on input 2, so I changed it accordingly:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrBaaiJnhvflDN5NgvLu9IiGSA5k9l3pYYVyyRIBVMiwiKSrMS9q0xLqHN6k2re5rghw4LlABnCzut59OKt3XzSOO8ANRQHG2o3j85iEJ0ezWvnz9-4NTgD_AKO3wEWa0T691dPKCPRe0/s1600/c84.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;955&quot; data-original-width=&quot;1330&quot; height=&quot;457&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrBaaiJnhvflDN5NgvLu9IiGSA5k9l3pYYVyyRIBVMiwiKSrMS9q0xLqHN6k2re5rghw4LlABnCzut59OKt3XzSOO8ANRQHG2o3j85iEJ0ezWvnz9-4NTgD_AKO3wEWa0T691dPKCPRe0/s640/c84.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
After that, I was able to get a SuperSpeed connection. As a formatted disk drive, I get sequential read speeds of around &lt;span style=&quot;color: yellow;&quot;&gt;300MB/s&lt;/span&gt;. Through Win32 Disk Imager, I can read the entire 1.25GiB virtual disk in about seven seconds. So much better! To celebrate, I set off some steel wool fireworks with Bill Kerman. (Steel wool, especially the ultrafine variety, &lt;a href=&quot;https://www.youtube.com/watch?v=oiWZYdr9Zvo&quot;&gt;burns quite spectacularly&lt;/a&gt;.)&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/TEHlCqnowWk&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Since I&#39;ve been putting off the task of NVMe writing, these are still just image sequences that can fit in the RAM buffer. In this case they&#39;re actually compressed about 11:1, well beyond my SSD writing requirement, mostly due to the relatively dark and low-contrast scene. The same &lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#quantizer&quot;&gt;quantizer&lt;/a&gt; settings in a brighter scene with more detail would yield a lower compression ratio. I did separate out the quantizer values for each subband, so I can experiment more with the quality/data rate trade-off.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The most noticeable defects aren&#39;t from wavelet compression, they&#39;re just the regular sensor defects. There&#39;s definitely some &quot;black sun&quot; artifact in the brightest sparks. There&#39;s also a rolling row offset that makes the dark background appear to flicker. I did switch to a different power supply for this test, which could be contributing more electrical noise. In any case, I definitely need to implement row noise correction. The combination of all-intraframe compression and a global shutter does make it pretty good for observing the sometimes crazy behavior of individual sparks, though:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhD_y_EexyI2XnEFhFTpEF9zC_om9OQiNGuSjq22VruoLEpzEDbxaunMACw7our0HPmQZ16j-euzKNmYcwhMIaMvWIvPlGUt9pfRfMXOQ70gq9z1_Hb_p_fJRwOW6B7tAbtBQ4DttlTSQ0/s1600/c87.gif&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;720&quot; data-original-width=&quot;1280&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhD_y_EexyI2XnEFhFTpEF9zC_om9OQiNGuSjq22VruoLEpzEDbxaunMACw7our0HPmQZ16j-euzKNmYcwhMIaMvWIvPlGUt9pfRfMXOQ70gq9z1_Hb_p_fJRwOW6B7tAbtBQ4DttlTSQ0/s640/c87.gif&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;This one was gently falling and then just decided to explode into a dozen pieces, shooting off at 20-30mph.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHqagvAxJU7Ez8zAnnlKYium8ksT1NSK9gczNgyATynMKtojRmGUUhdzP6aAYLCbxpZu-xPO2UdcLFY_SrXTwzeTwnnlmMRmSFQMV_md8-afslc8f3SDZbRky8ZhswfJplEi5okorX9qo/s1600/c88.gif&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;720&quot; data-original-width=&quot;1280&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHqagvAxJU7Ez8zAnnlKYium8ksT1NSK9gczNgyATynMKtojRmGUUhdzP6aAYLCbxpZu-xPO2UdcLFY_SrXTwzeTwnnlmMRmSFQMV_md8-afslc8f3SDZbRky8ZhswfJplEi5okorX9qo/s640/c88.gif&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;My favorite, though, is this spark that gets flung off like a pair of binary stars. After a while, they decide to part ways and one goes flying up behind Bill&#39;s helmet. The comet-like tails are a motion artifact of the multi-slope exposure mode.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Another thing I learned from this is that I probably need an &lt;a href=&quot;https://en.wikipedia.org/wiki/Infrared_cut-off_filter&quot;&gt;IR-cut filter&lt;/a&gt;. I neglected to record some normal footage of the steel wool burning, but it&#39;s nowhere near as bright as it looks here. Much of that is just how human visual perception works. I tried to mitigate it somewhat by using the CMV12000&#39;s multi-slope exposure mode to rein in the highlights. But I think there&#39;s also some near-infrared adding to the brightness here. I&#39;ll have to get an external IR-cut filter to test this theory.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Although the image sequence transfer is 100x faster now, it still takes time to adjust settings and trigger the capture over JTAG. I would very much like to do everything over USB in the near future, at least until I have some form of UI on the camera itself. But I also don&#39;t really want to write a custom driver. I might try to abuse the mass storage device driver, since it&#39;s already working, by adding in some custom SCSI codes for control. This is also the device class I intend to use eventually as the host interface to the SSD, so I should get to know it well.&lt;/div&gt;
&lt;h4 style=&quot;clear: both; text-align: left;&quot;&gt;
v0.2 Carrier&lt;/h4&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Controlling the camera over USB is not the most user-friendly way of doing things, as I know from wrangling drivers and APIs for &lt;a href=&quot;http://scolton.blogspot.com/p/video.html#gs3&quot;&gt;previous camera projects&lt;/a&gt;. I could &lt;i&gt;maybe&lt;/i&gt;&amp;nbsp;see an exception where a Pixel 2 (modern Pixels don&#39;t have USB 3.0 anymore, because smartphone progress makes no fucking sense) hosts the camera, presenting a nice preview image and dedicated touch interface. But that&#39;s a large chunk of Android development that I don&#39;t want or know how to do.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Instead, I think it makes sense to stick to something extremely simple for now: an HDMI output and some buttons. I would love to have a touchscreen LCD, but they&#39;re huge time, money, power, and reliability sinks. They&#39;re also never bright enough, or if they are they kill the power and thermal budget. Better to just move the problem off-board, where it can be solved more flexibly depending on the scenario. At least that&#39;s what I&#39;ll tell myself.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
It seems like there are two main ways to do HDMI out from a Zynq SoC. The more modern Zynq Ultrascale+ development boards, like the &lt;a href=&quot;https://www.xilinx.com/products/boards-and-kits/zcu106.html&quot;&gt;ZCU106&lt;/a&gt;, use a PL-side GTH transceiver to directly drive a TMDS retimer. This supports HDMI 2.0 (4K), but would rule out the cheaper &lt;a href=&quot;https://shop.trenz-electronic.de/en/TE0803-02-04CG-1EA-MPSoC-Module-with-Xilinx-Zynq-UltraScale-ZU4CG-1E-2-GByte-DDR4-5.2-x-7.6-cm&quot;&gt;TE0803 XCZU4&lt;/a&gt; board, since its four PL-side transceivers are already in use for the SSD. The second method uses a dedicated HDMI transmitter like the &lt;a href=&quot;https://scolton.blogspot.com/2019/07/benchmarking-nvme-through-zynq.html&quot;&gt;Analog Devices ADV7513&lt;/a&gt; as an external PHY, which interfaces to the Zynq over a wide logic-level pixel bus. Even though it only goes up to 1080p, this sounds more like what I want for now. I just need a reasonable preview image.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_zNjBlWocMWRy99HitBnHIK7VyMAQmPwcJnCGP7HNt4b3_45ikZDkKXYYiPKccf4t1MUGfDWQCetxn_o_hwmg8rzzkMLbVDsGUtXzUxEt7jlV5KyZSFLIINeZjuR3WT6MZGT-ESXTMAc/s1600/c85.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1004&quot; data-original-width=&quot;1600&quot; height=&quot;400&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_zNjBlWocMWRy99HitBnHIK7VyMAQmPwcJnCGP7HNt4b3_45ikZDkKXYYiPKccf4t1MUGfDWQCetxn_o_hwmg8rzzkMLbVDsGUtXzUxEt7jlV5KyZSFLIINeZjuR3WT6MZGT-ESXTMAc/s640/c85.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;HDMI output subsystem based on the ADV7513.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
I had left a bunch of unused pins in the top right corner expecting to need a wide logic-level pixel bus, either for an LCD or an HDMI transmitter. The tricky part was finding room for the connector and IC. I decided to ditch the microSD card holder, which had a bad footprint anyway, to make the space. Without growing the board, I can fit a full-size (Type A) HDMI connector on the top side and the ADV7513 plus supporting components on the bottom. The TMDS lines do have to change layers once, but they&#39;re short and length-matched so I think they&#39;ll be okay.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
At the same time, I also rerouted a good portion of the right edge of the board. The port I&#39;ve been using for UART terminal access is gone, replaced by a more general-purpose optically-isolated I/O connector. This can still be used for terminal access, or as a trigger/sync signal. I also added a barrel jack connector for power/charge input. Finally, a 0.1&quot; header off the back of the board has the battery power input and some unprotected I/O for two buttons, a rotary encoder, and a red &quot;recording&quot; LED on a separate board. This UI board would be mounted to the top face, right-hand side, where such things would typically be on a camera.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjK_nyiva4J_ue4vRwUbO3A9Fo25_xjYJPTaeyPDN5yxt0DJaCBbpmk-unGJXt8w-Sxpacsjiwr2nf2dPzGf4-y6pKJ-ldvmyPtYDcZ3V5dFSdNi-SxlIkCJi8Gpt_5IBpvlCOFhAp6s9Y/s1600/c86.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1331&quot; data-original-width=&quot;1600&quot; height=&quot;532&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjK_nyiva4J_ue4vRwUbO3A9Fo25_xjYJPTaeyPDN5yxt0DJaCBbpmk-unGJXt8w-Sxpacsjiwr2nf2dPzGf4-y6pKJ-ldvmyPtYDcZ3V5dFSdNi-SxlIkCJi8Gpt_5IBpvlCOFhAp6s9Y/s640/c86.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;New right-edge connector layout and top face UI board header.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I consider this to be the bare minimum design for standalone functionality. It will need a simple menu and status overlay on the HDMI output. I&#39;m also skipping any BMS or charge circuitry for now, so the battery must be self-contained (like this &lt;a href=&quot;https://www.newegg.com/p/14R-008K-00136?item=9SIA2EY0ZU2243&amp;amp;source=region&amp;amp;nm_mc=knc-googlemkp-pc&amp;amp;cm_mmc=knc-googlemkp-pc-_-pla-tenergy-_-tools+-+batteries-_-9SIA2EY0ZU2243&amp;amp;gclid=CjwKCAiA_MPuBRB5EiwAHTTvMXIP10Q7hGOHyn7coa3zSFbn8DyuB8-qLV5kVlj3wlB9K4bK8iJy8xoCw2YQAvD_BwE&amp;amp;gclsrc=aw.ds&quot;&gt;3-cell pack&lt;/a&gt;) and charged by a CC/CV adapter. It&#39;s well-within the power range of USB C charging, so that could be an option in the future, but I don&#39;t think it&#39;s important enough for this revision.&lt;br /&gt;
&lt;br /&gt;
One of the reasons I don&#39;t mind doing more small iterations rather than trying to cram features into one big revision is that I have been able to get these boards relatively fast and cheap from &lt;a href=&quot;https://jlcpcb.com/&quot;&gt;JLCPCB&lt;/a&gt;. Originally, I chose their service because they were the first and only place I found with a standard &lt;a href=&quot;https://jlcpcb.com/client/index.html#/impedance&quot;&gt;impedance-controlled stack-up&lt;/a&gt;, complete with an online&amp;nbsp;&lt;a href=&quot;https://jlcpcb.com/client/index.html#/impedanceCalculation&quot;&gt;calculator&lt;/a&gt;. But it&#39;s also just the most economical way to get a six-layer impedance controlled board in under two weeks. Each one is around $30. Even including all the power supplies and interfaces, the board is really a minor cost compared to the sensor and SoM it carries.&lt;br /&gt;
&lt;br /&gt;
Other than that, there was only one minor fix that needed to be made regarding the SSD&#39;s PCIe reference clock. I had mistakenly assumed this could be generated or forwarded by the ZU+ out of one of its GT clock pairs. But this doesn&#39;t seem to be standard practice. Instead, the external clock generator distributes matching reference clocks to both the ZU+ GT clock input and the SSD. I hacked this on to v0.1 with some twisted pair blue wire surgery, but it was easy to reroute for v0.2. Aside from this, I didn&#39;t touch any of the differential pairs, or really any other part of the board. Well, I did add one more small component...but that&#39;ll be for much later.&lt;br /&gt;
&lt;br /&gt;
These boards should arrive in time for a Thanksgiving weekend soldering session. I plan to build up two this time: one monochrome and, if all goes well, finally, one &lt;b&gt;&lt;span style=&quot;color: red;&quot;&gt;c&lt;/span&gt;&lt;span style=&quot;color: lime;&quot;&gt;o&lt;/span&gt;&lt;span style=&quot;color: #0b5394;&quot;&gt;l&lt;/span&gt;&lt;span style=&quot;color: lime;&quot;&gt;o&lt;/span&gt;&lt;span style=&quot;color: red;&quot;&gt;r&lt;/span&gt;&lt;/b&gt;. Before then, I&#39;d like to have at least some plan for the NVMe write...&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/8976982923812940835/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/11/zynq-ultrascale-superspeed-ram-dumping.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/8976982923812940835'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/8976982923812940835'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/11/zynq-ultrascale-superspeed-ram-dumping.html' title='Zynq Ultrascale+ SuperSpeed RAM Dumping + v0.2 Carrier'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLjgvsb6E-7Udc4-xDdPIKwXFy1MJl0NrkCrzUM2ix-aztNmq6QEuUPDC5nOhBVaUnUdILHQVCsUGYXRDJvhwpyqHy_HaI_eNUaCMskv-vZrT_scpNfDgISrqJoMPXgin-31RRPjwqW9E/s72-c/c51.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-7791104871079674246</id><published>2019-11-11T17:23:00.001-05:00</published><updated>2019-11-12T16:58:46.006-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="TinyCross"/><title type='text'>TinyCross: 4WD and Servoless RC Mode</title><content type='html'>I finished building up the second &lt;a href=&quot;http://scolton.blogspot.com/2018/09/tinycross-electron-control-unit.html&quot;&gt;dual motor drive&lt;/a&gt; for TinyCross, which means that the electronics and wiring have finally caught up to the mechanical build and both are 100% complete!&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOx5cWvu-JENhLKtK94kXJV5Mb_LE90NNd5ItoWsaRSM8vC0VdiiUaIxPJ4skINc0Tak4zlhUTu7ZkJtA-1enRjIjX3cF_neOwCoCgzAtvQMQSM-G1JsYkL50L4ml-PZKp0DeLKfdtddw/s1600/tc76.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOx5cWvu-JENhLKtK94kXJV5Mb_LE90NNd5ItoWsaRSM8vC0VdiiUaIxPJ4skINc0Tak4zlhUTu7ZkJtA-1enRjIjX3cF_neOwCoCgzAtvQMQSM-G1JsYkL50L4ml-PZKp0DeLKfdtddw/s640/tc76.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
That&#39;s not to say that the project is 100% complete; there&#39;s still some testing to be done to bring it all the way up to full power, as well as some weight reduction and weatherproofing tasks. But there are no more parts sitting in bins waiting for installation. It should be at peak mass, too, which is good because it&#39;s 86lb (39kg)&amp;nbsp;&lt;i&gt;without&amp;nbsp;&lt;/i&gt;batteries. The original target was 75lb (34kg) without batteries, but I will settle for 80lbs (36kg) if I can get there.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The second &lt;a href=&quot;https://scolton.blogspot.com/2018/09/tinycross-electron-control-unit.html&quot;&gt;TxDrive&lt;/a&gt; went together with no issues, and the software is identical to the front wheel drive. I have both set at 80A right now, which gives a total force at the ground of 112lbf (51kgf). That&#39;s about the same peak force as the rebuilt &lt;a href=&quot;https://scolton.blogspot.com/2013/10/tinykart-black-alien-power.html&quot;&gt;&quot;black&quot; version of tinyKart&lt;/a&gt;, which was &lt;a href=&quot;https://www.youtube.com/watch?v=OuCMtIB5P0E&quot;&gt;maybe too much&lt;/a&gt;&amp;nbsp;for that frame. But TinyCross is about 20% heavier (with the driver weight included) and 4WD, so it should be able to handle some more. I haven&#39;t seen any thermal issues at 4x80A - if anything, the motors run cooler now that all four are sharing the acceleration. Over the next few test drives, I&#39;ll work my way up toward the 120A target.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
But before that, there&#39;s something I&#39;ve been wanting to try. I have an abundance of actuators and not that many degrees of freedom. I decided to borrow an idea from &lt;a href=&quot;https://scolton.blogspot.com/2016/06/twitch-x-servoless-linkage-drive.html&quot;&gt;Twitch X&lt;/a&gt;&amp;nbsp;to cash in some of this actuator surplus for one more degree of control, specifically automatic servoless steering. So, a free 1/3-scale RC car mode without adding any parts.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBcY_WJ0Upn1QfSDIJlmtMOZH-wlgnChqeLSuOH_ulqzyDSxjdsdUsRXGCZhygKYlp5W-UYCMtarUqYVJlJaFV6yLXlPJEQZVXd7y5PJmlaLSNTWcoscNhCTw_PDhogyf5G9nXoH_OYOM/s1600/tc77.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBcY_WJ0Upn1QfSDIJlmtMOZH-wlgnChqeLSuOH_ulqzyDSxjdsdUsRXGCZhygKYlp5W-UYCMtarUqYVJlJaFV6yLXlPJEQZVXd7y5PJmlaLSNTWcoscNhCTw_PDhogyf5G9nXoH_OYOM/s640/tc77.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Well okay, I do have to add the receiver.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The steering wheel board reads the throttle and steering PWM signals from a normal RC receiver. The throttle PWM gets directly mapped to a torque command for all four motors. The steering PWM sets a target angle for the steering wheel. The measured angle comes from an IMU, the secret part on the steering wheel board. (Yes, there are all sorts of issues with that...I honestly just don&#39;t want to run any more wires.) The angle error drives a feedback controller that outputs differential torque commands to the front motors. Not much to it, really.&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgk4gYRTy0w8qeI1wbRTv2GmJQ7hIC39pheEX5cekVYXflrXIJVeyiNaTjXFz7LGP5Ixyf8GiEcl0S-mwBmZZ74Fd0HK4R-TXAxvoHBm8iXlv6WNAcOIulHqJ-35zyc67InEow-u-M3QxY/s1600/tc79.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1390&quot; data-original-width=&quot;1203&quot; height=&quot;400&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgk4gYRTy0w8qeI1wbRTv2GmJQ7hIC39pheEX5cekVYXflrXIJVeyiNaTjXFz7LGP5Ixyf8GiEcl0S-mwBmZZ74Fd0HK4R-TXAxvoHBm8iXlv6WNAcOIulHqJ-35zyc67InEow-u-M3QxY/s400/tc79.png&quot; width=&quot;345&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
I&#39;ve also seen &lt;i&gt;so many&lt;/i&gt;&amp;nbsp;runaway robots and go-karts in my life that I consider it a must to have working failsafes for both radio loss of signal and receiver (PWM) disconnect. It&#39;s extra work but trust me, it&#39;s worth it! Anyway, time for a test drive:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/LCgn0YJ2SII&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I wasn&#39;t sure how tightly I could tune the steering control loop, since there&#39;s a long chain of mechanical mush between the torque output at the motors and the sensor input at the steering wheel. But it works just fine. After a minute I forgot it wasn&#39;t really an RC car and tried some curb jumping. Just like Twitch X, the wheels do need traction to be able to control the steering angle. But then again, that is a necessary condition for steering anyway.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I don&#39;t actually think there&#39;s much point in a go-kart-sized RC car. But it&#39;s a short jump from that to an autonomous platform. It might also be useful to adjust the &quot;feel&quot; of the steering during normal driving. Mostly, I just like to abide by the Twitch X philosophy of using your existing actuators to do as much as possible.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/7791104871079674246/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/11/tinycross-4wd-and-servoless-rc-mode.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/7791104871079674246'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/7791104871079674246'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/11/tinycross-4wd-and-servoless-rc-mode.html' title='TinyCross: 4WD and Servoless RC Mode'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOx5cWvu-JENhLKtK94kXJV5Mb_LE90NNd5ItoWsaRSM8vC0VdiiUaIxPJ4skINc0Tak4zlhUTu7ZkJtA-1enRjIjX3cF_neOwCoCgzAtvQMQSM-G1JsYkL50L4ml-PZKp0DeLKfdtddw/s72-c/tc76.jpg" height="72" width="72"/><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-3735983221484569012</id><published>2019-10-27T01:36:00.000-04:00</published><updated>2019-11-15T14:31:03.240-05:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>Real-Time Wavelet Compression for High Speed Video</title><content type='html'>&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The next stop on the &lt;a href=&quot;http://scolton.blogspot.com/2019/06/freight-train-of-pixels.html&quot;&gt;Freight Train of Pixels&lt;/a&gt; is the wavelet compression engine. Previously, I built up the &lt;a href=&quot;http://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;CMV12000 input module&lt;/a&gt;, which turned out to be easier than I thought. The output of that module is a set of 64 10-bit pixels and one 10-bit control signal that update on a 60MHz pixel clock (px_clk). This is too much data to write directly to an NVMe SSD, so I want to compress it by about 5:1 in real-time on the &lt;a href=&quot;https://shop.trenz-electronic.de/en/TE0803-02-04CG-1EA-MPSoC-Module-with-Xilinx-Zynq-UltraScale-ZU4CG-1E-2-GByte-DDR4-5.2-x-7.6-cm?c=452&quot;&gt;XCZU4&lt;/a&gt; Zynq Ultrascale+ SoC.&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Wavelet compression seems like the right tool for the job, prioritizing speed and quality over compression ratio. It needs to run on the SoC&#39;s programmable logic (PL), where there&#39;s enough parallel computation and memory bandwidth for the task. This post will build up the theory and implementation of this wavelet compression engine, starting from the basics of the discrete wavelet transform and ending with encoded data streams being written to RAM. It&#39;s a bit of a long one, so I broke it into several sections:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#dwt&quot;&gt;The Discrete Wavelet Transform&lt;/a&gt;]&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#lifting&quot;&gt;The Lifting Scheme&lt;/a&gt;]&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#h26&quot;&gt;Horizontal 2/6 DWT Core&lt;/a&gt;]&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#v26&quot;&gt;Vertical 2/6 DWT Core&lt;/a&gt;]&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#extensions&quot;&gt;Input Extensions?&lt;/a&gt;]&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#quantizer&quot;&gt;Quantizer&lt;/a&gt;]&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#encoder&quot;&gt;Variable-Length Encoder&lt;/a&gt;]&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#wrap&quot;&gt;Wrapping Up&lt;/a&gt;] - First test video here.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
[&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#info&quot;&gt;More Information&lt;/a&gt;] - Source here, plus some references.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;h4 style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;dwt&quot;&gt;&lt;/a&gt;The Discrete Wavelet Transform&lt;/h4&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Suppose we want to store the row vector [150, 150, 150, 150, 150, 150, 150, 150]. All the values are less than 256, so eight bytes works. If this vector is truly from a random stream of bytes, that might be the best we can do. But real-world signals of interest are not random and a pattern of similar numbers reflects something of physical significance, like a stripe in a bar code. This fact can be leveraged to represent a structured signal more compactly. There are many ways to do this, but let&#39;s look specifically at the discrete wavelet transform (DWT).&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
A simple DWT could operate on adjacent pairs of data points, taking their sum (or average) and difference. This two-input/two-output operation would scan, without overlap, through the row vector to produce four averages and four differences, as shown below. Assuming no rounding has taken place, the resulting eight values &lt;i&gt;fully represent&lt;/i&gt;&amp;nbsp;the original data, since the process could be reversed. Furthermore, the process can be repeated in a binary fashion on just the averages:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwZpgLHcY7cj4HCx3hGCCTVf4eyPiMwhj-cCVDFYDz2FiIrxYzASqKbh7mluMQvawvkvHKUb680b5AobL_MrKerKz8PXwfKdP_3neuPwuXywuTEgfmi57uHpp2c4Vj5kmV24imZ8fnuGI/s1600/wv04.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1498&quot; data-original-width=&quot;860&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwZpgLHcY7cj4HCx3hGCCTVf4eyPiMwhj-cCVDFYDz2FiIrxYzASqKbh7mluMQvawvkvHKUb680b5AobL_MrKerKz8PXwfKdP_3neuPwuXywuTEgfmi57uHpp2c4Vj5kmV24imZ8fnuGI/s640/wv04.png&quot; width=&quot;364&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
After three levels, all that remains is a single average and seven zeros, representing the lack of difference between adjacent data points at each stage above. This is an extreme example, but in general the DWT will concentrate the information content of a structured signal into fewer elements, paving the way for compression algorithms to follow.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
For image compression, it&#39;s possible to perform a 2D DWT by first transforming the data horizontally, then transforming the intermediate result vertically. This results in four outputs representing the average image as well as the high-frequency horizontal, vertical, and diagonal information. The entire process can then be repeated on the new 1/2-scale average image.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgATzF5QiQ3F45ftGOwZwsMa6gBSGla78IKHX-ER9S67YvHkEUve8dxa3twvEFMQqXtSCSn6hKBS48_YeuAYK2doombOu9wgQprp_hGPW3a3xnPgiEWe3NZyJ9Qx-VecRTYY36BdMoqRIE/s1600/wv15.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;733&quot; data-original-width=&quot;1297&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgATzF5QiQ3F45ftGOwZwsMa6gBSGla78IKHX-ER9S67YvHkEUve8dxa3twvEFMQqXtSCSn6hKBS48_YeuAYK2doombOu9wgQprp_hGPW3a3xnPgiEWe3NZyJ9Qx-VecRTYY36BdMoqRIE/s640/wv15.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;A three-stage 2D Haar DWT output. Green indicates zero difference outputs.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The DWT discussed above uses the simplest possible wavelet, the &lt;a href=&quot;https://en.wikipedia.org/wiki/Haar_wavelet&quot;&gt;Haar wavelet&lt;/a&gt;, which only looks at two adjacent values for both the sum/average and the difference calculation. While this is extremely fast, it has relatively low curve-fitting ability. Consider what happens if the Haar DWT is performed on a ramp signal:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRPda4t2EDw8OuijyuagDnU9xnGEGIspbPHa8O_Za4kvIPjSsvil2Qx8ktVi0G5JLSth7kFSgUNNO9j1FJopjwMSCS4f-vu_-aO1WfFzybI8N0YwneimQ-Io1wGJv1WvoWyPqxUYHFbMg/s1600/wv05.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1022&quot; data-original-width=&quot;1219&quot; height=&quot;335&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRPda4t2EDw8OuijyuagDnU9xnGEGIspbPHa8O_Za4kvIPjSsvil2Qx8ktVi0G5JLSth7kFSgUNNO9j1FJopjwMSCS4f-vu_-aO1WfFzybI8N0YwneimQ-Io1wGJv1WvoWyPqxUYHFbMg/s400/wv05.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Instead of zeros, the ramp input produces a constant difference output. It&#39;s still smaller than the original signal, but not as good for compression as all zeros. It&#39;s possible to use more complex wavelets to capture higher-order signal structure. For example, a more complex wavelet may compute the deviation from the local slope instead of the immediate difference, which is back to zero for a ramp input:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPakDkVpgF3YAzOmPDPgnarm4xPM-QvFunPdDXUGVbR0cEm6qADJ4_LaLlW3nHKUww0q4-AOOVVtnhzskReuPTqL12ioVXO2CyS0fy9ZpbuuwSUcwEtTnY7H1FQEDZJBcWfBGdovVO2NA/s1600/wv06.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;996&quot; data-original-width=&quot;1216&quot; height=&quot;327&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhPakDkVpgF3YAzOmPDPgnarm4xPM-QvFunPdDXUGVbR0cEm6qADJ4_LaLlW3nHKUww0q4-AOOVVtnhzskReuPTqL12ioVXO2CyS0fy9ZpbuuwSUcwEtTnY7H1FQEDZJBcWfBGdovVO2NA/s400/wv06.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
To compute the &quot;local slope&quot;, some number of data points before and after the pair being processed are needed. The sum/average operation may also be more complex and use more than two data points. One classification system for wavelets is based on how many points they use for the sum and difference operations, their &lt;a href=&quot;https://en.wikipedia.org/wiki/Support_(mathematics)&quot;&gt;support&lt;/a&gt;. The Haar wavelet would be classified as 2/2 (for sum/difference), and is unique in that it doesn&#39;t require any data beyond the immediate pair being processed. A wavelet with larger support can usually fit higher-order signals with fewer non-zero coefficients.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The better signal-fitting ability of wavelets with larger support comes at a cost, since each individual operation requires more data and more computation. Now is a good time to introduce the 2/6 wavelet that will be the focus of most of this post. It uses two data points for the sum/average, just like the Haar wavelet, but six for the difference. One way to look at it is as a pair of weighted sums being slid across the data:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFIp4G1M8hAn8HgfzBXQfPt47-gRkupwMS7FO8WPEmapqk5Vu5v_GkacSlNUokARGdYH6kaVqZSqjgXV62Ui4HpfD5_gCxwhrgcKVPYKf4NoM18emrj_Qoqm_29Xp4Pts1Zy3vFa8IewE/s1600/wv08.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;796&quot; data-original-width=&quot;917&quot; height=&quot;276&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFIp4G1M8hAn8HgfzBXQfPt47-gRkupwMS7FO8WPEmapqk5Vu5v_GkacSlNUokARGdYH6kaVqZSqjgXV62Ui4HpfD5_gCxwhrgcKVPYKf4NoM18emrj_Qoqm_29Xp4Pts1Zy3vFa8IewE/s320/wv08.png&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The two-point sum is a straightforward moving average. The six-point difference is a little more involved: the four outer points are used to establish the local slope, which is then subtracted from the immediate difference calculated from the two inner points. This results in zero difference for constant, ramp, and even second-order signals. Although it&#39;s more work than the Haar wavelet, the computational requirement is still relatively low thanks to weights that are all powers of two.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The 2/6 wavelet is used by the &lt;a href=&quot;https://gopro.github.io/cineform-sdk/&quot;&gt;CineForm&lt;/a&gt; codec, which was open-sourced by GoPro in 2017. The &lt;a href=&quot;https://gopro.github.io/cineform-sdk/&quot;&gt;GitHub page&lt;/a&gt; has a great explanation of how wavelet compression works, along with the full SDK and example code. There&#39;s also a very easy to follow &lt;a href=&quot;https://github.com/gopro/cineform-sdk/tree/master/Example/WaveletDemo&quot;&gt;stripped-down C example&lt;/a&gt; that compresses a single grayscale image. If you want to explore the entire history of CineForm, which overlaps in many ways with the history of wavelet-based video compression in general, it&#39;s definitely worth reading through the &lt;a href=&quot;https://cineform.blogspot.com/&quot;&gt;GoPro/CineForm Insider&lt;/a&gt; blog.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Besides the added computation, the wavelets other than the Haar also need data from outside the immediate input pair. The 2/6 wavelet, for example, needs two more data points to the left and to the right of the input pair. This creates a problem at the first and last input pair of the vector, where two of the data points needed for the difference calculation aren&#39;t available. Typically, the vector is extended by padding it with data that is in some way an extension of the nearby signal. This adds a notion of state to the system that can be its own logical burden.&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8OQT89xYrz6PYhw3flin8mVsaYNhdzQ4ImL9aT2U5msXr5F5NclQR90NR_mABNM0kQ0slYBArAOIchcxeQ1pLliRc7m0JLZznKiajaCHMXpPo4wkatQr97heDPu8Rz_NkbYHtcLFy_8U/s1600/wv09.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1063&quot; data-original-width=&quot;1600&quot; height=&quot;424&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8OQT89xYrz6PYhw3flin8mVsaYNhdzQ4ImL9aT2U5msXr5F5NclQR90NR_mABNM0kQ0slYBArAOIchcxeQ1pLliRc7m0JLZznKiajaCHMXpPo4wkatQr97heDPu8Rz_NkbYHtcLFy_8U/s640/wv09.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Different ways to extend data for the 2/6 DWT operation at the start and end of a vector.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
There are other subtle issues with the DWT in the context of compression. For one, if the data is &lt;i&gt;not&lt;/i&gt;&amp;nbsp;structured, the DWT actually increases the storage requirement. Consider all the possible sum and difference outputs for two random 8-bit data points under the Haar DWT. The sum range is [0:510] and the difference range is [-255:255], both seemingly requiring 9-bit results. The signal&#39;s entropy hasn&#39;t changed; the extra bits are disguising two new redundancies in the data: 1) The sum and difference are either both even or both odd. 2) A larger difference implies a sum closer to 255.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
Things get even a bit worse with the 2/6 DWT. If the difference weights are multiplied by 8 to give an integer result, the difference range for six random 8-bit values is [-2550:2550], which would require 13 bits to store. Since there are six inputs per two outputs in the difference operation, it&#39;s also harder to see the extra bits as representing some simple redundancies in the transformed data.&amp;nbsp;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg73fM9NaBhFz9mrFwYsS2A6h8m_ew7HTQauELPwU7taID_aMg4gtN_NxtX9qEt460N-XcWAQEyHh6y3Rk7v_zXZZqKRMWC8Z872ioKDx9xodb0N0XH50rrgXg5ciugWtuoHmdKxpVS9k4/s1600/wv13.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1007&quot; data-original-width=&quot;1600&quot; height=&quot;402&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg73fM9NaBhFz9mrFwYsS2A6h8m_ew7HTQauELPwU7taID_aMg4gtN_NxtX9qEt460N-XcWAQEyHh6y3Rk7v_zXZZqKRMWC8Z872ioKDx9xodb0N0XH50rrgXg5ciugWtuoHmdKxpVS9k4/s640/wv13.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;65536 random vectors fed through the 2/2 and 2/6 DWT give an idea of the output range.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The differences in a structured signal will still be concentrated heavily in the smaller values, so compression can still be effective. But it would be nice not to have to reserve so much memory for intermediate results, especially in a multi-stage DWT. Rounding or truncating the data after each sum and difference operation seems justified, since a lossy compression algorithm will wind up discarding least significant bits anyway. But it would be nice to keep the DWT itself lossless, deferring lossy compression to a later operation. In fact, there&#39;s an efficient algorithm for performing the DWT operations reversibly without introducing as much redundancy in the form of extra result bits.&lt;/div&gt;
&lt;h4 style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;lifting&quot;&gt;&lt;/a&gt;The Lifting Scheme&lt;/h4&gt;
&lt;div&gt;
What follows are a couple of simple examples of an amazing piece of math that&#39;s relatively modern, with the &lt;a href=&quot;https://cm-bell-labs.github.io/who/wim/papers/lift2.pdf&quot;&gt;original publication&lt;/a&gt; in 1995. There&#39;s a lot to the theory, but one core concept is that a DWT can be built from smaller steps, each inherently reversible. For example, the Haar (2/2) DWT could be built from two steps as follows:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjV2KWX4eJThUPIQQdLqCKWALhqMkV5kTv5P_iZyzvcYc1HoXot0zJthwDfVK2u_JjhVyYMB4kus3aoukMTN6oJH9rXTT5yHh65hMkLDvgJKm1JfpC2vEP4d5dEejh1ocpjkuw0LKqM4H0/s1600/wv10.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1218&quot; data-original-width=&quot;1549&quot; height=&quot;313&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjV2KWX4eJThUPIQQdLqCKWALhqMkV5kTv5P_iZyzvcYc1HoXot0zJthwDfVK2u_JjhVyYMB4kus3aoukMTN6oJH9rXTT5yHh65hMkLDvgJKm1JfpC2vEP4d5dEejh1ocpjkuw0LKqM4H0/s400/wv10.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
Included in the forward &lt;span style=&quot;color: #ffe599;&quot;&gt;Step 2&lt;/span&gt; is a truncation of the difference by right shift. This is equivalent to dividing by two and rounding down, and turns the sum output into an integer average, which only requires as many bits as one input. But, remarkably, no information is really discarded.&amp;nbsp;The C code makes the inherent reversibility pretty clear. Even if you truncated more&amp;nbsp;bits of the difference, it would be reversible. In fact, you could skip the sum entirely and just forward x0 to s.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The mechanism of the lifting scheme has effectively eliminated one of the two redundancies that was adding bits to the results of the 2/2 DWT. Specifically, the fact that the sum and difference were either both even or both odd has been traded in for a one-bit reduction in the range of the sum output. Likewise, it can help reduce the range of the difference output of the 2/6 DWT without sacrificing reversibility:&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYA32Z2NmC27Exp8Js2fqwLBDSx8EJixPOx0xXX4Rh2fjd3dEj9K0fWRhFZ4apPQwH5thJ92E_xImwM7pH9fSsLlV4PzI7HRl7SDZxJSvNuTXt9CVKHCIWRaiiW0GbvR8QVjE-jhWOK4Y/s1600/wv14.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;659&quot; data-original-width=&quot;1600&quot; height=&quot;262&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYA32Z2NmC27Exp8Js2fqwLBDSx8EJixPOx0xXX4Rh2fjd3dEj9K0fWRhFZ4apPQwH5thJ92E_xImwM7pH9fSsLlV4PzI7HRl7SDZxJSvNuTXt9CVKHCIWRaiiW0GbvR8QVjE-jhWOK4Y/s640/wv14.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
One additional step is added that calculates the local slope from two adjacent sum outputs. (The symbols z&lt;sup&gt;-1&lt;/sup&gt; and z are standard notation for &quot;one sample behind&quot; and &quot;one sample ahead&quot;.) After the subtraction, a round-down-bias-compensating constant of two is added in and then the result is right shifted by two, effectively dividing it by four. The result is similar to the four outer 1/8-weighted terms of the six-point difference, but with rounding. However, because this whole step is done identically in the forward and reverse direction, it&#39;s still fully reversible.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The calculation of the local slope from two adjacent &lt;i&gt;outputs&lt;/i&gt; instead of four adjacent &lt;i&gt;inputs&lt;/i&gt; highlights another important efficiency feature of the lifting scheme: intermediate results get reused in later lifting steps. This reduces the total number of shift/add operations (or multiply/add operations, for more complex wavelets). It also means that once the immediate sum and difference steps have been performed on a particular pixel pair, that input pair is no longer needed and its memory can be used to store intermediate results instead. (Fully in-place computation is possible.)&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The 2/6 lifting scheme construction as described above will be the basis for the hardware implementation to follow. One important consideration is that the real implementation must be &lt;a href=&quot;https://en.wikipedia.org/wiki/Causal_system&quot;&gt;causal&lt;/a&gt;, so the &quot;one sample ahead&quot; (z) component of the local slope calculation implies a latency of at least one pixel pair from input to output. This has different consequences for the horizontal DWT, which operates in the same dimension as the sensor readout scan, and the vertical DWT, which must wait for enough completed rows. For this and other reasons, the hardware implementations for each dimension can wind up being quite different.&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;h26&quot;&gt;&lt;/a&gt;Horizontal 2/6 DWT Core&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
For running a multi-stage 2D DWT at 3.8Gpx/s on a Zynq Ultrascale+ SoC, the bottleneck isn&#39;t really computation, it&#39;s memory access. Writing raw frames to external (PS-side) DDR4 RAM and then doing an in-place 2D DWT on them is infeasible. Even the few Tb/s of BRAM bandwidth available on the PL-side needs to be carefully rationed! For that reason, I decided I wanted the Stage 1 horizontal cores to run directly on the 64 &lt;a href=&quot;http://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;pixel input streams&lt;/a&gt;, using only distributed memory. Something like this:&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTO8zfheUZKbpGAjudaYxNocjfC059BeSFU2glJyaNED1-Wc0dn7EJNVeAnt6dgDrG2qJTiWIHa7c-yem2iwOL9wp0o0HQlN2f11gsIa04XEcvodHKDrlJGFVYxUIQ2NYeW4IwgEwBvBk/s1600/wv16.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;718&quot; data-original-width=&quot;1600&quot; height=&quot;284&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTO8zfheUZKbpGAjudaYxNocjfC059BeSFU2glJyaNED1-Wc0dn7EJNVeAnt6dgDrG2qJTiWIHa7c-yem2iwOL9wp0o0HQlN2f11gsIa04XEcvodHKDrlJGFVYxUIQ2NYeW4IwgEwBvBk/s640/wv16.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
Because the DWT operates on pixel pairs, one register is required to store the even pixel. Then, all the action happens on the odd pixel&#39;s clock cycle:&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;Combinational block A does &lt;span style=&quot;color: #76a5af;&quot;&gt;Step 1&lt;/span&gt; and &lt;span style=&quot;color: #ffe599;&quot;&gt;Step 2&lt;/span&gt; of the 2/6 DWT lifting scheme as described above, placing its results in registers &lt;span style=&quot;color: #76a5af;&quot;&gt;D_0&lt;/span&gt; and &lt;span style=&quot;color: #ffe599;&quot;&gt;S_0&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #ffe599;&quot;&gt;S_0&lt;/span&gt; and &lt;span style=&quot;color: #76a5af;&quot;&gt;D_0&lt;/span&gt;&#39;s previous values are shifted into &lt;span style=&quot;color: #ffe599;&quot;&gt;S_1&lt;/span&gt; and &lt;span style=&quot;color: #76a5af;&quot;&gt;D_1&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #ffe599;&quot;&gt;S_1&lt;/span&gt;&#39;s previous value is shifted into &lt;span style=&quot;color: yellow;&quot;&gt;S_out&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;Combinational block B does &lt;span style=&quot;color: #c27ba0;&quot;&gt;Step 3&lt;/span&gt; of the 2/6 DWT lifting scheme, using &lt;span style=&quot;color: #ffe599;&quot;&gt;S_0&lt;/span&gt; and &lt;span style=&quot;color: yellow;&quot;&gt;S_out&lt;/span&gt;&#39;s previous values to compute the local slope and subtract it from &lt;span style=&quot;color: #76a5af;&quot;&gt;D_1&lt;/span&gt;. The result is placed in &lt;span style=&quot;color: magenta;&quot;&gt;D_out&lt;/span&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
This is a tiny core: the seven 16-bit registers get implemented as 112 total Flip-Flops (FFs) and the combinational logic takes around 64 Look-Up Tables (LUTs), as four 16-bit adders. And it &lt;i&gt;has&lt;/i&gt;&amp;nbsp;to be tiny, because not only are there 64 pixel input streams, but each stream services two color fields (in the &lt;a href=&quot;https://en.wikipedia.org/wiki/Bayer_filter&quot;&gt;Bayer-masked&lt;/a&gt; color version of the sensor). So that&#39;s 128 Stage 1 horizontal cores running in parallel. The good news is that they only need to run at px_clk frequency, 60MHz, which isn&#39;t much of a timing burden.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Unfortunately, that tiny core was a little too good to be true. The 64 pixel input streams from the CMV12000 are divided up into two rows and 32 columns. Within each column, there are only 128 adjacent pixels, and only 64 of a given color field. Remember the part about the 2/6 DWT requiring input extensions of two samples at the beginning and end of stream? Now, each core would need logic to handle that. But unlike the actual beginning and end of a row, in most cases the required data actually does exist, just not in the same stream as the one being processed by the core. A better solution, then, is to tie adjacent cores together with the requisite edge pair data:&lt;/div&gt;
&lt;div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtvZAggxE9bx76j8GYdvMqCqT6NbeTFNS9xlhTYDaGxt4h2RiNCwFfPW6Q_iFN_605H8AAryVU2jZewKFOu2-5H5TmIaPg8JiK9MmGIbnUmIteu01gDcn-y9_qZBlKSRj9V4CQxZcen0g/s1600/wv17.gif&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;586&quot; data-original-width=&quot;1380&quot; height=&quot;270&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtvZAggxE9bx76j8GYdvMqCqT6NbeTFNS9xlhTYDaGxt4h2RiNCwFfPW6Q_iFN_605H8AAryVU2jZewKFOu2-5H5TmIaPg8JiK9MmGIbnUmIteu01gDcn-y9_qZBlKSRj9V4CQxZcen0g/s640/wv17.gif&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
Now, the horizontal core can be in one of three states: Normal, during which the core has access to everything it needs from within its own column. &lt;span style=&quot;color: #f1c232;&quot;&gt;Last Out&lt;/span&gt;, where the core needs to borrow the sum of the first pixel pair from the core to its right (S_pp0_fromR). And &lt;span style=&quot;color: #f1c232;&quot;&gt;First Out&lt;/span&gt;, where it needs to borrow the first two sums and the first difference from the core to its right (S_pp0_fromR, S_pp1_fromR, and D_pp0_fromR). The active state is driven by a global pixel counter.&lt;br /&gt;
&lt;br /&gt;
In addition to the extra switching logic (implemented as LUT-based multiplexers), the cores need to store their first and second pixel pair sums (&lt;span style=&quot;color: #ffe599;&quot;&gt;S_pp0&lt;/span&gt;, &lt;span style=&quot;color: #ffe599;&quot;&gt;S_pp1&lt;/span&gt;) and first pixel pair difference (&lt;span style=&quot;color: #76a5af;&quot;&gt;D_pp0&lt;/span&gt;) for later use by the adjacent core. One more register, &lt;span style=&quot;color: #ffe599;&quot;&gt;S_2&lt;/span&gt;, is also added as a dedicated local sum history to allow the output sum to be the target of one of the multiplexers. The total resource utilization of the Stage 1 horizontal core is 176 FFs (11x16b registers) and 107 LUTs. That&#39;s still pretty small, but 128 cores in parallel eats up about 15% of the available logic in the XCZU4.&lt;br /&gt;
&lt;br /&gt;
Luckily, things get a lot easier in Stage 2 and Stage 3, which only need 32 and 16 horizontal cores, respectively. They&#39;re also fed whole pixel pairs from the stage above them, so the X_even register isn&#39;t needed. Otherwise, they operate in a similar manner to the Stage 1 core shown above.&lt;br /&gt;
&lt;h4&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;v26&quot;&gt;&lt;/a&gt;Vertical 2/6 DWT Core&lt;/h4&gt;
While I can get away with using only distributed memory for the horizontal cores, the vertical cores need to store several entire rows. This means using Block RAM (BRAM), which is dedicated synchronous memory available on the Zynq Ultrascale+ PL side. The XCZU4 has 128 BRAMs, each with 36Kib of storage. Each color field is 2048px wide, so a single BRAM could store a single row of a single color field (2048 · 16b = 32Kib). I settled on storing eight rows, requiring a total of 32 BRAMs for the four color fields of Stage 1.&lt;br /&gt;
&lt;br /&gt;
Storing whole rows in each BRAM is the wrong way to do things, though. The data coming from the CMV12000 is primarily parallel by column, and preserving that parallelism for as long as possible is the key to going fast. If all 32 Stage 1 horizontal cores of a given color field had to access a single BRAM every time a new pixel pair was ready at their output (once every four px_clk), it would require eight write accesses per px_clk. While 480MHz is &lt;i&gt;theoretically&lt;/i&gt;&amp;nbsp;in-spec for the BRAM, it would make meeting timing constraints much harder. It&#39;s much better to divide up the BRAMs into column-parallel groups, each handling 1/8 of the total color field width and receiving data from only four horizontal cores (just a 60MHz write access).&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPbOym4LTpUSejjwJc-i137wnO_HKnQeQdoma8ahT560l_0O_ebDgixS7MwYtGLmRNCTkxSFPWAxX873ECKfBedtXvUNhBwrq8s4wbA2qoxQH96LcBcbC3Mkw9am7CWgtu34qOEaQgz9k/s1600/wv18.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;425&quot; data-original-width=&quot;1600&quot; height=&quot;170&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPbOym4LTpUSejjwJc-i137wnO_HKnQeQdoma8ahT560l_0O_ebDgixS7MwYtGLmRNCTkxSFPWAxX873ECKfBedtXvUNhBwrq8s4wbA2qoxQH96LcBcbC3Mkw9am7CWgtu34qOEaQgz9k/s640/wv18.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Stage 1 parallel architecture for a single color field. BRAM writes round-robin through four horizontal cores at 60MHz.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Now, each BRAM stores all eight rows for its column group. The vertical 2/6 DWT core can be built around the BRAM, with the DWT operations running on six older rows while two newer rows are being written with horizontal core data. Doing six reads per two writes (180MHz read access) isn&#39;t too bad, especially since the BRAM has independent read and write ports. But I can do even better by using a 64-bit read width and processing two pixel pairs at a time in the vertical core. To avoid dealing with a 3/2-speed clock, I decided to implement the reads as six of eight states in a state machine that runs at double px_clk (120MHz).&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3OQh2vJ2twGRAzQAFgDGp_tfohHHwHaaiSX2zNb44rWwDh82SZE-vQe7Yl1utD4sB8mEijLyHe8Ts-NZQlKpac45OzbKStIq8AqS9ubpiUEdoPIrH3B_oCS6bbyh0Wh6PmRXlWG_q0Xo/s1600/wv19.gif&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;773&quot; data-original-width=&quot;1430&quot; height=&quot;344&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3OQh2vJ2twGRAzQAFgDGp_tfohHHwHaaiSX2zNb44rWwDh82SZE-vQe7Yl1utD4sB8mEijLyHe8Ts-NZQlKpac45OzbKStIq8AqS9ubpiUEdoPIrH3B_oCS6bbyh0Wh6PmRXlWG_q0Xo/s640/wv19.gif&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Since it has access to all the data it needs in the six oldest rows, the vertical core &lt;i&gt;can&lt;/i&gt;&amp;nbsp;actually be pretty simple. Input extensions are only needed at the top and bottom of a frame (sort-of, see below); no data is needed from adjacent cores. It&#39;s not as computationally efficient as the horizontal core: &lt;span style=&quot;color: #76a5af;&quot;&gt;Step 1&lt;/span&gt; and &lt;span style=&quot;color: #ffe599;&quot;&gt;Step 2&lt;/span&gt; are repeated (via combinational block A) three times on a given row pair as it gets pushed through. This is done intentionally to avoid having to write back local sums to the BRAM, so the vertical core only needs read access.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Since the vertical core operates on 64-bit registers (two pixel pairs at a time), all the multiplexers and adders are four times as wide, giving a total LUT count of 422, roughly four times that of the horizontal core. This seems justified, with a ratio of 4:1 for horizontal:vertical cores in Stage 1. Still, it means the complete Stage 1 uses about 30% of the total logic available on the XCZU4. I do get a bit of a break on the FF count, since this core only has six 64-bit registers (384 FFs, much less than four times the horizontal core FF count).&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The output of the vertical core is two 64-bit registers containing the vertical sum and difference outputs of two pixel pairs. Because of the way the horizontal sums and differences are interleaved in the BRAM rows, this handily creates four 32-bit pixel pairs to route to either the next wavelet stage (for the LLx pair) or the compression hardware (for the HLx, LHx, and HHx pairs). The LLx pair becomes the input for the next stage horizontal cores.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvmwjTxyMNWEpmZxjHb8MDCUpwHBbXyKrRH8CzR_cIl5zB-dmv7bvyt68AUA1SkohPhLTQp5bdzn0xbnhwGoqpd2SvTJBuGSJCYPa2Spr_8zr-r6XP7IReYaqyTy30L_FYBW7tB3yXMaM/s1600/wv20.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;497&quot; data-original-width=&quot;1600&quot; height=&quot;196&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvmwjTxyMNWEpmZxjHb8MDCUpwHBbXyKrRH8CzR_cIl5zB-dmv7bvyt68AUA1SkohPhLTQp5bdzn0xbnhwGoqpd2SvTJBuGSJCYPa2Spr_8zr-r6XP7IReYaqyTy30L_FYBW7tB3yXMaM/s640/wv20.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
At each stage, the color field halves in width and height. But, the vertical cores need to store eight rows at all stages, so only the width reduction comes into play. Thus, Stage 2 needs 16 vertical cores (four per color field) and Stage 3 needs eight (two per color field). The extra factor of two is not lost completely, though: at each stage, the write and read access rates of the vertical core BRAM are also halved. In total, the three-stage architecture looks something like this:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhLLq7B5l50lNHRrRx38_xELqcUQM7rCe11nzTId_1W57ddYs4R2VuSrhji79iyEZxN9WIiiEEGqLGIZL5JFhzXHE_Pc99dBCfEWUFcBEivBsa9opQWiNjIRNsFDJ5A3RZ50l5NBV75VSs/s1600/wv21.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1414&quot; data-original-width=&quot;1600&quot; height=&quot;564&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhLLq7B5l50lNHRrRx38_xELqcUQM7rCe11nzTId_1W57ddYs4R2VuSrhji79iyEZxN9WIiiEEGqLGIZL5JFhzXHE_Pc99dBCfEWUFcBEivBsa9opQWiNjIRNsFDJ5A3RZ50l5NBV75VSs/s640/wv21.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Three-stage DWT architecture for a single color field. In total, 44 horizontal cores and 14 vertical cores are used per color field. Data rates are divided by four at each stage, since it only processes the LLx output from the previous stage.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Vertical core outputs that don&#39;t go to a later stage (HHx, HLx, LHx, and LL3) are sent to the compression hardware, which is where we&#39;re heading next as well, after tying up some loose ends.&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;extensions&quot;&gt;&lt;/a&gt;Input Extensions?&lt;/h4&gt;
&lt;div&gt;
I mentioned that the horizontal cores are chained together so that the DWTs can access adjacent data as needed within a row, but what happens with the left-most and right-most cores, at the beginning and end of a row? And what happens at the start and end of a frame, in the vertical cores? Essentially, nothing:&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;For the horizontal DWT, the left-most and right-most cores are just tied to each other as if they were adjacent, a free way to implement periodic extension.&lt;/li&gt;
&lt;li&gt;For the vertical DWT, the BRAM contents are just allowed to roll over between frames. The first rows of frame N reference the last rows of frame N-1.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
This isn&#39;t necessarily the optimal approach for an image; symmetric extension will produce smaller differences, which are easier to compress. But, it&#39;s fast and cheap. It&#39;s also reversible if the inverse DWT implements the same type of extension of its sum inputs. Even the first rows of the first frame can be recovered if the BRAM is known to have started empty (free zero padding).&lt;br /&gt;
&lt;h4&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;quantizer&quot;&gt;&lt;/a&gt;Quantizer&lt;/h4&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
If you take a typical raw image straight off a camera sensor, run it through a multi-stage 2D DWT, and then zip the result, you&#39;ll probably get a compression ratio of around 3:1. This isn&#39;t as much a metric of the compression method itself as of the image, or maybe even of our brains. We consider 50dB to be a good signal-to-noise ratio for an image. No surprise, this gives an RMS noise of just under 1.0 in the standard 8-bit display depth of a single color channel. But a 10- or 12-bit sensor will probably have single-digit noise, which winds up on the DWT difference outputs. This single-digit noise is distributed randomly on the frame and &lt;a href=&quot;https://en.wikipedia.org/wiki/Entropy_(information_theory)&quot;&gt;requires&lt;/a&gt; a few bits per pixel to encode, limiting the lossless compression ratio to around 3:1.&lt;br /&gt;
&lt;br /&gt;
To get to 5:1 or more, it&#39;s necessary to introduce a lossy stage in the form of re-quantization to a lower bit depth. It&#39;s lossy in the sense of being mathematically irreversible, unlike the quantization in the lifting scheme. Information will be discarded, but we can be selective about what to discard. The goal is to reduce the entropy with as little effect on the subjective visual quality of the image as possible. Discarding bits from the difference channels, especially the highest-frequency ones (HH1, HL1, LH1), has the least visual impact on the final result. This is especially true if the least significant bits of those channels are lost in sensor noise anyway.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgw7RnrPkPyd7B8mbalDRFs1iYE4CdjT_DHmdj29gdAm7p3A9sTlIeThezUoxxWpSMKxuk7GOBsxyZOiPM2ld0NrTYjGwGhGF-jvyn3E5J054vXiDv97n0ND3jzcnia2B0E_jh8KmpY7nA/s1600/wv03.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;187&quot; data-original-width=&quot;372&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgw7RnrPkPyd7B8mbalDRFs1iYE4CdjT_DHmdj29gdAm7p3A9sTlIeThezUoxxWpSMKxuk7GOBsxyZOiPM2ld0NrTYjGwGhGF-jvyn3E5J054vXiDv97n0ND3jzcnia2B0E_jh8KmpY7nA/s1600/wv03.png&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Bits can be discarded from the difference channels with far less visual impact than discarding bits from the average.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
In a sense, the whole purpose of doing the multi-stage DWT was to separate out the information content into sub-bands that can be prioritized by visual importance:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;/div&gt;
&lt;ol&gt;
&lt;li style=&quot;text-align: left;&quot;&gt;LL3. This is kept as raw 10- or 12-bit sensor data.&lt;/li&gt;
&lt;li style=&quot;text-align: left;&quot;&gt;LH3, HL3, and HH3.&lt;/li&gt;
&lt;li style=&quot;text-align: left;&quot;&gt;LH2, HL2, and HH2.&lt;/li&gt;
&lt;li style=&quot;text-align: left;&quot;&gt;LH1, HL1, and HH1.&lt;/li&gt;
&lt;/ol&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Arguably, the HHx sub-bands are of lower priority than the LHx and HLx sub-bands on the same level. This would prioritize the fidelity of horizontal and vertical edges over diagonal ones. With these priorities in mind, each sub-band can be configured for a different level of re-quantization in order to achieve a target compression ratio.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
In C, the fastest way to implement a configurable quantization step would be a variable right shift, causing the signal to be divided by {2, 4, 8, ...} and then rounded down. But implementing a &lt;a href=&quot;https://en.wikipedia.org/wiki/Barrel_shifter&quot;&gt;barrel shifter&lt;/a&gt; in LUT-based multiplexers eats resources quickly if you need to process a bunch of 16-bit variable right shifts in parallel. Fortunately, there are dedicated multipliers distributed around the PL-side of the Zynq Ultrascale+ SoC that can facilitate this task. These DSP slices have a 27-bit by 18-bit multiplier with a full 45-bit product register. To implement a configurable quantizer, the input can be multiplied by a software-set value and the output can be the product, right shifted by a &lt;i&gt;constant&lt;/i&gt; number of bits (just a bit-select):&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKWi2SiTMmBcn0a_mvICmwyEZrx77kNYPpO_tlMaJvNW1I9IeRL2Tav9fhTRQzLcH7Pt677tTSdMQp9P_p-E7LsyCqnV-6iQWUNCE9c0pWBLc4RY0qaLHEQUkXKHML_Bz_JSiGXcocPWU/s1600/wv22.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;479&quot; data-original-width=&quot;1136&quot; height=&quot;167&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKWi2SiTMmBcn0a_mvICmwyEZrx77kNYPpO_tlMaJvNW1I9IeRL2Tav9fhTRQzLcH7Pt677tTSdMQp9P_p-E7LsyCqnV-6iQWUNCE9c0pWBLc4RY0qaLHEQUkXKHML_Bz_JSiGXcocPWU/s400/wv22.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
This is more flexible than a variable shift, since the multiplier can be any integer value. Effectively, it opens up division by non powers-of-two. For example, 85/256 for an approximate divide-by-3. The DSP slices also have post-multiply logic that can be used, among many other things, to implement different rounding strategies. An unbiased round-toward-zero can be implemented by adding copies of the input&#39;s sign bit to the product up to (but not including) the first output bit:&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAWRMGeuazCAii5N4Mg10-9k8rqjPkDROxjxpGzHLmFHYTH7v203ULV5k90q6luIbvF39LTLjHCC0RRRr4DSz6Y3VK2WO896GCYfuy6C8Pq2Sh46fCi-AOsEyB9NE3cYyLccHfqqIThXc/s1600/wv23.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;670&quot; data-original-width=&quot;1228&quot; height=&quot;217&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAWRMGeuazCAii5N4Mg10-9k8rqjPkDROxjxpGzHLmFHYTH7v203ULV5k90q6luIbvF39LTLjHCC0RRRr4DSz6Y3VK2WO896GCYfuy6C8Pq2Sh46fCi-AOsEyB9NE3cYyLccHfqqIThXc/s400/wv23.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
The XCZU4 has 728 DSP slices distributed throughout its PL-side, so dedicating a bunch of these to the quantization task seems appropriate. The combined output of all the wavelet stages is 64 16-bit values per px_clk, so 64 DSPs would do the trick with very little timing burden. But there&#39;s a small catch that has to do with how the final quantized values get encoded.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;encoder&quot;&gt;&lt;/a&gt;Variable-Length Encoder&lt;/h4&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
So far we&#39;ve done a terrible job at reducing the bit rate: the sensor&#39;s 37.75Gb/s (4096 · 3072&amp;nbsp;· 10b&amp;nbsp;· 300fps) has turned into 61.44Gb/s at the quantizer (64&amp;nbsp;· 16b&amp;nbsp;· 60MHz). But, if the DWT and quantizer have worked, most of the 16-bit values in HHx, HLx, and LHx should be zeros. A lot will be&amp;nbsp;±1, fewer will be&amp;nbsp;±2, fewer&amp;nbsp;±3, and so on. There will be patches of larger numbers for actual edges, but the probability will be heavily skewed toward smaller values.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigLB4VApJPEkQF5_iM2sk9lHG_jxi_yCL_LmOg3CQxfXYm8hwn6KYt9CFg1La3ME3f2xQ6x_IZ2VddsApRL2gidQdSC3m91is4Ihs4qq0WYE4YQNzg6fogcEoFyHEa_x2yetraoANGd8M/s1600/wv24.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigLB4VApJPEkQF5_iM2sk9lHG_jxi_yCL_LmOg3CQxfXYm8hwn6KYt9CFg1La3ME3f2xQ6x_IZ2VddsApRL2gidQdSC3m91is4Ihs4qq0WYE4YQNzg6fogcEoFyHEa_x2yetraoANGd8M/s640/wv24.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Probability distribution for a typical set of DWT high-frequency outputs. Green indicates zero difference outputs.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
If the new probability distribution has an &lt;a href=&quot;https://en.wikipedia.org/wiki/Entropy_(information_theory)&quot;&gt;entropy&lt;/a&gt; of less than 2.00 bits, it&#39;s theoretically possible to achieve 5:1 compression. One way to do this is with a variable-length code, which maps inputs with higher probability to outputs with fewer bits. A known or measured probability distribution of the inputs can be used to create the an efficient code, such as in &lt;a href=&quot;https://en.wikipedia.org/wiki/Huffman_coding&quot;&gt;Huffman coding&lt;/a&gt;. Adapting the code as the probability distribution changes is a little beyond the amount of logic I want to add at the moment, so I will just take an educated guess at a fixed code that will work okay given the expected distributions:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5Am3OHuC1cppxgWHui5UQ-HR_WBkdieg06wHVBanddxn-Sn3gAToA9ds-bLixWOfpbIAoiWlflmdOe45E7z1lsyznn92Atz6Uu-ghP3MHyJNfVTsUarRFVhxaNJe5GQ11w9mExmJ7ObE/s1600/wv25.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1009&quot; data-original-width=&quot;1600&quot; height=&quot;402&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5Am3OHuC1cppxgWHui5UQ-HR_WBkdieg06wHVBanddxn-Sn3gAToA9ds-bLixWOfpbIAoiWlflmdOe45E7z1lsyznn92Atz6Uu-ghP3MHyJNfVTsUarRFVhxaNJe5GQ11w9mExmJ7ObE/s640/wv25.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
This encoder looks at four &lt;i&gt;sequential&lt;/i&gt; quantized values and determines how many bits are required to store the largest of the four. It then encodes that bit requirement in a prefix and concatenates the reduced-width codes for each of the four values after that. This is all relatively fast and easy to do in discrete logic. Determining how many bits are required to store a value is similar to the &lt;a href=&quot;https://en.wikipedia.org/wiki/Find_first_set&quot;&gt;find first set&lt;/a&gt; bit operation. Some bit bins are grouped to reduce multiplexer count for constructing the coded output. Applying this code to the three example probability distributions above gives okay results:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span style=&quot;color: #b45f06;&quot;&gt;LH1:&lt;/span&gt; 1.38bpp (7.26:1) compared to 1.14bpp (8.77:1) theoretical.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: yellow;&quot;&gt;HL1:&lt;/span&gt; 1.39bpp (7.20:1) compared to 1.06bpp (9.43:1) theoretical.&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: magenta;&quot;&gt;HH1:&lt;/span&gt;&amp;nbsp;2.41bpp (4.16:1) compared to 1.84bpp (5.44:1) theoretical.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
To get closer to the theoretical maximum compression, more logic can be added to revise the code based on observed probabilities. It might also be possible to squeeze out more efficiency by using local context to condition the encoder, almost like an extra mini wavelet stage. But since this has to go really fast, I&#39;m willing to trade compression efficiency for simplicity for now.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I emphasized that the four quantized values need to be &lt;i&gt;sequential&lt;/i&gt;, i.e. from a single continuous row of data. The probability distribution depends on this, and the parallelization strategy of the quantizers and encoders must enforce it. The vertical core outputs are all non-adjacent pixel pairs, either from different levels of the DWT, different color fields within a level, or different columns within a color field. So, while a single four-pixel quantizer/encoder input can round-robin through a number of vertical core outputs, the interface must store one previous pixel pair. I decided to group them in 128-bit compressor cores, each with two 4x16b quantizers and encoders:&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixs0lhRuvtjU5TjYl6PbtyzS41zAcG0WirVPlXT8KZww1YY-6rHP4whbNU6oStCWzfkI3FgwPAXyruVTqYApL7t7Oq-DdVGppU-2SxyeS-IRZ5StK9t1Gb0DaWhdOJDE0D9PAS3DuPu5c/s1600/wv26.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;867&quot; data-original-width=&quot;1600&quot; height=&quot;346&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixs0lhRuvtjU5TjYl6PbtyzS41zAcG0WirVPlXT8KZww1YY-6rHP4whbNU6oStCWzfkI3FgwPAXyruVTqYApL7t7Oq-DdVGppU-2SxyeS-IRZ5StK9t1Gb0DaWhdOJDE0D9PAS3DuPu5c/s640/wv26.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Each input is a 32-bit pixel pair from a single vertical core output. During the &quot;even&quot; phase, previous pixel pair values are saved in the in_1[n] registers. During the &quot;odd&quot; phase, the quantizers and encoders round-robin through the inputs, encoding the current and previous pixel pairs of each. A final step merges the two codes and repacks them into a 192-bit buffer. When this fills past 128 bits, a write is triggered to a 128-bit FIFO (made with two BRAMs) that buffers data for DDR4 RAM writing, using a similar process as I previously used for &lt;a href=&quot;http://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;raw sensor read-in&lt;/a&gt;.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The eight-input compressor operates on two pixel pairs simultaneously at each px_clk, covering eight inputs in four px_clk. This matches up well to the eight vertical cores per color field of Stage 1, which update their outputs every fourth px_clk. In total, Stage 1 uses 12 compressor cores: four color fields each for HH1, HL1, and LH1 pixel pair outputs. Things get a little more confusing in Stage 2 and Stage 3, where the outputs are updated less frequently. To balance the load before it hits the RAM writer, it makes sense to increase the round-robin by 2x at each stage. When load-balanced, the overall mapping uses winds up using 16 compressor cores:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9inwvuqUJ_-sw-MSA8vjX6HDJ3nC7qhccFdXZ4TBsENduzCZAf7-nkUaTy3NzjBe6CYd0qFsTNFlc0Ft024FOqBPh2kiURzlq3GljFfvy3Ze661iqwUmYyX1FZrd08Vp1X37KqWC2cOY/s1600/wv27.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1170&quot; data-original-width=&quot;1600&quot; height=&quot;466&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9inwvuqUJ_-sw-MSA8vjX6HDJ3nC7qhccFdXZ4TBsENduzCZAf7-nkUaTy3NzjBe6CYd0qFsTNFlc0Ft024FOqBPh2kiURzlq3GljFfvy3Ze661iqwUmYyX1FZrd08Vp1X37KqWC2cOY/s640/wv27.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Each core handles 1/16 of the 61.44Gb/s output from the DWT stages, writing compressed data into its BRAM FIFO. An AXI Master module checks the fill level of each FIFO and triggers burst writes to PS DDR4 RAM as needed. A single PL-PS AXI interface has a maximum theoretical bandwidth of 32Gb/s at 250MHz, which is why I needed two of them in parallel for &lt;a href=&quot;http://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;writing raw images to RAM&lt;/a&gt;. But since the target compression ratio is 5:1, I should be able to get away with a single AXI Master here. In that case, the BRAM FIFOs are critical for handling transient entropy bursts.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
And that&#39;s it, we&#39;ve reached the end of this subsystem. The 128b-wide AXI interface to the PS-side DDR controller is the final output of the wavelet compressor. What becomes of the data after that is a &lt;a href=&quot;http://scolton.blogspot.com/2019/07/benchmarking-nvme-through-zynq.html&quot;&gt;problem for another day&lt;/a&gt;.&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;wrap&quot;&gt;&lt;/a&gt;Wrapping Up&lt;/h4&gt;
&lt;div&gt;
Even though most of the cores described above are tiny, there are just so many of them running in parallel that the combination of &lt;a href=&quot;http://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html&quot;&gt;CMV12000 Input stage&lt;/a&gt; and Wavelet stages described here already takes up 70% of the XCZU4&#39;s LUTs, 39% of its FFs, and 69% of its BRAMs.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBOd8HtI-gfA6MhyphenhyphenBue9snKnDnwwmxy2yelC2m3PtAWuKSW8lHG-FaSLAxv-F6izTK2wkOWqvKfyY3HZL81hpCjWWNHMcVsN4QvpH-EYlvr6VU8nYofcFfOmln-SxIZF4fZv2hqEE9j04/s1600/wv28.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1193&quot; data-original-width=&quot;1546&quot; height=&quot;491&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBOd8HtI-gfA6MhyphenhyphenBue9snKnDnwwmxy2yelC2m3PtAWuKSW8lHG-FaSLAxv-F6izTK2wkOWqvKfyY3HZL81hpCjWWNHMcVsN4QvpH-EYlvr6VU8nYofcFfOmln-SxIZF4fZv2hqEE9j04/s640/wv28.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I&#39;m sure there&#39;s still some room for optimization. Many of the modules are running well below their theoretical maximum frequencies, so there&#39;s a chance that some more time multiplexing of certain cores could be worthwhile. But I think I would rather stick to having many slow cores in parallel. The next step up would probably be the &lt;a href=&quot;https://shop.trenz-electronic.de/en/TE0808-04-6BE21-A-UltraSOM-MPSoC-Module-with-Zynq-UltraScale-XCZU6EG-1FFVC900E-4-GB-DDR4&quot;&gt;XCZU6&lt;/a&gt;, which has more than double the logic resources. It&#39;s getting more expensive than I&#39;d like, but I will probably need it to add more pieces:&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;The PCIe bridge and, if needed, NVMe hardware acceleration.&lt;/li&gt;
&lt;li&gt;Decoding and preview hardware, such as HDMI output generation. This can be much less parallel, since it only needs to run at 30fps. Maybe the ARMs can help.&lt;/li&gt;
&lt;li&gt;128 &lt;i&gt;more&lt;/i&gt;&amp;nbsp;Stage 1 horizontal cores to deal with the sensor&#39;s 2048x1536 sub-sampling mode, where four whole rows are read in at once. This should run at above 1000fps.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;
For now, though, let&#39;s run this machine:&lt;br /&gt;
&lt;br /&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/ep0a7_K0EIs&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;
That&#39;s a quick test at 4096x2304, with all the quantizers (other than LL3) set to 32/256 and the frame rate maxed out (400fps at 16:9). This results in an overall compression ratio of about 6.4:1 for this clip. The second half of the clip shows the wavelet output at different stages, although the sensor noise wreaks havoc on the H.264. It&#39;s hard to tell anything from the YouTube render, but here&#39;s a PNG frame (&lt;a href=&quot;https://scolton-www.s3.amazonaws.com/video/wv33.png&quot;&gt;full size&lt;/a&gt;):&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1IpOjL06e6vP6QlSIJC4QRYeRxnTu8yQgw87S5rAJsPteQJ5Z1sk0OdCMpKJc5vrU2TxOPS97JRNjJpmb5qGFt1fnZo6__hrB6eyzaJBjnoRQaFjjfil9o77_bpZliPKJNfcUlfGZugI/s1600/wv33.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1IpOjL06e6vP6QlSIJC4QRYeRxnTu8yQgw87S5rAJsPteQJ5Z1sk0OdCMpKJc5vrU2TxOPS97JRNjJpmb5qGFt1fnZo6__hrB6eyzaJBjnoRQaFjjfil9o77_bpZliPKJNfcUlfGZugI/s640/wv33.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;No Kerbals were harmed in the making of this video.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
The areas of high contrast look pretty good. I&#39;m happy with the reconstruction of the text and QR code on the label. The smoke looks fine - it&#39;s mostly out of focus anyway. Wavelet compression is entirely intraframe, so things like smoke and fire don&#39;t produce any motion artifacts. There are a few places where the wavelet artifacts do show up, mostly in areas with contrast between two relatively dark features. For example on the color cube string:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRmrgoJ49SuJ8dXyy9CDPWIBCqE5fmuYbqGg3qCQmZnFNcMapAlU7JexsK6nLBg5uJaKJtKJ1BYjyjl5XDtTVUxLiaWJa61MewoIkvFxghpE4ztKxnRZ6m4YxJgI8j_c8Ubw8PrwjPxWY/s1600/wv30.png&quot; imageanchor=&quot;1&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1316&quot; data-original-width=&quot;1587&quot; height=&quot;530&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRmrgoJ49SuJ8dXyy9CDPWIBCqE5fmuYbqGg3qCQmZnFNcMapAlU7JexsK6nLBg5uJaKJtKJ1BYjyjl5XDtTVUxLiaWJa61MewoIkvFxghpE4ztKxnRZ6m4YxJgI8j_c8Ubw8PrwjPxWY/s640/wv30.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;
Probably, HH3, HL3, and LH3 should get back one or two bits (quantizer setting of 64-128). On other hand, HH1 might be able to go down one more bit since it looks like it&#39;s still encoding mostly sensor noise. I&#39;m not sure how much the quantizer settings will need to change from scene to scene, or even from frame to frame, but overall I think it&#39;ll be easy to maintain good image quality at a target ratio of 5:1. I also have a few possibilities for improving the compression efficiency without adding much more logic, such as adding some local context-based prediction to the quantizer DSP&#39;s adder port.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;
I&#39;ll probably take a break from wavelets for now, though, and return to the last big component of this system: the NVMe SSD write. Now that the data is below 1GB/s, I need a place to put it. The RAM will still be used as a big FIFO frame buffer, but ultimately the goal is to continuously record. I also want to drop in a color sensor at some point, since this wavelet compression architecture is really meant for that (four independent color fields). Glad to have the most unknown subsystem completed though!&lt;br /&gt;
&lt;br /&gt;
That&#39;s way too much information for a single post, but oh well. I&#39;ll just end with some wavelet art from the incomplete last frame of the sequence above.&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQlHNRZ5uFpEIzMpvykFg0ccJcvOUL6ktEEYmLBZMMWMY6-1mHHG_2JqNSIlTE-rGh4mAYi7OLse7lq4w9IXTA0X1hdh1S-XB2zmHiWvD24EbHunPmKhSIIQCQio27LJJ703ibgO0zx_w/s1600/360.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQlHNRZ5uFpEIzMpvykFg0ccJcvOUL6ktEEYmLBZMMWMY6-1mHHG_2JqNSIlTE-rGh4mAYi7OLse7lq4w9IXTA0X1hdh1S-XB2zmHiWvD24EbHunPmKhSIIQCQio27LJJ703ibgO0zx_w/s640/360.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;It&#39;s interesting to see what uninitialized RAM looks like when run through the decoder and inverse DWT.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;h4&gt;
&lt;a href=&quot;https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html&quot; name=&quot;info&quot;&gt;&lt;/a&gt;More Information&lt;/h4&gt;
&lt;div&gt;
I put the Verilog source for the modules described above &lt;a href=&quot;https://github.com/coltonshane/WAVE-Vivado&quot;&gt;here&lt;/a&gt;. There isn&#39;t enough there to clone a working project from scratch, but you can take a look at the &lt;a href=&quot;https://github.com/coltonshane/WAVE-Vivado/tree/master/base_cmd.srcs/sources_1/ip&quot;&gt;individual cores&lt;/a&gt;. Keep in mind that these are just my implementations and I am far from a Verilog master. I would love to see how somebody with 10,000 hours of Verilog experience would write the modules described above.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Here are some unordered references to other good wavelet compression resources:&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href=&quot;https://github.com/gopro/cineform-sdk&quot;&gt;GoPro CineForm SDK&lt;/a&gt;&amp;nbsp;and &lt;a href=&quot;http://cineform.blogspot.com/&quot;&gt;Insider Blog&lt;/a&gt; has a ton of good discussion of wavelet compression, including a history of the CineForm codec going back to 2005.&lt;/li&gt;
&lt;li&gt;This paper, by some of the big names, lays the foundation for reversible integer-to-integer wavelet transforms, including the one described above: R. Calderbank, I. Daubechies, W. Sweldens, and B.-L. Yeo, “&lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S1063520397902384&quot;&gt;Wavelet Transforms That Map Integers to Integers&lt;/a&gt;,” &lt;i&gt;Applied and Computational Harmonic Analysis&lt;/i&gt;, vol. 5, pp. 332-369, 1998.&lt;/li&gt;
&lt;li&gt;The &lt;a href=&quot;http://wavelets.pybytes.com/&quot;&gt;Wavelet Browser&lt;/a&gt;, by PyWavelets, has a catalog of wavelets to look at. Interestingly, what I&#39;ve been calling the 2/6 wavelet is there called the &lt;a href=&quot;http://wavelets.pybytes.com/wavelet/rbio1.3/&quot;&gt;reverse biorthogonal 1.3&lt;/a&gt; wavelet.&lt;/li&gt;
&lt;li&gt;This &lt;a href=&quot;http://www.cs.tut.fi/~tabus/course/SC/246pagesCourseonJPEG2000.pdf&quot;&gt;huge set of course notes on JPEG2000&lt;/a&gt;, which is also wavelet-based, has a lot of information on the 5/3 and 9/7 wavelets used there, as well as quantization and encoding strategies.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/3735983221484569012/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/3735983221484569012'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/3735983221484569012'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/10/real-time-wavelet-compression-for-high.html' title='Real-Time Wavelet Compression for High Speed Video'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhwZpgLHcY7cj4HCx3hGCCTVf4eyPiMwhj-cCVDFYDz2FiIrxYzASqKbh7mluMQvawvkvHKUb680b5AobL_MrKerKz8PXwfKdP_3neuPwuXywuTEgfmi57uHpp2c4Vj5kmV24imZ8fnuGI/s72-c/wv04.png" height="72" width="72"/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-4871473114498529387</id><published>2019-10-05T20:56:00.003-04:00</published><updated>2019-10-06T10:19:58.344-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="ksp"/><category scheme="http://www.blogger.com/atom/ns#" term="laythe colony"/><title type='text'>KSP: Laythe Colony Part 4, Drop Ships and Lonely Rovers</title><content type='html'>After the &lt;a href=&quot;http://scolton.blogspot.com/2019/05/ksp-laythe-colony-part-3-colony-ships.html&quot;&gt;second Jool launch window&lt;/a&gt;, I still had 196 days to get a few extra ships off Kerbin before its &lt;a href=&quot;http://scolton.blogspot.com/2017/08/ksp-laythe-colony-part-1.html&quot;&gt;destruction&lt;/a&gt; on Year 3, Day 0. They couldn&#39;t transfer to Jool until the third launch window - around Year 3, Day 260 - but they could still get out of harm&#39;s way. I hadn&#39;t specified exactly &lt;i&gt;how&lt;/i&gt;&amp;nbsp;Kerbin is destroyed, but since this entire scenario is based on &lt;a href=&quot;https://www.goodreads.com/book/show/22816087-seveneves&quot;&gt;Seveneves&lt;/a&gt;, I think it was reasonable to say that these ships should not sit in cismunar orbit. So I decided to send them out to Minmus for parking.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghhfg9FndKweshyek5xmWIQgDmXFt6ZEtsNfeaDxK5sK-XcJ8Z4gjPfcP5myrIXz_6t5PLnwJNBAi81vnPVkxbO27KVRPEBD9IsHf6uOqUcAqJSoTmEHS0hhounQur8Ncd4CXV30kf1sg/s1600/ld30.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghhfg9FndKweshyek5xmWIQgDmXFt6ZEtsNfeaDxK5sK-XcJ8Z4gjPfcP5myrIXz_6t5PLnwJNBAi81vnPVkxbO27KVRPEBD9IsHf6uOqUcAqJSoTmEHS0hhounQur8Ncd4CXV30kf1sg/s640/ld30.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Colony ship #11 or #12 - I lost count. Parked at Minmus for a front row seat to the end of the world.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
By this point I was getting pretty tired of building colony ships. Each one takes about a dozen launches to assemble, crew, and fuel in low Kerbin orbit. But I managed to get two more built and parked at Minmus. I also realized that there would be a little bit of a housing shortage on Laythe with the extra 72 Kerbals these colony ships carry, so I sent up one more &lt;a href=&quot;http://scolton.blogspot.com/2018/11/ksp-laythe-colony-part-2-robotic-fleet.html&quot;&gt;HAB1&lt;/a&gt;&amp;nbsp;transfer ship as well. But parking ships in Minmus orbit isn&#39;t exactly efficient, and I am running a pretty tight&amp;nbsp;Δv budget. A perfect opportunity, then, to create one last piece of hardware for this mission.&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
The Drop Ships&lt;/h4&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzV7mAMW1ddXxqmhMtY7VJQl5qvCJ8-TmfFOvfl9v31WfRzeSSoXe6dltSk0ORTPoEpG2yJfg86v9vS7FZE2hAR_vZ_DZiJELrun3KCBPM4Sh-9krP2x_wHQgN1BBgzF1JUtlgREf-Khw/s1600/ld23_crop.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;852&quot; data-original-width=&quot;1600&quot; height=&quot;340&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzV7mAMW1ddXxqmhMtY7VJQl5qvCJ8-TmfFOvfl9v31WfRzeSSoXe6dltSk0ORTPoEpG2yJfg86v9vS7FZE2hAR_vZ_DZiJELrun3KCBPM4Sh-9krP2x_wHQgN1BBgzF1JUtlgREf-Khw/s640/ld23_crop.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The DS1 lander, a last-minute mining platform and fuel tanker for the fleet.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Until now, the only ships in my fleet with mining capabilities were the &lt;a href=&quot;http://scolton.blogspot.com/2018/11/ksp-laythe-colony-part-2-robotic-fleet.html&quot;&gt;LR1&lt;/a&gt; rovers, which can refuel space planes on the surface of Laythe. The planes can then climb into Laythe orbit and transfer any spare fuel to the colony ships. But it would take quite a few launches to fully refuel the colony ships this way. Better to mine on a moon with a shallow gravity well, like Pol, and net a bunch more fuel. So I designed a drop ship mining platform/tanker to do just that. Refueling the ships parked at Minmus before the third Jool transfer window would be a good test.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I&#39;ve done space planes and straightforward powered descent, but never a true VTOL in the sense of a ship that is designed to hover and translate horizontally looking for flat ground or good mining prospects. Most of my knowledge about drop ships comes from watching &lt;a href=&quot;https://www.youtube.com/channel/UCqgsE2C9kx62D1zYWHyAEow&quot;&gt;Cupcake Landers&lt;/a&gt; videos. I just tried to make it symmetric, place the C.G. properly, and set up the fuel tanks so that the C.G. doesn&#39;t shift much as they drain.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijIIPp35bs3Uq1O8ET06HH63o5DRP9Hrt1ctLXgA3x_WvFoPALKKvMdZZsUPHyPDmw3kDp4OQtqFZIAH6U_K6k3_PNcb4AghSlqKyQ9xOKwhrPhltZnrnZM_ecFD6BDS5tJLZfbWo8XKk/s1600/ld20.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijIIPp35bs3Uq1O8ET06HH63o5DRP9Hrt1ctLXgA3x_WvFoPALKKvMdZZsUPHyPDmw3kDp4OQtqFZIAH6U_K6k3_PNcb4AghSlqKyQ9xOKwhrPhltZnrnZM_ecFD6BDS5tJLZfbWo8XKk/s640/ld20.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Drop ship mining practice on Minmus.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Even though they&#39;re essentially flying fuel tanks, cruising through mountain ranges in the low gravity of Minmus in them is easy and actually kind-of fun.&amp;nbsp;Normally I&#39;m trying to time suicide burns just right or not stall out my space planes, both of which are more stressful technical tasks. Piloting a drop ship is closer to a &lt;a href=&quot;https://www.youtube.com/watch?v=MllGbvFBP2k&quot;&gt;sci-fi&lt;/a&gt; &lt;a href=&quot;https://www.youtube.com/watch?v=8EtnyKXacF0&quot;&gt;landing&lt;/a&gt; &lt;a href=&quot;https://www.youtube.com/watch?v=rNpbZdEH2M4&quot;&gt;experience&lt;/a&gt;.&amp;nbsp; Which reminds me: if you&#39;re looking for a quick diversion from the brutally technical challenge of KSP, &lt;a href=&quot;http://outerwilds.com/&quot;&gt;Outer Wilds&lt;/a&gt; is a beautiful (and creepy) exploration/mystery game with some incredible open-world storytelling. Absolutely worth going in blind and playing through.&lt;br /&gt;
&lt;br /&gt;
Back to Minmus mining, though. I only had time to build two of these drop ships. They can operate autonomously, but they also have room for a pilot, for navigation in frontier areas with poor relay coverage, and an engineer, for more efficient mining. I realized while building these final few ships that I neglected to put relay antennas on the &lt;a href=&quot;https://scolton.blogspot.com/2019/05/ksp-laythe-colony-part-3-colony-ships.html&quot;&gt;colony ships&lt;/a&gt;, something that is required for remote piloting rovers, space planes, or drop ships. Since I might need to do a lot of remote piloting in the Jool system, I decided to steal a couple relay satellites from Kerbin orbit.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitlk_bYJlL76pOQ-NJVhkJtF0ZPKIuz6VB1zQ064fmUkmfizUJzDKOBQQ7mIjgi2OWKrIwx7AES5p4AdwpcZMPOMRTeHPMrG2JKWmLu-bR0xZx58AN2fTcLQ3HPLkw0cs9g69QX7bXoSs/s1600/ld33.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1074&quot; data-original-width=&quot;1600&quot; height=&quot;428&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitlk_bYJlL76pOQ-NJVhkJtF0ZPKIuz6VB1zQ064fmUkmfizUJzDKOBQQ7mIjgi2OWKrIwx7AES5p4AdwpcZMPOMRTeHPMrG2JKWmLu-bR0xZx58AN2fTcLQ3HPLkw0cs9g69QX7bXoSs/s640/ld33.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Stealing a satellite with the grabby claw I knew would come in handy.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Once these ships leave Kerbin orbit, there won&#39;t be any need for a Kerbin &lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2hOdKD1UWiovwfyp38DNwR89tdp3Nxa3kNh2_PfY6UNRl6G2CKifiX2TeFg1b7Mpp_JBclb3M2psc0psmVdpcb-QsFjFEZ2ayQcToAXwVJteD_e6sJK_Rd5LQcyV6l9eHBdLJBmTQA1o/s1600/lc00.png&quot;&gt;comms network&lt;/a&gt;&amp;nbsp;anymore, so I (literally) grabbed some of Kerbin&#39;s relay satellites with the last two colony ships. It is possible to create a remote piloting connection through a relay satellite in a grabby claw, something I find satisfyingly appropriate for Kerbal-style mission &quot;planning&quot;. Anyway, I made a few round trips to Minmus surface to refuel the ships of the third wave and then that was it for Kerbin.&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
&lt;span style=&quot;color: red;&quot;&gt;0 Days Remain&lt;/span&gt;&lt;/h4&gt;
&lt;div&gt;
On Year 3, Day 0, time was up for the Kerbal home planet. The remaining population (of 432 Kerbals) was in flight, either on the way to Jool or at Minmus awaiting the third transfer window. No more hardware would be launched and the roughly four kilotons of ship and propellant in the fleet would have to become the Laythe colony. But it would still be almost another two years before the first colony ship arrives in the Jool system. Before that, the &lt;a href=&quot;http://scolton.blogspot.com/2018/11/ksp-laythe-colony-part-2-robotic-fleet.html&quot;&gt;robotic fleet&lt;/a&gt; would have to lay the groundwork.&lt;/div&gt;
&lt;h4&gt;
The Lonely Rovers&lt;/h4&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The 18 ships of the first Jool launch window arrived at their destination during the second half of Year 3. I set up the transfers such that the relay satellites would arrive first, since having a working comms net in the Jool system would be crucial to the rest of the mission. The RS3 ships and especially the ion engine satellites themselves have plenty of Δv to spare, so I just brute-forced them into useful coverage orbits around Jool and Laythe.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg21ZEICOelps5ws9ys2wSALrQVqrTeSg1k8RS_TmmsN6viGYa4g640WtIdjfQqmpZ7lg_1YseLDfdVBWxB_kQjwH3lgX7rdHua6Hl62GayBQcbi7JdqkxdMI-J5DNbGE0FpK2rP1xZIXY/s1600/ld35.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg21ZEICOelps5ws9ys2wSALrQVqrTeSg1k8RS_TmmsN6viGYa4g640WtIdjfQqmpZ7lg_1YseLDfdVBWxB_kQjwH3lgX7rdHua6Hl62GayBQcbi7JdqkxdMI-J5DNbGE0FpK2rP1xZIXY/s640/ld35.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The first relay satellites arrive at Jool. I&#39;m definitely guilty of setting up the WiFi before unpacking...&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
For the remainder of the ships, though, the Δv budget was tight enough that I definitely wanted to grab Tylo gravity assists on the way in. This created a bit of traffic as several ships would hit the Tylo gateway within days, or sometimes hours, of each other. To get captured using a gravity assist, I aimed to pass &quot;in front of&quot; Tylo, so that its gravity mostly pulls in a direction opposite my orbit and I feed it some of my kinetic energy. After some refinement, I also was able to target a captured orbit with a periapsis similar to the orbital radius of Laythe. From there, it&#39;s easy to get a low-energy intercept on the next orbit with just a couple small correction burns at periapsis and apoapsis.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjY_gWcOy9LNLN1GMiKe7EYLR6ZCKihpm_GL0z_qGeoItvyV3KNaRIZa0QG9pPc-lkVD-xe_dp-JRqQ6JDVQ7F9wuVZBFtIm-P7GTMmbt3VsKRQMfAL7oC6F-yxvlDDqY1cERxLEJ3Hgdg/s1600/ld37.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;913&quot; data-original-width=&quot;1600&quot; height=&quot;364&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjY_gWcOy9LNLN1GMiKe7EYLR6ZCKihpm_GL0z_qGeoItvyV3KNaRIZa0QG9pPc-lkVD-xe_dp-JRqQ6JDVQ7F9wuVZBFtIm-P7GTMmbt3VsKRQMfAL7oC6F-yxvlDDqY1cERxLEJ3Hgdg/s640/ld37.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Busy airspace (or, spacespace?) around the Tylo gateway.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Using gravity assist captures off Tylo, or in a few cases off Laythe itself, my average Δv from low Kerbin orbit to low Laythe orbit was about &lt;span style=&quot;color: yellow;&quot;&gt;3475m/s&lt;/span&gt;, with a tolerance of about&amp;nbsp;±350m/s. This is quite a bit below the 4360m/s you get from the &lt;a href=&quot;https://forum.kerbalspaceprogram.com/index.php?/topic/87463-13-community-delta-v-map-26/&quot;&gt;subway map&lt;/a&gt;, which would have been cutting it very close for some of my ships. As it is, all of the robotic fleet made it to low Kerbin orbit with fuel to spare and without having to do any aerobraking. Assuming all the Δv saved went into accelerating Tylo (and it wasn&#39;t on rails), its apoapsis would be raised by about 1nm.&lt;br /&gt;
&lt;br /&gt;
Getting to Laythe is not the same as landing on Laythe, though. It&#39;s a water world with only a few islands to target. I&#39;ve &lt;a href=&quot;https://scolton.blogspot.com/2014/03/ksp-mission-to-laythe-and-back.html&quot;&gt;landed there before&lt;/a&gt;, using a &lt;a href=&quot;http://scolton.blogspot.com/2013/12/ksp-ascent-and-deorbit-simulators.html&quot;&gt;custom deorbit burn tool&lt;/a&gt; to target the island on the equator with the &lt;a href=&quot;https://wiki.kerbalspaceprogram.com/wiki/File:Laythe_Topo_compressed.png&quot;&gt;flattest terrain&lt;/a&gt;. To hit that island, it makes sense to burn over the small island that&#39;s about 90º west of there. I set up each ship in a near-circular 100km equatorial orbit and then start a burn just as the ship passes over the coast of that island:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj95XxgNCrOfsdm30o1N6Y66SgUa-w9lYfbaEKrr4ZDWPK5NgZ5QTf60iZ_WgQHpXx6eAqNr2kU2EageOovkTtU8jXpN4Qa7Q0chrWG06gKM1L77PUiWjkZ7hB81cvyCR2dvwtXbbqiJ4U/s1600/ld44.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj95XxgNCrOfsdm30o1N6Y66SgUa-w9lYfbaEKrr4ZDWPK5NgZ5QTf60iZ_WgQHpXx6eAqNr2kU2EageOovkTtU8jXpN4Qa7Q0chrWG06gKM1L77PUiWjkZ7hB81cvyCR2dvwtXbbqiJ4U/s640/ld44.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Laythe deorbit burn over the small island on the equator, to hit the flat island about 90º to the east.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
After the burn, the lander can ditch its propulsion module (which is mostly empty now and will burn up separately) and prep for entry. For the first phase of the landing, an inflatable heat shield protects the descent package from the initial atmospheric heating.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEie9KaEcmxmphN_4Z7oFyX4y7Vr1Xmr9_UlyU8GdQoWNZMXTzOu_Qapz68bz3DXt5IpI_qyDzPFS6riTfm1RR5R23L6IRXBtqEidKa2h77ULPXg7Z3KGW61TS0VfYoBgLzE7P1mxjIs4nk/s1600/ld46.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEie9KaEcmxmphN_4Z7oFyX4y7Vr1Xmr9_UlyU8GdQoWNZMXTzOu_Qapz68bz3DXt5IpI_qyDzPFS6riTfm1RR5R23L6IRXBtqEidKa2h77ULPXg7Z3KGW61TS0VfYoBgLzE7P1mxjIs4nk/s640/ld46.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Landing Phase 1: Using an inflatable heat shield to protect the payload while bleeding off some speed.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
As the air gets thicker, the drag on the heat shield overcomes the ability of the reaction wheels to keep it facing forward, so the lander flips around. The fairing still provides thermal and aerodynamic protection for the payload, and the heat shield now becomes more of an air brake, bleeding off even more speed in preparation for the final descent.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiN4_IgFn1bNstYGI8GNlEQon8C9YUcYNZAhyphenhyphenZTwOCEVrqDgLzMuMdne_ppQIimjkia8i0HD9DGb3AxD_qrQAhj5QExGPX475zNp9D600JinQ9zoYbfBLSid-Ctpj_DITurNaP9Rm1p0r0/s1600/ld47.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiN4_IgFn1bNstYGI8GNlEQon8C9YUcYNZAhyphenhyphenZTwOCEVrqDgLzMuMdne_ppQIimjkia8i0HD9DGb3AxD_qrQAhj5QExGPX475zNp9D600JinQ9zoYbfBLSid-Ctpj_DITurNaP9Rm1p0r0/s640/ld47.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Landing Phase 2: The craft flips around, with the heat shield now acting as an air brake.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
At about 3km AGL, the speed is low enough to jettison the fairing and deploy the main parachutes. The heat shield stays attached until the main chutes deploy, at which point it can be jettisoned in a controlled orientation so it doesn&#39;t crash back into the ship.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_jv4yFfzLA_CtwszAKmzHGP8MIJ42Ikperln0ENUUDWo_6CJdYUTLXrEBlTgKIbD60idScJTBvIjTYj-28EkutK5npXbIGczcB1POiMjI8IJN1DYwyQES5nKxdSsjGuZYkPlmK7ETFNQ/s1600/ld56.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_jv4yFfzLA_CtwszAKmzHGP8MIJ42Ikperln0ENUUDWo_6CJdYUTLXrEBlTgKIbD60idScJTBvIjTYj-28EkutK5npXbIGczcB1POiMjI8IJN1DYwyQES5nKxdSsjGuZYkPlmK7ETFNQ/s640/ld56.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Landing Phase 3: Fairing jettisoned, main chutes deployed, heat shield dropped.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Finally, at about 300m AGL, the descent engines kick in and bleed off the final bit of vertical velocity. They don&#39;t have much fuel, so the burn has to be timed pretty well. I use the AeroGUI&#39;s AGL indicator and the lander&#39;s shadow to judge it.&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOgtS4KMV8JGy85lbezkWlOIKo9npq0mk0vJBrRPngNd9V0yElvsDH6wlqvFJDL7UfNDTn74oEqNQETiWEguAKGfFi-3-0hxRHnJbNGFzFYWJu3odJ8vf8VERyo7EOmP0k-ikUEXXeGvs/s1600/ld63.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOgtS4KMV8JGy85lbezkWlOIKo9npq0mk0vJBrRPngNd9V0yElvsDH6wlqvFJDL7UfNDTn74oEqNQETiWEguAKGfFi-3-0hxRHnJbNGFzFYWJu3odJ8vf8VERyo7EOmP0k-ikUEXXeGvs/s640/ld63.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Landing Phase 4: Powered descent. Kicks up a good amount of sand.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
That&#39;s how things &lt;i&gt;should&lt;/i&gt;&amp;nbsp;go. But the first two landings were not quite perfect. I nearly overshot the landing zone on the first try, coming down less than 1km from the eastern shore. This is almost exactly where I landed my &lt;a href=&quot;https://scolton.blogspot.com/2014/03/ksp-mission-to-laythe-and-back.html&quot;&gt;first Laythe mission&lt;/a&gt;, and I knew it was on a major slope. In the process of preparing for a potentially harrowing post-landing slide into the ocean, I forgot a few steps of the landing checklist and the descent engines didn&#39;t start up. The resulting ~15m/s impact was enough to break off the mining rig and fuel tank from the first LR1 rover down. But the drivetrain survived, so it could still act as a scout if it could get up the hill.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxQ9KxeAGPWji-UrMfO98KqttDP2XvYvnRSZ0GW1GzDsAP2PI8MZx7-MRX8fFCkqcmMYFfpZTk6GU0poTcvfFZU1Z7kwJTtfV5CLFib1eylQJrPFGh254IAhO_5cerj2BkR99sMX67KOY/s1600/ld41.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxQ9KxeAGPWji-UrMfO98KqttDP2XvYvnRSZ0GW1GzDsAP2PI8MZx7-MRX8fFCkqcmMYFfpZTk6GU0poTcvfFZU1Z7kwJTtfV5CLFib1eylQJrPFGh254IAhO_5cerj2BkR99sMX67KOY/s640/ld41.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The first (hard) landing on Laythe in this mission, dangerously close to the shore.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Having nearly overshot the landing zone into the ocean, I tweaked the deorbit burn a little (from 104m/s to 110m/s). However, this was a little too much tweaking and lander #2 wound up heading straight for the lake in the middle of this island. Luckily, this was a HAB1 lander, which has a little more fuel on board for the powered final descent. I managed to just barely hover-translate to the cliff edge overlooking the lake&#39;s eastern shore with no fuel to spare.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjO4JWN1kPAGInozXCb9M2uFQNd28OSWXN94QLYFzGRsvktr5T7ilCNGpVGyA2YIU86ex7W1GN1LInGWJI-gc_4MXYHJ9S1JnjG7DI6CxOw_zc6hM7RhMVDvUHExmgkYp_l2lvlpVwjiMU/s1600/ld48.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjO4JWN1kPAGInozXCb9M2uFQNd28OSWXN94QLYFzGRsvktr5T7ilCNGpVGyA2YIU86ex7W1GN1LInGWJI-gc_4MXYHJ9S1JnjG7DI6CxOw_zc6hM7RhMVDvUHExmgkYp_l2lvlpVwjiMU/s640/ld48.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Landing #2 involved some last-second piloting to steer away from the lake to the edge of a cliff.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Those two landings gave me the upper and lower limits for the deorbit burn. I used 108m/s as the burn for the remaining 12 landers, and they all touched down safely on the relatively flat land between the lake and the eastern shore.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3mC7UP0pDPcrac8Iz3plukG610mBv4dJH0KhSoB0Y5FUAj5DHvFqmkonBeIB60BSVftd7zw5zrMWhftNWbukcR2uoP_p6-QHDO0ow1HXVqhZihEuw3SVZZASMKtaNIUD-SWjrLVZd-RM/s1600/ld68.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3mC7UP0pDPcrac8Iz3plukG610mBv4dJH0KhSoB0Y5FUAj5DHvFqmkonBeIB60BSVftd7zw5zrMWhftNWbukcR2uoP_p6-QHDO0ow1HXVqhZihEuw3SVZZASMKtaNIUD-SWjrLVZd-RM/s640/ld68.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Typical landing zone after dialing in the exact deorbit burn.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I say relatively flat because it&#39;s still filled with sand dunes. They&#39;re no problem for the 6- and 8-wheeled rovers, but I need a 1-2km stretch of actually flat terrain to use as a space plane runway. I scouted for a while before settling on the strip marked out by the pink markers in the landing photo above. It&#39;s about 1.5km long and 300m wide, near the equator, and aligned well for west-to-east landings. It&#39;s completely flat in the crosswind direction and slightly sloped upward in the &quot;upwind&quot; landing direction. I&#39;d prefer something flat in all directions, but this is the next best thing.&lt;br /&gt;
&lt;br /&gt;
By the end of the first wave, I could place the landers with about&amp;nbsp;±2km accuracy from orbit. But they are rovers, so it&#39;s easy to reposition them as needed. The LR1s all grouped together to form the corners of the runway, acting as visible markers for the space planes on approach. They&#39;re needed at the runway for refueling anyway, so this seems like the best place for them. In order to avoid excessive part counts in one location, I decided to move the HAB1s, the colony habitats, away from the runway and toward the lake. There, they could be assembled into housing groups.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxzB8UkKMZW7LFjmK6TtT1Tj_QeDPpLbfr7txzUAEPI8zkm6W455eETNTOxXzYUcdeOr9Ixmu3JJFqkTi7q4LxjFgVgELyzN6kIKoAUdKA1n4nOgOaH69Ppz7T8y_wlNVmtItTCfYIyNc/s1600/ld72.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxzB8UkKMZW7LFjmK6TtT1Tj_QeDPpLbfr7txzUAEPI8zkm6W455eETNTOxXzYUcdeOr9Ixmu3JJFqkTi7q4LxjFgVgELyzN6kIKoAUdKA1n4nOgOaH69Ppz7T8y_wlNVmtItTCfYIyNc/s640/ld72.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Setting up some modular housing on the dunes.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
It&#39;s not a metropolis, but having a mobile and reconfigurable colony seems ideal on the sand dunes of an otherwise pretty desolate water planet. In total, 13.5 of the 14 rovers in the robotic fleet made it to the surface, and all 14 were able to find their way to each other and remotely set up the infrastructure for a colony. It&#39;ll be another year before the colony ships arrive in the second wave, but when they do, they&#39;ll have a place to stay - with a nice view.&lt;/div&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgymGtE5vG0W0Uvui6kThqT5mG0jZDrXEsoRhSNvXjp5Kzx3ILcYiS6x52JQlTar4vqN58E2-T46dcJT69U5z6iMPVARk3MLEwE5sIStOv8nbKqVOTwZS9jYhJz84xMDtngqD0VcN-dObk/s1600/ld62.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;900&quot; data-original-width=&quot;1600&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgymGtE5vG0W0Uvui6kThqT5mG0jZDrXEsoRhSNvXjp5Kzx3ILcYiS6x52JQlTar4vqN58E2-T46dcJT69U5z6iMPVARk3MLEwE5sIStOv8nbKqVOTwZS9jYhJz84xMDtngqD0VcN-dObk/s640/ld62.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/4871473114498529387/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/10/ksp-laythe-colony-part-4-drop-ships-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/4871473114498529387'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/4871473114498529387'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/10/ksp-laythe-colony-part-4-drop-ships-and.html' title='KSP: Laythe Colony Part 4, Drop Ships and Lonely Rovers'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghhfg9FndKweshyek5xmWIQgDmXFt6ZEtsNfeaDxK5sK-XcJ8Z4gjPfcP5myrIXz_6t5PLnwJNBAi81vnPVkxbO27KVRPEBD9IsHf6uOqUcAqJSoTmEHS0hhounQur8Ncd4CXV30kf1sg/s72-c/ld30.png" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-5959307952267114969</id><published>2019-09-28T17:10:00.004-04:00</published><updated>2019-09-29T11:49:01.004-04:00</updated><title type='text'>Fast atan2() alternative for three-phase angle measurement.</title><content type='html'>Normally, to get the phase angle of a set of (assumed balanced) three-phase signals, I&#39;d do a &lt;a href=&quot;https://en.wikipedia.org/wiki/Alpha%E2%80%93beta_transformation&quot;&gt;&lt;span id=&quot;goog_581252223&quot;&gt;&lt;/span&gt;Clarke Transform&lt;span id=&quot;goog_581252224&quot;&gt;&lt;/span&gt;&lt;/a&gt; followed by a&amp;nbsp;&lt;a href=&quot;https://en.wikipedia.org/wiki/Atan2&quot;&gt;atan2&lt;/a&gt;(β,α). This could be atan2f(), for single-precision floating-point in C, or some other &lt;a href=&quot;http://www-labs.iro.umontreal.ca/~mignotte/IFT2425/Documents/EfficientApproximationArctgFunction.pdf&quot;&gt;approximation&lt;/a&gt; that trades off accuracy for speed. The crudest (and fastest) of these is a first-order approximation &lt;span style=&quot;color: yellow; font-family: &amp;quot;times&amp;quot; , &amp;quot;times new roman&amp;quot; , serif; font-size: medium;&quot;&gt;atan(x)&amp;nbsp;≈ (π/4)·x&lt;/span&gt;&amp;nbsp;&lt;span style=&quot;text-align: center;&quot;&gt;&lt;span style=&quot;font-family: inherit;&quot;&gt;which has maximum error of&amp;nbsp;±4.073º over the range {-1 ≤ x&amp;nbsp;≤ 1}&lt;/span&gt;:&lt;/span&gt;&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvw8qUwRAtPlaAZcH8JMjufmZ1d-pJPXQNuv2wG-aitOq3NWKhj2rsHnBRhATuArbg3EuPjh4D7LRZa0716QdWQF0EMdpXhANnhldUkGbNTb50BXgjloXi5LV-hl62wwjphsy2dS_VvDc/s1600/at00.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1455&quot; data-original-width=&quot;1377&quot; height=&quot;400&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvw8qUwRAtPlaAZcH8JMjufmZ1d-pJPXQNuv2wG-aitOq3NWKhj2rsHnBRhATuArbg3EuPjh4D7LRZa0716QdWQF0EMdpXhANnhldUkGbNTb50BXgjloXi5LV-hl62wwjphsy2dS_VvDc/s400/at00.png&quot; width=&quot;377&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
Interestingly, this isn&#39;t the best (&lt;span style=&quot;color: #76a5af;&quot;&gt;minimax&lt;/span&gt; or &lt;span style=&quot;color: #8e7cc3;&quot;&gt;least mean square&lt;/span&gt;) linear fit over that range. But it&#39;s pretty good and has zero error on both ends, so it can be stitched together into a continuous four-quadrant approximation that covers all finite inputs to the two-argument atan2(β,α):&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8Qvvfc7NXcKoElJBU5tuXGaQtNfK0QEj2qEYOpIYXJU1PRspMaRIZAXzM-oXTjyVee-rz3SpQDozIo1DPCuLhGJO2Ilwmo2lVf33dACTinpJn40QMXca-1woxbkxZ0ZVqymh-F1VMcww/s1600/at02.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1287&quot; data-original-width=&quot;1504&quot; height=&quot;341&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8Qvvfc7NXcKoElJBU5tuXGaQtNfK0QEj2qEYOpIYXJU1PRspMaRIZAXzM-oXTjyVee-rz3SpQDozIo1DPCuLhGJO2Ilwmo2lVf33dACTinpJn40QMXca-1woxbkxZ0ZVqymh-F1VMcww/s400/at02.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
One common implementation determines the quadrant based on α&amp;nbsp;and β&amp;nbsp;and then runs the linear approximation on either &lt;span style=&quot;font-family: inherit;&quot;&gt;x = &lt;span style=&quot;color: #ea9999;&quot;&gt;β/α&lt;/span&gt;&amp;nbsp;or x = &lt;span style=&quot;color: #9fc5e8;&quot;&gt;α/β&lt;/span&gt;&lt;/span&gt;, whichever is in the range {-1 ≤ x&amp;nbsp;≤ 1} in that quadrant. The combination of a quadrant offset and the local linear approximation determines the final result.&lt;br /&gt;
&lt;br /&gt;
It&#39;s possible to extend this method to three inputs, a set of three-phase signals assumed to be balanced. Instead of quadrants, the input domain is split based on the six possible sorted orders of the three-phase signals. Within each sextant, the middle input (the one crossing zero) is divided by the difference of the other two to form a normalized input, analogous to selecting x = β/α or x = α/β in the atan2() implementation:&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwENr2oXMNxm8YH40P7DZhpO_vuHfmWm9Z-8kNdPJSzR4u4D9AFzJMEd3q268Ol8ITjhhejeHF8OYn7pCNxbuYvdUsk36bvXbtEtAvouMep8KKqoVqE_a3cL-skYMpfOQcSidRyCer204/s1600/at03.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;788&quot; data-original-width=&quot;1600&quot; height=&quot;196&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwENr2oXMNxm8YH40P7DZhpO_vuHfmWm9Z-8kNdPJSzR4u4D9AFzJMEd3q268Ol8ITjhhejeHF8OYn7pCNxbuYvdUsk36bvXbtEtAvouMep8KKqoVqE_a3cL-skYMpfOQcSidRyCer204/s400/at03.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
This normalized input, which happens to range from -1/3 to 1/3, is multiplied by a linear fit constant to create the local approximation. To follow the pattern of the four-quadrant approximation, a constant of&amp;nbsp;&lt;span style=&quot;font-family: &amp;quot;times&amp;quot; , &amp;quot;times new roman&amp;quot; , serif; font-size: medium;&quot;&gt;&lt;span style=&quot;color: cyan;&quot;&gt;π/2&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;color: yellow; font-family: &amp;quot;times&amp;quot; , &amp;quot;times new roman&amp;quot; , serif; font-size: large;&quot;&gt;&amp;nbsp;&lt;/span&gt;gives a fit that&#39;s not (minimax or least mean square) optimal, but stitches together continuously at sextant boundaries. As with the atan2() implementation, the combination of a sextant offset and the local approximation determine the final result.&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzhwuEOI4Y9nrqC6a6uaCdJ0ocH8ldrAkok-nRpErxvSPfaSBXz4dtnvQjapRwhCqNXZ6dbCJFB4fJn5-rK2OQD3p2_SmrNmK-0pKqrtd3bPavKFMRn0xodcpG-ntzq0RNq7AjQmoRzGY/s1600/at04.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;802&quot; data-original-width=&quot;1600&quot; height=&quot;200&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzhwuEOI4Y9nrqC6a6uaCdJ0ocH8ldrAkok-nRpErxvSPfaSBXz4dtnvQjapRwhCqNXZ6dbCJFB4fJn5-rK2OQD3p2_SmrNmK-0pKqrtd3bPavKFMRn0xodcpG-ntzq0RNq7AjQmoRzGY/s400/at04.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
For this three-phase approximation the maximum error is ±1.117º, significantly lower than the four-quadrant approximation. If starting from three-phase signals anyway, this method may also be faster, or at least nearly the same speed. The conditional section for selecting a sextant is more complex, but there are fewer intermediate math operations. (Both still have the single pesky floating-point divide for normalization.)&lt;br /&gt;
&lt;br /&gt;
To put this to the test, I tried directly computing the phase of the three &lt;a href=&quot;https://scolton-www.s3.amazonaws.com/motordrive/sensorless_gen1_Rev1.pdf&quot;&gt;flux observer&lt;/a&gt; signals on &lt;a href=&quot;https://scolton.blogspot.com/search/label/TinyCross&quot;&gt;TinyCross&lt;/a&gt;&#39;s dual motor drive. This usually isn&#39;t the best way to derive sensorless rotor angle: An angle tracking observer or PLL-type method can do a better job at filtering out noise by enforcing physical bandwidth constraints. But for this test, I just compute the angle directly using either atan2f(β,α) or one of the two approximations above.&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigJqcxMinj6mdQRLP3sl7CLlvzqjOZJQ6bBcpej2on2vw49JofkijKuFiyDMsm-V-Uis5AUaBoU-qrd6yMP77A-UdGXWM1LJ4mWy9KSCZRNE_83pVXnrf8nsQH20c9h5UXWw-xKTz47R4/s1600/at06.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;137&quot; data-original-width=&quot;706&quot; height=&quot;77&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigJqcxMinj6mdQRLP3sl7CLlvzqjOZJQ6bBcpej2on2vw49JofkijKuFiyDMsm-V-Uis5AUaBoU-qrd6yMP77A-UdGXWM1LJ4mWy9KSCZRNE_83pVXnrf8nsQH20c9h5UXWw-xKTz47R4/s400/at06.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Computation times for different angle-deriving algorithms.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
The three-phase approximation does turn out to be a little faster in this case. To keep the comparison fair, I tried to use the same structure for both approximations: the quadrant/sextant selection conditional runs first, setting bits in a 2- or 3-bit code. That code is then used to look up the offset and the numerator/denominator for the the local linear approximation. This is running on an STM32F303 at 72MHz. The PWM loop period is 42.67μs, so a 1.5-2.0μs calculation per motor isn&#39;t too bad, but every cycle counts. It&#39;s also &quot;free&quot; accuracy improvement:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEid8xxBiqeaw7Yrp3zkiE00thJ27ef5PaEto6u1RA51X3zxJ6pQFhMnAw9ROwL_H4kaHDXfG1k6kXrO86q4AdKMHAI5YJzXxpcuLK7NYT6WiN8XVn8baGM5vhELUwwpUUDtgsV_iUtF9Po/s1600/at05.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;990&quot; data-original-width=&quot;1600&quot; height=&quot;396&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEid8xxBiqeaw7Yrp3zkiE00thJ27ef5PaEto6u1RA51X3zxJ6pQFhMnAw9ROwL_H4kaHDXfG1k6kXrO86q4AdKMHAI5YJzXxpcuLK7NYT6WiN8XVn8baGM5vhELUwwpUUDtgsV_iUtF9Po/s640/at05.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The&amp;nbsp;±4º error ripple in the four-quadrant approximation shows up clearly in real data. The smaller error in the three-phase approximation is mostly lost in other noise. When the error is taken with respect to a post-computed atan2f(), the four-quadrant approximation looks less noisy. But I think this is just a mathematical symptom. When considering error with respect to an independent angle measurement (from Hall sensor interpolation), they show similar amounts of noise.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I don&#39;t have an immediate use for this, since TinyCross is primarily sensored and the flux signals are already &lt;a href=&quot;http://scolton.blogspot.com/2019/09/tinycross-first-test-drive-and.html&quot;&gt;synchronously logged&lt;/a&gt;&amp;nbsp;(for diagnostics only).&amp;nbsp;But clock cycle hunting is a fun hobby.&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/5959307952267114969/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/09/fast-atan2-alternative-for-three-phase.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/5959307952267114969'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/5959307952267114969'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/09/fast-atan2-alternative-for-three-phase.html' title='Fast atan2() alternative for three-phase angle measurement.'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvw8qUwRAtPlaAZcH8JMjufmZ1d-pJPXQNuv2wG-aitOq3NWKhj2rsHnBRhATuArbg3EuPjh4D7LRZa0716QdWQF0EMdpXhANnhldUkGbNTb50BXgjloXi5LV-hl62wwjphsy2dS_VvDc/s72-c/at00.png" height="72" width="72"/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-9097108050566287516</id><published>2019-09-24T01:34:00.000-04:00</published><updated>2019-09-24T01:38:04.940-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="TinyCross"/><title type='text'>TinyCross: First Test Drive and Synchronous Data Logging</title><content type='html'>&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
With the front wheel drive complete and the steering wheel control board working, it&#39;s finally time for a first test drive:&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/LhocJXDgiWc&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
I&#39;ve been waiting over a year to see if this mountain bike air shock suspension setup would work, and it looks like it does! I haven&#39;t done any tuning on it besides setting the preload, but it handles my pretty beat up parking lot nicely, absorbing bumps that would have broken &lt;a href=&quot;http://scolton.blogspot.com/p/cap-kart.html#tinykart&quot;&gt;tinyKart&lt;/a&gt; in minutes. The steering linkage also seems okay, with good travel and minimal bump steer. There are still some minor mechanical improvements I want to make, but it&#39;s nice to see the suspension concept in action after all this time.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: left;&quot;&gt;
I started with front wheel drive so I could see if the motor drive had any Flame Emitting Transistors, but happily it did not. It&#39;s the same &lt;a href=&quot;https://scolton.blogspot.com/2018/09/tinycross-electron-control-unit.html&quot;&gt;gate drive design&lt;/a&gt; that I use on everything and it always just works, so I shouldn&#39;t be surprised anymore. But I am asking a lot of the &lt;a href=&quot;https://www.onsemi.com/products/discretes-drivers/mosfets/fdmt80080dc&quot;&gt;FDMT80080DC&lt;/a&gt; FETs (just one per leg), so I&#39;m working my way up to 120A (peak, line-to-neutral) phase current incrementally. The above test is at 80A and the FETs seem happy, although the motors do get pretty warm already. They might need some i²t thermal protection to handle 120A peaks.&lt;/div&gt;
&lt;h4 style=&quot;clear: both; text-align: left;&quot;&gt;
Synchronous Data Logging&lt;/h4&gt;
&lt;div&gt;
One of the early lessons I learned in building motor drives is to always log data. Nothing ever works perfectly on the first try, but having data logging built in from the start is the best way I know of to quickly diagnose problems. A lot of the stuff that happens in a motor drive is faster than typical data logging can capture, but a lot of it is also periodic. By synchronizing the data collected to the rotor electrical angle, its possible to reveal detailed periodic signals even with relatively low frequency (50Hz) logging to an SD card. As a quick example, here&#39;s a standard data logger plot of motor phase currents over time:&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixsxfaw_SDVR2xjCn_-aJO3m0tR-P74d-mY8kgu4JxJsYLSjYsg-AIgdfoFpNFlVF5kt8IPTB4fvZUwO8XWpe5FOMygd9IyJLh4FslU40_XnXKsuF1wlCET2uG4ga9hdMoz7o62jmrPvc/s1600/19-09-15_03.gif&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;640&quot; data-original-width=&quot;989&quot; height=&quot;412&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixsxfaw_SDVR2xjCn_-aJO3m0tR-P74d-mY8kgu4JxJsYLSjYsg-AIgdfoFpNFlVF5kt8IPTB4fvZUwO8XWpe5FOMygd9IyJLh4FslU40_XnXKsuF1wlCET2uG4ga9hdMoz7o62jmrPvc/s640/19-09-15_03.gif&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Phase current vs. time, pretty boring.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
This type of plot shows the drive cycle, with periods of high current during acceleration (or braking) and periods of near zero current when coasting or stopped. And it shows that phase currents sometimes exceed 100A even with an 80A command. But a plot of Q-axis (torque-producing) current, which is already synchronous, could give a better summary of this information. The time resolution (40ms) isn&#39;t fine enough to show the AC signals.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
However, each set of three phase currents is also stamped with a rotor electrical angle measured at the same time (within about 10 microseconds). Cross-plotting the phase currents against their angle stamp, instead of against time, reveals a much more interesting view of the data:&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZMB65a0k1L5YGUURaIMXe53z9EyzJVokj6xGHlqSRXx2Wv_JjNpzeP5TBplPW2jPqyVq5m_w5jX88nhGEGV4_iA5GEH6lREr7PqGBJfpC6h1QFKW5KURatQxFoifVIb6DttF3AskSiJU/s1600/19-09-15_04.gif&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;640&quot; data-original-width=&quot;989&quot; height=&quot;414&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZMB65a0k1L5YGUURaIMXe53z9EyzJVokj6xGHlqSRXx2Wv_JjNpzeP5TBplPW2jPqyVq5m_w5jX88nhGEGV4_iA5GEH6lREr7PqGBJfpC6h1QFKW5KURatQxFoifVIb6DttF3AskSiJU/s640/19-09-15_04.gif&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Same data, different meaning.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Now it&#39;s possible to see the three phase current waveforms separated by 120edeg. The peaks are at 0º (Phase A), 120º (Phase B), and -120º (Phase C). There are also negative peaks at the same angles, where braking is occurring. Most interestingly, the shape of the current waveforms at 80A peak is revealed to be asymmetric and far from sinusoidal.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The angular resolution of this type of waveform capture is only limited by the angle measurement, regardless of logging frequency. By contrast, the fastest it would be possible to log a continuous waveform would be at the PWM frequency (23.4kHz, in this case), which gives an speed-dependent angular resolution of 11.3edeg per 1000rpm. It would become difficult to resolve the shape of the current waveform at high speeds. There&#39;s always a trade-off, though: Synchronizing low-speed log data with angle stamps is only able to show the average shape of long-term periodic signals. It would not catch a glitch in a single cycle of the phase currents.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
While the phase current shape is interesting, the position of the peaks is just a consequence of the current controller. Zero electrical degrees is defined (by me, arbitrarily) as the angle at which Phase A&#39;s back EMF is at a peak. The current controller &lt;a href=&quot;http://scolton.blogspot.com/2009/11/everything-you-ever-wanted-to-know.html&quot;&gt;aligns the Phase A current with the Phase A back EMF&lt;/a&gt; for maximum torque per amp. So the phase current plot shows that the current controller is doing its job. This information is also captured by the already-synchronous Q-axis and D-axis current signals:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQDYxQFBZ9e7pwu4marUZWAof4RxOUWNSfCEDdcxWJrPAFGekBYBm7Kf9rwit61dkBRiyQsireaiahMlWEa61WRTLw59heyUdre71ZyK_zkSHHN_enJHEQAzpesXox5v4Qb8qv9KF5NnU/s1600/19-09-15_05.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;973&quot; data-original-width=&quot;1600&quot; height=&quot;388&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQDYxQFBZ9e7pwu4marUZWAof4RxOUWNSfCEDdcxWJrPAFGekBYBm7Kf9rwit61dkBRiyQsireaiahMlWEa61WRTLw59heyUdre71ZyK_zkSHHN_enJHEQAzpesXox5v4Qb8qv9KF5NnU/s640/19-09-15_05.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Q-axis and D-axis current plotted against time.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The Q-axis current represents torque-producing current, aligned with the back EMF, and is the current being commanded by the throttle input. The D-axis current is field-augmenting (or weakening, if negative) current and doesn&#39;t contribute to torque production. In this case, the current controller seeks to hit the desired Q-axis current and keep the D-axis current at zero. It does this by varying the voltage vector applied to the motor. More on this later. The Q-axis and D-axis currents are rotor-synchronous values, so they already convey the magnitude and phase of the phase currents, just not the actual shape.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
All of this is based on the assumption that the measured rotor angle is correct, i.e. properly defined with respect to the permanent magnets. On this kart, I&#39;m using magnetic rotary sensors mounted to the motor shafts that communicate the rotor angle to the motor controller via SPI and &lt;a href=&quot;http://scolton.blogspot.com/2018/09/tinycross-electron-control-unit.html&quot;&gt;optically-isolated emulated Hall sensor signals&lt;/a&gt;. But it&#39;s also possible to measure the rotor angle with a &lt;a href=&quot;http://scolton.blogspot.com/search/label/sensorless&quot;&gt;flux observer&lt;/a&gt;, as long as the motor is spinning sufficiently fast. I have this running in the background, logging flux estimates for each phase.&lt;br /&gt;
&lt;br /&gt;
Again, plotting flux against time doesn&#39;t give a whole lot of information. It&#39;s interesting to see the observer converge as speed increases from zero at the start, and the average amplitude of about 5mWb is consistent with the motor&#39;s rpm/V constant and measured back-EMF. But the real value of this data comes from cross-plotting against the sensor-derived rotor angle:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_Jsb_4YxJtafSB_i-7hNrGgaqj8zyfyF7h-yECBadBkJuEDuUhFiZCVujQk2f-7m_06ycEHDJbLc2hey7nsH03i2OzQLl0WJjv1UHYcPDcNM1I-gmkntlaU0L-aN_aSoyl1YEhn3YnVM/s1600/19-09-15_07.gif&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;668&quot; data-original-width=&quot;1002&quot; height=&quot;426&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_Jsb_4YxJtafSB_i-7hNrGgaqj8zyfyF7h-yECBadBkJuEDuUhFiZCVujQk2f-7m_06ycEHDJbLc2hey7nsH03i2OzQLl0WJjv1UHYcPDcNM1I-gmkntlaU0L-aN_aSoyl1YEhn3YnVM/s640/19-09-15_07.gif&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Cross-plotting against sensor-derived electrical angle shows substantial offset between the two motors.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The flux from Phase A should cross zero when its back EMF is at its peak, i.e. at an electrical angle of 0º in my arbitrarily-defined system. So, the front-right motor is more correct. The front-left is offset by about 30-45edeg, which is enough to start causing significant differences in torque. Indeed I had noticed some torque steer during the first test drives, which is what prompted me to do the sensor/sensorless angle comparison in the first place.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;
Since I have all three phases of flux, I can estimate the flux vector angle with some math and compare it to the sensor-derived rotor angle:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilMWNbv2VKWFrRm_Fv6vNP6l3CUEpqNTyttqyV5PGIiZUkB8zjGxfW2vSkwgSlUryvg9HXIjgK54BIyPtg4HZ8snLR7_2sA9dDAlAZ0bJiClrr7VATWPbkGT16fErAMhqPQ29ZfHi2pYw/s1600/19-09-15_08.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1267&quot; data-original-width=&quot;1600&quot; height=&quot;506&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilMWNbv2VKWFrRm_Fv6vNP6l3CUEpqNTyttqyV5PGIiZUkB8zjGxfW2vSkwgSlUryvg9HXIjgK54BIyPtg4HZ8snLR7_2sA9dDAlAZ0bJiClrr7VATWPbkGT16fErAMhqPQ29ZfHi2pYw/s640/19-09-15_08.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Digging into flux angle offset of the front-left motor a little more.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Both motors have some variation in flux angle offset, but the front-left varies more and is further from the nominal 90º. Except...when it&#39;s not. There are two five-second intervals where the average offset of the front-left flux looks like it returns to nearly 90º, both occurring either during or just after applying negative current. However, there&#39;s one more negative current pulse, earlier in the test drive, that does not have a flux angle shift. My troubleshooting neural network has been trained over many project iterations to interpret this as the signature of a mechanical problem.&lt;br /&gt;
&lt;br /&gt;
Sure enough, I was able to grab the rotor of the front-left motor and twist with hand strength only (&amp;lt; 5Nm) enough to make the shaft move relative to the rotor can. It only moved about 5º, but that&#39;s 35edeg, which is about the offset I had been seeing in the data. The press fit had failed and it was relying on back-up set screws on flats to keep from completely slipping. I suspect this won&#39;t be the last motor to fail in this way. I pressed out the shaft, roughed up the surface a little, and pressed it back in with some Loctite 609. I also drilled a hole in the back that can potentially be tapped as a back-up plan. And finally I recalibrated everything and marked the shaft so I&#39;ll know if it slips again.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkL9RaJJVv1iEASxGXyQ2DKQ5tAjyr4UhKjtQK_Is3sJWa02GJ-YDHCV97y61Hy0s9uMAfHu7LJCoQz-BNHHtoCOy2dybYt26TRw-E3T5OFoui0Im0M6ixqSnxJAyjZSFoXcOwAG0ra18/s1600/tc74.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkL9RaJJVv1iEASxGXyQ2DKQ5tAjyr4UhKjtQK_Is3sJWa02GJ-YDHCV97y61Hy0s9uMAfHu7LJCoQz-BNHHtoCOy2dybYt26TRw-E3T5OFoui0Im0M6ixqSnxJAyjZSFoXcOwAG0ra18/s640/tc74.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Reworked shaft, with a 1/4-20 tap drill (not going to tap it unless I absolutely have to), roughed surface, and press-fit augmented with Loctite 609, which should be good up to 25Nm for this surface area (4-5x margin).&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
After a few more test drives, it looks like it&#39;s holding. The front-left flux vs. sensor-derived angle looks much closer to the correct phase as well:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVYOEUzUaEgFoi7LH0IjS5Sk54c0DGeHfKM6nprrQXp1C97rN_NpFFJpY5h58Tz4Ss84qDcQlo2wyDRHXlWQVOSjilfUEW5ALjblxCXoqpq_YXPRvdchN3ax6ac0UXskNPxbj0n1vFTAM/s1600/19-09-22_00.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1110&quot; data-original-width=&quot;1600&quot; height=&quot;444&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVYOEUzUaEgFoi7LH0IjS5Sk54c0DGeHfKM6nprrQXp1C97rN_NpFFJpY5h58Tz4Ss84qDcQlo2wyDRHXlWQVOSjilfUEW5ALjblxCXoqpq_YXPRvdchN3ax6ac0UXskNPxbj0n1vFTAM/s640/19-09-22_00.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Phase A flux vs. sensor-derived angle after shaft rework.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
There&#39;s still a +/-10edeg offset from nominal, which could be from calibration accuracy or static biases like normal shaft twisting. It might be worth investigating more, but it&#39;s not enough offset to create any noticeable torque steer on the front wheel drive, so I&#39;m satisfied for now. I will preemptively do the same rework on the remaining three motor shafts.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
One other interesting cross-plot to look at is the Q- and D-axis voltage as a function of speed. I mentioned above that the current controller attempts to align the current vector with the back EMF vector by manipulating the voltage vector, the basis of field-oriented control. Due to the electrical time constant (L/R) of the motor, the voltage must lead the back EMF by a varying amount. This shows up as negative D-axis voltage increasing in magnitude with speed (and current).&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieyd05zY2tRt3PfER4qJxwqXySs7pi0LA1mUQL5zq_tBitikMfd2CvyYgSZRXYobtWdJ8fq8-x5yLY2GgFwhV2MI7oU0bdpu0voHb6tvZS8nt3eGl7La7CeHSsDsWPOwDgNmp9Xuj3Pk8/s1600/19-09-22_01.gif&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;628&quot; data-original-width=&quot;901&quot; height=&quot;446&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieyd05zY2tRt3PfER4qJxwqXySs7pi0LA1mUQL5zq_tBitikMfd2CvyYgSZRXYobtWdJ8fq8-x5yLY2GgFwhV2MI7oU0bdpu0voHb6tvZS8nt3eGl7La7CeHSsDsWPOwDgNmp9Xuj3Pk8/s640/19-09-22_01.gif&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Jitter 3D&amp;nbsp; plot of the voltage vector operating curve.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
At 80A and 2500erad/s (~3400rpm and ~27mph), the voltage vector is already leading by 45º, with 12V on both axes. This gives me a rough estimate for the motor&#39;s synchronous inductance.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOzyfo1nbhQXFL2I1ww8tZ2R_Hmzx8QlAeDBph_OV0A6hyyN88vXn3eT3SzhEzlpsucHBFU4u1wNfsjOSkGOCXCTPU6JPlI4POh0Zbnx-US0kskJwVPVDxk1-kigY15ld0YrUa-oySc_o/s1600/19-09-22_01.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;193&quot; data-original-width=&quot;514&quot; height=&quot;120&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOzyfo1nbhQXFL2I1ww8tZ2R_Hmzx8QlAeDBph_OV0A6hyyN88vXn3eT3SzhEzlpsucHBFU4u1wNfsjOSkGOCXCTPU6JPlI4POh0Zbnx-US0kskJwVPVDxk1-kigY15ld0YrUa-oySc_o/s320/19-09-22_01.png&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Along with the measured resistance (32mΩ) and flux amplitude (5mWb), this is all that&#39;s required for a first-order motor model, and thus a torque-speed curve. Running this through the gear ratio, the force-speed curve at the ground should look something like:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3yD9wSn6oQjTUOJtNVwEjcZp83CIrl2onR1-xWDDw5HxbkWlW7rGR8J6AkLefAPh-BdOxi8aYAEl6rry2KcS49mL2RrRyMVY-jFV2EBEVi2HC-ZppP6z6amAI8jtarxp33AALz3kZu-Y/s1600/19-09-22_03.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;718&quot; data-original-width=&quot;1526&quot; height=&quot;300&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3yD9wSn6oQjTUOJtNVwEjcZp83CIrl2onR1-xWDDw5HxbkWlW7rGR8J6AkLefAPh-BdOxi8aYAEl6rry2KcS49mL2RrRyMVY-jFV2EBEVi2HC-ZppP6z6amAI8jtarxp33AALz3kZu-Y/s640/19-09-22_03.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The inductance has a large impact on the maximum speed at which 120A can be driven into the motor in-phase with the back EMF. This determines the maximum power, since above this speed the force drops off faster than the speed increases. The top speed is wherever on the curve the drag forces equal the motor force, probably in the 40-45mph range. This is all without using third harmonic injection, which gives an extra 15% voltage overhead (for the cost of higher peak battery power, of course). If I do turn that on, it will probably come with a gear ratio change to put that extra 15% toward more torque, not more speed.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
That&#39;s all I wanted to check before building up the second motor controller for the rear wheel drive. I&#39;m very eager to see how it handles with 4WD, and how close to this force-speed curve I can actually get.&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/9097108050566287516/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/09/tinycross-first-test-drive-and.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/9097108050566287516'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/9097108050566287516'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/09/tinycross-first-test-drive-and.html' title='TinyCross: First Test Drive and Synchronous Data Logging'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/LhocJXDgiWc/default.jpg" height="72" width="72"/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-761232912323229832</id><published>2019-09-04T22:57:00.000-04:00</published><updated>2019-09-05T11:59:59.038-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>CMV12000 Full-Speed (38.4Gb/s) Read-In on Zynq Ultrascale+</title><content type='html'>In my original &lt;a href=&quot;http://scolton.blogspot.com/2019/06/freight-train-of-pixels.html&quot;&gt;Freight-Train-of-Pixels&lt;/a&gt; post, I explored three main challenges of building a 3.8Gpx/s imager: the source, the pipe, and the sink. Working backwards, the sink is an NVMe SSD that (&lt;a href=&quot;http://scolton.blogspot.com/2019/07/benchmarking-nvme-through-zynq.html&quot;&gt;hopefully&lt;/a&gt;) will be capable of 1GB/s writes. The pipe is a ~5:1 wavelet compression engine that has to make 3.8Gpx/s = 1GB/s in realtime, with minimal effect on image quality. And the source is the CMV12000 image sensor that relentlessly feeds pixel data into this machine. This post focuses on the source, and specifically the read-in mechanism implemented on a Zynq Ultrascale+ SoC for the 38.4Gb/s of LVDS data from the sensor.&lt;br /&gt;
&lt;h4&gt;
Physical Interface&lt;/h4&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgz0zwfs81tIQ_BDScivkxC7-6gKKkSQjB3vKaglN6JqnAObD6-XptWiG-85OcBtQKeGWkvXeiEtymVce5064Pp6Xo6UHGlhnIyUA4HrPNK_FdU1LRu8xJNhKJGXONDFrOlh_qrrGNSpik/s1600/c04.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1098&quot; data-original-width=&quot;1600&quot; height=&quot;273&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgz0zwfs81tIQ_BDScivkxC7-6gKKkSQjB3vKaglN6JqnAObD6-XptWiG-85OcBtQKeGWkvXeiEtymVce5064Pp6Xo6UHGlhnIyUA4HrPNK_FdU1LRu8xJNhKJGXONDFrOlh_qrrGNSpik/s400/c04.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The Source: Breaking out the CMV12000&#39;s 64 LVDS pairs was interesting.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The pixel data interface on the CMV12000 is 64 LVDS pairs, each operating at (up to) 300MHz DDR (600Mb/s). In the context of FPGA I/O interfaces, 300MHz DDR is really not that fast. It&#39;s just a lot of inputs. Most Zynq Ultrascale+ SoCs have enough LVDS-capable package pins to do this, but it took some searching to find a carrier board that breaks out enough of them to headers. I&#39;m using the &lt;a href=&quot;https://shop.trenz-electronic.de/en/Products/Trenz-Electronic/TE08XX-Zynq-UltraScale/TE0803-Zynq-UltraScale/&quot;&gt;Trenz Electronic TE0803&lt;/a&gt;, specifically the ZU4CG vesion, which breaks out a total of 72 HP LVDS pairs from the ZU+.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The physical interface for LVDS is a 100Ω differential pair. At 300MHz DDR, the length-matching requirements are not difficult. A bit is about 250mm long, so millimeter-scale mismatches due to an uneven number of 45º left and right turns are not a big deal; no meandering is really needed for intrapair matching. Likewise, I felt it was okay to break some routing rules by splitting a pair for a few millimeters to reach inner CMV12000 pins, rather than pushing trace/space limits to squeeze two traces between pins.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg39HKLxniHQwqobJERmxEnJ_hbF_tEwGUqTZTeDmPn1VXbx6cXOykYPfvyApixuu2Pc5noVRxVXHxVVS-TTh9KfHejpQZQjSUVRFcUqtpgNgvnjMk5Ag_J6UXKK0De5wkSrn8BYd7wt0E/s1600/c65.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1259&quot; data-original-width=&quot;1600&quot; height=&quot;502&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg39HKLxniHQwqobJERmxEnJ_hbF_tEwGUqTZTeDmPn1VXbx6cXOykYPfvyApixuu2Pc5noVRxVXHxVVS-TTh9KfHejpQZQjSUVRFcUqtpgNgvnjMk5Ag_J6UXKK0De5wkSrn8BYd7wt0E/s640/c65.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Routing of the LVDS pairs to TE0803 headers. Pairs are length-matched to within ~1mm, but no interpair matching was attempted. The FPGA must deal with interpair length differences as well as the CMV12000&#39;s large internal skew.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Interpair skew &lt;i&gt;is&lt;/i&gt;&amp;nbsp;still an issue. For ease of routing, no interpair length matching was attempted, resulting in length differences of as much as 30% of a bit interval. But this isn&#39;t even the bad news. The CMV12000 has a ~150ps skew between &lt;i&gt;each&lt;/i&gt;&amp;nbsp;channel from 1-32 and 33-64. That means that channels 32 and 64 are ~4.7ns behind channels 1 and 33, a whopping 280% of a bit interval. It would be silly to try to compensate for this with length matching, since that&#39;s equivalent to about 700mm at c/2!&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
Deserialization and Link Training&lt;/h4&gt;
&lt;div&gt;
For a brief moment after reading about the CMV12000&#39;s massive interchannel skew, I thought I might be screwed. FPGA inputs deal with skew by using adjustable delay elements to add delay to the edges that arrive early, slowing them all down to align with the most fashionably late edge. But the delay elements in the Zynq Ultrascale+ are only guaranteed to provide up to 1.1ns of delay. It&#39;s possible to cascade the unused output delay elements with their associated input delay elements, but that&#39;s still only 2.2ns.&lt;br /&gt;
&lt;br /&gt;
But I don&#39;t need to account for the whole 4.7ns of interchannel skew; I only need to reach the same phase angle in the next bit. At 600Mb/s, that&#39;s only 1.67ns away. Delays larger than this can be done with bit slipping, as shown below. Since this still relies on the cascaded delay elements to span one full bit interval, an interesting consequence is that a&amp;nbsp;&lt;i&gt;minimum&lt;/i&gt;&amp;nbsp;speed is imposed (about 450Mb/s for 2.2ns of available delay). So I guess it&#39;s go fast or go home...&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEia-nzY-LowYXjbqHLKGcqFoK3jUGI8KmlZwObzx5uxT_Ny73G5mRE2r1frIFBn_9Q0JNyfgHBO0Jg1KU-k2MzkLoStqALV8hS-e2i3zgldUwtBGoqOeBYtfFQh62mbtmRvpy2JQSjmrnM/s1600/c66.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;328&quot; data-original-width=&quot;1600&quot; height=&quot;129&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEia-nzY-LowYXjbqHLKGcqFoK3jUGI8KmlZwObzx5uxT_Ny73G5mRE2r1frIFBn_9Q0JNyfgHBO0Jg1KU-k2MzkLoStqALV8hS-e2i3zgldUwtBGoqOeBYtfFQh62mbtmRvpy2JQSjmrnM/s640/c66.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Channels are aligned using an adjustable delay of up to one bit period and integer bit slipping in the deserialized data.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
The Ultrascale+ deserializer hardware supports up to 1:8 (byte) deserialization from DDR inputs. The bit slip logic selects a new byte from any position in the 16-bit concatenation of the current and previous raw deserialized bytes. The combination of the delay value and integer bit slip offset independently align each channel.&lt;/div&gt;
&lt;div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
A complication is that the CMV12000 has 8-, 10-, and 12-bit pixel modes, with the highest readout efficiency in the default 10-bit mode. To go from 8-bit deserialized data to 10-bit pixel data requires building a &quot;gearbox&quot;, a nomenclature I really like. An 8:10 gearbox can be built pretty easily with just a few registers:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfMdjLxqDVQgOmeav2kvXX3fOBmIVIFBsXybNrRp_mRUSZOndH4SjJ25F8a9f-a8Aox0VSXdjh34kN2oa8Jn8qE1zbT2VIFqUpK_EFlHfaz_eNsK_oJMXvb8iROcY0s3ky6jFBfhSWe1o/s1600/c67.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;468&quot; data-original-width=&quot;1555&quot; height=&quot;192&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfMdjLxqDVQgOmeav2kvXX3fOBmIVIFBsXybNrRp_mRUSZOndH4SjJ25F8a9f-a8Aox0VSXdjh34kN2oa8Jn8qE1zbT2VIFqUpK_EFlHfaz_eNsK_oJMXvb8iROcY0s3ky6jFBfhSWe1o/s640/c67.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;An 8:10 gearbox, with four states corresponding to alignment of the 10-bit output within two adjacent 8-bit inputs.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The gearbox cycles through four states, registering a 10-bit output from an offset of {0, 2, 4, or 6} within two adjacent 8-bit inputs to pick out whole pixels from the data. This looks simple enough, but there&#39;s a subtlety in the fact that five bytes must cycle through the registers for every four pixels. In other words, the input clock (byte_clk) is running 5/4 as fast as the output clock (px_clk). The two clocks must be divided down from the same source (the LVDS clock in this case) to ensure that timing constraints can be evaluated. Additionally, to work as pictured above, the phase of the two clocks must be such the &quot;extra&quot; byte shift occur between states 3 and 0.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The overall input module is pretty tiny, which is good because I have to instantiate 65 of them (64 pixel channels and one control channel). They&#39;re built into an AXI-Lite slave peripheral with all the per-channel tweakable parameters as well as the final 10-bit pixel outputs mapped for the ARM to play with. The CMV12000 outputs training data on the pixel channels any time they&#39;re not being used to send real data. So, my link training process is:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;Find the correct phase for the px_clk so that, as described above, the gearbox works properly. Incorrect phase will result in flickering pixel data as the byte shifts occur in the wrong place relative to the gearbox px_clk state machine. I&#39;m not sure why this phase changes from reset to reset. It&#39;s the same value for all 65 channels, so I feel like there should be a way to have it start up deterministically. But for now it&#39;s easy enough to try all four values and see which one produces constant data.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;On each channel, set the sampling point by sweeping through the adjustable delay values looking for an eye center. (Or, since it&#39;s not guaranteed that a complete eye will be contained in the available delay range, a sampling point sufficiently far from eye edges.)&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;On the control channel, set the bit slip offset to the value between 3 and 12 that produces the expected training value. This covers all ten possibilities for phasing of the pixel data relative to the deserializer. Note that this requires registering and concatenating three deserialized bytes, rather than two as pictured in the bit slip example above.&lt;br /&gt;&amp;nbsp;&lt;/li&gt;
&lt;li&gt;On each pixel channel, set the bit slip offset to the value closest to the control channel bit slip offset that produces the expected training value. It should be within&amp;nbsp;±3 of the control channel bit slip offset, since that&#39;s the maximum interchannel skew.&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;
This only takes a fraction of a second, so it can easily be done on start-up or even in between captures to protect against temperature-dependent skew. By looking at the total delay represented by the delay tap values and bit slip offsets, it&#39;s clear that the CMV12000&#39;s interchannel skew is the dominant factor and that the trained delays roughly match the datasheet skew specification of 150ps per channel:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHfI5gzqJGPhFg46SmWR7u-T38k_I-qmvccsZzwXwk8khSi98WIwGbUxoBuBJW3J9oS-x1WzqKLUFwXkV-mpQQGpk6MKsX98mZ-Tgrj7SGQKU-VLaJx7qa200ACFVVkY9_CAksEhHrEes/s1600/c37.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;663&quot; data-original-width=&quot;1001&quot; height=&quot;263&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHfI5gzqJGPhFg46SmWR7u-T38k_I-qmvccsZzwXwk8khSi98WIwGbUxoBuBJW3J9oS-x1WzqKLUFwXkV-mpQQGpk6MKsX98mZ-Tgrj7SGQKU-VLaJx7qa200ACFVVkY9_CAksEhHrEes/s400/c37.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Total CMV12000 channel delays measured by training results.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
That&#39;s the hard part of the source done, with less trouble than I expected. The output is a 60MHz px_clk and 65 10-bit values that update on that clock. This will be the interface to the middle of the pipeline, the wavelet engine. But I need to be able to test the sensor before that&#39;s complete, and more than 64 pixels at a time. Without the compression stage, though, that means writing data at full rate to external DDR4 attached to the ZU+. Although it&#39;s a throwaway test, I will need to write to that RAM (at a lower rate) after the compression stage anyway, so this would be good practice.&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
RAMMING SPEED&lt;/h4&gt;
&lt;div&gt;
The ZU4CG version of the TE0803 has 2GB of 2400MT/s DDR4 configured as x64. That&#39;s over 150Gb/s of theoretical memory bandwidth, so the 38.4Gb/s CMV12000 data should be pretty easy. The DDR4 is attached to the PS side of the ZU+, though, and the dedicated DDR controller there is shared by many elements of the system, including the ARM cores.&amp;nbsp;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The CMV12000 front-end described above exists on the PL side. The fastest interface between the PL and the PS is a set of 128-bit AXI memory buses, exposed as the slave ports S_AXI_HPx_FPD to the PL. There are four such slave ports, but only a maximum of three can be simultaneously routed to the DDR controller:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgohOSi3ye7nrK4JA6CzaEVkQ9bK2Cpn6ZYedbqu6soSIeLbrfbBI_JKIBYinTNdCjuO2inpb8Qml-3ykBsT7YASIazJAZ6Aq7ByfZuLhRW1sZtFzxADE1rfWo-HOKlIVTi4GzkOlH7PyQ/s1600/c68.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1600&quot; data-original-width=&quot;1562&quot; height=&quot;640&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgohOSi3ye7nrK4JA6CzaEVkQ9bK2Cpn6ZYedbqu6soSIeLbrfbBI_JKIBYinTNdCjuO2inpb8Qml-3ykBsT7YASIazJAZ6Aq7ByfZuLhRW1sZtFzxADE1rfWo-HOKlIVTi4GzkOlH7PyQ/s640/c68.png&quot; width=&quot;624&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Up to three 128-bit AXI memory buses can be dedicated to direct PL-PS DDR access.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
The Ultrascale+ AXI might be able to go up to 333MHz, according to the datasheet, but 250MHz is the more common setting. That&#39;s okay - that&#39;s still 96Gb/s of theoretical bus bandwidth. But you can start to see why it&#39;s infeasible to store intermediate compression data in external RAM. Even 2.5 accesses per pixel would saturate the bus.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
For this test, I set up some custom BRAM FIFOs to use for buffering between the hard-timed pixel input and the more uncertain AXI write transfer. To keep things simple, four adjacent channels share one 64b-wide FIFO, aligning their pixel data to 16 bits. All FIFO writes happen on the px_clk when the control channel indicates valid pixel data&lt;i&gt;.&lt;/i&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The other side of the FIFO is a little more confusing. I split channels 1-32 and 33-64 (8 FIFOs each) into two write groups, each with its own AXI master port with 32Gb/s of theoretical bandwidth. The bottom channels drive S_AXI_HP0_FPD and the top drive S_AXI_HP1_FPD, and I rely on the DDR controller to sort out simultaneous write requests.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEissiwMNnQbpw_zRt5Jojeloy2RTD6isFDl8mD4uHko7CDs1ljsb3FxXufKaG5xaomig7HXOhQN0RHUhaoaFUJt_67oser_COMz6vIUeezZ4pdAQyiw2ThzDnrzpW2Sj0bLwzADE-J-H1s/s1600/c69.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;464&quot; data-original-width=&quot;1600&quot; height=&quot;184&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEissiwMNnQbpw_zRt5Jojeloy2RTD6isFDl8mD4uHko7CDs1ljsb3FxXufKaG5xaomig7HXOhQN0RHUhaoaFUJt_67oser_COMz6vIUeezZ4pdAQyiw2ThzDnrzpW2Sj0bLwzADE-J-H1s/s640/c69.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Bottom channel RAM writing test pipeline, through BRAM FIFO buffers. Top channels are similar.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
When the FIFO levels reach a certain threshold, a write transaction is started. Each transaction is 16 bursts of 16 beats of 16 bytes, and the 16 bytes of a beat are views of the FIFO output data. For simplicity, I just alternate between views of the 8 MSBs of 16 pixels to fill each 128-bit beat. I may stick the 2 LSBs from all 64 channels in their own view at some point, but for now I can at least confirm sensor operation with the 8 MSBs.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Without further ado, the first full image off the sensor:&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsTYkjG5383iNXolpq-kK8HVWRvPlCUiPcRQY9EspBBNhmbfCh-q957wANNP_nVpWqHQmOByZuh5K9IMWHzwtfNIVUb8c-Dxx5cD8X5aXd9dJ3zomddtd3itVJd6X-GqlAMtdZR7BGmg/s1600/c54.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1165&quot; data-original-width=&quot;1545&quot; height=&quot;482&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhsTYkjG5383iNXolpq-kK8HVWRvPlCUiPcRQY9EspBBNhmbfCh-q957wANNP_nVpWqHQmOByZuh5K9IMWHzwtfNIVUb8c-Dxx5cD8X5aXd9dJ3zomddtd3itVJd6X-GqlAMtdZR7BGmg/s640/c54.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;What were you expecting?&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
It turned out better than I thought, even looking like a VHS tape on rewind as it does. There are both horizontal and vertical defects. The vertical defects were concentrated in one 128px-wide column, served by a single LVDS pair, so that was easily traceable to a marginal solder joint. The horizontal defects were more likely to be missing or corrupted RAM writes. They would change position every frame.&lt;br /&gt;
&lt;br /&gt;
At first I suspected the DDR controller might be struggling to arbitrate between the two PL-PS ports and the ARM. The ARM might try to read program data while the image capture front-end is writing, incurring both a read/write turnaround penalty and a page change penalty. But in that case the AXI slave ports should exert back-pressure on their PL masters by deasserting the AWREADY signal, and I didn&#39;t see this happening. To further rule out ARM contention, I moved the ARM program into on-chip memory and disabled all but the two slave ports being used to write data to the DDR controller...still no good.&lt;br /&gt;
&lt;br /&gt;
I also tried different combinations of pixel clock speed (down to 30MHz), AXI clock speed (down to 125MHz), burst size, and total transfer size with no real change. Even with only one port writing, the problem persisted. Then I tried replacing the image views with some FIFO debug info: input/output counters and the difference used to calculate the fill level. I had expected the difference to vary up and down by one or two address units since the counters run on different clocks, but I what I saw were cases where the difference was entirely wrong, possibly triggering bad transfers.&lt;br /&gt;
&lt;br /&gt;
So what I had was a clock domain crossing problem. Rather than describe it in detail, I&#39;ll just link &lt;a href=&quot;https://zipcpu.com/blog/2018/07/06/afifo.html&quot;&gt;this article&lt;/a&gt;&amp;nbsp;that I wish I had read beforehand. The crux of it is that the &lt;i&gt;individual bits&lt;/i&gt;&amp;nbsp;of the counter can&#39;t be trusted to change together and if you catch them in mid-transition during an asynchronous clock overlap, you can get results that are complete nonsense, not just off-by-one. The article details a comprehensive bidirectional synchronization method using Gray code counters, but for now I just tried a simple one-way method where the input counter is driven across the clock domain with an alternating &quot;pump&quot; signal:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiVIzwKGQZRsjSiQSuqQiE8Mst2kuf3gwh5QN1JoB4GdJCwGyEPGODlBhzTRTKMxRmxCWl4M2L4Tay2_Lpz1n70-wo_8eVmqjuOaHFLRppcyMzvloEz8zVhMl9-aHKxNgdpTi3EcYCTXI/s1600/c70.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;366&quot; data-original-width=&quot;1600&quot; height=&quot;146&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiVIzwKGQZRsjSiQSuqQiE8Mst2kuf3gwh5QN1JoB4GdJCwGyEPGODlBhzTRTKMxRmxCWl4M2L4Tay2_Lpz1n70-wo_8eVmqjuOaHFLRppcyMzvloEz8zVhMl9-aHKxNgdpTi3EcYCTXI/s640/c70.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Synchronization pump for FIFO input counter.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The pump is driven by the LSB of the input-side counter and synchronized to the AXI clock domain through a series of flip-flops. This only works if the output-side clock is sufficiently faster than the input-side clock that it can always detect every edge of the pump signal. That&#39;s the case here, with a 250MHz axi_clk and a 60MHz px_clk. The value of &lt;span style=&quot;font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;in_cnt_axi&lt;/span&gt;, the input counter pumped to the AXI clock domain, is what&#39;s compared to the output counter (which is already in the AXI clock domain) to evaluate the FIFO level and trigger AXI transfers. It&#39;s the right amount of simple for me, adding only a few flip-flops to the design.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYlvNFS1AbeYfTDjsnXn_bk7DTjIdnighJc-2VYyULG8ydnYHvz5xUN4o7TuFFVk4ogJYyytAN0s6IXpWhAFDmGxg8Jzk12CMaanz6bqGz8XxgdDNHLowsw0BW7oC_VESA0Lk5DR5GxvA/s1600/c60.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYlvNFS1AbeYfTDjsnXn_bk7DTjIdnighJc-2VYyULG8ydnYHvz5xUN4o7TuFFVk4ogJYyytAN0s6IXpWhAFDmGxg8Jzk12CMaanz6bqGz8XxgdDNHLowsw0BW7oC_VESA0Lk5DR5GxvA/s640/c60.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;And just like that, clean kerbal portraits.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
In theory, I could read in about 170 frames this way (in 0.567s...). It currently takes me 30 seconds to get each frame off over JTAG, though, so I may want to get USB (SuperSpeed!) up and running first. More importantly, I can evaluate sensor stuff independent of the two other main challenges (wavelet pipeline and SSD sink). I&#39;m actually surprised at the okay-ness of the raw image, but there is definitely some fixed pattern noise to contend with. I also want to try the multi-slope HDR mode, which should be great for fitting more dynamic range in the 10-bit data (with no processing on my end!).&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I started with the source and sink because, even though they&#39;re the more known tasks, they represent external constraints that are actually show-stoppers if they don&#39;t work. Now I am confident in everything up to the pixel data hand-off on the source side. The sink side is still a mess, but the hardware has been checked at least. That leaves the more unknown challenge of the wavelet compression engine. But since it&#39;s entirely built from logic, with interfaces on both ends that I control, I&#39;m actually less worried about it. In other words, it&#39;s nice to not have to think about whether or not to build something from scratch...&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/761232912323229832/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/761232912323229832'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/761232912323229832'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/09/cmv12000-full-speed-384gbs-read-in-on.html' title='CMV12000 Full-Speed (38.4Gb/s) Read-In on Zynq Ultrascale+'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgz0zwfs81tIQ_BDScivkxC7-6gKKkSQjB3vKaglN6JqnAObD6-XptWiG-85OcBtQKeGWkvXeiEtymVce5064Pp6Xo6UHGlhnIyUA4HrPNK_FdU1LRu8xJNhKJGXONDFrOlh_qrrGNSpik/s72-c/c04.png" height="72" width="72"/><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-3007469060745622049</id><published>2019-08-28T23:53:00.002-04:00</published><updated>2019-08-30T12:27:11.499-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="TinyCross"/><title type='text'>TinyCross: Electronics Update</title><content type='html'>Where I left off, TinyCross was at the &lt;a href=&quot;http://scolton.blogspot.com/2019/03/tinycross-chassis-build.html&quot;&gt;rolling chassis&lt;/a&gt; stage. Mechanically, it went together relatively smoothly, most of the issues having been worked out in CAD. There are a few minor tweaks I&#39;d like to make to make it lighter and narrower, but they&#39;re low priority compared to getting a first test drive in. So, on to the electronics.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqE84F1dzGcyHOeEAfz093Wmx05Gmh41UVGbHY_oLFmfcVZj76GoZVj-ljc_YoXrsEIb6691ore3vt3tbaJL1UFMdWNTGk8_r3ZzpNUCQsyd2cb1GT6K1JXcIcoBu62HjkccjqrN9MEWo/s1600/tc55.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;487&quot; data-original-width=&quot;1600&quot; height=&quot;194&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqE84F1dzGcyHOeEAfz093Wmx05Gmh41UVGbHY_oLFmfcVZj76GoZVj-ljc_YoXrsEIb6691ore3vt3tbaJL1UFMdWNTGk8_r3ZzpNUCQsyd2cb1GT6K1JXcIcoBu62HjkccjqrN9MEWo/s640/tc55.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;It always looks so clean until you start adding wires.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I&#39;ve already done a post on the &lt;a href=&quot;http://scolton.blogspot.com/2018/09/tinycross-electron-control-unit.html&quot;&gt;motor drive design&lt;/a&gt;. Since the kart is four wheel drive, one will control the two front motors and one will control the two rear motors. For now I&#39;ve only built up one, just in case there are any observations from the first build that would require changing parts on the second. Here&#39;s what the power side looks like:&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMRj-OXGPg0AjSFMVkvKy20R9-_Bcw4JOIbCrDj-xGqv7YfD6ywUsRTK9xDKtEQZrfzYr2OdRw4NdB5UW2WWxe3cO-WS0PMZw-9GoPZ3HdWpw6cYqFri6YHIGJQSZ3lSH6jOjHY4D3gSc/s1600/tc66.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMRj-OXGPg0AjSFMVkvKy20R9-_Bcw4JOIbCrDj-xGqv7YfD6ywUsRTK9xDKtEQZrfzYr2OdRw4NdB5UW2WWxe3cO-WS0PMZw-9GoPZ3HdWpw6cYqFri6YHIGJQSZ3lSH6jOjHY4D3gSc/s640/tc66.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;TxDrive, power side.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
It&#39;s one of the weirdest power layouts I&#39;ve done for a motor drive. The design supports two different FET configurations: one with a single &lt;a href=&quot;http://ixapps.ixys.com/Datasheet/MTI200WX75GD.pdf&quot;&gt;MTI200WX75GD&lt;/a&gt; per motor and another with a six&amp;nbsp;&lt;a href=&quot;http://www.mouser.com/ds/2/149/FDMT80080DC-760735.pdf&quot;&gt;FDMT80080DC&lt;/a&gt;s. Since the MTI200&#39;s are perpetually out of stock, I committed to the FDMT solution for this build. It really doesn&#39;t look like enough FET, but on paper they&#39;re almost identical to the MTI200. I especially like the 1453A pulsed current rating. The board is four layers but only 1oz copper, so I also reinforced some of the high current density paths with 1mm bus wire and copper braid.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The FDMT 8x8 SO8 package creates a few other advantages in this configuration. The parasitic inductance is lower and there&#39;s room for local ceramic capacitor decoupling near each half bridge, which will help contain switching transients. The entire power side is also at or below 1mm in height, so the whole surface area of the board, include the somewhat overworked 12V to 5V LDO, can be heat sunk to the chassis through some thermal pad:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheNjoiUZjtM9_nx10XZfcIBIV84MfaQ4UXvsTuOuIaau9H2WZV2mnogR6vkE254IWJD5yAn2NP6fX04X41XWo3G124XPmSVytWX5Qgbm72WC2jbZHc_1WWL2WgrFDQf9ASQo7Hmis5RZc/s1600/tc62.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheNjoiUZjtM9_nx10XZfcIBIV84MfaQ4UXvsTuOuIaau9H2WZV2mnogR6vkE254IWJD5yAn2NP6fX04X41XWo3G124XPmSVytWX5Qgbm72WC2jbZHc_1WWL2WgrFDQf9ASQo7Hmis5RZc/s640/tc62.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;TxDrive, signal side.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
On the other side of the board, each half bridge gets a &lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjzVsW07cFFqEdOqz23OHXxmbq6V6M8z28fUkgCrfzKucUOpzXJ5QWuvQQ0dJ4APZL1WoABbt_7nPn_wWTGlwNuzwyCtNne3uPqY1jAKx4NNUn8-2fIO2zh7mkyfhD5vUTofzKiYp4_1Lw/s1600/layout09.png&quot;&gt;12mm-wide vertical slice&lt;/a&gt;&amp;nbsp;with its phase wire exit, gate drive, current sense, and 2x47uF aluminum polymer bus capacitance. An additional 820uF of bulk capacitance per motor gets folded over into the unused volume. The signal board sits in the middle and carries the MCU, its power supply, a CAN transceiver, and the encoder interface.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Two pairs of 12AWG inputs, each with 4mm bullet connectors, support up to about 150A of peak battery current. I find it easier to deal with two 12AWG wires than one 8AWG wire. The six phase outputs are also 12AWG, so everything can pass through a common size grommet on the eventual enclosure. The only other connections are CAN (twisted pair) and the encoders (&lt;a href=&quot;https://www.digikey.com/product-detail/en/3m/3517-9-100SF/MB09H-10-ND/1190677&quot;&gt;9-conductor shielded ribbon cable&lt;/a&gt;).&amp;nbsp;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The ribbon cables and phase wires run in parallel down each upper A-arm to the motors. This is the scariest run of wire for many reasons. Electrically, the phase wires are high-dV/dt EMI sources that will capacitively couple onto the encoder cable. This is the main motivation for using shielded cable and &lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEher5ZXjLnQsB7UNH4-YfDcc1HvnHDFxpVLHVkrbtAdLiLGDoe-FFaoJMNk3R7eW6SiikXom7RaxHv4W8H7T2NM4uL8enZxvRruEyKWFOsMVWRZZmLwFlfrbtb8GYv1oYYbRceerpp8738/s1600/layout15.png&quot;&gt;three-phase optoisolated&lt;/a&gt;&amp;nbsp;Hall signal configuration. Mechanically, these wires pass through several moving parts. (The encoder cable even passes &lt;i&gt;through&lt;/i&gt;&amp;nbsp;the drive belt loop!) They need enough slack to accommodate the entire steering and suspension travel, but the slack needs to be in the right places, with good strain relief everywhere else. The routing is actually pretty clean, and will get cleaner once the drives and encoders have their covers installed.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhiDryVI1yYHNH0YyYKVx2eCSYKQAJCEHoRGDTS-ipHxu_51D2A_qNY2vgjcDktj6JC1t-8egjGR5VsnthkHT4vOUf3ehSiStbHNHsXQZeSM1Q6Tih1bI9ula8zFTZf2wnjwvicof6db8s/s1600/tc63.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhiDryVI1yYHNH0YyYKVx2eCSYKQAJCEHoRGDTS-ipHxu_51D2A_qNY2vgjcDktj6JC1t-8egjGR5VsnthkHT4vOUf3ehSiStbHNHsXQZeSM1Q6Tih1bI9ula8zFTZf2wnjwvicof6db8s/s640/tc63.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Front wheel drive fully wired up.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
That&#39;s all just for the front drive; everything will get repeated for the rear. That means that in total there are four pairs of 12AWG DC wire to route out from the central battery input, and up to 300A of total peak battery current to deal with. And this is where I get to spread the good word about MIDI fuses. They are by far the most power-dense fuse format. I&#39;ve always used the car audio ones, with questionable voltage rating, but Littelfuse makes some &lt;a href=&quot;https://www.littelfuse.com/products/fuses/automotive-passenger-car/high-current-fuses.aspx&quot;&gt;serious ones&lt;/a&gt; as well, up to a 75V / 200A model rated to break 2500A! Their triple fuse holder is also perfect for my circuit.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMrLca_iMZp_J6vbFVYI7qEs1927IcoL1OPhgpLQaTYbEhkSNmz3BQ2gPjyQYHADarj-p4Y4V2ibM-Xd1QvKrgrA7VWQxLrSXAvbjQBLFtR6uumSIb5C69oCUqSHynWehnn8OPgZskOtM/s1600/tc57.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhMrLca_iMZp_J6vbFVYI7qEs1927IcoL1OPhgpLQaTYbEhkSNmz3BQ2gPjyQYHADarj-p4Y4V2ibM-Xd1QvKrgrA7VWQxLrSXAvbjQBLFtR6uumSIb5C69oCUqSHynWehnn8OPgZskOtM/s640/tc57.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Main power input.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The two battery inputs (each a series string of two &lt;a href=&quot;https://www.getfpv.com/tattu-plus-10000mah-22-2v-25c-6s-lipo-smart-battery-pack.html?utm_source=google&amp;amp;utm_medium=cpc&amp;amp;adpos=1o1&amp;amp;scid=scplp4479&amp;amp;sc_intid=4479&amp;amp;gclid=CjwKCAjwqZPrBRBnEiwAmNJsNrJksAOdqpt46B8fclg9PyWJypYmRLmFpWXvAH12PwYAQGHeJPCdYRoC4vYQAvD_BwE&quot;&gt;Tattu 6S 10Ah Smart packs&lt;/a&gt;) get the same dual 12AWG treatment, with back-to-back thick #10 ring terminals (the wonderful&amp;nbsp;&lt;a href=&quot;https://www.mcmaster.com/7113k17&quot;&gt;McMaster 7113K17&lt;/a&gt;) bolted to individual fuses.These connect to a bus bar that feeds the full 4x12AWG positive group. This goes through a master power switch and then splits off two and two to the front and rear drives. Meanwhile, a separate 30A fuse feeds off the bus bar through a small switch to the charger and steering wheel board.&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
The...steering wheel board?&lt;/h4&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I am really trying to minimize the number of microcontrollers (and also the number of firmware images) on this kart. Each drive has an &lt;a href=&quot;https://www.st.com/en/microcontrollers-microprocessors/stm32f303.html&quot;&gt;STM32F303&lt;/a&gt; that&#39;s pretty busy running two motors and really shouldn&#39;t be doing anything else. But I can stuff every other process onto a single high-level controller. This controller needs to handle driver interface (including throttle read-in), CAN communication with the drives, and ideally battery management. This constrains it to be somewhere near the center of the kart, and the steering wheel seemed like a logical place.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEhNag74iS7jhzSbRY_SPzdeMybUSwYpt3ATmXfP-MLZR8p7Dd8zXgZqL9wehnvZqbsSz-PPNh_U-TX5lOsvhMnXGyim4XsWHDgD2khRQg8pfVPPt8JtePrOdcLnKA7hz3du38eF-xxA4/s1600/tc65.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEhNag74iS7jhzSbRY_SPzdeMybUSwYpt3ATmXfP-MLZR8p7Dd8zXgZqL9wehnvZqbsSz-PPNh_U-TX5lOsvhMnXGyim4XsWHDgD2khRQg8pfVPPt8JtePrOdcLnKA7hz3du38eF-xxA4/s640/tc65.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Why have I not grip-taped my steering wheels before?&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
I&#39;ve also always wanted to have an OLED steering wheel display. Having live data will definitely help with troubleshooting. Although it&#39;s not absolutely necessary, I decided to use the &lt;a href=&quot;https://www.st.com/en/microcontrollers-microprocessors/stm32f7x6.html&quot;&gt;STM32F746&lt;/a&gt; for this board since it has the DMA2D graphics driver. The OLED is 4-bit monochrome, which isn&#39;t a natively-supported output format for the DMA2D. But as long as you&#39;re blitting even numbers of pixels, you can still make it work. The interface between the OLED and the main board is a SPI variant, good enough for a 50-60Hz update rate. I was originally going to put it on headers, but for clam shell serviceability it was better to just use thin wires.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZPAIzk6EHP2BcgF8k-UteOumTpTz3kD46waLSeXrVhjTDZV5fQWF-FDNWug6fmHlOmfuPYdwp3LtETYc2qVjfV3GNaVMe3UDRHf-9VW8u9UgrRXwcBvax0XYVF5d1q-esEbpnfz2cSss/s1600/tc64.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZPAIzk6EHP2BcgF8k-UteOumTpTz3kD46waLSeXrVhjTDZV5fQWF-FDNWug6fmHlOmfuPYdwp3LtETYc2qVjfV3GNaVMe3UDRHf-9VW8u9UgrRXwcBvax0XYVF5d1q-esEbpnfz2cSss/s640/tc64.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Display interface and &quot;hot&quot; side of the BMS.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Also on that side of the board is the battery management system (BMS) cell balance circuitry. This got out of hand quickly since I left almost no room for it: the entire area under the display is pretty much off-limits. But I managed to cram 12 cells worth of balance circuit on each side with the resistors themselves sinking heat into the steering wheel metal. To facilitate routing, the circuit alternates FET/resistor placement for the odd and even cells:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhR9-zjvTRtmhc6wtuLEAQQD2r6jqdhce7vfZUqCsBdRUNbV7_jdx238vOzurv10gyBM2CK9P-_RBcR9VJzTTGQtJOZ1_BYcXod819sbO53CElSxfZzLWphPKhsb21Y_Vi1bVc-69YTAYk/s1600/tc68.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;906&quot; data-original-width=&quot;1499&quot; height=&quot;385&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhR9-zjvTRtmhc6wtuLEAQQD2r6jqdhce7vfZUqCsBdRUNbV7_jdx238vOzurv10gyBM2CK9P-_RBcR9VJzTTGQtJOZ1_BYcXod819sbO53CElSxfZzLWphPKhsb21Y_Vi1bVc-69YTAYk/s640/tc68.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Cell balance group.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
To discharge an individual cell, a square wave is driven onto its charge pump, which turns on the its FET. This happens to the cell(s) with the highest voltage until they are evened out. Usually this is done during or after charging. During discharge, it&#39;s sufficient to just monitor the cell voltages and stop when any one cell reaches a low voltage threshold.&lt;br /&gt;
&lt;br /&gt;
Accurately measuring individual cell voltages is itself an interesting challenge. The main problem is that the cells are offset by up to 48V from the ADC ground. Of course, it&#39;s possible to use simple voltage dividers to bring the signals down to below 3.3V. But it would be better to have individual differential measurements of each cell. This means a lot of op-amps or...&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHQ339jy-V0sstUU101sucz8NnpVdZSsOBZNOoXVYUIWXeq-nxZcYmz-t3E5_A_GIPW2G19v1Y5JaR-SOOaR6Nfl0QeAzk9RWm2B0NojH97BS7u01USN1YjuH5pdZIenenqNRJ9preJ8A/s1600/tc69.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;515&quot; data-original-width=&quot;1600&quot; height=&quot;206&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHQ339jy-V0sstUU101sucz8NnpVdZSsOBZNOoXVYUIWXeq-nxZcYmz-t3E5_A_GIPW2G19v1Y5JaR-SOOaR6Nfl0QeAzk9RWm2B0NojH97BS7u01USN1YjuH5pdZIenenqNRJ9preJ8A/s640/tc69.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;One op-amp and a 72V analog multiplexer for cell voltage measurement.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I found some 72V analog multiplexers (&lt;a href=&quot;https://datasheets.maximintegrated.com/en/ds/MAX14752-MAX14753.pdf&quot;&gt;MAX14753&lt;/a&gt;) that can feed the inputs of one nice op-amp. The muxes are dual 4-to-1 selectors cascaded and wired such that the two outputs are always adjacent cell nodes, which drive the inputs of a differential amplifier. This all fits in a pretty small footprint on the opposite side of the board from the cell balance circuitry. Also on this side of the board are all the connectors, the logic and analog power supplies, the charge cutoff FETs, buffers for driving the cell balance charge pumps, a very sad SD card holder with a reversed footprint, the STM32F7 itself, and a mystery component.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjso042IyQl1aa1VPEShpX5h1Bbdgls4hpMuP6FVRhdrY13OHqfDv7hx5q34i12a4UmJn_eqNT_NhDqc1oziFJMIgsCoHBdJLwvE2xcV2CTrJT8ROc8veyH1IX27wZdoZACoJQGn4Cvoh8/s1600/tc58.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjso042IyQl1aa1VPEShpX5h1Bbdgls4hpMuP6FVRhdrY13OHqfDv7hx5q34i12a4UmJn_eqNT_NhDqc1oziFJMIgsCoHBdJLwvE2xcV2CTrJT8ROc8veyH1IX27wZdoZACoJQGn4Cvoh8/s640/tc58.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The crowded side of the steering wheel board.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Right now the main purpose of this board is to act as the high-level controller for commanding torque and reading back data from the motor drives. The BMS functionality is a secondary objective, since I can still monitor pack voltage through the drives and charge off-board. The torque command comes from a nice trigger stolen from &lt;a href=&quot;https://www.amazon.com/FS-GT2-Channel-Digital-Transmitter-Receiver/dp/B00KHLDCX4/ref=sr_1_4?keywords=RC+car+transmitter&amp;amp;qid=1567003056&amp;amp;s=gateway&amp;amp;sr=8-4&quot;&gt;Amazon&#39;s second cheapest RC car transmitter&lt;/a&gt;. Like &lt;a href=&quot;http://scolton.blogspot.com/p/cap-kart.html#tinykart&quot;&gt;tinyKart&lt;/a&gt;, this means all the controls are on the steering wheel - no pedals. The trigger is bidirectional, so it can command positive and negative torque.&amp;nbsp;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
All four motors receive a torque command over CAN at 1kHz that they apply to their current controllers. The motors then take turns replying with their crucial data (electrical angle, speed, voltage, current, and fault status) at 250Hz, and their less important data at 50Hz. This should allow for some fairly tight feedback loops through the central controller for things like speed control, traction control, and torque vectoring. There&#39;s also that mystery component, which is controls-related.&lt;br /&gt;
&lt;br /&gt;
For now, I&#39;m just starting to test the power system with the front drive only, quite honestly so I can see the fire when it happens. The two motors will just get the same torque command, ramping up slowly to full voltage/current. I did as much testing as I could on power supplies, but it&#39;s finally time for batteries. Here&#39;s the first batteries-in test, at a very easy 6V/20A (peak line-to-neutral quantities) limit:&lt;br /&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/WfxM44ryrbA&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;/div&gt;
&lt;br /&gt;
It&#39;s nothing exciting, but the first batteries-in test is always a bit scary since there&#39;s no longer a CC/CV supply keeping things from going out of hand. After I do some wiring and software clean up and make sure the data logging is working, I&#39;ll ramp up from there toward the full 24V/120A, and then full four wheel drive. I&#39;ve learned to expect smoke at some point during this process though, so I&#39;m holding off on building the second drive until I see what fails on the first...&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/3007469060745622049/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/08/tinycross-electronics-update.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/3007469060745622049'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/3007469060745622049'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/08/tinycross-electronics-update.html' title='TinyCross: Electronics Update'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqE84F1dzGcyHOeEAfz093Wmx05Gmh41UVGbHY_oLFmfcVZj76GoZVj-ljc_YoXrsEIb6691ore3vt3tbaJL1UFMdWNTGk8_r3ZzpNUCQsyd2cb1GT6K1JXcIcoBu62HjkccjqrN9MEWo/s72-c/tc55.png" height="72" width="72"/><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-5425774488286185847</id><published>2019-07-15T14:43:00.000-04:00</published><updated>2019-09-04T22:58:14.680-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>Benchmarking NVMe through the Zynq Ultrascale+ PL PCIe Linux Root Port Driver</title><content type='html'>I want to be able to &lt;a href=&quot;http://scolton.blogspot.com/2019/06/freight-train-of-pixels.html&quot;&gt;sink 1GB/s&lt;/a&gt; into an NVMe SSD from a Zynq Ultrascale+ device, something I know is technically possible but I haven&#39;t seen demonstrated without proprietary hardware accelerators. The software approach - through Linux and the Xilinx drivers - has enough documentation scattered around to make work, if you have a lot of patience. But the only speed reference I could find for it is this&amp;nbsp;&lt;a href=&quot;http://www.fpgadeveloper.com/2016/07/measuring-the-speed-of-an-nvme-pcie-ssd-in-petalinux.html&quot;&gt;Z-7030 benchmark&lt;/a&gt; of 84.7MB/s. I found nothing for the newer ZU+, with the XDMA PCIe Bridge driver. I wasn&#39;t expecting it to be fast enough, but it seemed worth the effort to do a speed test.&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
For hardware, I have my TE0803 carrier with the &lt;a href=&quot;https://shop.trenz-electronic.de/en/TE0803-02-04CG-1EA-MPSoC-Module-with-Xilinx-Zynq-UltraScale-ZU4CG-1E-2-GByte-DDR4-5.2-x-7.6-cm&quot;&gt;ZU4CG version of the TE0803&lt;/a&gt;. All I need for this test is JTAG, UART, and the four PL-side GT transceivers, for PCIe Gen3 x4. I made a JTAG + UART cable out of the &lt;a href=&quot;https://www.digikey.com/product-detail/en/digilent-inc/410-357-B/1286-1196-ND/8605091&quot;&gt;Digilent combo part&lt;/a&gt;, which is directly supported in Vivado and saves a separate USB port for the terminal. Using Trenz&#39;s bare board files, it was pretty quick to set up.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiB8kNE6wkE61Xydq45DiHttb50qkdQ43kJVxwWpNpw1hH7qa8MCuXzgx3XGeFNADZ_vWtvY8-oK_P-Y4cV9BLC2zf_clKrhUWNFaEmPgXUu5pD9lDv5vqs9X3ed91jQlAlC7JCF5rxo1A/s1600/c19.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiB8kNE6wkE61Xydq45DiHttb50qkdQ43kJVxwWpNpw1hH7qa8MCuXzgx3XGeFNADZ_vWtvY8-oK_P-Y4cV9BLC2zf_clKrhUWNFaEmPgXUu5pD9lDv5vqs9X3ed91jQlAlC7JCF5rxo1A/s640/c19.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;TE0803 Carrier + Dead-Bugged Digilent JTAG+USB Adapter.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Next, I wanted to validate the PCIe routing with a loopback test, following &lt;a href=&quot;https://www.youtube.com/watch?v=qqi5ohBa-EY&quot;&gt;this video&lt;/a&gt; as a guide. I made my own loopback out of the &lt;a href=&quot;https://www.amazon.com/EXPLOMOS-NGFF-Adapter-Power-Cable/dp/B074Z5YKXJ&quot;&gt;cheapest M.2 to PCIe x4 adapter Amazon has to offer&lt;/a&gt;&amp;nbsp;by desoldering the PCIe x4 connector and putting in twisted pairs. This worked out nicely since I could intentionally mismatch the length of one pair to get a negative result, confirming I wasn&#39;t in some internal loopback mode.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8yuMY7jouxtyw6Ed2j-XMWIVZc5JMkkMF1cmGypC60qCrtWBAQTq9PX545y4z-bcP_Tvi1dvBLs5F4he7HNmPSi-4bjPd6hE7zozn-7uWaIhvHfi9XsItpDSqfIM_-FyTEYAMDQE_bJk/s1600/c23.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8yuMY7jouxtyw6Ed2j-XMWIVZc5JMkkMF1cmGypC60qCrtWBAQTq9PX545y4z-bcP_Tvi1dvBLs5F4he7HNmPSi-4bjPd6hE7zozn-7uWaIhvHfi9XsItpDSqfIM_-FyTEYAMDQE_bJk/s640/c23.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The three-eyed...something.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
For most of the rest of this test, I&#39;m roughly following the script from the &lt;a href=&quot;https://github.com/fpgadeveloper/fpga-drive-aximm-pcie&quot;&gt;FPGA Drive example design&lt;/a&gt; readme files, with deviations for my custom board and for Vivado 2019.1 support. The scripts there generates a Vivado project and block design with the Processing System and the XDMA PCIe Bridge. I had a few hardware differences that had to be taken care of manually (EMIO UART, inverted SSD reset signal), but having a reference design to start from was great.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The example design includes a standalone application for simply checking that a PCIe drive enumerates on the bus, but it isn&#39;t built for the ZU+. As the readme mentions, there had been no standalone driver for the XDMA PCIe Bridge. Well, as of Vivado 2019.1, there is! In SDK, the standalone project for &lt;b&gt;xdmapcie_rc_enumerate_example.c&lt;/b&gt; can be imported directly from the peripheral driver list in &lt;b&gt;system.mss&lt;/b&gt; from the exported hardware.&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCiL9H_aBZoG4tnYsIkr6MZoljEzQoT_Vc9gAWn171L6HwD-0uyY9TjB9JmKXMTeiFWAT6OdGgt92H94w16hXOOfbY0i5dzBpVFl48HaI4x_VP-bppxxLFMvV_vJcqMfErIPtdthtfghE/s1600/c27.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;647&quot; data-original-width=&quot;1366&quot; height=&quot;302&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCiL9H_aBZoG4tnYsIkr6MZoljEzQoT_Vc9gAWn171L6HwD-0uyY9TjB9JmKXMTeiFWAT6OdGgt92H94w16hXOOfbY0i5dzBpVFl48HaI4x_VP-bppxxLFMvV_vJcqMfErIPtdthtfghE/s640/c27.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;XDMA standalone driver example is available as of Vivado 2019.1!&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I installed an SSD and ran this project and much to my amazement, the enumeration succeeded. By looking at the PHY Status/Control register at offset 0x144 from the Bridge Register Memory Map base address (0x400000000 here), I was also able to confirm that link training had finished and the link was Gen3 x4. (Documentation for this is in &lt;a href=&quot;https://www.xilinx.com/support/documentation/ip_documentation/axi_pcie3/v3_0/pg194-axi-bridge-pcie-gen3.pdf&quot;&gt;PG194&lt;/a&gt;.) Off to a good start, then.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvhI7MqNy1asaAt7JP0qE8M5uDKuZ1iQyJDHmhylX4JevSs0C2LAHEcDPMtr4v_x_1RhzHcx3KGqtb6D6Gnv_859Fg9fcLYnlXk3xEy-0jk0ZXHoYIsGrRT5B4abE7HzZEZXTGF2YsGCA/s1600/c32.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1200&quot; data-original-width=&quot;1600&quot; height=&quot;480&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvhI7MqNy1asaAt7JP0qE8M5uDKuZ1iQyJDHmhylX4JevSs0C2LAHEcDPMtr4v_x_1RhzHcx3KGqtb6D6Gnv_859Fg9fcLYnlXk3xEy-0jk0ZXHoYIsGrRT5B4abE7HzZEZXTGF2YsGCA/s640/c32.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Installed a 1TB Samsung 970 EVO Plus.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
Unfortunately, that&#39;s where the road seems to end in terms of quick and easy setup. The next stage involves PetaLinux, which is a toolchain for building the Xilinx Linux kernel. I don&#39;t know about other people, but every time the words &quot;Linux&quot; and &quot;toolchain&quot; cross my path, I automatically lose a week of time to setup and debugging. This was no exception.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Unsurprisingly, PetaLinux tools run in Linux. I went off on a bit of a tangent trying to see if they would run in &lt;a href=&quot;https://devblogs.microsoft.com/commandline/wsl-2-is-now-available-in-windows-insiders/&quot;&gt;WSL2&lt;/a&gt;. They do, if you contain your project in the Linux file system. In other words, I couldn&#39;t get it to work on /mnt/c/... but it worked fine if the project was in ~/home/... But, WSL2 is a bit bleeding edge still and there&#39;s no USB support as of now. So you can build, but not JTAG boot. If you boot from an SD card, though, it might work for you.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
So I started over with a &lt;a href=&quot;https://www.virtualbox.org/&quot;&gt;VirtualBox VM&lt;/a&gt;&amp;nbsp;running Ubuntu 18.04, which was mercifully easy to set up. For reasons I cannot at all come to terms with, you need at least 100GB of VM disk space for the PetaLinux build environment, all to generate a boot image that measures in the 10s of MB. I understand that tools like this tend to clone in entire repositories of dependencies, but seriously?! It&#39;s larger than all of my other development tools combined. I don&#39;t need the entire history of every tool involved in the build...&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSy2kEfBDczPUY8nIWd7QIrD3qwLp4_4veY21HuACNkQuZu8a8OcsO9VcD94WdS-d6Z-3L0encvTHHomwzaMP3xy1HVjS46Ylf2iOKHLs_AWFi9aApF94MtZ5MeNq6cRfno7kFUeRQvxc/s1600/c28.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;652&quot; data-original-width=&quot;1600&quot; height=&quot;260&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSy2kEfBDczPUY8nIWd7QIrD3qwLp4_4veY21HuACNkQuZu8a8OcsO9VcD94WdS-d6Z-3L0encvTHHomwzaMP3xy1HVjS46Ylf2iOKHLs_AWFi9aApF94MtZ5MeNq6cRfno7kFUeRQvxc/s640/c28.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;And here I thought Xilinx was a disk space hog...&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The build process, even not including the initial creation of this giant environment, is also painfully slow. If you are running it on a VM, throw as many cores at it as you can and then still plan to go do something else for an hour. I started from the build script in the FPGA Drive example design, making sure it targetted &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;cpu_type=&quot;zynqMP&quot;&lt;/span&gt; and &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;pcie_ip=&quot;xdma&quot;&lt;/span&gt;.&amp;nbsp;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
This &lt;i&gt;should&lt;/i&gt;&amp;nbsp;set up the kernel properly, but some of the config options in PetaLinux 2019.1 might not exactly match the packaged configs. There&#39;s a reference &lt;a href=&quot;http://www.fpgadeveloper.com/2016/04/conne&quot;&gt;here&lt;/a&gt; explaining how to manually configure the kernel for PCIe and NVMe hosting on the Z-7030. I went through that, subbing in what I thought were correct ZU+ and XDMA equivalents where necessary. Specifically:&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;It seems like as of PetaLinux 2019.1 (v4.19.0), there&#39;s an entire new menu item under &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;Bus support&lt;/span&gt; for &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;PCI Express Port Bus support&lt;/span&gt;. Including this expands the menu with other PCI Express-specific items, which I left at whatever their default state was.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_-JlH6Qu5hwLeCvVDSKi65iTffSyMh-mNkr2e0oGLYCN-5kVDFWVCeClGRmEvfMZD12ycJ6RflzzquNYntPCUE8ecELdjsxJh1BQoxQYbxLkyhgG6-sa66Q518KiJe1WIkSN1672qtiw/s1600/c29.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;423&quot; data-original-width=&quot;710&quot; height=&quot;237&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_-JlH6Qu5hwLeCvVDSKi65iTffSyMh-mNkr2e0oGLYCN-5kVDFWVCeClGRmEvfMZD12ycJ6RflzzquNYntPCUE8ecELdjsxJh1BQoxQYbxLkyhgG6-sa66Q518KiJe1WIkSN1672qtiw/s400/c29.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Under &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;Bus support &amp;gt; PCI controller drivers&lt;/span&gt;, &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;Xilinx XDMA PL PCIe host bridge support&lt;/span&gt; has to be included. I don&#39;t actually know if the &lt;span style=&quot;color: #b6d7a8;&quot;&gt;&lt;span style=&quot;font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;NWL PCIe Core&lt;/span&gt; &lt;/span&gt;is also required, but left it in since it was enabled by default. It might be the driver for the PS-side PCIe?&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjm0kyHvFL70KH9EqHG5yGdefMtB-RDIMsFjAHF4sMY51o46vzl14ldDRkfjN8wz4nQiakd97WqiTjQTjwoSIuGCIrNQN-xfhOJF735iuVjEtBohe3e3OJieoFn44yR95PF3fShePVncg0/s1600/c30.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;428&quot; data-original-width=&quot;712&quot; height=&quot;240&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjm0kyHvFL70KH9EqHG5yGdefMtB-RDIMsFjAHF4sMY51o46vzl14ldDRkfjN8wz4nQiakd97WqiTjQTjwoSIuGCIrNQN-xfhOJF735iuVjEtBohe3e3OJieoFn44yR95PF3fShePVncg0/s400/c30.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;ul&gt;
&lt;li&gt;Some things related to NVMe are in slightly different places. There&#39;s an item called &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;Enable the block layer&lt;/span&gt; on the main config page that I assume should be included. Under &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;Device Drivers&lt;/span&gt;, &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;Block devices&lt;/span&gt; should be included. And under &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;Device Drivers &amp;gt; NVME Support&lt;/span&gt;, &lt;span style=&quot;color: #b6d7a8; font-family: &amp;quot;courier new&amp;quot; , &amp;quot;courier&amp;quot; , monospace;&quot;&gt;NVM Express block device&lt;/span&gt; should also be included.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhemQNOWI0zBCOipiUtV-w_pIf_-12Fd53zxSwXncu1AlswaWbz7J1Xr4Uckc5gZGmqT7TLycMqxCQ_A8bZ99EJJhAJeTE9a2an5aHF5HfODqRZgVFBLUdh-wW3TmKC5Qo4tXcEurIEM2A/s1600/c31.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;424&quot; data-original-width=&quot;718&quot; height=&quot;235&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhemQNOWI0zBCOipiUtV-w_pIf_-12Fd53zxSwXncu1AlswaWbz7J1Xr4Uckc5gZGmqT7TLycMqxCQ_A8bZ99EJJhAJeTE9a2an5aHF5HfODqRZgVFBLUdh-wW3TmKC5Qo4tXcEurIEM2A/s400/c31.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The rest of the kernel and rootfs config seems to match pretty closely the Z-7030 setup linked above. But I will admit it took me three attempts to create a build that worked, and I don&#39;t know exactly what trial-and-error steps I did between each one. Even once the correct controller driver (&lt;b&gt;pcie-xdma-pl.c&lt;/b&gt;) was being included in the build, I couldn&#39;t get it to compile successfully without &lt;a href=&quot;https://github.com/Xilinx/linux-xlnx/commit/bc110c1e7da48439835265a0fbe9f8fc57cad752#diff-df396aeff136bd2a390eaa1fe1be8639&quot;&gt;this patch&lt;/a&gt;. I don&#39;t know what the deal is with that, but after that I finally got a build that would enumerate the SSD on the PCIe bus:&lt;br /&gt;
&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg54JEc_WGiHJY0fI6CA9av67TfVK_Se9gLdIc5scKE1i-s6hqXOonlzSrHsZg_WaFnIHUb9CwlcZQTvC9ALSu3fSuFczLGex87hIKPmRvfRVhsm9GU-2cO2p5y12QNTuU9tdlhk7ra4fc/s1600/c33.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;414&quot; data-original-width=&quot;877&quot; height=&quot;302&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg54JEc_WGiHJY0fI6CA9av67TfVK_Se9gLdIc5scKE1i-s6hqXOonlzSrHsZg_WaFnIHUb9CwlcZQTvC9ALSu3fSuFczLGex87hIKPmRvfRVhsm9GU-2cO2p5y12QNTuU9tdlhk7ra4fc/s640/c33.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Output from &lt;b&gt;lspci -vv &lt;/b&gt;confirms link speed and width.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I had already partitioned the drive off-board, so I skipped over those steps and went straight to the speed tests as described &lt;a href=&quot;http://www.fpgadeveloper.com/2016/07/measuring-the-speed-of-an-nvme-pcie-ssd-in-petalinux.html&quot;&gt;here&lt;/a&gt;. I tested a few different block sizes and counts with pretty consistent results: about &lt;span style=&quot;color: yellow;&quot;&gt;460MB/s write&lt;/span&gt;&amp;nbsp;and &lt;span style=&quot;color: yellow;&quot;&gt;630MB/s read&lt;/span&gt;.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_evwsZPn5FJWsAHL2dcct2l5ERD8DJukiU7jIGDaYWE0C1loY9E_kQ-H3WlhRr_25m5kjYI6-IRM5ycplxNfuisjnhA0QGpaOLb9mKXOt1oxVBYitR-ud163zmCUg2ZrJHjhNr0j9980/s1600/c35.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;574&quot; data-original-width=&quot;1275&quot; height=&quot;288&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_evwsZPn5FJWsAHL2dcct2l5ERD8DJukiU7jIGDaYWE0C1loY9E_kQ-H3WlhRr_25m5kjYI6-IRM5ycplxNfuisjnhA0QGpaOLb9mKXOt1oxVBYitR-ud163zmCUg2ZrJHjhNr0j9980/s640/c35.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Not sure about those correctable errors. I guess it&#39;s better than un-correctable errors.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
That &lt;i&gt;is&lt;/i&gt;&amp;nbsp;actually pretty fast, compared to the Z-7030 benchmark. The ZU+ and the new driver seem like they&#39;re able to make much better use of the SSD. But, it&#39;s still about a factor of two below what I want. There could be some extra performance to squeeze out from driver optimization, but at this point I feel like the effort will be better-spent looking into hardware acceleration, which &lt;a href=&quot;https://www.youtube.com/watch?time_continue=34&amp;amp;v=ivcm2nwGsQM&quot;&gt;has been demonstrated&lt;/a&gt; to get to 1GB/s write speeds, even on older hardware.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Since there&#39;s no published datasheet or pricing information for that or any other NVMe hardware accelerator, I&#39;m not inclined to even consider it as an option. At very least, I plan to read through the open specification and see what actually is required of an NVMe host. If it&#39;s feasible, I&#39;d definitely prefer an ultralight custom core to a black-box IP...but that&#39;s just me. In the mean time, I have some parallel development paths to work on.&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/5425774488286185847/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/07/benchmarking-nvme-through-zynq.html#comment-form' title='26 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/5425774488286185847'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/5425774488286185847'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/07/benchmarking-nvme-through-zynq.html' title='Benchmarking NVMe through the Zynq Ultrascale+ PL PCIe Linux Root Port Driver'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiB8kNE6wkE61Xydq45DiHttb50qkdQ43kJVxwWpNpw1hH7qa8MCuXzgx3XGeFNADZ_vWtvY8-oK_P-Y4cV9BLC2zf_clKrhUWNFaEmPgXUu5pD9lDv5vqs9X3ed91jQlAlC7JCF5rxo1A/s72-c/c19.jpg" height="72" width="72"/><thr:total>26</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8200098102909041178.post-1489307433046940511</id><published>2019-06-13T01:28:00.000-04:00</published><updated>2020-04-18T22:44:30.864-04:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="camera"/><category scheme="http://www.blogger.com/atom/ns#" term="CMV12000"/><category scheme="http://www.blogger.com/atom/ns#" term="WAVE"/><title type='text'>Freight Train of Pixels</title><content type='html'>I have a problem. After any amount of time at any level of development of anything, I feel the urge to move down one layer into a place where I really shouldn&#39;t be. Thus, after spending time implementing capture software for my &lt;a href=&quot;http://scolton.blogspot.com/p/video.html#gs3&quot;&gt;&lt;strike&gt;Point Grey&lt;/strike&gt; FLIR block cameras&lt;/a&gt;, I am now tired of dealing with USB cables and drivers and firmware and settings.&lt;br /&gt;
&lt;br /&gt;
What I want is image data. Pixels straight from a sensor. As many as I can get, as fast as I can get them. To quote Jeremy Clarkson from The Great Train Race (&lt;a href=&quot;https://www.topgear.com/videos/top-gear-tv/great-train-race-part-14-series-13-episode-1&quot;&gt;Top Gear S13E1&lt;/a&gt;), &quot;Make millions of coals go in there.&quot; Except instead of coals, pixels. And instead of millions, trillions. It doesn&#39;t matter how. I mean, it does, but really I want the only real constraints to be where the pixels are coming from and where they are going. So let&#39;s see what&#39;s in this rabbit hole.&lt;br /&gt;
&lt;h4&gt;
The Source&lt;/h4&gt;
&lt;div&gt;
The image sensor feeding this monster will be an ams (formerly CMOSIS) &lt;a href=&quot;https://ams.com/cmv12000&quot;&gt;CMV12000&lt;/a&gt;. It&#39;s got lots of pros and a few cons for this type of project, which I&#39;ll get into in more detail. But the main reason for the choice is entirely non-technical: This is a sensor that I can get a &lt;a href=&quot;https://ams.com/cmv12000#tab/documents&quot;&gt;full datasheet&lt;/a&gt;&amp;nbsp;for and &lt;a href=&quot;https://ams.com/cmv12000#tab/shop-now&quot;&gt;purchase&lt;/a&gt; without any fucking around. This was true even back in the CMOSIS days, but as an active ams part it&#39;s now documented and distributed the same way as their $1 ICs.&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioCGko5my065A3wUT8ydaEnf7TovVid40ES864RwE1S-74UjaxQxlwcgjTJ7xvD1RiIsg048BWP4ZVnyTii2RwjDUbCJFqmN1MN3GdmKjFOtQiZWAM6q-aGiYJjIVbVKLhISgBPUgpbrE/s1600/c17.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1026&quot; data-original-width=&quot;1600&quot; height=&quot;410&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioCGko5my065A3wUT8ydaEnf7TovVid40ES864RwE1S-74UjaxQxlwcgjTJ7xvD1RiIsg048BWP4ZVnyTii2RwjDUbCJFqmN1MN3GdmKjFOtQiZWAM6q-aGiYJjIVbVKLhISgBPUgpbrE/s640/c17.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The CMV12000 is not $1, sadly, but you can Buy It Now if you really want. For prototyping, I have two monochrome ones that came from a heavily-discounted surplus listing. Hopefully they turn on.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
This is a case, then, where the available component drives the design. The CMV12000 is &lt;i&gt;not&lt;/i&gt;&amp;nbsp;going to win an image quality shootout with a 4K Sony sensor, but it &lt;i&gt;is&lt;/i&gt;&amp;nbsp;remarkably fast for its resolution: up to &lt;span style=&quot;color: yellow;&quot;&gt;300fps at 4096x3072&lt;/span&gt;. That&#39;s 3.8Gpx/s, somewhere between the total camera interface on a &lt;a href=&quot;https://youtu.be/Ucp0TTmvqOE?t=4859&quot;&gt;Tesla Full Self Driving Chip&lt;/a&gt;&amp;nbsp;(2.5Gpx/s) and the imaging rate of a &lt;a href=&quot;https://www.phantomhighspeed.com/products/cameras/4kmedia/flex4k&quot;&gt;Phantom Flex 4K&lt;/a&gt; (8.9Gpx/s). There was a &lt;a href=&quot;https://ams.com/documents/20143/36005/CMV12000_AN000462_1-00.pdf&quot;&gt;version jump&lt;/a&gt; on this sensor that I think moved it into a different category of speed, and that&#39;s where I&#39;m placing the lever for pushing this design.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The CMV12000 is also a global shutter CMOS sensor, something more common in industrial and machine vision applications than consumer cameras. The entire frame is sampled at once, instead of row-by-row as in rolling shutter CMOS. (The &lt;a href=&quot;https://www.youtube.com/watch?v=nP1elMR5qjc&quot;&gt;standupmaths&lt;/a&gt; video on the topic is my favorite.) The&amp;nbsp; advantage is that moving objects and camera panning don&#39;t create distortion, which is arguably just correct behavior for an image sensor... But although a few pro cameras with global shutter have existed, even those have mostly died out. This is due to an interlinked set of trade-offs that give rolling shutter designs the advantage in cost and/or dynamic range.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
For engineering applications, though, a global shutter sensor with an external trigger is essentially a visual oscilloscope, and can be useful beyond just creating normal video. By synchronizing the exposure to a periodic event, you can measure frequencies or visualize oscillations well beyond the frame rate of the sensor. Here&#39;s an example of my global shutter Grasshopper 3 camera capturing the cycle of a pixel shifting DLP projector. Each state is 1s/720 in duration, but the trigger can be set to any multiple of that period, plus or minus a tiny bit, to capture the sequence with an effective frame rate much higher than 720fps.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;iframe allow=&quot;accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot; frameborder=&quot;0&quot; height=&quot;360&quot; src=&quot;https://www.youtube.com/embed/YtBDJjSPTlo&quot; width=&quot;640&quot;&gt;&lt;/iframe&gt;

&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Whether a consequence of the global shutter or not, the main on-paper shortcoming of the CMV12000 is the relatively high dark noise of &lt;span style=&quot;color: red;&quot;&gt;13e-&lt;/span&gt;. For comparison, the &lt;a href=&quot;https://www.sony-semicon.co.jp/products_en/new_pro/may_2017/imx294cjk_e.html&quot;&gt;Sony IMX294CJK&lt;/a&gt;, the 4K sensor in some new cameras with very good low-light capability, is below 2e-. That&#39;s a rolling shutter sensor, though. Sony also makes low-noise global shutter CMOS sensors like the &lt;a href=&quot;https://www.sony-semicon.co.jp/products_en/IS/sensor0/img/product/cmos/IMX253_255LLR_LQR_Flyer.pdf&quot;&gt;IMX253&lt;/a&gt;, at around 2.5e-. The extra noise on the CMV12000 will mean that it needs more light for the same image quality compared to these sensors.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Even given adequate light, the higher noise also eats into the dynamic range of the sensor. The signal-to-noise ratio for a given saturation depth will be lower. This means either noisy shadows or blown-out highlights. But the CMV12000 has a feature I haven&#39;t seen on any other commercially-available sensor: a per-pixel stepped partial reset. The theory is to temporarily stop accumulating charge on bright pixels when they hit intermediate voltages, while allowing dark pixels to keep integrating. Section 4.5.1 in &lt;a href=&quot;https://ora.ox.ac.uk/objects/uuid:a97cabab-5058-4267-9a0c-559d40af300a&quot;&gt;this thesis&lt;/a&gt;&amp;nbsp;has more on this method.&lt;br /&gt;
&lt;br /&gt;
In the example below, the charge reading is simulated for 16 stops of contrast. With baseline lighting, the bottom four stops are lost in the noise and the top four are blown out. Increasing the illumination by 4x recovers two stops on the bottom, but loses two on top. The partial reset capability slows down the brightest pixels, recovering several more stops on top without affecting the dark pixels. The extra light is still needed to overcome the dark noise, but it&#39;s less of an issue in terms of dynamic range.&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgg6ol4BHTWgXQ7Fmmr8LvrtZeCcvhMiZMvsMwO420yy8QCjGipO05e4PK-wMFOQa21KKf1LSuysZ4-Y4v7433zZCtFLm0x_hsvG3xDG5CvrosojeclkOPi1dbniVFve9RuUusG0alYeZU/s1600/c08.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1079&quot; data-original-width=&quot;1600&quot; height=&quot;430&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgg6ol4BHTWgXQ7Fmmr8LvrtZeCcvhMiZMvsMwO420yy8QCjGipO05e4PK-wMFOQa21KKf1LSuysZ4-Y4v7433zZCtFLm0x_hsvG3xDG5CvrosojeclkOPi1dbniVFve9RuUusG0alYeZU/s640/c08.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Dynamic range recovery using 3-stage partial reset.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
The end result of partial reset is a non-linear pixel response to illumination. This is often done anyway, after the ADC conversion, to create log formats that compress more dynamic range into fewer bits per pixel. Having hardware that does something similar in-pixel, before the ADC, is a powerful feature that&#39;s not at all common.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Another aspect of the CMV12000 that helps with implementation is the pixel data interface: the data is spread out on 64 parallel LVDS output pairs that each serve a group of pixel columns. This extra-wide bus means more reasonable clock speeds: 300MHz DDR (600Mb/s) for full rate. A half-meter wavelength means wide intra-pair routing tolerances. There is still a massive 4.8ns inter-channel skew that has to be dealt with, but it would be futile to try to length match that. The sensor does put out training data meant for synchronizing the individual channels at the receiver, which is a headache I plan to have in the future.&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;h4&gt;
The Sink&lt;/h4&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
I&#39;m starting from the assumption that it&#39;s impossible to really do anything permanent with 38Gb/s of data, if you&#39;re working with hardware at or below that of a laptop PC. In an early concept, I was planning to just route the data to a PCIe x4 output and send it in to something like an Intel NUC for further processing. But even that isn&#39;t fast enough for the CMV12000. (Also, you can buy &lt;a href=&quot;https://www.ximea.com/en/products/xilab-application-specific-custom-oem/embedded-vision-and-multi-camera-setup-xix/cmosis-cmv12000-color-4k-embedded-camera&quot;&gt;something like that&lt;/a&gt; already. No fun.) And even if you could set up a 40Gb/s link to a host PC through something like Thunderbolt 3, it&#39;s really just kicking the problem down the road to more and more general hardware, which probably means more Watts per bit per second.&lt;br /&gt;
&lt;br /&gt;
Ultimately, unless the data is consumed immediately (as with a machine vision algorithm that uses one frame and then discards it), or buffered into RAM as a short clip (as with circular buffers in high-speed cameras), the only way to sink this much data reasonably is to compress it. &lt;span style=&quot;color: white;&quot;&gt;And this is where this project goes off the rails a little.&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
For starters, I choose &lt;span style=&quot;color: yellow;&quot;&gt;1GB/s&lt;/span&gt; as a reasonable sink rate for the data. This is within reach of NVMe SSD write speeds, and makes for completely reasonable recording times of 17min/TB (at maximum frame rate). This is very light compression, as far as video goes - less than &lt;span style=&quot;color: yellow;&quot;&gt;5:1&lt;/span&gt;. I think the best tool for the job is probably wavelet compression, rather than something like h.265. It&#39;s intra-frame and uses relatively simple logic, which means fast and cheap. But putting aside the question of how fast and how cheap for now, I first just want to make sure the quality would be acceptable.&lt;br /&gt;
&lt;br /&gt;
There are several good examples of wavelet compression already in use: &lt;a href=&quot;http://www.cs.tut.fi/~tabus/course/SC/246pagesCourseonJPEG2000.pdf&quot;&gt;JPEG2000&lt;/a&gt; uses different variants for lossless and lossy image compression. &lt;a href=&quot;https://www.red.com/red-101/redcode-file-format&quot;&gt;REDCODE&lt;/a&gt; is wavelet-based and 5:1 is a standard setting described as &quot;visually lossless&quot;. &lt;a href=&quot;https://github.com/gopro/cineform-sdk&quot;&gt;CineForm&lt;/a&gt; is a wavelet codec recently open-sourced by GoPro. The SDK for CineForm includes a lightweight example project that just compresses a monochrome image with different settings. Running a test image through that with settings close to 5:1 produces good results:&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXl9eA5ZoIjN73M8fRmmM-SMtcYw6lU89Kx2CfdjSWgMa1d9T9AFf1Vh5DRfGQfPRDL6Tv1MhyphenhyphenrMs2QMbKvVyx5FDSa38LFJzw8Nk8blt-_VLuw03F-SfniDy_D7YgSGKgy5NMpW0ixAQ/s1600/c12.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;720&quot; data-original-width=&quot;1280&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXl9eA5ZoIjN73M8fRmmM-SMtcYw6lU89Kx2CfdjSWgMa1d9T9AFf1Vh5DRfGQfPRDL6Tv1MhyphenhyphenrMs2QMbKvVyx5FDSa38LFJzw8Nk8blt-_VLuw03F-SfniDy_D7YgSGKgy5NMpW0ixAQ/s640/c12.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The original monochrome image.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-s-tSoQ8me77sJs0pONI0LIb77zr5ash1kkkSSZiqgpVn6VSK7lN4BOTY6zAHs2Evd4So5BF4Wjv7FExnpnsh01GtyiVN2xvmnKruRspFSDj50Tm6_9Fd1fWFCNuYotALqKfwsrhpJS0/s1600/c13.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;720&quot; data-original-width=&quot;1280&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi-s-tSoQ8me77sJs0pONI0LIb77zr5ash1kkkSSZiqgpVn6VSK7lN4BOTY6zAHs2Evd4So5BF4Wjv7FExnpnsh01GtyiVN2xvmnKruRspFSDj50Tm6_9Fd1fWFCNuYotALqKfwsrhpJS0/s640/c13.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The wavelet transform outputs a 1/8-scale low-frequency thumbnail and three stages of quantized high-frequency blocks, which are sparse and easy to compress. I just zipped this image as a test and got a 5.7:1 ratio with these settings.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguyoN7Y1SB1yx3IIUe2unr1VCyCk8TDq1vocvmPHrI_Osc9xy2InftgIcYoEiBnmoukdKsUR9AaBeY69OKKj6W3LAs-YTyuEePuzc4kizJPIiGnbcgQgrgxIPlXe0ULdCF0vf0_CqNcTo/s1600/c14.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;720&quot; data-original-width=&quot;1280&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguyoN7Y1SB1yx3IIUe2unr1VCyCk8TDq1vocvmPHrI_Osc9xy2InftgIcYoEiBnmoukdKsUR9AaBeY69OKKj6W3LAs-YTyuEePuzc4kizJPIiGnbcgQgrgxIPlXe0ULdCF0vf0_CqNcTo/s640/c14.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The recovered image.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUmgdnxS2amoYUG1lxKVRiVXf2a78bdAPKsif23oYucbHm9SwC9BXAkKzzIfk9MMO5BNJJYR1Bip6arH0sjV9NAacyDCED2gWGju7YKuODGMmOpnyLe6oG1htmzpbzIbl8nGVBoXHtgVM/s1600/c09.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;825&quot; data-original-width=&quot;1166&quot; height=&quot;452&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUmgdnxS2amoYUG1lxKVRiVXf2a78bdAPKsif23oYucbHm9SwC9BXAkKzzIfk9MMO5BNJJYR1Bip6arH0sjV9NAacyDCED2gWGju7YKuODGMmOpnyLe6oG1htmzpbzIbl8nGVBoXHtgVM/s640/c09.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Since these images are going to be destroyed by rescaling anyway, here&#39;s a 400% zoom of some high-contrast features.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUmgdnxS2amoYUG1lxKVRiVXf2a78bdAPKsif23oYucbHm9SwC9BXAkKzzIfk9MMO5BNJJYR1Bip6arH0sjV9NAacyDCED2gWGju7YKuODGMmOpnyLe6oG1htmzpbzIbl8nGVBoXHtgVM/s1600/c09.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;
The choice of wavelet type does matter, but I think the quantization strategy is even more important. The wavelet transform doesn&#39;t reduce the size of the data, it just splits it into low-frequency and high-frequency blocks. In fact, for all but the simplest wavelets, the blocks require more bits to store than the original pixels:&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibzOEzCS3XjYcnUdOPCNwEv6zNbrM6w8x4_NfRC2eTISI81HnSz6NVzYMK2ZHipCG89Exl6H5KRoWJFSNxQF9l8zimdwmEHvkMl9uBk0J0dKMuRARSbGEigtnFzJAgW2cpkaVsg83WPB0/s1600/wv02.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1154&quot; data-original-width=&quot;1430&quot; height=&quot;515&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibzOEzCS3XjYcnUdOPCNwEv6zNbrM6w8x4_NfRC2eTISI81HnSz6NVzYMK2ZHipCG89Exl6H5KRoWJFSNxQF9l8zimdwmEHvkMl9uBk0J0dKMuRARSbGEigtnFzJAgW2cpkaVsg83WPB0/s640/wv02.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Output range maps for different wavelets. All but the simplest wavelets (Haar, Bilinear) have corner cases of low-frequency or high-frequency outputs that require one extra bit to store.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Take the Cineform 2/6 wavelet (a.k.a&amp;nbsp;&lt;a href=&quot;http://wavelets.pybytes.com/wavelet/rbio1.3/&quot;&gt;reverse biorthogonal 1.3&lt;/a&gt;?) as an example: the low-frequency block is just an average of two adjacent pixels, so it doesn&#39;t need any more bits than the source data. But the high-frequency blocks look at six adjacent pixels and could, for some corner cases, give a result that&#39;s larger than the maximum pixel amplitude. It needs one extra bit to store the result without clipping. Seems like we&#39;re going in the wrong direction!&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Like most image compression techniques, the important fact is that the high frequency information is less valuable, and can be manipulated and even discarded without as much visual penalty. By applying a deadband and quantization step to the high-frequency blocks, the data becomes more sparse and easier to compress. Since this is the lossy part of the algorithm, the details are hugely important. I have a little sandbox program that I use to play with different wavelet and quantization settings on test images. In most cases, 5:1 compression is very reasonable.&lt;br /&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqb714dnLd4qKZw6Zif5ovZyOxLxM8dm_hqrvPB1gDHmHZSConeGdvRB1R27PIXNoVHTl8ASetYM-zY3AFIkP2Hqu1CiSlktBpxDTVW3hPfy1IgLZRN3jLY9K8WlSvaU9tWiJHfMkQZ2s/s1600/c15.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;802&quot; data-original-width=&quot;1424&quot; height=&quot;360&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqb714dnLd4qKZw6Zif5ovZyOxLxM8dm_hqrvPB1gDHmHZSConeGdvRB1R27PIXNoVHTl8ASetYM-zY3AFIkP2Hqu1CiSlktBpxDTVW3hPfy1IgLZRN3jLY9K8WlSvaU9tWiJHfMkQZ2s/s640/c15.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Different wavelets and quantizer settings can be compared quickly in this software sandbox.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
That&#39;s enough evidence for me that wavelet compression is a completely acceptable trade-off for opening up the possibility of sinking to a normal 1TB SSD instead of an absurd amount of RAM. A very fast RAM buffer is still needed to smooth things out, but it can be limited in size to just as many frames as are needed to ride out pipeline transients. Now, with the source and sink constraints defined, what the hell kind of hardware sits in the middle?&lt;/div&gt;
&lt;h4 style=&quot;text-align: left;&quot;&gt;
The Pipe&lt;/h4&gt;
&lt;div&gt;
There was never any doubt that the entrance to this pipeline had to be an FPGA. Nothing else can deal with 64 LVDS channels. But instead of just repackaging the data for PCIe and passing it along to some poor single board computer to deal with, I&#39;m now asking the FPGA to do everything: read in the data, perform the wavelet compression, and write it out to an SSD. This will ultimately be smaller and cheaper, since there&#39;s no need for a host computer, but it means a much fancier FPGA.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I&#39;m starting from scratch here, so all of this is just an educated guess, but I think a viable solution lies somewhere in the spectrum of &lt;a href=&quot;https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html#productTable&quot;&gt;Xilinx Zynq Ultrascale+&lt;/a&gt; devices. They are FPGA hardware bolted to ARM cores in a single chip. Based on the source and sink requirements I can narrow down further to something between the ZU4 and ZU7. (Below the ZU4 doesn&#39;t have the necessary transceivers for PCIe Gen3 x4 to the SSD, and above the ZU7 is prohibitively expensive.) Within each ZU number, there are also three categories: CG has no extra hardware, EG has a GPU, and EV has a GPU and h.264/h.265 codec.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
In the interest of keeping development cost down, I&#39;m starting with the bottom of this window, the ZU4CG. The GPU and video codec might be useful down the road for 30fps previews or making proxies, but they&#39;re too slow to be part of the main pipeline. Since they&#39;re fairly sideways-compatible, I think it&#39;s reasonable to start small and move up the line if necessary.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
I really want to avoid laying out a board for the bare chip, its RAM, and its other local power supplies and accessories. The &lt;a href=&quot;http://zedboard.org/product/ultrazed-ev&quot;&gt;UltraZed-EV&lt;/a&gt; &lt;i&gt;almost&lt;/i&gt;&amp;nbsp;works, but it doesn&#39;t break out enough of the available LVDS pins. It&#39;s also only available with the ZU7EV, the very top of my window. The &lt;a href=&quot;https://shop.trenz-electronic.de/de/Produkte/Trenz-Electronic/TE08XX-Zynq-UltraScale/&quot;&gt;TE08xx Series&lt;/a&gt; of boards from Trenz Electronic is perfect, though, covering a wider range of the parts and breaking out enough IO. I picked up the &lt;a href=&quot;https://shop.trenz-electronic.de/de/TE0803-02-04CG-1EA-MPSoC-Modul-mit-Xilinx-Zynq-UltraScale-ZU4CG-1E-2-GByte-DDR4-5-2-x-7-6-cm&quot;&gt;ZU4CG version&lt;/a&gt; for less than the cost of just the &lt;a href=&quot;https://www.digikey.com/product-detail/en/xilinx-inc/XCZU4CG-1SFVC784I/122-2262-ND/7034579&quot;&gt;ZU4CG on Digi-Key&lt;/a&gt;.&lt;/div&gt;
&lt;div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKCD6QeOUGwqXsHnPUSHQa0MPnbgIlHJU6fFBwnxZ3Cdhf135O6uX4VvSFMvTtqBTxjbcRu5HJwCYKSgDG6vAwgv_ypv_nydfl3GwXOoejAnDdo5FPCkafoAUyB-WEcmqv0ImVazhATNU/s1600/c18.jpg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;960&quot; data-original-width=&quot;1600&quot; height=&quot;382&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKCD6QeOUGwqXsHnPUSHQa0MPnbgIlHJU6fFBwnxZ3Cdhf135O6uX4VvSFMvTtqBTxjbcRu5HJwCYKSgDG6vAwgv_ypv_nydfl3GwXOoejAnDdo5FPCkafoAUyB-WEcmqv0ImVazhATNU/s640/c18.jpg&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Credit card-sized TE0803 board with the ZU4CG and 2GB of RAM. Not counting the FPGA, the processing power is actually good deal less than what&#39;s on a modern smartphone.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;div&gt;
One small detail I really like about the TE0803 is that the RAM is wired up as 64-bit wide. Assuming the memory controller can handle it, that would be over 150Gb/s for DDR4-2400, which dwarfs even the CMV12000&#39;s data rate. I &lt;i&gt;think&lt;/i&gt;&amp;nbsp;the RAM buffer will wind up on the compressed side of the pipeline, but it&#39;s good to know that it has the bandwidth to handle uncompressed sensor data too, if necessary.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Time for a motherboard:&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTLjwnOwBJyOYMkOktU47tc5uow2UrCey8sN7mDISVWlxBBdCmdonQXSehxhYjgjj2hDZ6tbilfVmPxAh0KK92q0i8so8yTr0Fyqxk-vEyu7496VEOego-KEqrwfUezwhRvi5C4rJlk5Y/s1600/c07.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;861&quot; data-original-width=&quot;1600&quot; height=&quot;344&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTLjwnOwBJyOYMkOktU47tc5uow2UrCey8sN7mDISVWlxBBdCmdonQXSehxhYjgjj2hDZ6tbilfVmPxAh0KK92q0i8so8yTr0Fyqxk-vEyu7496VEOego-KEqrwfUezwhRvi5C4rJlk5Y/s640/c07.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The &quot;tall&quot; side has the TE0803 headers, an M.2 connector, USB-C, a microSD slot, power supplies, and an STM32F0 to act as a sort-of power/configuration supervisor. Sensor pins are soldered on this side.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTk6LjCPyFnFnd8NsinfqvDSsP_v-1J9Z360MrIOl7KZtu-PPPf4SbZnjvFaEpRgIvzKp4npm_mbALS5h_u31nRt_bTtgHr3es0Te-Q5B-BeXjvE-zdQBTfRhwkbfwKmwe1JhGUiO-mgE/s1600/c06.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;932&quot; data-original-width=&quot;1600&quot; height=&quot;372&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTk6LjCPyFnFnd8NsinfqvDSsP_v-1J9Z360MrIOl7KZtu-PPPf4SbZnjvFaEpRgIvzKp4npm_mbALS5h_u31nRt_bTtgHr3es0Te-Q5B-BeXjvE-zdQBTfRhwkbfwKmwe1JhGUiO-mgE/s640/c06.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;The &quot;short&quot; side has just the sensor and some straggler passives that are under 1mm tall.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;
Aside from the power supplies, this board is really just a breakout for the TE0803, and the placement of everything is driven by where the LVDS- and PCIe-capable pins are. Everything is a differential pair, pretty much. There are a bunch of different target impedances: 100Ω for LVDS, 85Ω for PCIe Gen3, 90Ω for USB. I was happy to find that &lt;a href=&quot;https://jlcpcb.com/&quot;&gt;JLCPCB&lt;/a&gt; offers a standard 6-layer controlled-impedance stackup. They even have their own &lt;a href=&quot;https://jlcpcb.com/client/index.html#/impedanceCalculation&quot;&gt;online calculator&lt;/a&gt;. I probably still fucked up somehow, but hopefully at least some of it is right so I can start prototyping the software.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
Software? Hardware? What do you call FPGA logic? There are a bunch of somewhat independent tasks to deal with on the chip. At the input side, the pixel data needs to be synchronized using training data to deal with the massive 4.8ns inter-channel skew. The FPGA inputs have a built-in delay tap, but it maxes out at 1.25ns. You can, in theory, cascade these with the adjacent unused output delays, to reach 2.5ns. That&#39;s obviously not enough to directly cancel the skew, but it&amp;nbsp;&lt;i&gt;is&lt;/i&gt;&amp;nbsp;enough to reach the next 300MHz clock edge. So, possibly some combination of cascaded hardware delays and intentional bit slipping can cover the full range. It&#39;s going to be a nightmare.&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
The output side might be even worse. Just look at the number of differential pairs going in to the TE0803 headers vs. the number coming out. That&#39;s the ratio of how much tighter the timing tolerance is on the PCIe outputs. The edge of one bit won&#39;t hit the M.2 connector until a couple more have already left the FPGA. In this case, I have taken the effort to length match the pairs themselves. I won&#39;t know how close I am until I can do a loopback test.&lt;/div&gt;
&lt;table align=&quot;center&quot; cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;margin-left: auto; margin-right: auto; text-align: center;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0eng7sfvk8Kf_2snSIDcpuA2f3wphR6qGoxGZyeFDMLL51bLAxpXBllXyBK2WtS-HKAMDd58YB-V8mK5OOHrTfIBL-0dg7mV2vXFBleKU9Z_tUvLyCpvAefaG9RQXdkEsHAZnjdAUlks/s1600/c16.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; data-original-height=&quot;1028&quot; data-original-width=&quot;1600&quot; height=&quot;410&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0eng7sfvk8Kf_2snSIDcpuA2f3wphR6qGoxGZyeFDMLL51bLAxpXBllXyBK2WtS-HKAMDd58YB-V8mK5OOHrTfIBL-0dg7mV2vXFBleKU9Z_tUvLyCpvAefaG9RQXdkEsHAZnjdAUlks/s640/c16.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Length matching the PCIe differential pairs to make up for the left turns and TE0803 routing.&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
Even assuming the routing is okay, there&#39;s the problem of NVMe. NVMe is an open specification for what lives on top of the PCIe PHY to control communication with the SSD. It&#39;s built in to Linux, including versions that can run on the ZU4CG&#39;s ARM cores. But that puts the operating system in the pipeline, which sounds like a disaster. I haven&#39;t seen any examples of that running at anywhere near 1GB/s. I think hardware-accelerated NVMe might work, but as far as I can tell there are no license-free NVMe cores in existence. I don&#39;t have a solution to this problem yet, but I will happily sink hours into anything that prevents me from having to deal with IP vendors.&lt;br /&gt;
&lt;br /&gt;
Sitting right in the middle, between these input and output constraints, is the complete mystery that is the wavelet core. This has to be done in hardware. The ARM cores and even the GPU are just not fast enough, and even if they were, accessing intermediate results would quickly eat the RAM bus. The math operations involved are so compact, though, that it seems natural to implement them in tiny logic/memory cores and then put as many of them in parallel as possible.&lt;br /&gt;
&lt;br /&gt;
The wavelet cores are the most interesting part of this pipeline and require a separate post to cover in enough detail to be meaningful. I have a ton of references on the theory and a little bit of concept for how to turn it into lightweight hardware. As it stands, I know only enough to have some confidence that it will fit on the ZCU4CG, in terms of both logic elements and distributed memory for storing intermediate results. (The memory requirement is much less than a full frame, since the wavelets only look ahead/behind a few pixels at a time.) But there is an immense amount of implementation detail to fill in, and I hope to make a small dent in that while these boards are in flight.&lt;br /&gt;
&lt;br /&gt;
To summarize, I still have no clue if, how, or when any of this will work. My philosophy on this project is to send the pixels as fast as they want to go and try to remove anything that gets in the way. It&#39;s not really a plan - more of a series of challenges.&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</content><link rel='replies' type='application/atom+xml' href='https://scolton.blogspot.com/feeds/1489307433046940511/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://scolton.blogspot.com/2019/06/freight-train-of-pixels.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/1489307433046940511'/><link rel='self' type='application/atom+xml' href='https://www.blogger.com/feeds/8200098102909041178/posts/default/1489307433046940511'/><link rel='alternate' type='text/html' href='https://scolton.blogspot.com/2019/06/freight-train-of-pixels.html' title='Freight Train of Pixels'/><author><name>Shane Colton</name><uri>http://www.blogger.com/profile/10603406287033587039</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioCGko5my065A3wUT8ydaEnf7TovVid40ES864RwE1S-74UjaxQxlwcgjTJ7xvD1RiIsg048BWP4ZVnyTii2RwjDUbCJFqmN1MN3GdmKjFOtQiZWAM6q-aGiYJjIVbVKLhISgBPUgpbrE/s72-c/c17.jpg" height="72" width="72"/><thr:total>5</thr:total></entry></feed>