Wednesday, June 12, 2019

Freight Train of Pixels

I have a problem. After any amount of time at any level of development of anything, I feel the urge to move down one layer into a place where I really shouldn't be. Thus, after spending time implementing capture software for my Point Grey FLIR block cameras, I am now tired of dealing with USB cables and drivers and firmware and settings.

What I want is image data. Pixels straight from a sensor. As many as I can get, as fast as I can get them. To quote Jeremy Clarkson from The Great Train Race (Top Gear S13E1), "Make millions of coals go in there." Except instead of coals, pixels. And instead of millions, trillions. It doesn't matter how. I mean, it does, but really I want the only real constraints to be where the pixels are coming from and where they are going. So let's see what's in this rabbit hole.

The Source

The image sensor feeding this monster will be an ams (formerly CMOSIS) CMV12000. It's got lots of pros and a few cons for this type of project, which I'll get into in more detail. But the main reason for the choice is entirely non-technical: This is a sensor that I can get a full datasheet for and purchase without any fucking around. This was true even back in the CMOSIS days, but as an active ams part it's now documented and distributed the same way as their $1 ICs.
The CMV12000 is not $1, sadly, but you can Buy It Now if you really want. For prototyping, I have two monochrome ones that came from a heavily-discounted surplus listing. Hopefully they turn on.
This is a case, then, where the available component drives the design. The CMV12000 is not going to win an image quality shootout with a 4K Sony sensor, but it is remarkably fast for its resolution: up to 300fps at 4096x3072. That's 3.8Gpx/s, somewhere between the total camera interface on a Tesla Full Self Driving Chip (2.5Gpx/s) and the imaging rate of a Phantom Flex 4K (8.9Gpx/s). There was a version jump on this sensor that I think moved it into a different category of speed, and that's where I'm placing the lever for pushing this design.

The CMV12000 is also a global shutter CMOS sensor, something more common in industrial and machine vision applications than consumer cameras. The entire frame is sampled at once, instead of row-by-row as in rolling shutter CMOS. (The standupmaths video on the topic is my favorite.) The  advantage is that moving objects and camera panning don't create distortion, which is arguably just correct behavior for an image sensor... But although a few pro cameras with global shutter have existed, even those have mostly died out. This is due to an interlinked set of trade-offs that give rolling shutter designs the advantage in cost and/or dynamic range.

For engineering applications, though, a global shutter sensor with an external trigger is essentially a visual oscilloscope, and can be useful beyond just creating normal video. By synchronizing the exposure to a periodic event, you can measure frequencies or visualize oscillations well beyond the frame rate of the sensor. Here's an example of my global shutter Grasshopper 3 camera capturing the cycle of a pixel shifting DLP projector. Each state is 1s/720 in duration, but the trigger can be set to any multiple of that period, plus or minus a tiny bit, to capture the sequence with an effective frame rate much higher than 720fps.



Whether a consequence of the global shutter or not, the main on-paper shortcoming of the CMV12000 is the relatively high dark noise of 13e-. For comparison, the Sony IMX294CJK, the 4K sensor in some new cameras with very good low-light capability, is below 2e-. That's a rolling shutter sensor, though. Sony also makes low-noise global shutter CMOS sensors like the IMX253, at around 2.5e-. The extra noise on the CMV12000 will mean that it needs more light for the same image quality compared to these sensors.

Even given adequate light, the higher noise also eats into the dynamic range of the sensor. The signal-to-noise ratio for a given saturation depth will be lower. This means either noisy shadows or blown-out highlights. But the CMV12000 has a feature I haven't seen on any other commercially-available sensor: a per-pixel stepped partial reset. The theory is to temporarily stop accumulating charge on bright pixels when they hit intermediate voltages, while allowing dark pixels to keep integrating. Section 4.5.1 in this thesis has more on this method.

In the example below, the charge reading is simulated for 16 stops of contrast. With baseline lighting, the bottom four stops are lost in the noise and the top four are blown out. Increasing the illumination by 4x recovers two stops on the bottom, but loses two on top. The partial reset capability slows down the brightest pixels, recovering several more stops on top without affecting the dark pixels. The extra light is still needed to overcome the dark noise, but it's less of an issue in terms of dynamic range.
Dynamic range recovery using 3-stage partial reset.
The end result of partial reset is a non-linear pixel response to illumination. This is often done anyway, after the ADC conversion, to create log formats that compress more dynamic range into fewer bits per pixel. Having hardware that does something similar in-pixel, before the ADC, is a powerful feature that's not at all common.

Another aspect of the CMV12000 that helps with implementation is the pixel data interface: the data is spread out on 64 parallel LVDS output pairs that each serve a group of pixel columns. This extra-wide bus means more reasonable clock speeds: 300MHz DDR (600Mb/s) for full rate. A half-meter wavelength means wide intra-pair routing tolerances. There is still a massive 4.8ns inter-channel skew that has to be dealt with, but it would be futile to try to length match that. The sensor does put out training data meant for synchronizing the individual channels at the receiver, which is a headache I plan to have in the future.

The Sink

I'm starting from the assumption that it's impossible to really do anything permanent with 38Gb/s of data, if you're working with hardware at or below that of a laptop PC. In an early concept, I was planning to just route the data to a PCIe x4 output and send it in to something like an Intel NUC for further processing. But even that isn't fast enough for the CMV12000. (Also, you can buy something like that already. No fun.) And even if you could set up a 40Gb/s link to a host PC through something like Thunderbolt 3, it's really just kicking the problem down the road to more and more general hardware, which probably means more Watts per bit per second.

Ultimately, unless the data is consumed immediately (as with a machine vision algorithm that uses one frame and then discards it), or buffered into RAM as a short clip (as with circular buffers in high-speed cameras), the only way to sink this much data reasonably is to compress it. And this is where this project goes off the rails a little.

For starters, I choose 1GB/s as a reasonable sink rate for the data. This is within reach of NVMe SSD write speeds, and makes for completely reasonable recording times of 17min/TB (at maximum frame rate). This is very light compression, as far as video goes - less than 5:1. I think the best tool for the job is probably wavelet compression, rather than something like h.265. It's intra-frame and uses relatively simple logic, which means fast and cheap. But putting aside the question of how fast and how cheap for now, I first just want to make sure the quality would be acceptable.

There are several good examples of wavelet compression already in use: JPEG2000 uses different variants for lossless and lossy image compression. REDCODE is wavelet-based and 5:1 is a standard setting described as "visually lossless". CineForm is a wavelet codec recently open-sourced by GoPro. The SDK for CineForm includes a lightweight example project that just compresses a monochrome image with different settings. Running a test image through that with settings close to 5:1 produces good results:
The original monochrome image.
The wavelet transform outputs a 1/8-scale low-frequency thumbnail and three stages of quantized high-frequency blocks, which are sparse and easy to compress. I just zipped this image as a test and got a 5.7:1 ratio with these settings.
The recovered image.
Since these images are going to be destroyed by rescaling anyway, here's a 400% zoom of some high-contrast features.

The choice of wavelet type does matter, but I think the quantization strategy is even more important. The wavelet transform doesn't reduce the size of the data, it just splits it into low-frequency and high-frequency blocks. In fact, for all but the simplest wavelets, the blocks require more bits to store than the original pixels:
Output range maps for different wavelets. All but the simplest wavelets (Haar, Bilinear) have corner cases of low-frequency or high-frequency outputs that require one extra bit to store.
Take the Cineform 2/6 wavelet (a.k.a reverse biorthogonal 1.3?) as an example: the low-frequency block is just an average of two adjacent pixels, so it doesn't need any more bits than the source data. But the high-frequency blocks look at six adjacent pixels and could, for some corner cases, give a result that's larger than the maximum pixel amplitude. It needs one extra bit to store the result without clipping. Seems like we're going in the wrong direction!

Like most image compression techniques, the important fact is that the high frequency information is less valuable, and can be manipulated and even discarded without as much visual penalty. By applying a deadband and quantization step to the high-frequency blocks, the data becomes more sparse and easier to compress. Since this is the lossy part of the algorithm, the details are hugely important. I have a little sandbox program that I use to play with different wavelet and quantization settings on test images. In most cases, 5:1 compression is very reasonable.
Different wavelets and quantizer settings can be compared quickly in this software sandbox.
That's enough evidence for me that wavelet compression is a completely acceptable trade-off for opening up the possibility of sinking to a normal 1TB SSD instead of an absurd amount of RAM. A very fast RAM buffer is still needed to smooth things out, but it can be limited in size to just as many frames as are needed to ride out pipeline transients. Now, with the source and sink constraints defined, what the hell kind of hardware sits in the middle?

The Pipe

There was never any doubt that the entrance to this pipeline had to be an FPGA. Nothing else can deal with 64 LVDS channels. But instead of just repackaging the data for PCIe and passing it along to some poor single board computer to deal with, I'm now asking the FPGA to do everything: read in the data, perform the wavelet compression, and write it out to an SSD. This will ultimately be smaller and cheaper, since there's no need for a host computer, but it means a much fancier FPGA.

I'm starting from scratch here, so all of this is just an educated guess, but I think a viable solution lies somewhere in the spectrum of Xilinx Zynq Ultrascale+ devices. They are FPGA hardware bolted to ARM cores in a single chip. Based on the source and sink requirements I can narrow down further to something between the ZU4 and ZU7. (Below the ZU4 doesn't have the necessary transceivers for PCIe Gen3 x4 to the SSD, and above the ZU7 is prohibitively expensive.) Within each ZU number, there are also three categories: CG has no extra hardware, EG has a GPU, and EV has a GPU and h.264/h.265 codec.

In the interest of keeping development cost down, I'm starting with the bottom of this window, the ZU4CG. The GPU and video codec might be useful down the road for 30fps previews or making proxies, but they're too slow to be part of the main pipeline. Since they're fairly sideways-compatible, I think it's reasonable to start small and move up the line if necessary.

I really want to avoid laying out a board for the bare chip, its RAM, and its other local power supplies and accessories. The UltraZed-EV almost works, but it doesn't break out enough of the available LVDS pins. It's also only available with the ZU7EV, the very top of my window. The TE08xx Series of boards from Trenz Electronic is perfect, though, covering a wider range of the parts and breaking out enough IO. I picked up the ZU4CG version for less than the cost of just the ZU4CG on Digi-Key.
Credit card-sized TE0803 board with the ZU4CG and 2GB of RAM. Not counting the FPGA, the processing power is actually good deal less than what's on a modern smartphone.
One small detail I really like about the TE0803 is that the RAM is wired up as 64-bit wide. Assuming the memory controller can handle it, that would be over 150Gb/s for DDR4-2400, which dwarfs even the CMV12000's data rate. I think the RAM buffer will wind up on the compressed side of the pipeline, but it's good to know that it has the bandwidth to handle uncompressed sensor data too, if necessary.

Time for a motherboard:
The "tall" side has the TE0803 headers, an M.2 connector, USB-C, a microSD slot, power supplies, and an STM32F0 to act as a sort-of power/configuration supervisor. Sensor pins are soldered on this side.
The "short" side has just the sensor and some straggler passives that are under 1mm tall.
Asside from the power supplies, this board is really just a breakout for the TE0803, and the placement of everything is driven by where the LVDS- and PCIe-capable pins are. Everything is a differential pair, pretty much. There are a bunch of different target impedances: 100Ω for LVDS, 85Ω for PCIe Gen3, 90Ω for USB. I was happy to find that JLCPCB offers a standard 6-layer controlled-impedance stackup. They even have their own online calculator. I probably still fucked up somehow, but hopefully at least some of it is right so I can start prototyping the software.

Software? Hardware? What do you call FPGA logic? There are a bunch of somewhat independent tasks to deal with on the chip. At the input side, the pixel data needs to be synchronized using training data to deal with the massive 4.8ns inter-channel skew. The FPGA inputs have a built-in delay tap, but it maxes out at 1.25ns. You can, in theory, cascade these with the adjacent unused output delays, to reach 2.5ns. That's obviously not enough to directly cancel the skew, but it is enough to reach the next 300MHz clock edge. So, possibly some combination of cascaded hardware delays and intentional bit slipping can cover the full range. It's going to be a nightmare.

The output side might be even worse. Just look at the number of differential pairs going in to the TE0803 headers vs. the number coming out. That's the ratio of how much tighter the timing tolerance is on the PCIe outputs. The edge of one bit won't hit the M.2 connector until a couple more have already left the FPGA. In this case, I have taken the effort to length match the pairs themselves. I won't know how close I am until I can do a loopback test.
Length matching the PCIe differential pairs to make up for the left turns and TE0803 routing.
Even assuming the routing is okay, there's the problem of NVMe. NVMe is an open specification for what lives on top of the PCIe PHY to control communication with the SSD. It's built in to Linux, including versions that can run on the ZU4CG's ARM cores. But that puts the operating system in the pipeline, which sounds like a disaster. I haven't seen any examples of that running at anywhere near 1GB/s. I think hardware-accelerated NVMe might work, but as far as I can tell there are no license-free NVMe cores in existence. I don't have a solution to this problem yet, but I will happily sink hours into anything that prevents me from having to deal with IP vendors.

Sitting right in the middle, between these input and output constraints, is the complete mystery that is the wavelet core. This has to be done in hardware. The ARM cores and even the GPU are just not fast enough, and even if they were, accessing intermediate results would quickly eat the RAM bus. The math operations involved are so compact, though, that it seems natural to implement them in tiny logic/memory cores and then put as many of them in parallel as possible.

The wavelet cores are the most interesting part of this pipeline and require a separate post to cover in enough detail to be meaningful. I have a ton of references on the theory and a little bit of concept for how to turn it into lightweight hardware. As it stands, I know only enough to have some confidence that it will fit on the ZCU4CG, in terms of both logic elements and distributed memory for storing intermediate results. (The memory requirement is much less than a full frame, since the wavelets only look ahead/behind a few pixels at a time.) But there is an immense amount of implementation detail to fill in, and I hope to make a small dent in that while these boards are in flight.

To summarize, I still have no clue if, how, or when any of this will work. My philosophy on this project is to send the pixels as fast as they want to go and try to remove anything that gets in the way. It's not really a plan - more of a series of challenges.

Monday, May 27, 2019

KSP: Laythe Colony Part 3, The Colony Ships

Jool Launch Window #2 is all about getting as many Kerbals in transit to Laythe as possible, and that means building a fleet of colony ships. This was actually the first ship designed for this mission, but I only built one as a proof-of-concept before committing to the Robotic Fleet for Launch Window #1. Those habitats, rovers, and relays will arrive first to pave the way for the colony ships.

The colony ships are built in orbit, with each part launched separately on the same heavy-lift boosters that sent up the Robotic Fleet. The core of each ship, around which the rest of the ship is built, is a Passenger Module:

The Passenger Module
The Passenger Module has room for 18 Kerbals (half the crew of each ship), with two main living compartments on each end, a central stack of general purpose seating, and two observation domes. It's meant to be the "comfortable" portion of the ship, to make the multi-year journey more bearable than would be possible in a lander cockpit. Not that Kerbals really care.

One of the main quality of life considerations for the colony ships is the ability to spin to generate artificial gravity in some of the living quarters. For this reason, the rest of the ship is built along an axis passing through the center of the passenger module. Forward, the next part is the Docking Module:

The Docking Module
While it has space for another six Kerbals, the Docking Module is more of a working space than a living space. Since it's on the central axis, there's no artificial gravity. But it has a large science lab and common area for the crew. Most importantly, it serves as the docking interface for the Space Planes, which shuttle crew to and from the colony ships.

The Space Planes
The Space Planes are really the key to this entire mission, providing a way to get hundreds of Kerbals down to Laythe without having to exactly target flat landing sites from orbit. I tweaked and tested the design to the point where getting to orbit, docking with a colony ship, and returning to Kerbin for a runway landing was utterly routine. Each colony ship required four Space Plane round-trips and two one-way trips to fully crew. The two one-ways go with the ship to Laythe, where they will be used to ferry Kerbals down to the surface.

The back-to-back docking configuration for the Space Planes minimizes the moment of inertia along the spin axis. The two planes have to be exactly symmetric, so each interfaces with two medium-size docking ports for alignment. It is possible, with careful flying, to get both ports to engage at the same time. In addition to enforcing symmetry, this makes the final structure much more rigid. Finding parts that are exactly the right spacing on both sides to make this possible was the trickiest part of the design.

I could write an entire post about the Space Plane design, but I think I'll just post some pictures and videos of it kicking ass instead:







The last picture has a story that goes with it: For some reason, after dozens of clean flights, I botched a take-off and slammed back into the KSC runway with the gear still down, breaking off both wings, the outer engines and fuel tanks, the vertical stabilizers, and all but the two inner horizontal control surfaces. The fraction of a plane that was left was somehow still able to gain altitude, do a wide 180º turn, and make a water landing just off shore.

Anyway, back to the colony ships. Behind the Passenger Module is a truss structure I just call The Node:

The Node
This is the lightest and simplest part of the colony ship, primarily serving as a connector between the crew stack and the propulsion. It also carries the large solar panels, some battery storage, extra reaction control systems, and side ports for docking other modules, such as for refueling. Altogether a small but important building block.

To push all this, three Propulsion Modules are launched separately and docked to the back of the Node. These are the full four-engine versions of the propulsion modules used for the Robotic Fleet.

The Propulsion Modules
With 12 engines in total, the colony ships actually have a higher thrust-to-weight ratio than the robotic landers. The entire colony ship, including the two Space Planes, comes in at just under 300 tons and has a fully-fueled Delta-V of about 4400m/s, which should be enough to get to Laythe orbit with just a tiny bit of help from gravity assists off Tylo (or Laythe itself).

The process of building a single colony ship takes 13 separate launches: six for assembly, six crew shuttles (including two permanent ones), and one refueling run. (While the propulsion modules get to orbit fully-fueled, the total Delta-V counts on topping off the two Space Planes.) It's without a doubt the most ambitious in-orbit construction project I've attempted in KSP.


Oh, and I built 10 of these for Launch Window #2. That's about 3,000 tons of hardware, including 20 Space Planes and 360 Kerbals, on the way to Laythe.

196 Days Remain

I had intended for the second launch window to be the last ships out, but my arbitrary deadline of Year 3, Day 0 for the destruction of Kerbin leaves some time to send up a few more. They'll have to wait for the third launch window, possibly in the relative safety of a Minmus orbit, but I can think of a few extra pieces of hardware that would be useful to the colony.

Sunday, March 31, 2019

TinyCross: Chassis Build

Just hit print...
This winter build season is coming to a close: almost all of the mechanical work for TinyCross is done! Here's a recap how the rolling chassis came together:

The frame and suspension is mostly a kit of aluminum plates and 80/20 extrusion, with almost no post-machining required, so it went together very quickly. After the main box frame assembly and seat mounting, I started with a test build of a single corner of A-arm geometry to make sure I hadn't missed any clearance issues. I also wanted to get a first impression of the stiffness in real life, since that's probably the biggest risk of this new design.


I had no trouble at all with the front-right corner. Everything fit together as planned and the stiffness felt adequate, largely thanks to the zero-slop QA1 ball joints. I was actually a little surprised at how well-behaved it felt. Once all six ball joints and the air shock pins were tightened down, it really did have only the degrees of freedom it should have: one for steering and one for suspension travel. There's no rattling or play at all. With high confidence from the test corner, I went into production line mode for the other three.

PTFE-lined QA-1 1/4-28 rod ends (CMR4TS) are the real stars of this build. It would not be possible with McMaster's selection of ball joints, which are either cheap and overly loose or expensive and overly tight.
I say there was almost no post-machining, but just tapping all the 80/20 ends was a whole day of work.
About here is where the perfect build ended, though, because when I went to attach the front-left corner, I discovered that there was a slight interference between the A-arms and the air shock valve stems. The parts I designed, all 2D plates, are 100% symmetric, so I didn't bother to model the other corners. But the shocks themselves are not symmetric, so it wasn't exactly correct to assume that things would fit together the same in the mirrored configuration. Since the interference was minimal, I debated cutting notches in the A-arms for the valve stems. But I was able to find a more satisfying solution.

Can you spot it?
Not modeling the mirrored parts was a semi-legitimate time-saving strategy, but not modeling the rear corners was just laziness. And of course there was a major interference there: the brake calipers would not have cleared the corners of the frame. (In the front, it's no problem since there is extra space for steering travel.) It was nothing that couldn't be solved with a hacksaw and some improvisation, though. I actually like the final outcome better than the original design...

...he justifies, in post.
Minor issues aside, I am pleased with how the chassis turned out. It's much stiffer than tinyKart, thanks to a slight excursion into the third dimension, but still very light. And I went from 50% to 99% confidence on the suspension design after getting hands on the assembled corners.

So far, so light.
Most of the machining for this build was for turned parts within the four drive modules. The spindle shafts for the wheels were made from 7071 aluminum and support the wheel bearings (6902-2RS). Unlike on tinyKart, the shafts are doubly-supported within a box structure built around the wheel, which should be much more impact tolerant. The large drive pulleys got some weight reduction and a custom bolt pattern to interface with the wheel hubs.


My favorite bit of packaging is the brake caliper occupying the volume inside the belt loop, with the brake disk flush against one side of the wheel pulley. Torque is sourced and sunk from the same side of the wheel - in fact from the same metal plate.


The motor shaft and motor pulley also required some custom machining. This was a weak link on tinyKart: The original design used set screws on shaft flats, but it was prone to loosening over time (or, in one case, completely shearing off the 10mm shaft at the flat). After switching to keyed shafts (via Alien Power Systems motors), those problems mostly went away. But there was still axial play, and the torque was still being transmitted through a 3mm key into an aluminum keyway.

For TinyCross, I wanted to have a clamping and keyed shaft adapter so the torque would be primarily transmitted through friction, with the key as back-up. There's not a lot of room to work within the 15-tooth drive pulley, so that just gets bored out as much as possible and then pressed like hell, with retaining compound, onto a 7071 adapter. This adapter then gets the 10mm bore with a 3mm keyway. But it also gets slotted, turning it into a clamp. Finally, an off-the-shelf 0.75in aluminum clamping collar tightens the whole assembly down onto the motor shaft, with the key in place.


Additionally, the outboard side of the shaft interfaces with another bearing, for double support, and has a pocket for a shaft rotation sense magnet, to be picked up by a rotary encoder IC.

Not messing around.
For brakes, I opted for the same disks, calipers, cables, and levers as on tinyKart. I briefly debated going hydraulic, but the plumbing for four wheel disk brakes seemed like an unnecessary nightmare. tinyKart never had a problem with braking torque; it could easily lock up both front wheels. It just had so little weight on the front wheels that braking and steering were often mutually exclusive activities. With four wheel disk brakes, TinyCross should be much more controllable under braking. The TerraTrike dual pull levers are key to making this work: they have a fulcrum between the lever and two cable ends that ensures both cables get pulled with equal force. I have one such lever for the two front discs and one for the rears.

Independent rear brake lever...what could go wrong?
The last piece of the mechanical puzzle is the steering. At each wheel, there's a steering arm that terminates in a place to mount yet another ball joint, using a T-nut. This is driven by a link comprising a threaded rod with an aluminum stiffener, another trick carried over from tinyKart. The aluminum stiffener is compressed and the threaded rod is stretched by two nuts, creating a link that's stiffer than either part by itself.



For the rear wheels, all that's required are two fixed mounting points for the other end of these rods. Rear toe angle is set by adjustment with the threaded rod.

The toe-setting plate doubles as the TinyCross badge.
The front requires an actual steering mechanism. In lieu of a rack-and-pinion, I used a simple four-bar, driven by the steering column through a universal joint buried in the middle of the front suspension support tower. Each link in the four-bar has its own set of thrust bearings and radial bushings, to minimize the extra linkage slop.



This sort of setup works since the steering throw is very short: ±45º of travel is all it needs. Amazingly, it all clears over the full suspension travel and there doesn't seem to be much bump steer. (I shouldn't be amazed, since it works in CAD, but I am anyway.)

And just like that, it rolls.


Unfortunately, although the frame and suspension were on-target, the four power modules are way over weight budget. The wheels themselves are annoyingly heavy. I can't do much about that, but I can probably take some weight out of the surrounding assembly. A lot of the design is driven around the off-the-shelf cast aluminum rims. If I am willing to chop down the rim and re-machine its outer bearing bore, I can probably save a little weight and a lot of width. I can maybe even get it below 34in, which would help with getting through doorways.

But for now I will shift focus to the electronics. Preliminary bring-up of the motor drives has been uneventful (that's a good thing), but I'll need to actually hook them up to LiPos next, which could always become interesting in fiery ways.

To be continued...

Thursday, November 29, 2018

KSP: Laythe Colony Part 2, The Robotic Fleet and Launch Window #1

In honor of the successful Mars InSight landing this week, I thought I'd do a progress report on my long-term KSP mission to get as many Kerbals off Kerbin by Year 2, Day 0 as possible. Part 1 sets up the premise and the main strategy. In this Part 2, I throw about 1,000 tons of robotic hardware at Jool during the first available launch window, with hopes that at least some of it winds up in a single spot on the surface of Laythe as the seed for a colony.

The busy 1000m/s on-ramp to Jool Transfer Orbit.

The Robotic Fleet

For the first launch window, I decided to send only uncrewed vehicles to feel out the Jool transfer orbit, the details of maneuvering within the Jool system, and the landing procedures at Laythe. The robotic fleet consists of three types of ship: Triple Relay Satellites (RS3), Laythe Rovers (LR1), and Habitats (HAB1). Each one has a different function crucial to settling a remote colony.

RS3

Does this thing get HBO?
These are the smallest and lightest ships, but critical to this first remote-controlled mission phase. One of the relatively new realism additions to KSP is that uncrewed vehicles need to have a line-of-sight communication path back to Kerbin, or to a ship with a crew, in order to maneuver. To achieve this requires lining up a bunch of relay satellites around Kerbin and at other useful locations in the system. 

Each RS3 assembly carries three small relay satellites with their own ion drives. In addition to the ones already parked around Kerbin, two sets of three are on their way out to Kerbin L4 and L5 stations and then eventually other equally spaced points in the orbit. Three more sets are heading out to an intermediate orbit between Kerbin and Jool. And four sets of three are in the fleet heading for Jool, to set up a network around Laythe.

The start of the mission's comms network.

LR1

Practice driving on Kerbin.
The Laythe Rovers are giant 30 ton workhorses. The main function of these 8WD crawlers is to seek out ore to mine and make fuel on Laythe. They each have two large drills, a refinery, and a huge fuel storage tank. They can dock with a parked space plane to refuel it, which is critical for sustaining a link between the Laythe surface and hardware/habitats in orbit.

Landing the rovers is a four-step process. They come packaged in an aero shell with a heat shield, so the initial descent involves just surviving with the heat shield pointed in the right direction. After some time, the drag on the large heat shield flips the package around and the heat shield itself becomes a supersonic air brake, with the aero shell protecting the rover. Once subsonic, the heat shield and fairing are discarded and set of parachutes further slows and rights the rover. Lastly, a set of four rockets slows it to a safe velocity just in time for touchdown.

Step 5 is to quickly deploy the solar panels and drive out of the way of falling fairing debris.

HAB1

Home, sweet home.
The Kerbals can live for extended periods of time in orbit, but having a home base on the Laythe surface will be important for long-term survival. In order to facilitate construction, the surface habitats are themselves rovers with roughly the same chassis as the LR1s. They land the same way and, once on the surface, can drive to each other. This will be important, since the landing target might hundreds of square kilometers.

The habitats are extremely modular. They can be individual homes for a single Kerbal family, including single-passenger mini-rover parked in front. Or, they can be docked together indefinitely to form a larger base, thanks to a central hallway section with docking ports on either end. The slight angle of the hallways allows them to fit inside the aero shell.

How much fuel to bring?

The LR1 and HAB1 landed payloads are both around 28 tons, about half the mass of my first Laythe lander. (That lander had to be heavy in order to have enough fuel to get back off of Laythe, a task to be handled by space planes this time around.) In that mission, two identical ships flew independently to Laythe with an average of about 2500m/s of fuel-burning Δv. But, they also made heavy use of aerocapture at both Jool and Laythe. Without that, it would take something more like 4360m/s to get from low Kerbin orbit to low Laythe orbit, according to the amazing KSP Subway Map.

With a known Δv requirement, figuring out how much fuel to bring is simple. With the 800s specific impulse of the LV-Ns, the minimum wet to dry mass ratio is:
So, the ships need to carry about 3/4 ton of fuel for every one ton of dry mass. Not too bad, since the same heavy lifter that carries up the packaged landers can carry up an equivalent mass of LV-N engine, fuel tank, and liquid fuel. Thus, each robotic lander requires two separate launches:

First, a heavy lift booster hauls the lander payload in its aero shell into orbit.
These are the unsung heroes of the mission, relentlessly hauling all the more exciting hardware into orbit.
Next, a second booster brings up a propulsion module, with LV-Ns and a lot of liquid fuel.
Just remember to check yo' staging...
The two meet in orbit and create a transport ship with a wet to dry mass ratio of about 1.775, for a Δv of about 4500m/s.

Rendezvous between an LR1 lander package and its propulsion module.
A total Δv of 4500m/s is cutting it a bit close, but they would only need a small amount of aerobraking or Tylo gravity assist to gain back a comfortable margin. There's also a good amount of RCS fuel on board that can be dumped (in a prograde or retrograde fashion) near the end of the trip if it's not needed. Additionally, the RS3 ships have a lot of fuel to spare if their propulsion modules can be swapped onto the more thirsty landers nearer to Laythe. The wet to dry mass ratio of the fleet as a whole has a comfortable margin.

Launch Window #1

The first Jool launch window happens around Day 190 in-game. (The simple and more complex online calculators both agree to within a few days).  Up to that point, I spent time refining the landers and practicing the landings on Kerbin. But once the designs were locked, the push began to assemble the fleet in orbit. 

The practical limit on fleet size is how many ships can be juggled during the actual launch window. In order to boost the wet to dry mass ratio, these ships have the two-engine version of the propulsion module, which gives them a somewhat low thrust to weight ratio. The ~2000m/s ejection burn had to be split into two parts: one into a 10-day elliptical orbit and a second to escape onto the final Jool transfer. Even still, the burns were 10 minutes each, so the ships had to be spaced out so they would reach their final periapsis burn at reasonable intervals.

Also, it would be nice if they didn't hit the Mun on their way in.
After the final burn, the transfer takes over two Kerbin years, meaning Kerbin will have been destroyed by the time the first ship even arrives at Jool. The lander designs can never be tweaked, and the crewed fleet will have to set out with no guarantee that there will be a base waiting. No pressure.

Lots of empty space to cross now.

661 Days Remain

With the first 18 ships on their way to Jool and then Laythe to set up base, the priority shifts to getting Kerbals off-planet. This means mass-producing and then filling the immense colony ships, which are the most intricate builds I have attempted in KSP yet.

More to come...