Real-time multi-agent mission-control simulator orchestrating 8 LLM-driven console operators under a Flight Director command loop — with first-principles propulsive landing physics underneath.
Propulsive landing — a reusable rocket coming down on its tail, optionally onto an actively-station-keeping droneship — collapses six classically-decoupled engineering problems into one tightly-coupled real-time loop: guidance, control, structures, combustion stability, weather / sea-state, thermal margins, recovery hardware, and comms-latency to the platform. A modern flight team handles this by partitioning across console operators (Propulsion, GNC, Structures, Weather, Thermal, Recovery, Inspection, ASDS) under a Flight Director. The interesting question is not "can we simulate the rocket" — that's solved physics. It's can we simulate the operators: distributed agents reasoning under partial information, surfacing anomalies, coordinating timing, and converging on a HOLD / RESUME / SCRUB / GO-NO-GO call.
I built T-MINUS to find out how far an LLM-orchestrated mission-control loop can be pushed against first-principles physics — and whether the conversations the agents produce are useful artifacts, not theatre.
Eight LLM-driven console operators, each with their own state, telemetry feed, decision scope, and customer-tuned personality, run concurrently under a Flight Director command loop. The Flight Director can issue HOLD / RESUME / SCRUB / GO-NO-GO at any cadence; the operators speak when their console state changes or when a monitor escalates a condition. Underneath them, a deterministic physics-and-systems substrate runs the actual flight:
"The monitor decides WHEN, the LLM decides WHAT and HOW."
That sentence is the load-bearing architectural decision. The monitor — a small, deterministic state machine watching telemetry — is responsible for when a console agent gets invoked, what subset of state it sees, and what prompt frame it receives. The LLM is responsible only for the what and how of the response: anomaly framing, recommendation phrasing, escalation choice. This separation keeps timing deterministic, keeps LLM cost bounded, and keeps the agents from hallucinating themselves into the wrong moment of the flight.
T-MINUS was validated through a campaign of eight full flights across five distinct mission profiles, with customer-specific operator personalities (e.g., a NASA crew mission's Propulsion console runs more conservatively than a Starlink stack's). Total LLM calls across the campaign: 448, average grade A− against an internal rubric scoring anomaly detection accuracy, escalation timeliness, and call-quality under HOLD / SCRUB pressure.
Each profile drives different operator personalities, different go-criteria, and different acceptable risk envelopes:
| Profile | Driving Constraint |
|---|---|
| STARLINK | High-cadence commodity launches; tolerate marginal aborts in favor of throughput. |
| NASA_CREW | Human-rated; conservative scrub bias; emphasis on margin and traceability. |
| USSF_CLASSIFIED | Restricted disclosure paths; operator commentary stays inside need-to-know. |
| COMMERCIAL_GEO | High-energy trajectory; tighter thermal / structural margins on ascent. |
| UNIVERSITY_CUBESAT | Low-cost; primary launches; operator persona is leaner, more pedagogical. |
Each operator receives (a) a narrow telemetry slice routed by the monitor, (b) a
rolling state digest of its own console (open items, pending acknowledgements,
flag history), and (c) a customer-specific persona shaping its phrasing and risk
tolerance. Its output is a typed call (e.g., NOMINAL, WATCH,
HOLD-RECOMMEND, NO-GO) plus a natural-language justification
that lands on the Flight Director's panel.
Hardware failures are emergent, not scripted. Each modeled part has a degradation process — IMU bias walk, GPS-receiver thermal drift, RCS thruster duty-cycle wear, radar-altimeter sea-spike clutter — and the failures interact (e.g., RCS wear shows up first as an attitude-rate error that the IMU's noise floor partially masks). This is the part the operator agents actually have to diagnose; the monitor only knows there's a fault somewhere downstream.
The Flight Director is the human (or, in unattended runs, a separate scripted operator) that issues HOLD / RESUME / SCRUB / GO-NO-GO. The command loop is the only mechanism that can override an operator's recommendation. This keeps the agents subordinate — they recommend, the FD decides — which is the right shape of authority for a mission-control architecture.
T-MINUS is the project where I most clearly saw where LLMs fit in a high-reliability loop and where they don't. They are excellent at framing — taking a telemetry slice and producing a high-signal, situation-aware sentence. They are bad at owning timing or numerical thresholds. The monitor / LLM split that fell out of the build is the same pattern I'd argue for in any AI-augmented mission-control or production-systems context: deterministic infrastructure decides when and what state; the model decides language and recommendation. Keeping that line clean is how you get an A− campaign instead of a confidently-wrong one.