All Projects
LLM-Driven Integration · Federal Legacy Air-Gap Capable

Universal Interface

LLM-driven middleware that discovers the I/O of two systems, generates a Python connector, tests itself with auto-synthesized cases, refines until it converges, and routes the LLM provider by data sensitivity — Grok 4.3 for unclassified, Qwen3 via Ollama (local / air-gapped) for restricted / classified.

The Problem

Two numbers explain why this exists:

  • 80% of federal IT spend goes to legacy maintenance (GAO-25-107795).
  • 53% of federal agencies still rely on manual data transfer between systems (Federal News Network, 2026-04). Just ten critical federal legacy systems cost $337M/year.

The bureaucratic friction this represents is the worst kind of integration work: systems talk past each other through mismatched APIs, paper handoffs, CSV exports, PDF-form-and-fax loops, and screen-scrape RPA. Each pairing is bespoke. Each bespoke connector is fragile. The cost compounds.

Universal Interface is a prototype answer: instead of one fragile bespoke connector per system pair, an LLM-driven harness that builds, tests, and operates them on demand — and, critically, routes the LLM provider by data sensitivity so the same pipeline can do restricted / classified work air-gapped on local hardware without code changes.

What I Built

The system is six cooperating layers, each independently testable:

LayerRole
LLM Router (llm/)Pluggable providers: Grok 4.3 (cloud), Qwen3 via Ollama (offline / air-gapped), Anthropic Claude, deterministic Mock. Auto-routes by sensitivity tag (restricted → local-only) and task class.
Discovery (discovery/)Introspects REST (OpenAPI / heuristic crawl), CSV / TSV, JSON, PDF AcroForm, HTML form, SQL DB (any SQLAlchemy dialect), screen-scrape stub. Emits a canonical SystemDescriptor.
Schema Unifier (schema/)Heuristic + LLM-assisted field mapping. Confidence scoring, type compatibility, PII tagging baked in.
Generator (generator/)LLM writes a single-file Python module exposing def transform(record) -> dict. Falls back to a deterministic mapping-driven connector if LLM output is unusable.
Sandbox (sandbox/)Static AST policy check (blocks os, subprocess, socket, open, etc.) + isolated subprocess execution with timeout.
Test Synthesizer + RefinerGenerates pytest-style cases from sample records, schema-derived synthetic data, edge cases, and PII-aware adversarial inputs. Feeds test failures back to the LLM until convergence.
TelemetrySQLite store: sessions, LLM calls (tokens + latency), connector versions, test runs, executions. Analyzer produces health and cost summaries.
Core Principle

Don't write the connector. Write the harness that writes, tests, and refines the connector — and route the LLM by data sensitivity so the same harness handles unclassified and air-gapped classified work without code changes.

The Air-Gap Routing Decision

The provider router is the most consequential design choice in the project. Every federal legacy-system problem worth solving touches data that is at least controlled-unclassified (CUI), and frequently restricted or classified. A cloud LLM is not an option for that bucket — but the same problem still needs the same solution shape, and rewriting the pipeline for each sensitivity tier would defeat the point.

So the router auto-selects:

  • Unclassified → Grok 4.3 (cloud, 256K context, OpenAPI-compatible). Best capability per dollar for the open-data case.
  • Restricted / classified / sensitivity-taggedQwen3 via Ollama, dense Apache-2.0-licensed model running locally. No network egress at inference time. Same prompt frame, same output shape, same downstream sandbox.
  • Cloud unreachable → automatic fallback to the local model. Air-gap doesn't have to be a deployment posture — it can be a graceful degradation mode.

The whole router is a config swap. Drop in a different open-weight model (Qwen3.6-35B-A3B MoE, Llama 3.x, Mistral, Phi-4-mini), or a different cloud provider, and the rest of the pipeline doesn't know it happened.

Why AST-Policy Sandboxing Matters

Every connector the LLM writes is untrusted Python. The sandbox enforces two layers before any code runs against a real system:

  • Static AST policy check — parse the generated module, walk the AST, and reject anything that imports or calls into a blocked namespace (os, subprocess, socket, open, dynamic exec/eval, etc.).
  • Subprocess isolation — execute in a separate process with a timeout and a constrained environment, so even if the policy check is wrong, the blast radius is bounded.

Both are necessary because both fail differently. AST policy is fast and gives clean rejection messages back to the refiner loop. Subprocess isolation catches things AST analysis can't (e.g., resource exhaustion, runaway loops, anything dynamic).

The Refiner Loop

When tests fail, the failures plus the generated code go back to the LLM with a targeted prompt: here's what you wrote, here's what failed, here's why. The LLM produces a new version, the sandbox re-runs the suite. A configurable cap stops the loop before it can cost real money. Every iteration is versioned in SQLite so you can audit which connector version is in production and what test history it has.

CLI Surface

ui doctor       # provider availability + config sanity
ui discover     # introspect a source / target system
ui connect      # full pipeline: discover → unify → generate → test
ui run          # apply a saved connector to new records
ui evaluate     # rerun test suite against a versioned connector
ui telemetry    # health + cost summary from SQLite store
ui demo         # offline-safe demo via Mock provider

What I Learned

Universal Interface is where I learned to treat sensitivity tagging as a first-class architectural concept rather than an afterthought. Once you accept that a record's classification level is a routing primitive — the same way HTTP method or content-type is — the rest of the design falls out: pluggable providers, identical prompt frames across providers, identical sandbox enforcement, and a telemetry layer that tracks both providers under the same schema so cost and quality are comparable. The lesson transfers directly to any AI system that's going to operate inside defense / regulated / financial perimeters.