LLM-driven middleware that discovers the I/O of two systems, generates a Python connector, tests itself with auto-synthesized cases, refines until it converges, and routes the LLM provider by data sensitivity — Grok 4.3 for unclassified, Qwen3 via Ollama (local / air-gapped) for restricted / classified.
Two numbers explain why this exists:
The bureaucratic friction this represents is the worst kind of integration work: systems talk past each other through mismatched APIs, paper handoffs, CSV exports, PDF-form-and-fax loops, and screen-scrape RPA. Each pairing is bespoke. Each bespoke connector is fragile. The cost compounds.
Universal Interface is a prototype answer: instead of one fragile bespoke connector per system pair, an LLM-driven harness that builds, tests, and operates them on demand — and, critically, routes the LLM provider by data sensitivity so the same pipeline can do restricted / classified work air-gapped on local hardware without code changes.
The system is six cooperating layers, each independently testable:
| Layer | Role |
|---|---|
LLM Router (llm/) | Pluggable providers: Grok 4.3 (cloud), Qwen3 via Ollama (offline / air-gapped), Anthropic Claude, deterministic Mock. Auto-routes by sensitivity tag (restricted → local-only) and task class. |
Discovery (discovery/) | Introspects REST (OpenAPI / heuristic crawl), CSV / TSV, JSON, PDF AcroForm, HTML form, SQL DB (any SQLAlchemy dialect), screen-scrape stub. Emits a canonical SystemDescriptor. |
Schema Unifier (schema/) | Heuristic + LLM-assisted field mapping. Confidence scoring, type compatibility, PII tagging baked in. |
Generator (generator/) | LLM writes a single-file Python module exposing def transform(record) -> dict. Falls back to a deterministic mapping-driven connector if LLM output is unusable. |
Sandbox (sandbox/) | Static AST policy check (blocks os, subprocess, socket, open, etc.) + isolated subprocess execution with timeout. |
| Test Synthesizer + Refiner | Generates pytest-style cases from sample records, schema-derived synthetic data, edge cases, and PII-aware adversarial inputs. Feeds test failures back to the LLM until convergence. |
| Telemetry | SQLite store: sessions, LLM calls (tokens + latency), connector versions, test runs, executions. Analyzer produces health and cost summaries. |
Don't write the connector. Write the harness that writes, tests, and refines the connector — and route the LLM by data sensitivity so the same harness handles unclassified and air-gapped classified work without code changes.
The provider router is the most consequential design choice in the project. Every federal legacy-system problem worth solving touches data that is at least controlled-unclassified (CUI), and frequently restricted or classified. A cloud LLM is not an option for that bucket — but the same problem still needs the same solution shape, and rewriting the pipeline for each sensitivity tier would defeat the point.
So the router auto-selects:
The whole router is a config swap. Drop in a different open-weight model (Qwen3.6-35B-A3B MoE, Llama 3.x, Mistral, Phi-4-mini), or a different cloud provider, and the rest of the pipeline doesn't know it happened.
Every connector the LLM writes is untrusted Python. The sandbox enforces two layers before any code runs against a real system:
os, subprocess, socket, open, dynamic exec/eval, etc.).Both are necessary because both fail differently. AST policy is fast and gives clean rejection messages back to the refiner loop. Subprocess isolation catches things AST analysis can't (e.g., resource exhaustion, runaway loops, anything dynamic).
When tests fail, the failures plus the generated code go back to the LLM with a targeted prompt: here's what you wrote, here's what failed, here's why. The LLM produces a new version, the sandbox re-runs the suite. A configurable cap stops the loop before it can cost real money. Every iteration is versioned in SQLite so you can audit which connector version is in production and what test history it has.
ui doctor # provider availability + config sanity ui discover # introspect a source / target system ui connect # full pipeline: discover → unify → generate → test ui run # apply a saved connector to new records ui evaluate # rerun test suite against a versioned connector ui telemetry # health + cost summary from SQLite store ui demo # offline-safe demo via Mock provider
Universal Interface is where I learned to treat sensitivity tagging as a first-class architectural concept rather than an afterthought. Once you accept that a record's classification level is a routing primitive — the same way HTTP method or content-type is — the rest of the design falls out: pluggable providers, identical prompt frames across providers, identical sandbox enforcement, and a telemetry layer that tracks both providers under the same schema so cost and quality are comparable. The lesson transfers directly to any AI system that's going to operate inside defense / regulated / financial perimeters.