An extensible platform for testing AI agents against configurable tools, failure conditions, and safety policies — before they reach production. This demo uses a refund workflow as the example.
Each layer has a specific responsibility. Together they form a testable pipeline that separates "the model wants to do X" from "X is actually safe" — regardless of what the tool or workflow is.
A GPT-backed agent receives a task and autonomously decides when and how to call the configured tool. The model controls timing and idempotency key generation — or fails to. In this demo, the tool is issue_refund.
Deterministic application code sits between the agent and the tool. It enforces amount limits, eligibility windows, and duplicate-prevention rules regardless of what the model requested.
Rule EngineThe platform simulates real-world conditions: transient timeouts, partial failures where the operation succeeded but the response was lost, and network errors. Agents must handle these gracefully.
Fault InjectionA timeout doesn't prove failure — the operation may have already completed. Safe agents supply a stable idempotency key so retries return the original result instead of repeating the side effect.
Idempotency KeysAfter each run, invariants inspect actual system state — not the model's claims. They detect duplicate operations, excess totals, and policy violations. The verdict comes from code, never from the model's self-report.
Invariant CheckingEvery model request, tool call, failure injection, and invariant check produces a sequenced, typed event. Traces are stored in PostgreSQL for replay, diffing, and regression detection across agent versions.
ObservabilityThe full CI/CD pipeline from commit to live container — zero EC2 servers managed.
Auth: GitHub OIDC → temporary AWS credentials via STS — no long-lived secrets stored anywhere. Infrastructure provisioned with Terraform. Logs shipped to CloudWatch.
Each technology earns its place — this isn't a demo with cloud logos slapped on.
Software Engineer at Electroimpact (aerospace manufacturing automation) and Computer Science graduate of the University of Washington Seattle. I built Agent Flight Simulator to explore the infrastructure side of AI reliability — specifically: how do you make agentic systems safe when you can't trust what the model reports about itself?
This project reflects my interest in AI infrastructure, distributed systems, and the operational concerns that come with putting LLMs in production loops.