Go · AWS · AI Infrastructure

Test autonomous agents
before they go wrong.

An extensible platform for testing AI agents against configurable tools, failure conditions, and safety policies — before they reach production. This demo uses a refund workflow as the example.

⬡ View on GitHub Run a Simulation ↓
agent-flight-simulator/cmd/simulate

Five layers that keep AI agents honest

Each layer has a specific responsibility. Together they form a testable pipeline that separates "the model wants to do X" from "X is actually safe" — regardless of what the tool or workflow is.

🤖

LLM Agent

A GPT-backed agent receives a task and autonomously decides when and how to call the configured tool. The model controls timing and idempotency key generation — or fails to. In this demo, the tool is issue_refund.

OpenAI Responses API
🛡️

Policy Guard

Deterministic application code sits between the agent and the tool. It enforces amount limits, eligibility windows, and duplicate-prevention rules regardless of what the model requested.

Rule Engine
💥

Failure Injection

The platform simulates real-world conditions: transient timeouts, partial failures where the operation succeeded but the response was lost, and network errors. Agents must handle these gracefully.

Fault Injection
🔑

Idempotency

A timeout doesn't prove failure — the operation may have already completed. Safe agents supply a stable idempotency key so retries return the original result instead of repeating the side effect.

Idempotency Keys

Deterministic Invariants

After each run, invariants inspect actual system state — not the model's claims. They detect duplicate operations, excess totals, and policy violations. The verdict comes from code, never from the model's self-report.

Invariant Checking
📋

Structured Traces

Every model request, tool call, failure injection, and invariant check produces a sequenced, typed event. Traces are stored in PostgreSQL for replay, diffing, and regression detection across agent versions.

Observability

Production-grade infrastructure

The full CI/CD pipeline from commit to live container — zero EC2 servers managed.

📝
Commit
GitHub
🔬
Test
GitHub Actions
📦
Build
Docker
🗄️
Push
Amazon ECR
🚀
Deploy
ECS Fargate
🌐
Route
ALB
🗃️
Persist
RDS Postgres

Auth: GitHub OIDC → temporary AWS credentials via STS — no long-lived secrets stored anywhere. Infrastructure provisioned with Terraform. Logs shipped to CloudWatch.

What's under the hood

Each technology earns its place — this isn't a demo with cloud logos slapped on.

Go
Application & simulation logic
PostgreSQL
Run and trace persistence
OpenAI API
Live LLM agent (function calling)
Docker
Reproducible container packaging
Amazon ECR
Versioned image registry
ECS Fargate
Serverless container runtime
RDS
Managed Postgres on AWS
ALB
Public entry point & health checks
Terraform
Infrastructure as code
GitHub Actions
CI/CD pipeline automation
OIDC
Temporary AWS credentials for CI
CloudWatch
Centralized log storage

Devin Smith

Software Engineer at Electroimpact (aerospace manufacturing automation) and Computer Science graduate of the University of Washington Seattle. I built Agent Flight Simulator to explore the infrastructure side of AI reliability — specifically: how do you make agentic systems safe when you can't trust what the model reports about itself?

This project reflects my interest in AI infrastructure, distributed systems, and the operational concerns that come with putting LLMs in production loops.

5
Core safety layers
9
AWS services
13
Trace event types
0
Long-lived secrets in CI