Agent Flight Simulator

How it works

Five layers that keep AI agents honest

Each layer has a specific responsibility. Together they form a testable pipeline that separates "the model wants to do X" from "X is actually safe" — regardless of what the tool or workflow is.

🤖

LLM Agent

A GPT-backed agent receives a task and autonomously decides when and how to call the configured tool. The model controls timing and idempotency key generation — or fails to. In this demo, the tool is issue_refund.

OpenAI Responses API

🛡️

Policy Guard

Deterministic application code sits between the agent and the tool. It enforces amount limits, eligibility windows, and duplicate-prevention rules regardless of what the model requested.

Rule Engine

💥

Failure Injection

The platform simulates real-world conditions: transient timeouts, partial failures where the operation succeeded but the response was lost, and network errors. Agents must handle these gracefully.

Fault Injection

🔑

Idempotency

A timeout doesn't prove failure — the operation may have already completed. Safe agents supply a stable idempotency key so retries return the original result instead of repeating the side effect.

Idempotency Keys

✅

Deterministic Invariants

After each run, invariants inspect actual system state — not the model's claims. They detect duplicate operations, excess totals, and policy violations. The verdict comes from code, never from the model's self-report.

Invariant Checking

📋

Structured Traces

Every model request, tool call, failure injection, and invariant check produces a sequenced, typed event. Traces are stored in PostgreSQL for replay, diffing, and regression detection across agent versions.

Observability

Deployment

Production-grade infrastructure

The full CI/CD pipeline from commit to live container — zero EC2 servers managed.

📝

Commit

GitHub

🔬

Test

GitHub Actions

📦

Build

Docker

🗄️

Push

Amazon ECR

🚀

Deploy

ECS Fargate

🌐

Route

ALB

🗃️

Persist

RDS Postgres

Auth: GitHub OIDC → temporary AWS credentials via STS — no long-lived secrets stored anywhere. Infrastructure provisioned with Terraform. Logs shipped to CloudWatch.

Tech stack

What's under the hood

Each technology earns its place — this isn't a demo with cloud logos slapped on.

Application & simulation logic

PostgreSQL

Run and trace persistence

OpenAI API

Live LLM agent (function calling)

Docker

Reproducible container packaging

Amazon ECR

Versioned image registry

ECS Fargate

Serverless container runtime

RDS

Managed Postgres on AWS

ALB

Public entry point & health checks

Terraform

Infrastructure as code

GitHub Actions

CI/CD pipeline automation

OIDC

Temporary AWS credentials for CI

CloudWatch

Centralized log storage

Builder

Devin Smith

Software Engineer at Electroimpact (aerospace manufacturing automation) and Computer Science graduate of the University of Washington Seattle. I built Agent Flight Simulator to explore the infrastructure side of AI reliability — specifically: how do you make agentic systems safe when you can't trust what the model reports about itself?

This project reflects my interest in AI infrastructure, distributed systems, and the operational concerns that come with putting LLMs in production loops.

⬡ github.com/drsmith5/agent-flight-simulator ⬡ github.com/drsmith5 ⬡ linkedin.com/in/devin-smith

Core safety layers

AWS services

Trace event types

Long-lived secrets in CI

Test autonomous agentsbefore they go wrong.