5 min read
AI test output can burn more context than the code your agent just wrote.
When an AI coding agent runs a large test suite, the expensive part can be the raw terminal output that gets sent back into the model. A 5,000-line log can become about 100,000 input tokens, while a structured test summary can carry the same decision signal in a few hundred tokens.
AI test output is eating the context window
Running tests is the right habit. The problem starts after the command finishes.
If your agent runs a suite with 5,000 test lines and the CLI sends the full output back to the model, the model has to read the noise before it can decide what happened. At an average of 80 characters per line, that is about 400,000 characters. OpenAI’s token guidance says one English token is roughly four characters, so the raw test output is near 100,000 tokens.
That can be more than the code change, the instructions, and the agent’s own reasoning combined.
The cost is not only pricing. Context is working memory. Once verbose logs enter the session, they compete with the code, the spec, the conventions, and the failure the agent needs to fix.
Why passing tests still cost tokens
Passing test output is weak evidence for the next decision.
A passing test line says one thing: this check did not fail. Multiply that by thousands and the agent receives a wall of confirmations it does not need. The useful signal for the next action is smaller: total tests, passed count, failed count, skipped count, duration, and the first failing assertion or runtime error.
Claude Code’s documentation is direct about this class of problem: command output enters the context window, and verbose output can consume thousands of tokens in a single turn. That includes test results, logs, and error messages.
| Test result | Raw output tells the agent | Compact result tells the agent |
|---|---|---|
| All pass | Thousands of pass confirmations | `passed=5000`, `failed=0`, `errored=0` |
| One failure | Every pass plus the failure | Summary plus failing test id and stack slice |
| Parser issue | Mixed logs and broken formatting | Parse warning plus raw excerpt request |
| Rerun | Same wall repeated again | Delta from previous result |
This is the hidden tax. A good workflow validates the code. A noisy workflow makes the validation evidence too large to reason about.
What the agent should see instead
The agent does not need the whole room. It needs the evidence.
For a clean run, the ideal payload is small enough to scan in one glance. For a failing run, the payload should expand only around the failure. That means the default result is a summary, and raw output becomes an escalation path, not the first thing sent into the model.
A 10-line JSON summary might be 600 to 900 characters, depending on the runner and failure detail. Using the same token rule, that is roughly 150 to 225 tokens. Compared with a 100,000-token raw log, the agent can preserve more context for source files, requirements, and the actual fix.
How paqad-ai keeps the proof small
paqad-ai treats test output as evidence, not conversation filler.
The framework already has structured test-output parsing for runner formats developers use in real projects: Jest and Vitest-style JSON, JUnit XML, pytest JSON, Go’s go test -json, RSpec JSON, and TAP. Its health check verifies whether structured parsing is ready for the active project stack before the workflow relies on it.
The important detail is the contract. paqad-ai stores a result with summary, failures, errors, warnings, and parse_metadata. Verification gates can then decide whether tests passed, failed, errored, or became inconclusive because parsing degraded.
That is the difference between asking an agent to read a terminal transcript and giving it a verification result.
The failure mode nobody prices
Token pricing makes the waste visible, but context loss is the real damage.
OpenAI’s current pricing page lists GPT-5 input tokens at $1.25 per million and output tokens at $10 per million. A single 100,000-token test log is not financially dramatic by itself. Repeated across agent loops, retries, CI failures, and long sessions, it starts to crowd out the evidence the model needs.
This is where teams misread the problem. They think the agent is getting worse at the code. Sometimes it is reading too much irrelevant proof.
The test suite did its job. The workflow sent the wrong evidence.
You still want tests. You want them more, not less. The fix is to make the test result agent-readable before it enters the context window.
What next?
If your agents run tests by dumping full logs back into the model, the workflow is spending context on the least useful part of validation. paqad-ai gives agents a smaller verification contract: counts first, failures next, raw output only when needed.
Install Bucket today and stop paying context for passing test lines.
