EvanFlow: TDD Feedback Loop for Claude Code

Open-source tool EvanFlow creates a test-driven development feedback loop optimized for Claude Code, helping developers improve code quality and accelerate iteration cycles.

TL;DR

EvanFlow is a free, open-source TDD harness built specifically for Claude Code. It runs a tight test-write-verify loop so the model gets failing test output as feedback before it tries to fix anything. Setup takes about five minutes. The tradeoff is that it only makes sense if you already have a test suite or are willing to write one first.

Claude Code's context window holds around 200,000 tokens, but in practice the model starts drifting on multi-file edits well before that ceiling. EvanFlow's premise is that the drift is not a context problem - it is a feedback problem. The model does not know when it is wrong until you tell it, and most workflows tell it too late.

How to set up EvanFlow in a working project

The EvanFlow repository is structured around a single core idea: run your test suite, pipe the failures directly to Claude Code, and let the model iterate against a real signal rather than a prompt description of what you want.

EvanFlow: TDD Feedback Loop for Claude Code — Source: Hacker News

Here is the setup sequence for a typical Node or Python project:

Clone the repo and install dependencies: git clone https://github.com/evanklem/evanflow && cd evanflow && npm install
Point EvanFlow at your test runner. In the config file, set testCommand to whatever runs your suite - pytest, jest --watchAll=false, go test ./..., or similar.
Set model to claude-code and confirm your Anthropic API key is available in the environment as ANTHROPIC_API_KEY.
Write a failing test for the feature you want built. This is the non-negotiable step. If you skip it, EvanFlow has nothing to drive the loop.
Run npm start or the equivalent entry point. EvanFlow executes the test suite, captures stdout and stderr from the failures, and passes that output to Claude Code as the first message in the session.
Watch the loop. Claude Code writes or edits code, EvanFlow re-runs the tests, and the cycle continues until the suite passes or you hit a configured iteration limit.

Verification checklist before you hand anything off to production:

Confirm the test that was originally failing is now green
Run the full suite, not just the target test - Claude Code will sometimes fix one test by breaking an adjacent one
Check the diff manually. The loop can produce working code that passes tests through a path you did not intend
Review any new files created during the session. The model occasionally introduces helper files that are not wired into the project correctly

The iteration limit matters

Set a hard cap on loop iterations in the config - 5 to 8 is reasonable for most tasks. Without it, a stubborn test failure can burn through API credits on circular attempts.

EvanFlow against the alternatives

There are several ways to run a TDD-style loop with an AI coding tool. They differ in how much you configure, how tightly they integrate with your test runner, and whether they work with Claude Code specifically.

Tool	TDD loop	Claude Code support	Setup time	Cost	Best for
EvanFlow	Native, automated	First-class	~5 min	Free (you pay API)	Claude Code users who already write tests
Cursor	Manual (paste output)	Via OpenRouter or direct	~2 min	$20/mo + API	Developers who want an IDE with AI built in
Goose	Shell tool, semi-automated	Yes, via API	~10 min	Free (you pay API)	Teams who want a broader agent with file and shell access
Manual Claude Code	None - paste and iterate by hand	Native	0 min	API only	One-off tasks, exploratory work

The honest comparison on Cursor is that its TDD story depends entirely on you copying test output and pasting it back into the chat. That works fine for a single failing test. It breaks down when you have 12 interdependent failures and you want the model to work through them systematically. Cursor and Claude Code serve different workflows - Cursor is an IDE replacement, EvanFlow is a loop harness. They are not really competing.

Goose is closer in spirit to EvanFlow - it is a headless agent that can run shell commands and observe output. But Goose is a general-purpose agent, not a TDD-specific tool, and configuring it to behave like EvanFlow requires more setup work than EvanFlow itself requires.

A concrete scenario where this changes the output

Say you are building a data validation module. You have a function validate_transaction that needs to reject negative amounts, amounts above a configurable ceiling, and malformed currency codes. You write three tests, all failing, before any implementation exists.

Without EvanFlow, the typical workflow is: describe the requirements in a prompt, let Claude Code write the function, run the tests yourself, paste the failures back if something breaks, repeat. Each round-trip takes 2 to 4 minutes if you are moving quickly. Three or four iterations to get all three tests green is not unusual.

With EvanFlow, that round-trip is automated. Claude Code sees the raw pytest output - including the assertion error, the line number, and the actual vs expected values - before it writes a single line. The failure message is more precise than any prompt description you would write by hand. In this kind of bounded, well-specified task, the model tends to get all three tests green in one or two iterations rather than three or four.

The scenario where this matters most is not a small validation function. It is a 300-line service class with 20 unit tests, where you are refactoring a database access layer and you cannot afford to break the existing contract. Running that manually through Claude Code is possible. Running it through EvanFlow means the model is working against a live definition of "correct" rather than a description of one.

It also changes how you write prompts. When the test output is the first message, you spend less time explaining edge cases and more time on architecture decisions. The model can infer the edge cases from the failing assertions.

Claude Code API costs per session and per team

EvanFlow itself is free. The cost is Claude Code API usage, and that scales directly with loop iterations and context size.

A single loop iteration on a medium-complexity task - say, one file edited, three failing tests, moderate context - costs roughly $0.03 to $0.08 using Claude Sonnet 3.5 at current pricing. That means a 5-iteration loop runs $0.15 to $0.40. A more complex session with 10 iterations and a large codebase context can reach $1.50 to $2.00.

$0.03

approximate cost per loop iteration on a medium-complexity task with Claude Sonnet 3.5

Those numbers are manageable for individual developers. For teams running EvanFlow across multiple developers on multiple repos, the math changes. If five developers each run three sessions per day at an average of $0.60 per session, you are looking at $9 per day or about $270 per month in API costs before any other Claude usage.

Compare that to GitHub Copilot at $19 per seat per month - Copilot does not have a TDD loop, but the cost structure is predictable. Claude Code API usage is not. The EvanFlow iteration cap is not optional for teams watching a budget.

There is no free tier buffer here the way there is with some hosted tools. Every iteration hits the API. If you are on Anthropic's free tier for personal use, you will hit rate limits quickly during an active loop session. The tool is realistically aimed at developers with a paid API key.

One cost that is easy to miss: failed loops. When EvanFlow cannot resolve the failing tests within your iteration limit, you have paid for all those attempts and still have a broken test. That is not a flaw in EvanFlow - it is how LLM-based iteration works - but it means you should set conservative iteration limits early and raise them only when you understand the tool's behavior on your specific test suite.

Which setup is right for your situation

Your situation	Best option	Why
You write tests first and use Claude Code already	EvanFlow	Automates exactly what you are already doing manually
You want an IDE with AI code completion and no test harness setup	Cursor	Better editing experience, no loop configuration required
You want a general-purpose agent that can also run tests	Goose	More flexible, handles tasks outside the TDD loop
You do not write tests or cannot write them first	Manual Claude Code	EvanFlow needs failing tests to drive the loop - without them it adds no value
You are on a tight API budget	Manual Claude Code with paste-back workflow	Each iteration costs money - manual control keeps you from burning credits on bad loops
You are refactoring a large module with an existing test suite	EvanFlow	The automated feedback is most valuable when there are many interdependent tests to hold green
You want to compare Claude Code against other AI coding tools	See the Cursor vs Claude comparison	The choice of model matters as much as the loop harness