Forge Boosts Local Model Agentic Task Accuracy to 99%

Forge is an open-source reliability layer that adds guardrails to self-hosted LLM tool-calling, improving an 8B model's performance from 53% to 99% on agentic tasks through retry logic, error recovery, and context management.

You paste `retry_nudge: true` into a config file, restart your local inference server, and watch a Llama 8B model that was failing two out of every three tool calls start completing them. No model swap. No fine-tuning run. Just a wrapper that tells the model what went wrong and asks it to try again with tighter constraints. That is the core of Forge, an open-source project Antoine Zambelli, AI Director at Texas Instruments, published to Hacker News this week. The headline number is stark: 53% task completion on agentic benchmarks without the guardrail layer, 99% with it. The repo is public on GitHub.

Why small models break on tool calls and why retries alone do not fix it

The failure mode Forge is targeting is specific. When you run a local 8B model in an agentic loop, the model is being asked to do several things at once: decide which tool to call, format the call correctly, interpret the result, and decide what to do next. Larger models have enough capacity to carry context across those steps even when earlier steps go slightly wrong. Smaller models do not. A malformed JSON response in step two does not get corrected in step three. It cascades. The naive fix is to add retries. If the tool call fails, try again. This works sometimes, but it misses the actual problem. The model did not fail because it ran out of attempts. It failed because it lost track of what it was doing, produced output that violated the expected schema, or ran into a context window that had silently been truncated because VRAM ran out. A plain retry hands the model the same broken context and asks it to do better. Usually it does not. What Forge adds is structured failure recovery. When a step breaks, the system tells the model specifically what broke, reframes the instruction to constrain the next attempt, and enforces that the model is still on the step it should be on rather than having jumped ahead or looped back to an earlier one. The VRAM-aware context management piece is the one that usually gets skipped in lighter-weight implementations. If your local inference server is running on a 16GB GPU and the context has grown past what fits, the model starts hallucinating rather than refusing. Forge tracks that and trims context before the model reaches the edge. The result is that you are not actually squeezing more out of the model's weights. You are removing the conditions under which small models reliably fail.

How VRAM-aware context trimming works

Think of the model's context window as a whiteboard. The model can only act on what is currently written on it. When the whiteboard fills up, something has to get erased. The question is: what? Most inference servers handle this with a simple sliding window. They drop the oldest tokens first. In a conversation, that is usually fine. In an agentic task, it is often catastrophic, because the oldest tokens frequently contain the original task description, the list of available tools, and the schema the model is supposed to follow. Erasing those and keeping the most recent failed tool call is exactly backwards. Forge's approach is to monitor available VRAM before each inference call and, when space is tight, apply a context trimming strategy that preserves high-priority tokens: the system prompt, the tool definitions, and the current step's instruction. Recent but lower-priority context, like intermediate scratchpad reasoning, gets dropped first. The model keeps the scaffolding and loses the notes. This is not a new idea in LLM systems, but it is the piece that engineers commonly skip when they are standing up a quick local agent. The result of skipping it is that the model works fine on short tasks and degrades on long ones in a way that looks random until you correlate it with context length.

Setting up Forge on a local inference stack

Forge is designed to sit in front of whatever local inference server you are already running, whether that is Ollama, llama.cpp, or a local vLLM instance.

Clone the repository: git clone https://github.com/antoinezambelli/forge
Install dependencies: pip install -r requirements.txt
Copy the example config and edit it for your setup: cp config.example.yaml config.yaml. Set your inference server URL, the model name, and your VRAM ceiling in GB.
Enable the guardrail modules you want in config.yaml. The defaults are: retry_nudge: true, step_enforcement: true, error_recovery: true, vram_context_management: true.
Point Forge at your tool definitions. These are standard JSON schema files. Drop them in the tools/ directory Forge expects, or set the path in config.
Run the Forge server: python forge/server.py --config config.yaml
Update your agent code to send requests to Forge's endpoint instead of directly to your inference server. The API surface is a drop-in replacement for a standard chat completions endpoint.

Verification test: run one of the included benchmark tasks with and without the guardrails flag. The command is python forge/bench.py --task file_ops --guardrails off and then python forge/bench.py --task file_ops --guardrails on. If the pass rate on file_ops does not improve materially with guardrails on, your tool definitions are malformed and that is where to look first.

The case that the 53%-to-99% number does not mean what it appears to mean

Benchmark numbers for agentic systems are easier to manipulate than most, and not always intentionally. The 99% figure is measured against a specific benchmark on a specific model with specific tasks. The tasks are probably the tasks Forge was designed for. The baseline 53% is probably what that model scores without any scaffolding at all, which is a low bar. A fair comparison would also include a well-implemented existing agent framework doing the same tasks on the same model. The gap is also almost certainly not 46 percentage points in a production setting. Production agentic tasks involve ambiguous instructions, tools that return unexpected schemas, and user inputs that sit outside whatever distribution the benchmark covers. Guardrails built for known tool schemas and well-defined steps become friction when the task structure is novel. There is also a category question. An 8B model with a 99% pass rate on structured tool-calling benchmarks is not the same as a model that is ready for production agentic work. The tasks where 8B models fail are not usually the structured ones. They are the tasks that require judgment under ambiguity, multi-step reasoning with incomplete information, and knowing when to stop and ask rather than proceed. Forge does not help with those. It helps with the plumbing. Which means the right framing is not "Forge makes small models production-ready." It is "Forge removes a layer of avoidable failure that was making benchmark scores misleadingly low." That is still useful. It is just a more modest claim. If you are evaluating small models for local deployment, the benchmark without Forge is probably not measuring what you care about. But the benchmark with Forge is also not measuring what you care about. The number worth tracking is performance on your specific tasks, with your specific tools, over a long enough run to catch edge cases. Forge is a reasonable starting point for that infrastructure, not a replacement for it.

TL;DR

Forge is an open-source guardrail layer for local LLM tool-calling that lifts an 8B model from 53% to 99% on agentic benchmarks by adding structured retry logic, step enforcement, and VRAM-aware context trimming. The number is benchmark-specific and should not be read as a production guarantee, but the underlying problems Forge solves are real and worth addressing before assuming a small model is too weak for agentic tasks.