Developer Tests LLM Hacking Skills With $1,500 Budget

A developer built a vulnerable app and spent $1,500 testing whether AI language models could successfully exploit real security flaws. The experiment reveals practical insights into LLM capabilities for security tasks.

You are choosing between two approaches to application security testing: hire a human penetration tester, or run the same test with an LLM and track what it costs. The question is not whether LLMs can find vulnerabilities. The question is which class of vulnerabilities they reliably find, which they miss, and what you actually pay per finding. Kasra's experiment spent $1,500 trying to answer this with a purpose-built vulnerable app and real API calls. The results are worth examining carefully.

How LLMs approach a vulnerable target

A human pentester works in loops: reconnaissance, hypothesis, probe, observe, refine. They carry context forward between steps. An LLM does something that looks similar but is structurally different. Each call to the model is, in isolation, a text completion. The "loop" is something the surrounding code builds, not something the model natively maintains. Think of it like giving a security consultant a notepad but erasing it between every conversation. They can be brilliant in each individual session, but they cannot remember that the endpoint they found in session three is related to the parameter injection they spotted in session seven. You have to hand them the connection explicitly. This matters because most real vulnerabilities are compositional. SQL injection in a search field is easy to find. But the critical path is often: unauthenticated endpoint reveals internal user IDs, those IDs are accepted by an admin function that trusts the session cookie format, and that cookie format has a predictable HMAC. Each piece is low severity. Together they are a full account takeover. LLMs, without careful prompt engineering to carry this chain forward, tend to report the pieces without assembling the chain. What they do well is pattern matching at scale. Give a model a code dump and ask it to find SQL injection candidates, and it will produce a list faster than any junior analyst. Claude and ChatGPT both perform well at this specific subtask. The gap appears when the task requires multi-step exploitation rather than single-step identification.

Deciding which tool fits which test

The decision tree here is straightforward once you accept that LLMs are better at some security tasks than others. If you are auditing your own code for known vulnerability classes (SQLi, XSS, insecure deserialization, hardcoded secrets), run a model pass first. It is faster and cheaper than a human review for initial triage. Cursor or Claude Code pointed at a repository will surface candidates in minutes. Treat the output as a lead list, not a final report. If your goal is to understand the actual exploitability of a finding, a human needs to be in the loop. The LLM will tell you that the parameter is injectable. It will not reliably tell you whether the injection bypasses the WAF sitting in front of it, whether the database user has permissions that make it meaningful, or whether a timing attack is actually feasible given network latency. If you are building an automated security pipeline, budget for two things: the cost of false positives (developer time spent chasing findings that do not pan out) and the cost of false negatives (vulnerabilities the model skipped because they required chained reasoning). Neither cost shows up in the API bill. If your codebase is mostly third-party dependencies, skip LLM-based code review entirely for the dependency surface. Use dedicated SCA tooling. LLMs are not trained to reason about specific CVE history across package versions with any reliability. If you are deciding between Cursor and GitHub Copilot for security-adjacent code review, the comparison matters less than the prompting strategy you wrap around either one.

A concrete walkthrough: one finding, real costs

Take a simple scenario. You have a Flask app with a user search endpoint. The query parameter is passed unsanitized to a SQLite query. Classic textbook injection. You paste the relevant route handler into a model and ask it to find security issues. The model identifies the injection in under 30 seconds. Cost: roughly $0.002 at current GPT-4o pricing for that payload size. Good. Now you want to know if it is exploitable. You ask the model to generate a proof-of-concept payload. It produces something like ' OR '1'='1 and a UNION-based extraction attempt. Still cheap. Still fast. Now you want to know whether the app's input length validation (it truncates at 50 characters) defeats the UNION approach. You ask the model. It hedges. It says "it depends on the database version and the specific truncation behavior." That hedge is correct, but it is not actionable. A human tester would fire up sqlmap, confirm the truncation, and find that a time-based blind injection works within the character limit in about 20 minutes. At $1,500 spread across hundreds of similar decision points, the finding rate per dollar looks reasonable until you account for the triage cost. Each "it depends" answer that a developer has to manually verify is roughly 30 minutes of engineer time. At a $100/hour fully-loaded rate, ten uncertain findings cost $500 in follow-up. The API bill is not the whole picture.

The number that reframes the experiment

$1,500

spent on API calls to test one intentionally vulnerable app

$1,500 sounds like a reasonable experiment budget. It is also roughly the cost of four hours of a senior penetration tester's time at standard consulting rates ($300-$400/hour is typical for experienced external pentesters in 2024). That four-hour block would produce a scoped findings report with exploitability ratings, reproduction steps, and remediation priority. The LLM run produced raw findings that still required human interpretation. If that number were $150, the calculus shifts. At one-tenth the cost, even a 60% false positive rate is acceptable for early-stage triage. You run the model pass on every pull request, accept the noise, and escalate the plausible findings to a human. That workflow makes sense. If that number were $15,000, you are approaching the cost of a full external pentest engagement, and the comparison becomes unflattering. A proper engagement includes authenticated testing, business logic review, and a remediation call. An LLM at $15,000 in API costs is burning money on repeated prompts without the cumulative reasoning a human tester builds across a multi-day engagement. At $1,500, the honest position is: this is useful for code-level pattern matching and faster than nothing for initial triage, but it is not a substitute for a scoped pentest. The experiment confirms that. What it also suggests is that the tooling around the model (how context is passed between steps, how findings are aggregated, how the human reviewer is looped in) matters more than which model you pick. The Claude vs ChatGPT question is less important than whether your pipeline actually chains the findings into exploitable scenarios rather than dumping a flat list of candidates. The security community has been here before with automated scanners. Burp Suite Professional does not replace a pentester. Neither does an LLM. The question is always where in the workflow the automation earns its cost, and where it creates work instead of eliminating it.

TL;DR

LLMs find known vulnerability patterns fast and cheaply, but struggle with the chained reasoning that makes findings exploitable. At $1,500 for a single app, the cost is competitive only if you account for the human triage time the model's uncertain outputs generate.