Many CISOs are familiar with the age-old security testing dilemma: choosing between the speed of a scanner and the depth of a human penetration tester.
When we set out to build our pentest platform, we set our sights on bridging that gap through RedVeil: an agentic AI-powered penetration testing solution that is capable of genuine reasoning and can provide thorough results in just a few hours.
Today, we are validating that approach with data.
We recently put RedVeil's core engine to the test against a widely recognized benchmark for autonomous network penetration testing. The results were clear: RedVeil outperformed the current industry leader by 7 points.
While we represent a new generation of offensive security, these results confirm that our agents are already setting a new standard for depth, accuracy, and reasoning capabilities.

The XBEN benchmark is a series of web applications that aim to replicate a wide range of real-world vulnerability scenarios. These applications cover common web vulnerabilities like cross-site scripting (XSS), SQL injection (SQLI), and server-side template injection (SSTI) along with more complex multi-step attack chains including deserialization, server-side request forgery, and blind injection.
Beyond the Score: Why Reasoning Matters
In the world of autonomous security, a high benchmark score is not just a vanity metric, it is a proxy for how well an AI agent can emulate a human adversary.
Legacy scanners operate on a "see match, alert user" basis, finding open ports or outdated libraries. But real attackers operate on attack-path reasoning. They observe, orient, decide, and act. They chain a minor information leak in one area with a misconfiguration in another to achieve a critical breach.
RedVeil's lead in this benchmark is driven by our agents' ability to execute these complex, multi-step attack chains. The benchmark results highlighted three key areas where RedVeil distinguishes itself from other solutions:
1. Validation Over Volume
The defining characteristic of our performance was the elimination of false positives. In the benchmark scenarios, points are awarded for proven exploits.
RedVeil's agents do not simply flag theoretical risks. They attempt safe exploitation to generate Proof-of-Concept (PoC) evidence. If our agent says you are vulnerable, it provides the evidence to prove it. This "validation over volume" approach allowed us to secure points where other tools might only generate noise.
2. State-Aware Navigation
Many automated tools struggle with "stateful" applications-workflows that require logging in, creating a record, and then manipulating that specific record.
The benchmark tested heavily for this. RedVeil's agents maintained context across the test, effectively navigating complex business logic that typically stumps automated scanners. This ability to maintain "state" allowed our agents to find vulnerabilities deep within the application logic, contributing significantly to our score.
3. The Agent Ops Advantage
Our performance is also tied to our Agent Ops architecture. Because we measure effort in discrete units of AI work rather than wall-clock time or a fixed test, our agents have the freedom to be thorough and give our customers the ability to be flexible with how they allocate their testing.
In the benchmark, this meant our agents did not time out or give up superficially. They continued to analyze targets, adjusting their strategies dynamically until they exhausted the search space or achieved the objective.
Democratizing Elite Security
We are proud of these results, but benchmarks are simulated environments. The real value lies in what this capability brings to your production environment.
By outperforming the leading standard, we are demonstrating that autonomous testing is no longer "experimental." It is ready for the enterprise.
- For the developer: It means receiving a report that reads like it was written by a human, in hours, not weeks. Remediation testing is even quicker with one-click rechecking.
- For the CISO: It means obtaining the depth of a manual consulting engagement, but with the ability to run it on-demand, overnight, for a fraction of the cost.
- For the compliance officer: It means audit-ready reports that satisfy SOC 2, ISO 27001, CMMC, and other compliance framework requirements, backed by a testing engine that is demonstrably superior to standard market tools.
The Future is Autonomous
Security is a moving target. As software development accelerates, the window for manual security testing shrinks.
We believe that the future belongs to defensive AI that can out-reason offensive AI. This benchmark result is a milestone in that journey, proving that RedVeil is not just keeping up with the industry standard; we are pushing it forward.
Ready to see what a verified, high-scoring autonomous pentest looks like?
Book a demo with a RedVeil expert today.