2025-08-04 –, Siena
Logic-based vulnerabilities remain the hardest to detect with automated application security tools. Our work examines how AI-based hackbots can be trained to discover such complex vulnerabilities. In this talk, we'll discuss our approach to training and evaluating these systems.
We demonstrate how we train a reinforcement learning agent to navigate applications, model state transitions, and identify logic flaws. These agents observe user roles, session tokens, and application responses to iteratively craft requests that reveal vulnerabilities.
Then, we evaluate this agent using Marvin, our open-source research framework that provides environments with vulnerable REST and GraphQL APIs that accurately mirror real-world application logic. By open-sourcing Marvin, we aim to set the standard for the hacker community to evaluate new hackbots.
We discuss the capabilities and limitations of these systems and point toward what we need to make AI practically useful for security research.
The content of this talk originated from a research project Dvir Lazar and I developed at Carnegie Mellon this past year. Following our research, Dvir and I co-founded Alkonos, an AI-based Dynamic Application Security Testing (DAST) startup.
The fundamental problem we're addressing is that current DAST tools widely adopted by both industry and hacker communities rely on pattern matching for known vulnerabilities or fuzzing without contextual insights. This approach renders them completely ineffective against some of the most critical web application security vulnerabilities, including IDORs, access control vulnerabilities, and account takeovers. According to OWASP, access control vulnerabilities are ranked as the #1 most critical vulnerability, yet traditional tools consistently fail to detect them.
Recent advancements in AI offer the potential to automate the detection of these complex vulnerabilities. However, as with any emerging technology, significant challenges remain. Our research revealed that while multiple companies and academic research efforts are tackling this field, there's no standardized way to measure the success of these tools. We argue that without proper benchmarks, the hacker community cannot effectively assess these solutions, and the industry lacks direction for developing robust automation tools.
To address this gap, we've developed Marvin, an MIT-licensed benchmark suite specifically designed to evaluate whether autonomous agents can discover logic bugs in realistic environments. Marvin provides standardized vulnerability scenarios with ground-truth labels, focusing on business logic flaws where AI systems traditionally struggle to understand application context and business rules.
Our framework features diverse application vulnerability corpora across multiple API paradigms (REST, GraphQL), controlled noise elements to test false positive rates, varied authentication mechanisms, and progressive difficulty tiers. We'll demonstrate how reinforcement learning-based hackbots can be trained on Marvin to successfully identify these vulnerabilities and present a live demonstration of our RL agent navigating complex API structures and exploiting business logic flaws that traditional security tools miss.
This talk will cover our approach to training and evaluating AI-based security testing systems, introduce the Marvin framework to the hacker community, and present a roadmap for advancing automated detection of logic-based vulnerabilities. We'll also discuss how the community can contribute to and utilize Marvin to evaluate vendor claims about AI-based security tools.
Taha Biyikli is Co-Founder & CEO of Alkonos, developing AI solutions for complex vulnerability detection. Previously, Taha led cybersecurity assessment teams and has been acknowledged by major organizations including Apple and the U.S. Department of Defense for discovering critical vulnerabilities. A member of Carnegie Mellon's Plaid Parliament of Pwning (PPP), Taha won the MITRE Embedded CTF 2025 with his team and specializes in application security and reverse engineering.