Security BSides Las Vegas 2025

Indexing the Chaos: Extract PII from Ransomware Leaks
2025-08-05 , Siena

Modern ransomware attacks no longer just encrypt files—they exfiltrate and leak terabytes of internal corporate documents. These leaks contain unstructured chaos: scanned passports, HR forms, insurance records, and other sensitive data. Yet most breach-checking tools ignore them completely.

This talk presents Have I Been Ransomed? (HIBR), a toolchain and public search engine designed to extract meaningful PII from this mess using OCR and Large Language Models (LLMs). We’ll explore how we crawl these leaks, how we safely extract identifiers without exposing PII, and how LLMs allow us to detect personal data buried deep inside PDFs and image scans. We'll also address the ethical landmines, legal constraints (e.g., GDPR), and our design decisions to avoid becoming a privacy nightmare.

Attendees will walk away with a practical understanding of how to process complex ransomware dump data and build awareness tools responsibly—while seeing live examples of HIBR in action.


The tool was developed as a response to a growing blind spot in breach awareness: unstructured data dumped by ransomware gangs. Traditional tools focus on structured email/password leaks. In contrast, ransomware leaks are a dumpster fire of scanned ID cards, tax records, and resumes, usually dropped on .onion sites or mirror dumps. No one wants to parse that—so I did.

This talk breaks down how I built:

A crawler (breach.house) that collects dump data (Ransomware Leaks, Normal Breaches, Stealer Logs, Leads)

A backend pipeline that:

    Ingests mixed-format files (PDF, DOC, images, databases, etc.)

    Uses OCR to extract text from image-based leaks

    Feeds results into a fine-tuned LLM that recognizes contextual PII

A frontend search engine (haveibeenransom.com) that shows only metadata, not PII, and flags where data might have been exposed.

This talk will explain how I implemented protections to comply with privacy law (GDPR, Article 6) and prevent misuse. No PII is shown. Users can only search identifiers (email, passport number) and see where it may have appeared—without downloading any leak.

This tool is open-source (in part) and still under active development. It’s a blend of OSINT, NLP, ethical grey zones, and threat intelligence, all rolled into one live system.

Juanma is a security researcher and developer focused on threat intel tooling and dark web data analysis. He builds open-source tools that turn leaked chaos into structured awareness, with a strong focus on privacy, legality, and responsible disclosure. His current project, Have I Been Ransomed?, is part of a broader mission to make ransomware leak awareness accessible and useful—without exposing the data that bad actors already dumped.

This speaker also appears in: