BSidesLuxembourg 2026

When LLMs Summarize Security Findings: The Tradeoffs You Can’t Ignore
2026-05-07 , IFEN room 2, Workshops and AI Security Village (Building D)

LLMs are often presented as a shortcut from “hundreds of findings” to “actionable summary.” In reality, getting useful and trustworthy output is less about a single prompt and more about understanding the knobs you can turn - and what typically happens when you turn them.

This talk uses vulnerability assessment results analysis as a concrete example task, but the goal is broader: a research-style exploration of the design space for LLM-assisted summarization. We’ll map the main control surfaces - goal definition, output constraints, input shaping, model selection, evaluation methods, and cost/latency budgets - and show how changing each one affects faithfulness, specificity, consistency, and failure modes.

The session offers a practical framework for experimenting safely: define measurable requirements, run iterative comparisons, and use structured judging to learn which combinations of knobs move you toward “useful” versus “confidently wrong.” Attendees leave with a repeatable way to reason about tradeoffs and a set of patterns they can apply to other security summarization problems.


Security teams routinely face large vulnerability assessment reports that are rich in detail but hard to operationalize. LLMs look promising for making this information accessible, yet outcomes vary wildly: some summaries are crisp and helpful; others are vague, incomplete, or subtly inaccurate. This session is a research-driven tour of why that happens and what you can control.

The talk is not a “ship this to production tomorrow” story. It is a guide to the experimentation landscape - using vulnerability findings as an illustrative workload - focused on the knobs you can tune and the behaviors you should expect.

The core idea: treat LLM summarization as a system with controllable parameters

We’ll explore six major knob categories:

  1. Task framing (what “good” means)

If you don’t specify the purpose (e.g., executive risk overview vs. remediation triage vs. compliance-oriented highlights), the model will invent its own. We’ll discuss how tight vs. broad goals change output specificity and risk of omission.

  1. Output constraints (how the answer must behave)

Word limits, required sections, citation/evidence requirements, and “no new facts” rules are not cosmetics—they change error rates and the model’s tendency to hedge or hallucinate.

  1. Input shaping (what the model actually sees)

The strongest lever is often preprocessing: deduplicating repetitive data, normalizing fields, extracting key evidence, compressing large reports into context-friendly representations, and moving deterministic operations (like counting/grouping) outside the model. This reduces failure modes and makes evaluation meaningful.

  1. Model selection (speed, cost, and capability)

Different models fail in different ways. We’ll cover the practical implications of choosing “fast enough” versus “best possible” and what quality typically degrades first when you optimize for latency/cost.

  1. Evaluation and judging (how you know it improved)

“Looks good to me” does not scale. We’ll outline a lightweight evaluation harness: a rubric that scores faithfulness, completeness, specificity, and usefulness; repeated runs to check consistency; and a structured judging approach to compare variants.

  1. Iteration strategy (how you converge)

Prompt iteration works best when grounded in measurements. We’ll show a “vibe coding” loop that’s still research-minded: change one knob, rerun tests, observe shifts in failure modes, then decide whether the tradeoff is acceptable for the goal.

What attendees will take away

  • A mental model of the main knobs available when applying LLMs to security summarization tasks
  • Predictable “what happens when you turn it” patterns (which tweaks usually help, which create new failure modes)
  • A repeatable experimentation framework for comparing prompts/models/input formats under real constraints
  • A clear tradeoff map: reliability vs. speed vs. cost, plus the engineering consequences of tighter coupling to input structures

While vulnerability assessment results are the running example, the approach generalizes to other security contexts: incident write-ups, alert triage digests, control evidence summaries, and executive reporting.


Do you consent for this presentation to be recorded and posted online ?:

Andrey Lukashenkov handles all things revenue, product, and marketing at Vulners - a bootstrapped, profitable company committed to providing an all-in-all vulnerability intelligence platform to the cybersecurity community.

Being naturally curious and having a technical background, he leverages unlimited access to the Vulners database to explore various topics related to vulnerability management, prioritization, exploitation, and scoring.