2025-06-19 –, Secondary stage
In this session, we’ll navigate the intricate landscape of distributed systems and discuss how Chaos Engineering offers a hands-on approach to gaining deeper insights into system behavior. We'll examine how teams leverage failure injection and error simulation to proactively identify weaknesses and strengthen resilience. From there, we'll dive into Gameday exercises, where teams deliberately push their systems to the limit to expose hidden resilience gaps. Finally, we’ll reflect on the current challenges of distributed systems and the realities teams face in maintaining resilience at scale.
What if the best way to build resilient systems is to break them, intentionally? This 40-minute talk will challenge our conventional thinking about resilience by diving into the hands-on execution of Chaos Engineering through Gamedays, which are structured, high-impact events where teams deliberately inject failure to uncover weaknesses before they manifest in production and impact customers. Despite its playful name, a Gameday is anything but a game; it’s a methodical, collaborative, and sometimes nerve-wrecking exercise designed to stretch the limits of our systems and validate our assumptions.
We’ll explore what it takes to run an effective Gameday: from selecting the right applications and environments to defining steady states and executing controlled, high-value chaos experiments. Attendees will gain insight into how Datadog orchestrates Gamedays, curating participants based on system architecture, aligning on steady-state definitions, and incrementally scaling failure scenarios from isolated latency injections to full-scale zonal disruptions.
Gamedays are a shift in mindset, from preventing failure to preparing for it. We’ll also discuss a crucial evolution: transitioning from manual, ad-hoc failure injection to a scalable, automated platform that enables safe, rapid, and transparent execution. The ultimate goal? To shift the mindset from fearing failure to leveraging it as a catalyst for resilience. By embracing Chaos Engineering through Gamedays, teams don’t just prevent outages - they gain deep, actionable insights, foster cross-team collaboration, and build a culture where failure isn’t a setback, but a stepping stone to resilience.
Aspiring to protect engineering organizations against the downside of unexpected failures and the lack of graceful degradation, I am leading a Chaos Engineering team to build scalable tooling to safely inject failures at scale, and facilitate gamedays to validate resilience hypothesis against reality.