2025-09-27 –, Venue
In distributed systems, how we handle failure is often more important than how we handle success. This talk challenges the "never give up" mentality by demonstrating why intelligent retreat strategies consistently outperform blind persistence when systems are under stress.
We'll explore the powerful combination of exponential back-off with jitter for managing retries, and explain why quickly "giving up" through strategic load shedding often leads to better overall system health than dogged persistence. We'll show how these complementary approaches can prevent cascading failures, improve user experience during degraded conditions, and help systems recover faster.
Moshe has been a DevOps/SRE since before those terms existed, caring deeply about software reliability, build reproducibility, and other such things. They have worked in companies as small as three people and as big as tens of thousands—usually someplace around where software meets system administration.