2023-05-30 –, Music Hall
What causes website outages? Even with all the best practices, your site may fail due to bad assumptions from leaky abstractions. This talk looks at what an outage taught us about leaky abstractions.
Video: https://youtu.be/aQSe5QjLGuo
Introduction
Django developers like their websites to stay up. But despite our best efforts, outages do sometimes happen. In this talk I'll look at an actual production outage, and identify the separate elements that contributed to the outage. I'll find the faulty assumptions — leaky abstractions — behind those elements. Then I'll look at what we did about our leaky abstractions.
Abstractions
Abstraction is one of the most powerful concepts in computing. We use layers of abstractions to limit the amount of detail we need to consider at once. Django uses abstractions everywhere: the ORM, the request/response cycle, and so on. But abstractions are inevitably leaky: they conceal details that actually turn out to be relevant. And that can lead to outages.
Outages and the Swiss cheese model
The Swiss cheese model of accident causation tells us that accidents (or outages) are rarely caused by single failures: it takes sequences of events to cause to accidents. Furthermore, if any one of those events had not happened, the accident wouldn't have resulted.
In modern computing systems, critical events leading to outages are usually at the software level, rather than involving physical hardware. Good development practices help to prevent risky code from getting to production. But relying on leaky abstractions in how we think about systems may lead to trouble.
An outage case study
I'll describe an actual outage that occurred, and identify which contributing issues could be attributed to leaky abstractions. As we explore those abstractions, we'll dive deeply (but briefly) into:
* Django database routing,
* locking and transactions in PostgreSQL,
* TCP networking, and
* process termination.
Filling in the holes in the Swiss cheese
I'll look at how we prevented the outage occurring again by addressing the contributing issues. The first step was to acknowledge the leakiness of an abstraction. Then we made changes to prevent anyone else from falling victim to the leaky abstraction.
What you should take away from this talk
What leaky abstractions might affect your code? Thinking about leaky abstractions may help prevent outages, so let's do it!
Tim has been a system administrator and, for the past 8 years, a Python/Django developer. He works for Kraken Technologies Australia, part of the Octopus Energy group.
Picture: Mark Hawkins for PyCon UK (adapted); CC BY 2.0