2024-11-14 –, Auditorium
“You’re not Google, and neither are we!” is a common sentence we hear when someone mentions SRE, or site reliability engineering. SRE as a term has gotten the label of unattainable and hyperscale software management - but we’re going to rectify that! In this talk, we walk through a more practical approach to keeping your software services running in sun and rain.
The goal of this talk is to give the audience a good and practical insight into ways of increasing and ensuring their software systems reliability. We introduce the concept of SRE (site reliability engineering) as well as its history and critique. We then proceed to walk through some practical examples of reliability practices including
What to monitor, why and how
What service level agreements are and how to use them
How to look at your service from an on-call perspective (what can break and how can we guard against that)
How load testing can be used both continuously and to learn about how your software behaves in different scenarios
All of these will include some of my real-life experiences and incidents I have seen when working as a backend engineer and doing on-call shifts.
Magdalena Stenius is a backend engineer specialized in working with machine learning use cases. She has a long track record of building both tooling for ML engineering as well as working with large scale recommender models, and most of her career has been spent in the python ecosystem. She is currently a backend engineer as well as on-call engineer at Wolt.