Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned From The Thud Security BSides Las Vegas 2025

Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned From The Thud
.ical

2025-08-05 17:00–17:45, Siena

Last year, we learned a key truth: not everything on the Internet is forever, and there is far more variability in host lifespan across different ports, protocols, and networks than we initially thought. Today, we’re going to focus on how we moved beyond the descriptive analyses to ask the next natural question: Given all this variability, how can we actually predict the lifespan of a host?

In this talk, I invite participants to dive down the ML rabbit hole with me. I’ll walk through how our research questions evolved, where our early methods/initial attempts failed, and what we learned from those failures to finally arrive at a practical solution. While ML has improved many aspects of our lives, applying it to solve problems in niche, high-noise areas like security and the Internet-wide measurement space is not always straightforward. With the right tweaks and persistence, we found a path forward, and I hope that audience members walk away with a better understanding of some of these ML pitfalls, as well as a way to think about how to apply ML to their own similarly gnarly problems, using our case study as an example.

One key aspect of Internet-Wide scanning research is “When should I scan this entity again?”. In this talk, I talk about how descriptive analyses (presented last year!) are insufficient in finding trends at an Internet-scale, and instead a better way to tackle this question is via a more methodological approach with ML techniques. In this talk, I go over the promises of ML, and what we faced in reality at each step of the way. While we were inevitably successful in applying ML techniques to our use case, it does illuminate that sometimes you can’t just throw an ML model at the problem naively, especially when you have so many contextual aspects to account for, and the need to re-work your outputs and expectations to match a more realistic model. Specifically, my talk will cover the following:

1) How did we get here?
- Last year we were like WOAH, lots of differences, but then trying to apply it in practice meant shifting the question to “can we predict the lifespan of a service”, such that we can predict when to scan it again?
2) What were the promises of ML?
- ML models would help with prediction, and also bring up interesting facets such as feature importance (should we be scanning based on port, or port and some other variable?).
- We tried some straightforward methods based on our inputs and outputs and immediately ran into some crazy and gnarly problems
3) Taking a step back – what do we need, and what do we have?
- We have a highly multi dimensional categorical dataset that we really cannot change.
- We really want to know when we should rescan something, or even a gradient of “scan these more, scan these other ones less”
4) Reframing the question and recognizing the aspects we couldn’t change led us down a new path
- Can we predict ephemerality? Which allows us to bucket hosts that we need to rescan more frequently vs hosts that we dont need to rescan more frequently
- Yes!! We can.
5) Now that we found a model that worked for us, we discuss evaluation and metrics
- Typically you focus on things like precision, recall, and f1 scores, and we see some variance in those that is not unexpected given the output data (walk through this example)
- In practical settings, we might want to reframe our metrics to be
- We also show which features are most important to the prediction, which is slightly different than our hypothesis going into the problem, but not wholly unexpected

Ariana Mirian

Ariana Mirian currently works as a senior security researcher at Censys, where she uses Internet Measurement to answer interesting security questions. Prior to Censys, she received her PhD from UCSD, where her thesis focused on answering the question: how can we use large scale measurement and analysis to better prioritize security processes? When not geeking out about Internet Measurement and security, Ariana is also an avid aerialist and birder.

Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned From The Thud .ical 2025-08-05 17:00–17:45, Siena

Predicting the Lifespans of Internet Services: Falling down the ML Rabbit Hole, and What We Learned From The Thud
.ical

2025-08-05 17:00–17:45, Siena