2025-10-02 –, Robert Faure Amphitheater
Language: English
Sampling is at the base of many statistical methods and procedures. Apart from the traditional sampling from finite populations available in StatsBase.jl, there are many more methods and techniques which could be of interest in specific simulation scenarios where performance and memory consumption are critical. In such cases, packages implementing sampling methods for streaming data could be particularly advantageous for scalable and efficient simulations.
For applications that handle high‐velocity data as real‑time monitoring of network traffic or sensor feeds in IoT deployments, the classical paradigm of drawing all observations into memory before sampling can become infeasible. Streaming algorithms overcome this bottleneck by maintaining only a compact summary of the data seen so far and updating the sample incrementally as each new element arrives.
We will present a package which expands the sampling tools in Julia in this direction: StreamSampling.jl. The package focuses on techniques for sampling from large or unbounded streams of data which could be too large to store in memory, providing methods such as reservoir sampling and other streaming algorithms that ensure efficient and representative sampling in this setting.
Interestingly, by eliminating the need for an initial memory-allocating pass to collect the elements, some of these methods have also shown performance benefits when sampling from generic iterators, which would otherwise need to be stored in memory if a traditional approach had been used.
I’m a Junior Researcher at the CENTAI Institute, specializing in statistical and simulation methodologies.