Sampling From Streams: Reservoir Sampling & VIRBs
Jonathan Arfa (~jonathan) |
Those of us who use Python to build systems that ingest constant streams of incoming data (which includes anybody whose code touches the internet) often need algorithms that keep a fixed-size sample from the stream for on-the-fly analytics. At Magnetic we are engaged in real-time bidding for online advertising, which also requires extremely low-latency Python code.
Reservoir sampling is a commonly used algorithm for keeping a sample that's unbiased over all events in the stream. But what if you want your algorithm to keep a representative sample from the past 10 minutes instead of the entire stream?
In this talk I'll start with a gentle introduction to reservoir sampling for those who are not familiar with it, and then discuss several variants that keep samples which are biased towards more recent events. One of these is the VIRB (Variable Incoming Rate Biased) sampler, which we have developed in-house at Magnetic.
The audience should be able to read and understand basic Python code. No background is required in Statistics or sampling algorithms, but it might make the talk more interesting.
Jonathan Arfa is a Data Scientist at Magnetic, an advertising technology firm in New York City. He has an M.S. in Statistics from UCLA.