Tagged: algorithms

Probabilistic Reactive Extensions: ProbablyDistinct

Reactive Extensions has an operator called Distinct. As data streams through, it filters out any items it has already seen before, allowing only unique items to pass to OnNext. The problem is Distinct stores all the unique items in a HashSet, which means it will consume a lot of memory if the number of unique items is large. The solution is to implement a ProbablyDistinct operator that uses a Bloom Filter to store unique items. A Bloom Filter is a very compact data structure that tests each item and replies “item is probably in the set” or “item is definitely not in the set”. In this case, there will be some false positives (it says it’s in the set, but it’s actually not). But it’s a tradeoff some applications might need, particulary long-running server apps that see lots of unique data items.

More broadly, it would be useful to have a few Rx operators that use probabilistic data structures. For example, the HyperLogLog can be used to estimate the count of large numbers of distinct items using very little storage. Another useful operator is the opposite of Distinct (Indistinct? SeenItBefore?). This is how most people use a Bloom Filter. Rather than do an expensive DB lookup, it first checks the filter to see if it is probably in the DB and then does the query.

Here’s the source for Distinct. All you have to do is replace the HashSet with a Bloom Filter (+ minor code tweaks).

        private static IObservable<TSource> Distinct_<TSource, TKey>(IObservable<TSource> source,
            Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)
            return new AnonymousObservable<TSource>(observer =>
                var hashSet = new HashSet<TKey>(comparer);
                return source.Subscribe(
                    x =>
                        var key = default(TKey);
                        var hasAdded = false;

                            key = keySelector(x);
                            hasAdded = hashSet.Add(key);
                        catch (Exception exception)

                        if (hasAdded)

Cache oblivious data structures for free

Databases layout tables carefully on the disk because disk I/O is insanely slow relative to memory. Scientific programmers organize data structures (DTs) carefully to maximize cache utilization because memory is insanely slow relative to the cache. Since writing cache conscious (CC) DTs is hard, the folks in MIT’s Cilk group came up with cache oblivious DTs (CO). This post explains the idea, albeit verbosely. The basic idea is to pack data into arrays so it will all load into the cache. These guys show that CC still beats CO by a mile, but it’s soooo difficult to implement. Some dudes at IBM show that a garbage collector can improve performance on some benchmarks simply by improving memory locality. It seems to me that if the GC documentation told me that it will usually group objects in a particular order, then I can write my DT code to take advantage of that locality. The GC is constructing CO DTs for free. But since I don’t know how the GC groups things, I could inadvertently be jumping all over main memory and suffering zillions of cache misses. That ain’t cool.