Bloom Filters: A probabilistic data structure for testing set membership

Bloom Filter

Bloom Filter

Bloom Filters

Bloom filters have been around for a while, 1970 to be exact. They’re not exactly the star of any university data structures class. There are much “sexier” (yeah, i know.) data structures like self balancing search trees, hash tables, and graphs. But hidden away in the depths of Algorithms and Data Structures books are fleeting mentions of “Bloom Filters”, described as a “probabilistic data structure” (more on this later). And this exotic and lesser known data structure finds its self being used by some of the most commonly employed software of the day.

Google Search makes heavy use of bloom filters, one area is filtering out “bad actors” – pages serving up malicious content. You see bloom filters are really good at doing one thing QUICKLY – set membership. Kind of.

Probabilistic Data Structures

Bloom Filters fall into that sketchy, not-well-defined area called “probabilistic data structures”. I mean not well defined, because depending on the data structure in question, the definition of “probabilistic” changes. Skip Lists are considered a probabilisitic data structure, but for an entirely different reason

What’s meant by probabilistic in regards to bloom filters, is that they will tell you if an item is a member of a set. Probably. You see, bloom filters forsake accuracy for speed, albeit by erring on the safe side. They will tell you with some degree of probability if an item is in the set, or it will tell you if the given item is DEFINITELY not in the set.

Due to the nature of how they’re applied, these “false positives but hard negatives” justifies their use. Often they are operated in time constrained environments. Being able to determine if a website is on the “bad actor” list before the user is forwarded to the page for example. There is one more caveat to bloom filters: Once an item is added to the filter, there is no realistic way of removing it without possibly effecting other items in the filter.

How?

At its most basic, a bloom filter works by using a “kind-of” hash table. They rely on the ability to hash values QUICKLY, thus not just any hash function will do in a production an environment. They also rely on multiple hash functions. They define a boolean array with all values set to false. When an item is added to the bloom filter list, a hash key (several of them, actually) is computed and the array slot at the index calculated is set to “true”. What makes this a probabilistic data structure is two fold albeit related issue: It’s mapping keys to boolean values, so it cant efficiently verify if the key that maps to that value is a true match or a hash collision, as there is no hash collision strategy. This where the false positives come from, but also why it can guarantee negative results.

By manipulating the size of the table compared to the number of entries, as well as employing multiple hash functions – the key must return true for every index it hashes to for a positive – we can keep the probability of a false negative acceptably small.

Implementation

Now that we understand how a bloom filter works, it’s time to see one in action. For this simple demonstration i’ll be using two hash functions and a vector of type bool.

As mentioned above the probability of a false positive is determined by the likelihood of a hash collision. The more hash functions utilized combined with a table of sufficient size will keep this possibility acceptably low. Another related data structure utilizes the same concept as a Bloom Filter is the Cuckoo Table, which uses Cuckoo hashing instead.

It is important to remember that whatever hash functions you choose must meet the following criteria: it must be FAST, and it must keep the probability of hash collisions as low as possible. When utilizing cyclic hashing like i do with hashf1() below, the choice of the shift can effect the number of collisions by a HUGE margin. The hash functions i use below are NOT suitable for a production environment and only used for examining the bloom filter concept.

I consider the above code to be practically self documenting. It is very clear to see what is going on. Insert works by taking in a string and calculating array indexes through the two hash functions, and setting those array positions to true. checkFilter() works similarly except instead of setting the value of the array, it returns the value. If both indexes calculated are true, it returns true – possibly present. Otherwise it returns false – Definitely NOT present.

Test Driving our Bloom Filter

For testing the filter we will add a number of strings to the filter, and then run check filter on a list of strings containing some of the values we added, and some of the values which we did not – it’s these values that may possibly produce false positives. If everything went well, strings we know we added should never return “not present”

[maxs-MacBook-Pro]% ./bloomFilter 
Algorithms Added to the bloom filter
Bloom Filter Added to the bloom filter
Donald Knuth Added to the bloom filter
Darth Vader Added to the bloom filter
Object Oriented Programming Added to the bloom filter
The Singularity Added to the bloom filter
LispMachine Added to the bloom filter
NASA Jet Propulsion Lab Added to the bloom filter
plattypus Added to the bloom filter
Dark Wing Duck Added to the bloom filter

10 strings added to bloom filter.

maxgcoding.com is definitely NOT in the filtered list.

C++ is definitely NOT in the filtered list.

Data Structures is definitely NOT in the filtered list.

Algorithms is most likely in the filtered list.

Bloom Filter is most likely in the filtered list.

Donald Knuth is most likely in the filtered list.

Darth Vader is most likely in the filtered list.

Object Oriented Programming is most likely in the filtered list.

Captain Picard is definitely NOT in the filtered list.

Artificial Super Intelligence is definitely NOT in the filtered list.

The Singularity is most likely in the filtered list.

The Gettysburgh Address is definitely NOT in the filtered list.

LispMachine is most likely in the filtered list.

NASA Jet Propulsion Lab is most likely in the filtered list.

XKCD is definitely NOT in the filtered list.

Two Programmers, One Compiler is definitely NOT in the filtered list.

Death Star is definitely NOT in the filtered list.

Lord Voldermort is definitely NOT in the filtered list.

plattypus is most likely in the filtered list.

Dark Wing Duck is most likely in the filtered list.
[maxs-MacBook-Pro]%

Leave a Reply

Your email address will not be published. Required fields are marked *