Last updated May 25, 2019

What Is Bloom Filter And How Useful It Is For Data Science

Share

Published on May 25, 2019

by Ram Sagar

Bloom filter, an ingenious data structure invented 50 years ago by Howard Bloom to test whether a certain elements belongs to a certain set.

There are many applications of bloom filters such as detecting malicious URLs on browsers, wallet synchronisation for cryptocurrencies, reducing datacenter workloads, recommendation system.

Knowledge of hash function prior to knowing Bloom Filter would be of great use here. But, in short, hash functions can be thought of as code given to each data item. The data item can be anything. It can be a string saying Hello World and there will be a corresponding code. This code is used as an index to share the information in a secure way and to retrieve information from memory core. This code tells the computer what to look for and where to.

Bloom filters have a number of hash functions, which are used to set bits in the bit array. The bit positions are determined by the hash functions.

As stated earlier, bloom filter was invented to check whether a set contains certain elements.

When a question is asked to check whether certain data item exists, then the hashed index or code(unique ID) corresponding to that item is checked. If the filter contains all the parts of the code with that of item then it can be concluded that the data item exists.

There can be false positives from time to time, and if you don’t need to remove items, it’s a great choice. A Bloom filter has constant time complexity for both adding items and asking whether they are present, and it requires very little space relative to the size of the items needed to store and check. It gets these properties in large part because it is based on hash functions.

Bloom Filter For Recommendation System

The use of machine learning for recommending movies based on user’s history and techniques like collaborative filtering has been widely documented and companies like Netflix and Amazon employ them on a large scale. In the case of Medium, one of the widely popular publishing platforms which has more than 30 million users every month.

For Medium, a simple method for suggesting a story might be “because you follow the author.” A more complex one might be collaborative filtering, which can be described as “people who recommended lots of the stories that you recommended also recommended this story”.

Medium has thousands of blog posts and each of these has an ID and is stored in a database table, but the table is so big, and accessed so frequently, that it can’t fit on one machine. The database takes the ID, hashes it to a unique value, and uses that to jump straight to the record location like we discussed earlier.

So, when a certain story is recommended to a user, the algorithm needs to be sure whether it has been suggested before or had already been read. read story is stored as a data item along with many other data and it is inefficient to call all these to verify quick actions. This is where Bloom filter comes into play. It is a mechanism that is particularly good at answering the question “has this been seen before?”.