On his site Sherif Ramadan has posted an interesting tutorial covering implementing bloom filters in PHP. Bloom filters are data structures that make it easier to determine if something is a member of a set.

Let's imagine you have built a music app like Spotify. You've finally grown this thing to sizeable amount of users and you have a decent number of titles in your content library. Let's also say this app has social elements to it so your users can connect with their facebook friends or twitter followers. Now, let's say each time your users play a song in your app you want to ask the question Which of this user's friends have NOT listened to this song yet? The intention being that you may recommend that song to them if they haven't listened to it.

One solution to this problem is to use a data structure known as a bloom filter. A bloom filter is basically a very space-efficient hash set with probabilistic tendency. If you aren't familiar with a hash set or sets in general, let's do a quick review of what they mean.

He goes on to explain what a bloom filter is and how it differs from normal sets, hash sets and hash maps. He then introduces some of the basic concepts involved in creating and using bloom filters. To help make things clearer, he provides a "contrived example" using lightbulbs and the probably that they've been turned on. From there he starts to get into something more practical, something more in the world of PHP. He includes a basic Bloomfilter class example and some of the results (performance) of using it over something like in_array (especially for large data sets).

In this new post to his blog Joshua Thijssen describes something that can help when processing large amounts of data (like, in his example, the text of a book) to search through the information and find if a certain piece of data is in the set - a bloom filter.

Most of my co-workers never really heard of bloom filters, and I’m continuously need to explain what they are, what their purpose is and why it’s a better solution than other ones. So let’s do an introduction on bloom filters. [...] Bloom filters have the property of being exceptionally fast AND exceptionally small compared to other structures but it comes with a price: it MIGHT be possible that our bloom filter thinks that an element is inside our set, when it really isn’t. Luckily, the reverse is not possible: when a bloom filter says something is NOT in the set, you are 100% sure that it isn’t part of the set.

He explains how the filter works, noting how it's better for memory consumption and how it's possible for it to give a "maybe" response instead of ab absolute "yes" or "no". He also points out a PHP extension, bloomy that takes the hard work out of it for you.

A Bloom filter is a probabilistic data structure that can be used to answer a simple question, is the given element a member of a set? Now, this question can be answered via other means, such as hash table or binary search trees. But the thing about Bloom filters is that they are incredibly space-efficient when the number of potential elements in the set is large.

The filters allow false positives with a defined error rate - it gives the "yes" or "no" answer based on the content and you, the developer, decide if that answer falls within a rate that's okay for you and your app. The filters also take the same amount of time to look up items no matter how many are in the set.

He includes an example of the extension in use - defining the number of elements, the false positive allowance and adding/searching data and how the responses would come back from the checks.