Powering social feeds and timelines with Elasticsearch

Evolving from MongoDB and Redis to Elasticsearch, Campus Discounts' founder and CTO Don Omondi talks about how and why the company made the switch to power their user recommendation feeds.

Campus Discounts is a social network where students find and recommend discounts posted by businesses near campus. We have a worldwide list of campuses with their geographic location. Businesses create pages and post discounts tagging campuses near them. Students can then view their campus page and find discounts nearby.

If students signup (it’s free), they can select product categories of interest and also connect to fellow students through the buddy system. When a student’s friend recommends a discount which falls within the student's categories of interest, he/she will be notified and see it on their feeds.

Our data is classified into two types. Primary data is our core data that includes users, pages, apps, discounts, countries, recommendations and campuses. It's stored in a MariaDB RDBMS. Secondary data is derived from actions on primary data such as likes, comments, follows, ratings, reviews, friendships, etc. and are stored in a MongoDB database.

Our MariaDB tables typically look like this:

The recommendations table column "type" can have a value of 1 to represent a discount_recommendation, 2 for a page_recommendation or 3 for an app_recommendation.

When User A recommends a discount (eg. Save Big, 25% Off Mens' Leather Shoes) it’s saved in the discount_recommendations table and pointers to it are saved in the recommendations table. This will appear in the feeds, for example, "User A recommends Save Big, 25% Off Mens' Leather Shoes."

If User B sees this recommendation and shares it, no new entry is made in the discount_recommendations table. Instead, a new entry in the recommendations table is made with the sharer as User B, but with data pointing to the exact recommendation made by User A. So, all comments and likes are tied to the original discount_recommendation.

When building the UI, if the recommendations sharer ID does not match the recommender ID, then the feed item will appear as "User B shared User A's recommendation Save Big, 25% Off Mens' Leather Shoes".

Normalization in RDBMS – The Good

This model highlights the strength of normalization in relational databases. For example, a discount recommendation is joined to a discount, which is joined to a category and page. The latter of which is joined to a country. The same discount recommendation is also joined to a user (recommender), which is joined to a country as well as a campus. A campus is also joined to a country.

This means that a user who has set "Men's Shoes" as a category of interest could have a feed entry such as:

discount recommendations for "men's shoes" – a category join

by my friends – a join on user_id

in my campus – a join on campus

by businesses – a join on page

in my country – a join on country

near my campus – a join by location.

That's 6 joins and it easily could be more.

Normalization in RDBMS – The Bad

RDBMS joins are one of the biggest performance killers at scale especially when multiple joins are used. In our case, the user feed contains 3 types of recommendations arranged in a collective, time-based chronological order. Queries span across 3 different tables and thus are fetched one-by-one.

A typical feed fetch of this nature would look like this:

Find my friend ids (1 query)

Find my interests i.e. category ids (1 query)

Find my latest 20 friend recommendations of interest (1 query)

Populate feed by fetching each of the 20 recommendations one by one i.e. (20 queries of 6 joins each!)

Almost all of our data is persisted asynchronously using RabbitMQ, so users are oblivious to whether it took 1 second, 15 seconds, or 5 minutes to perform tasks. Hence, writing to the DB was not the issue, just reads were.

MongoDB - Redis Hybrid

We initially tried to solve this problem by caching, so we created a temporary store for each user's timeline and feed in MongoDB and kept this store in RAM for fast retrieval.

We have two collections: user_feeds and user_timelines. Each collection stores one document per user with the _id set to the users' respective ids. When a user makes a recommendation, that data will be cached as an embedded document in a user's timeline.

Simultaneously, we established feeds for those friends whose interests align with the discount category of this new recommendation. Their MongoDB documents are updated with new entries as embedded documents. This is a Push-on-Change strategy where each document contains a cache of the recommendations table and is restricted to a maximum of 200 embedded documents.

Still, there were expensive queries on the discount_recommendations, page_recommendations, and app_recommendations tables. Ideally, each query would be embedded along with the recommendations table data, but this would lead to unsustainable data duplication.

Instead, we cached each result of the 6-join query in Redis setting the key to a hash of the respective ID. A user_feed document looked like this:

This approach performed really well. A MongoDB document was fetched and the individual recommendations fetched over a loop from Redis. Eventually, a user got his feed in less than 30ms for 20 items at a time after 21 queries.

But if it was all rosy, why did we abandon this approach?

The disadvantages:

A user’s friends could possibly fill his 200 limit feed cache with only one type of recommendation making filtering for another type yield nothing.

Whenever a user changed his interests by adding or removing a category, or when a user made or removed friends, his existing feed document would have to be destroyed and regenerated.

The biggest disadvantage was that a user could not filter his feed on the fly (e.g. a user viewing his feed could not just select one category for example phones and just get friends recommendations for phones alone). The only way to do so was to change profile settings which leads to disadvantage #2.

Elasticsearch

Therefore, a more flexible approach was required. We needed a model that allowed us to fall in line with the reactive requirements of modern apps. Using Elasticsearch, we changed our feeds generation strategy from Push-on-Change to Pull-on-Demand.

So how does Elasticsearch make things more flexible?

With Elasticsearch, a single query can easily and quickly fetch different documents across the entire dataset. Elasticsearch also has a feature called Types that are a very interesting but a sometimes misused feature that allows you to save several types of data in the same index. Being in the same index means a query across several types would normally perform better than a query across a similar dataset across several indices - unless the index holding the multiple types is really large.

The performance boost is nice, but the best part of types is that they represent a class of similar documents with similar mappings. So we've created a recommendations_type_index and saved our data as discount_recommendations_type, page_recommendations_type and app_recommendations_type.

Elasticsearch scores use index-wide statistics and since our recommendations have similar fields like datetime, totallikes, totalcomments, totalshares and so forth, we can now use Elasticsearch Types to provide a feeds display algorithm other than the time-based chronological method of our previous MongoDB – Redis setup.

The main advantages over previous setup are:

This reduced our queries from 23 to just 3.

No need to cache a list of 200 recommendations per user.

We are now able to query recommendations by deeply nested fields on the fly allowing us to introduce new, real-time, filter-like features to our users.

The main disadvantages over previous setup are:

Index size. We de-normalized our data to fit it in an all-in-one document. This increased query performance, but led to a much larger size on disk due to data duplication.

Tedious updates. Changes required a scan and update across all documents that have that data present.

Varying query times. MongoDB – Redis setup has pre-cached feeds that gave a more predictable standard deviation of feed generation times, but with Elasticsearch, a feed's query time now depends on the number of filters used.

Conclusion

So there you have it, a fully functional and elastic user feed system. I hope this exposed some interesting uses of various databases that we use and that it can inspire you to use it as it's meant to be – a tool to help you accomplish a task.