Real-Time Feature Engineering for Machine Learning with DocumentDB

Ever wanted to take advantage of your data stored in DocumentDB for machine learning solutions? This blog post demonstrates how to get started with event modeling, featurizing, and maintaining feature data for machine learning applications in Azure DocumentDB.

Machine Learning and RFM

The field of machine learning is pervasive – it is difficult to pinpoint all the ways in which machine learning affects our day-to-day lives. From the recommendation engines that power streaming music services to the models that forecast crop yields, machine learning is employed all around us to make predictions. Machine learning, a method for teaching computers how to think and recognize patterns in data, is increasingly being used to help garner insights from colossal datasets – feats humans do not have the memory capacity and computational power to perform.

In the world of event modeling and machine learning, RFM is no strange concept. Driven by three dimensions (Recency, Frequency, Monetary), RFM is a simple yet powerful method for segmenting customers used often in machine learning models. The reasoning behind RFM is intuitive and consistent across most scenarios: a customer who bought something yesterday is more likely to make another purchase than a customer who has not bought anything in a year. In addition, spendy customers who frequently make purchases are also categorized as valuable using the RFM technique.

Properties of RFM features:

RFM feature values can be calculated using basic database operations.

Raw values can be updated online as new events arrive.

RFM features are valuable in machine learning models.

Because insights drawn from raw data become less useful over time, being able to calculate RFM features in near real-time to aid in decision-making is important [1]. Thus, a general solution that enables one to send event logs and automatically featurize them in near real-time so the RFM features can be employed in a variety of problems is ideal.

Thanks to these capabilities, DocumentDB is able to meet the aforementioned time constraints and fill in the missing piece between gathering event logs and arriving at a dataset composed of RFM features in a format suitable for training a machine learning model that accurately segments customers. Because we implemented the featurization logic and probabilistic data structures used to aid the calculation of the RFM features with JavaScript stored procedures, this logic is shipped and executed directly on the database storage partitions. The rest of this post will demonstrate how to get started with event modeling and maintaining feature data in DocumentDB for a churn prediction scenario.

The end-to-end code sample of how to upload and featurize a list of documents to DocumentDB and update RFM feature metadata is hosted on our GitHub.

Scenario

The first scenario we chose to tackle to begin our dive into the machine learning and event modeling space is the problem from the 2015 KDD Cup, an annual Data Mining and Knowledge Discovery competition. The goal of the competition was to predict whether a student will drop out of a course based on his or her prior activities on XuetangX, one of the largest massive open course (MOOC) platforms in China.

The dataset is structured as follows:

Figure 1. We would like to gratefully acknowledge the organizers of KDD Cup 2015 as well as XuetangX for making the datasets available.

Each event details an action a student completed. Examples include watching a video or answering a particular question. All events consist of a timestamp, a course ID (cid), student ID (uid), and enrollment ID (eid) which is unique for each course-student pair.

Approach

Modeling Event Logs

The first step was to determine how to model the event logs as documents in DocumentDB. We considered two main approaches. In the first approach, we used the combination of <entity name, entity value, feature name> as the primary key for each document. An example primary key with this strategy is <”eid”, 1, “cat”>. This means that we created a separate document for each feature we wanted to keep track of when the student enrollment id is 1. In the case of a large number of features, this can result in a multitude of documents to insert. We took a bulk approach in the second iteration, using <entity name, entity value> instead as the primary key. An example primary key with this strategy is <”eid”, 1>. In this approach, we used a single document to keep track of all the feature data when the student enrollment id is 1.

The first approach minimizes the number of conflicts during insertion because there is the additional feature name attribute, making the primary key more unique. The resulting throughput is not optimal, however, in the case of a large number of features because an additional document needs to be inserted for each feature. The second approach maximizes throughput by featurizing and inserting event logs in a bulk manner, increasing the probability of conflicts. For this blog post, we chose to walk through the first approach, which provides for simpler code and fewer conflits.

Step 2

Featurize each event. In this example, each student action expands into 12 rows of the form { entity: { name: “ “, value: …}, feature: { name: “ “, value: …} } that must be inserted in your DocumentDB collection with the previously created stored procedure. We did this process in batches, the size of which can be configured.

The stored procedure takes as input a row of the form { entity: { name: “ ”, value: …}, feature: { name: “ ”, value: …} } and updates the relevant feature metadata to produce a document of the form { entity: { name: "", value: "" }, feature: { name: "", value: ...}, isMetadata: true, aggregates: { "count": ..., "min": ... } }. Depending on the name of the feature in the document that is being inserted into DocumentDB, a subset of predefined aggregates is updated. For example, if the feature name of the document is “cat” (category), the count_unique_hll aggregate is employed to keep track of the unique count of categories. Alternatively, if the feature name of the document is “time”, the minimum and maximum aggregates are utilized. The following code snippet demonstrates how the distinct count and minimum aggregates are updated. See the next section for a more detailed description of the data structures that we are using to maintain these aggregates.

Probabilistic Data Structures

We implemented the following three probabilistic data structures in JavaScript, each of which can be updated conditionally as part of the stored procedure created in the previous section.

HyperLogLog

Approximates the number of unique elements in a multiset by applying a hash function to each element in the multiset (obtaining a new multiset of uniformly distributed random numbers with the same cardinality as the original set) and calculating the maximum number of leading zeros in the binary representation of each number in the new set n. The estimated cardinality is 2^n [2].

BloomFilter

Tests whether an element is a member of a set. While false positives are possible, false negatives are not. Rather, a bloom filter either returns maybe in the set or definitely not in the set when asked if an element is a member of a set. To add an element to a bloom filter, the element is fed into k hash functions to arrive at k array positions. The bits at each of those positions are set to 1. To test whether an element is in the set, the element is again fed to each of the k hash functions to arrive at k array positions. If any one of the bits is 0, the element is definitely not in the set [3].

Count-Min Sketch

Ingests a stream of events and counts the frequency of distinct members in the set. The sketch may be queried for the frequency of a specific event type. Similar to the bloom filter, this data structure uses some number of hash functions to map events to values – however, it uses these hash functions to keep track of event frequencies instead of whether or not the event exists in the dataset [4].

Each of the above data structures returns an estimate within a certain range of the true value, with a certain probability. These probabilities are tunable, depending on how much memory you are willing to sacrifice. The following snippet shows how to retrieve the HyperLogLog approximation for the number of unique objects for the student with eid = 1.

Conclusion

The range of scenarios where RFM features can have a positive impact extends far beyond churn prediction. Time and time again, a small number of RFM features have proven to be successful when used in a wide variety of machine learning competitions and customer scenarios.

Combining the power of RFM with DocumentDB’s server-side programming capabilities produces a synergistic effect. In this post, we demonstrate how to get started with event modeling and maintaining feature data with DocumentDB stored procedures. It is our hope that developers are now equipped with the tools to add functionality to our samples hosted on GitHub to maintain additional feature metadata on a case by case basis. Stay tuned for a future post that details how to integrate this type of solution with Azure Machine Learning where you can experiment with a wide variety of machine learning models on your data featurized by DocumentDB.