Data science – Schibsted Byteshttp://bytes.schibsted.com
Insights from products & techWed, 29 Nov 2017 11:30:35 +0000en-UShourly1https://wordpress.org/?v=4.9Comparing three solutions for estimating population sizeshttp://bytes.schibsted.com/1732486-2/
http://bytes.schibsted.com/1732486-2/#respondWed, 29 Nov 2017 11:12:53 +0000http://bytes.schibsted.com/?p=1732486In an earlier post we discussed a service to estimate the number of unique visitors to a website. In this article we explore the algorithms we used to build the ...

]]>In an earlier post we discussed a service to estimate the number of unique visitors to a website. In this article we explore the algorithms we used to build the system: HLL (HyperLogLog) and KMV (Kth Minimal Value) and evaluate each.

When you set up an advertising campaign, it’s crucial to know how many users the campaign will reach. We built an estimation subsystem inside Schibsted’s Audience Targeting Engine (ATE) which can answer questions like:

“How many males from Oslo who are interested in sports visit Schibsted websites during a week?”

“How many users in Helsinki age 25-35 visit Schibsted websites in a day?”

Example: males from Norway who are interested in sports

The Problem

To estimate the number of unique users for the next week, we can look at historical data for the previous weeks and assume that we will get a similar number of users.

From a technical point of view this is a count-distinct problem. Though we don’t need to have the exact cardinality, it’s essential to have some approximate value.

As we discussed in our earlier article, we have several limitations:

We should provide the estimate in real time for any future targeting campaigns. This allows advertising campaign managers to try multiple combinations of targeting parameters and select the best one.

In the ATE, we have many values for our targeting parameters. For example, we have more than 100,000 locations, more than 100,000 interests and more than a million search terms, so it’s impossible to calculate in advance all combinations of targeting parameters.

HLL approach

The most popular approach to solve the count-distinct problem is to use the HyperLogLog (HLL) algorithm, which allows us to estimate the cardinality with a single iteration over the set of users, using constant memory.

The popular databases Redis and Redshift as well as the Spark computing framework already have a built-in implementation of this algorithm.

HLL in Spark

Our original approach was to use a Spark cluster. External job server functionality allows us to execute queries on Spark using a REST API. These queries are executed on data that is already cached in memory and disk, so it doesn’t require additional time to load the data from file systems and distribute it across the nodes.

To get an estimate, Spark executes queries using the countApproxDistinct function. For example: to find all males in Oslo of age 25-35, the query is the following.

This approach has an issue: Spark has to filter the whole dataset and only then it can create the corresponding HLL. These queries are very fast on small datasets but take up to several seconds on large datasets like ours. That is usually suitable for ad-hoc analytics, but it is not acceptable in our case, because our goal is to get a response within no more than a couple of seconds.

HLL in Redis

Redis uses a slightly different approach. Rather than calculate an HLL at query time, it stores pre-calculated HLLs as a value in its key-value storage. At query time, it only calculates an estimate based on the HLL. So the query executes within milliseconds.

Also, Redis supports unions of severals HLLs. So in order to select people who have been in Oslo or Helsinki, we can build two HLLs for these cities and compute the union of them. But for intersections, for example where we need to find people who were both in Oslo and Helsinki, it performs poorly in certain cases – for example, when we need to intersect multiple sets or when we need to intersect sets with very different sizes. To understand why exactly this happens we need to learn how the HLL algorithm works.

HLL Algorithm

The algorithm consists of two parts:

In the first part we create a basic HLL array which can be modified if we need to add more users. This is the most time-consuming part.

The second part is the actual evaluation, which is very fast and can be performed at any time.

Calculating the base array

Internally, an HLL is represented as an array Mof length 2^b. In order to add an element to the array we need to do 3 steps:

Compute the hashcode of the element.

Split the binary hash value in 2 parts.

The first part of length brepresents the position in the array M; we call it i

The second part (we call it w) is used to derive the position of the leftmost digit “1” in binary representation, we call it leftmostPos(w).

Insert the value into the array using the formula M[i] = max(M[i], leftmostPos(w))

In the example we took b=4, so the array M will have 16 elements. We will start with empty array M:

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Let’s say we inserting element x where hash(x) = 0011000110

Split the hash value into 2 parts:

Insert the position of the leftmost 1 = 4 to the 3rd position in the M array (array indexes are starting from zero):

0

0

0

4

0

0

0

0

0

0

0

0

0

0

0

0

Estimation

To get the cardinality estimation we need to calculate the harmonic mean function:

and then finally determine the count:

where αis a constant to correct hash collisions.

These formulas may not look very intuitive; for the mathematical proof of the algorithm you can check the original article.

HLL Unions

To calculate unions, we need two arrays M1 and M2 with calculated p(w) values. Based on these two arrays, we calculate a new array M. For each element we apply a formula similar to the one in step 3. M[i] = max(M1[i], M2[i]). This will allow us to get a new base array, so we can perform evaluations on it.

Intersections in HLL

There are two approaches to building intersections on top of HLLs:

The most commons approach is the Inclusion-exclusion principle. It allows us to get the intersection based on original estimates and their unions. The idea is based on the fact that the union of 2 sets is the sum of these set sizes, minus their intersection.

|A ∪ B| = |A| + |B| – |A ∩ B|

This idea can be expanded to any number of sets. For example for three sets the formula looks slightly more complex:

Another way of calculating intersections is with the Minhash approach. It suggests building an additional minhash data structure which helps to estimate intersections.

As we will see in the evaluations, the first approach has issues when we need to intersect multiple HLLs, and the second one requires additional time to build minhash structure.

Kth Minimal Value approach

We wanted to find an algorithm that has the same speed and precision as HLL but supports intersections by design.

This type of data structure exists in the datasketches framework and is called a theta sketch. It was developed fairly recently at Yahoo and to our knowledge only the Druid database is using it at the moment. The core of a theta sketch is based on the KMV (Kth Minimal Value) algorithm.

The KMV Algorithm

This algorithm has two steps, similar to HLL.

from each incoming element, calculate a hash value with a floating point number between 0 and 1, and keep only K minimal values.

based on the value of the Kth minimal element, it is possible to make an estimation:

Count = number of samples – 1value of the Kth sample

For example we have only 10 elements and we selected K=3

Hash codes of our elements:

0.02

0.07

0.18

0.20

0.21

0.31

0.56

0.59

0.81

0.96

Because we selected K=3, we keep only 3 elements.

0.02

0.07

0.18

…

The estimate in this case is

Count = 3 – 10.18 = 11.1

This estimate is pretty close, but in this case we are just lucky because the value of K is too small for practical purposes.

The best way to pick the value of K is to find a suitable tradeoff between accuracy and sketch size. For example, if we select K=2^12 then the error will be ~3% and the size of the sketch will be no more than 61Kb.

Unions

In order to calculate a union we need to have several sets with K (or less) minimal values in each. Then we can combine them into a single set that will contain the K minimal values of all the lists.

For example, we can take the set with 10 elements from the previous example and union it with the set which has 5 elements in common, so the expected estimate of the union is 15.

Set 1:

0.02

0.07

0.18

0.20

0.21

0.31

0.56

0.59

0.81

0.96

Set 2:

0.02

0.12

0.18

0.20

0.21

0.46

0.61

0.66

0.81

0.82

We have the same value of K=3, as we had in previous example, so we keep only 3 minimal values from each set.

Set 1:

0.02

0.07

0.18

…

Set 2:

0.02

0.12

0.18

…

Now we can select 3 minimal unique numbers out of these 2 sets.

Union set:

0.02

0.07

0.12

…

Count = 3 – 10.18 = 16.7, where original estimate was 15.

Intersections

Intersections can be calculated as an intersection of two arrays with K minimal values. In this case K may become smaller, but if the intersection is big enough this is not a problem.

For example, we can take the same two sets of 10 element each with 5 elements in common from the previous example. In this case, the expected estimate is 5.

We have the same value of K=3 as we had in previous example, so we keep only 3 minimal values.

Set 1:

0.02

0.07

0.18

…

Set 2:

0.02

0.12

0.18

…

After intersection, we keep only duplicated elements. In this case there are only two of them, so our K reduces to 2.

Metrics

Our goal was to build an accurate, high performance system, so the most important metrics for us were:

Accuracy

Time to compute an estimate (query time).

Accuracy in our case means relative error in percent, which can be described as

error = (measured – truth) / truth

HLLs and Theta sketches provide theoretical estimations of this metric, which depend on the size of the data structure. The more accurate the estimate, the more memory it uses. However, in some cases, especially involving intersections, there are no theoretical numbers, so it is useful to get experimental results.

The second important metric is the response time for a user, i.e. how quickly we can compute an estimate from HLLs or Theta sketches. This includes the time that is required to perform a union or an intersection of several sketches.

Memory usage is less important because all solutions are constant in memory, and accuracy is usually a tradeoff with memory consumption. Time to build the initial data structure is also not very important because we can always precalculate that data structure.

Experiments

When an advertising campaign is set up to target only males, it reaches millions of users. If it targets people with a specific interest, it is just a couple of thousand users. But it’s important in both cases to get fast and precise estimates.We conducted a large number of experiments with HLLs and datasketches and the experiments below are the most representative.

To make sure we could compare relative error and execution time of the algorithms we set up their internal parameters to have a 1% theoretical error.

To make the experiments more descriptive and flexible to set up, the algorithms were conducted on various sets of randomly generated data. The experiments were executed on AWS c4.xlarge instances.

Experiment 1: One category

Some advertising campaigns are very simple and target just one category. In this case, we build an HLL/datasketch with from 100 to several millions elements and query it. All the experimental values were below the theoretical 1% and the execution time for all algorithms was less than 0.1 ms.

Experiment 2: Unions of multiple groups of users

In the first article, we explained how we use a grid which can have up to several thousand points to target users by geographical coordinates. Each point has a relatively small number of users, but to get a final estimate we need to union all of them.

Unions of 1000 groups of 1000 unique users:

Algorithm

Relative Error

Execution time

HLL with inclusion/exclusion

0.25 %

60 ms

HLL with minhash

0.05 %

400 ms

Theta sketches

0.20 %

30 ms

As we can see from the table this estimate is very precise and very fast in most of the cases.

Experiment 3: Intersection of several large groups of users

Another common case is when we want to target users with several targeting parameters. For example, it can be a popular website in a country where we target males of a specific age who are interested in sports. So we get an intersection of several groups with many users.

Execution time of intersections

Intersection of 20 groups, 1 million users each:

Relative Error for 20 intersected groups

Execution time for 20 intersected groups

HLL with inclusion/exclusion

0.35 %

30 seconds

HLL with minhash

0.60 %

700 ms

Theta sketches

0.20 %

50 ms

The precision of all algorithms is very good but in this case, HLL with inclusion/exclusion struggles with execution time.

Experiment 4: Intersection of a small and a large group of users

This represents the case when we want to target a relatively small group of users with several targeting parameters. For example, females in a Norwegian town. The group “females” will have millions of users, but a Norwegian town may have only a few thousand people.

Precision of intersections

Relative Error for intersection of 1000 with 1 mln users

Execution time for intersection of 1000 with 1 mln users

HLL with inclusion/exclusion

34 %

4.4 ms

HLL with minhash

11 %

78 ms

Theta sketches

13 %

3.2 ms

In this experiment, opposite to Experiment 3, all implementations turn out to be very fast, but the estimates in the corner cases are not precise enough.

Conclusions

The results of our experiments show that in general, all algorithms have good results both for accuracy and execution time. In simple use-cases when we have only one category or union of several categories, there is no significant difference between algorithms. However, in certain cases especially with intersections, execution time and accuracy was not ideal for the HLL-based solutions.

For our service, we chose the theta framework from Yahoo’s Datasketches library because it shows the most stable results both for accuracy and execution in different complex cases.

We have previously shared how we have used this model to optimize marketing on Facebook, with experiments showing significantly improved conversion rates and ROI. In this post we will focus on how we have used the model to target users on our own sites and how we have improved telemarketing conversion rates by 540%.

How the prediction model works

The model predicts the likelihood of an individual user purchasing a subscription, based on their behaviour on our websites and apps. To do this, we train a machine learning algorithm on a dataset of all logged-in users from a given observation period during which they do not have an active subscription, but some of them do go on to subscribe in the following target period. The algorithm learns the difference in behaviour patterns between those that do not purchase and those that do purchase during the target period.

[caption id="attachment_1732423" align="aligncenter" width="1024"] How we use data from different time periods to create our dataset[/caption]

The algorithm crunches its way through many variables. Some of the most useful variables which emerged are somewhat obvious, such as recency (how long since we have seen you on our site?), frequency (how many days did we see you during the observation period?) and volume of content consumed. There are also some less obvious signals, such as the proportion of days visited that are weekend days, and the number of devices used to visit our site.

We can test the model on historical data to ensure it performs well, and then we can use that model to make prediction scores for all of our logged-in users today based on their recent behaviour. The output scores can then be used to optimize our sales initiatives across channels, by targeting users who are most likely to purchase a subscription.

The prediction model was originally developed for, and in close collaboration with, Aftenposten but in a way that made it easily scaled to our publishers in Schibsted. We currently have the model running in production on a weekly basis for four of our publishers, with three more planned for roll-out in the coming quarter.

[caption id="attachment_1732425" align="aligncenter" width="1024"] The machine learning pipeline is site-agnostic and scales easily to more publishers[/caption]

Six times higher conversion in telemarketing

At Aftenposten, we have been carrying out telemarketing to registered users for some time. Previously we’ve done this in a mostly unsegmented fashion, with users chosen randomly to be called. Over time we have seen a stable average conversion rate of 1%. In other words, out of all the users we contact, 1% of them purchase a subscription.

Earlier this year we carried out an experiment to see if targeting users for phone calls based on their Subscription Purchase score from our model would yield a higher conversion rate. There were two groups prepared for the experiment –one group taken randomly per our usual practice, and another group selected from the top 10% of users with highest Subscription Purchase score. Both groups were contacted in the same way and over the same time period. We were pleased to see that the targeted group converted directly from the telemarketing call at a rate of 6%, six times higher than the randomly selected group who converted at the expected baseline rate of 1%.

Since the experiment, we have started targeting our calls based on Subscription Purchase scores on a weekly basis, increasing the volume of users selected and still maintaining a good average conversion rate of 5.4%. The results seen at Aftenposten encouraged us to try a similar approach for another of our newspapers, Faedrelandsvennen, where we have not tried telemarketing before. In this case we have seen a conversion rate of 8.8%.

Doubling click-through rates of in-app ads

Another use case for our Subscription Purchase Prediction Model is to tailor users’ news experience and our in-product communication based on the model. To verify this we carried out an experiment on Bergens Tidende’s (BT’s) mobile app to see if targeting an ad for a BT subscription based on the scores would result in higher engagement and conversion.

Two groups were prepared accordingly: one control group randomly selected, and another group selected because they had high Subscription Purchase scores. The experiment ran for one week, during which both groups of users would see the ad at the top of their news feed when and if they opened the BT app on their mobile.

In that time, the number of subscriptions sold via this channel was too small to draw strong conclusions from, so we also looked at two other metrics: Impression rate and Click-through rate. Impression rates were 6.3% and 28.2% for the random and targeted groups respectively, whilst the click-through rates were 0.7% and 1.3%, with statistical significance at 95% confidence.

The experiment proved that the users most likely to buy a subscription according to the model were both more likely to open the app to see the ad and more likely to click the ad if they saw it, and has thus demonstrated the potential of using such scores within the product.

Next steps: Experiment, automate, monetize

Going forward we will scale the model to more of our sites, continue to experiment with new ways of using the model and automate successful use cases to improve day-to-day operations. Some ideas we are looking into are how we can use the scores in dynamic paywalls, to determine the prominence of paid vs. free articles in users’ personalized news feeds and as a signal of intent in our audience targeting ads offering.

]]>How we used Machine Learning to increase telemarketing conversion rates by 540%.As our publishers become increasingly dependent on revenue from digital users, Schibsted is investing in technology and data tools to help us grow our digital subscriber base. One such solution is our Subscription Purchase Prediction Model. We were really grateful to be awarded first prize for “Best Use of Data Analytics” at INMA World Congress 2017 for this project.We have previously shared how we have used this model to optimize marketing on Facebook, with experiments showing significantly improved conversion rates and ROI. In this post we will focus on how we have used the model to target users on our own sites and how we have improved telemarketing conversion rates by 540%.

How the prediction model works

The model predicts the likelihood of an individual user purchasing a subscription, based on their behaviour on our websites and apps. To do this, we train a machine learning algorithm on a dataset of all logged-in users from a given observation period during which they do not have an active subscription, but some of them do go on to subscribe in the following target period. The algorithm learns the difference in behaviour patterns between those that do not purchase and those that do purchase during the target period.
[caption id="attachment_1732423" align="aligncenter" width="1024"] How we use data from different time periods to create our dataset[/caption]
The algorithm crunches its way through many variables. Some of the most useful variables which emerged are somewhat obvious, such as recency (how long since we have seen you on our site?), frequency (how many days did we see you during the observation period?) and volume of content consumed. There are also some less obvious signals, such as the proportion of days visited that are weekend days, and the number of devices used to visit our site.We can test the model on historical data to ensure it performs well, and then we can use that model to make prediction scores for all of our logged-in users today based on their recent behaviour. The output scores can then be used to optimize our sales initiatives across channels, by targeting users who are most likely to purchase a subscription. The prediction model was originally developed for, and in close collaboration with, Aftenposten but in a way that made it easily scaled to our publishers in Schibsted. We currently have the model running in production on a weekly basis for four of our publishers, with three more planned for roll-out in the coming quarter.
[caption id="attachment_1732425" align="aligncenter" width="1024"] The machine learning pipeline is site-agnostic and scales easily to more publishers[/caption]

Six times higher conversion in telemarketing

At Aftenposten, we have been carrying out telemarketing to registered users for some time. Previously we’ve done this in a mostly unsegmented fashion, with users chosen randomly to be called. Over time we have seen a stable average conversion rate of 1%. In other words, out of all the users we contact, 1% of them purchase a subscription. Earlier this year we carried out an experiment to see if targeting users for phone calls based on their Subscription Purchase score from our model would yield a higher conversion rate. There were two groups prepared for the experiment –one group taken randomly per our usual practice, and another group selected from the top 10% of users with highest Subscription Purchase score. Both groups were contacted in the same way and over the same time period. We were pleased to see that the targeted group converted directly from the telemarketing call at a rate of 6%, six times higher than the randomly selected group who converted at the expected baseline rate of 1%.Since the experiment, we have started targeting our calls based on Subscription Purchase scores on a weekly basis, increasing the volume of users selected and still maintaining a good average conversion rate of 5.4%. The results seen at Aftenposten encouraged us to try a similar approach for another of our newspapers, Faedrelandsvennen, where we have not tried telemarketing before. In this case we have seen a conversion rate of 8.8%.

Doubling click-through rates of in-app ads

Another use case for our Subscription Purchase Prediction Model is to tailor users’ news experience and our in-product communication based on the model. To verify this we carried out an experiment on Bergens Tidende’s (BT’s) mobile app to see if targeting an ad for a BT subscription based on the scores would result in higher engagement and conversion.
Two groups were prepared accordingly: one control group randomly selected, and another group selected because they had high Subscription Purchase scores. The experiment ran for one week, during which both groups of users would see the ad at the top of their news feed when and if they opened the BT app on their mobile.
In that time, the number of subscriptions sold via this channel was too small to draw strong conclusions from, so we also looked at two other metrics: Impression rate and Click-through rate. Impression rates were 6.3% and 28.2% for the random and targeted groups respectively, whilst the click-through rates were 0.7% and 1.3%, with statistical significance at 95% confidence.
[caption id="attachment_1732411" align="aligncenter" width="1024"] Targeting advertisement on BT based on propensity scores gives higher impression and click-through rates[/caption]
The experiment proved that the users most likely to buy a subscription according to the model were both more likely to open the app to see the ad and more likely to click the ad if they saw it, and has thus demonstrated the potential of using such scores within the product.

Next steps: Experiment, automate, monetize

Going forward we will scale the model to more of our sites, continue to experiment with new ways of using the model and automate successful use cases to improve day-to-day operations. Some ideas we are looking into are how we can use the scores in dynamic paywalls, to determine the prominence of paid vs. free articles in users’ personalized news feeds and as a signal of intent in our audience targeting ads offering.

We built a service to estimate the number of website visitors reached by new audience segments in real time, for queries with any combination of user attributes. Here’s how we did it.

By Manuel Weiss, with help from the ATE team

Audience targeting and segments

Schibsted’s Audience Targeting Engine (ATE) allows us to target advertising based on a user’s attributes, like their age, gender, search history, location, etc.

ATE’s Segment Manager allows the creation of audience “segments”. A segment represents a group of users and is defined by a set of attributes that an advertiser is interested in (e.g. users aged 40-45, based in London). These segments can then be attached to an advertisement campaign.

When an ad campaign manager sets up a segment to target a certain group of users, it is important to know the approximate size of that user group so that the number of impressions can be estimated.

[caption id="attachment_1732281" align="aligncenter" width="1551"] Example of a segment: users interested in business and technology, aged 40-45, based in London[/caption]

The problem

To calculate the number of users matching a segment, we can look at historical data and compute how many unique users we have observed over a given time period (last week, for example) that would fall into this segment. (This is also known as the “count-distinct problem”.)

This is fairly easy to do for existing segments. But the estimate is most crucial when a new segment is created or an existing one changed. The problem is that it takes a while to go through the historical data and count all users matching the newly created segment. And we cannot precompute all possible segments, as they can consist of any combination of attributes. Additionally, it’s also possible to upload lists of user IDs, which can then also be combined (i.e. intersected) with other targeting attributes. Obviously, this custom list is unknown until it is uploaded.

[caption id="attachment_1732276" align="aligncenter" width="1931"] Example of unions and intersections for a segment[/caption]

The solution to that problem is to record the unique set of users separately for every criterion and then do the unions and intersections as needed when a new segment is created or changed. To give an example, let’s assume the following segment:

Gender: Male

Age groups: 36-45, 46-55

Location: Oslo, Bergen, Stavanger

Interests: Porsche, Tesla

We will have 8 sets of users: all male users, users aged 36 to 45 or 46 to 55, users in Oslo, Bergen or Stavanger and users interested in Porsche or Tesla. Now we’ll do a union of all sets within a targeting category and then an intersection between categories. This gives us the count of unique users for our segment.

In order to do this for any new segment, we need to store a set of unique user IDs for each possible value of each criterion:

3 for gender: Male, Female, Unknown

6 age groups

> 100.000 for location by district/region/country

Several 100.000s for interest categories (separate for each supported website)

Several millions for location by latitude/longitude

Several millions for search terms

Which creates a new problem: how do we efficiently store millions of user IDs for each of these sets, while being able to do unions and intersections?

To the rescue: sketches!

Luckily, smart people have been thinking about this forawhile and have come up with a smart solution: sketches! They are probabilistic data structures, like, for example, the slightly better known Bloom filters. The basic idea is similar to lossy compression (as we know it from JPEGs and MP3s): you don’t get quite the same thing back, but it is good enough for your purposes, using only a fraction of the original size.

There are many different versions of these data sketches, as they're also called, but they all build on the following observation: if I store some information about specific patterns in the incoming data, I can estimate how many distinct items I have observed so far. As a very simplified example, if I flip a coin many times and only store the largest number of heads in a row, I can estimate how many times the coin was flipped. For a more detailed explanation of sketches, see here.

All of these sketches rely on a uniform distribution of the incoming values, which can easily be achieved by hashing the original user IDs.

So we can construct a data structure that allows us to add a very large number of values, and then ask how many distinct values were observed. But even better – we can take two such data structures and do a union (i.e. how many distinct values show up in either of the two: “lives in Oslo or Bergen”) or an intersection (i.e. how many distinct values show up in both: “is male and 46-55 years old”).

For a more detailed introduction to these concepts, have a look at these excellent blogposts.

In an upcoming blog post, we will discuss some data sketch implementations and the benchmarks we did. Here, we want to focus on how we used data sketches to solve the problem outlined in the introduction and how we implemented this system in production.

Commercial importance of population estimates

For our campaign managers, seeing the population estimates in real time as they work on their segments is absolutely crucial. A segment’s reach needs to be greater than a certain number of unique users to be commercially viable. What this size is, depends completely on the type of segment. The segments that make up the standard offering in the product portfolio need to be very large and reach a cross section of our users. Custom segments for a specific advertiser can be very small in comparison, but still very attractive for a particular campaign. And, the ad sales team always needs to know the size of each segment to make the call: should I sell this? Will the campaign be able to deliver?

[caption id="attachment_1732274" align="aligncenter" width="1024"] The segment creation user interface with the estimated weekly unique users in the top right.[/caption]

Using data sketches in a production system

Our previous solution

When we first started estimating reach, we had only a couple of targeting attributes, and needed to support internal users only. For each query, we would kick off a job via Spark Jobserver that would calculate the estimate on the fly based on HyperLogLog data structures kept in memory. This approach did not scale with the number of targeting attributes, the amount of data and the number of concurrent estimation requests. When there was an issue with Spark, the system would stop responding. The new design overcomes all these issues and can respond to many simultaneous requests.

Calculating the sketches

Every night, we run a batch job on Spark which reads all the user profiles recorded on the previous day and calculates a data sketch for each attribute. The attributes in these profiles have been inferred by our User Modelling data science team based on browsing behaviour on our sites. Once all the sketches are calculated, we load them from S3 into a PostgreSQL RDS (serialised as a byte array).

This might seem like an odd choice, given there’s nothing relational about our list of sketches. But we are relying on two very convenient features:

The geo location sketches come in two different flavours, as there are two ways to define locations in a segment: by postcode/district code/region code and by latitude/longitude + radius. The latter allows the precise definition of target areas; e.g. for a new shopping mall which wants to target everyone living within 20km of their location or a coffee shop targeting mobile users within 500m of any of their branches.

To allow the targeting by latitude/longitude, we divide the world into a grid of points on the surface and calculate a data sketch for every quadrant in this grid for which we have seen any users. For efficiency reasons, we calculate these sketches at three different resolutions:

Partitioning in PostgreSQL

Another nice feature of PostgreSQL that we rely on is partitioning. Partitioning refers to splitting what is logically one large table into smaller physical pieces. We are loading new data sketches into the database every day and want to query the last seven days of data. At the same time, we want to remove older data when it is not in use anymore. As a precaution, we always keep 14 days of data in case there is some issue with loading new data.

To achieve this, we have a parent table, for example low_res_geo_sketches, and one partition for each day. When we add a new partition, we remove the oldest from the parent table, but keep it around for another week. For selects on the parent table, PostgreSQL automatically includes data from all partitions associated with it.

Another optimisation is to run the CLUSTER command after loading a new partition and creating the geo index. This reorganises the data physically on disk to match the index, so that retrieval is as fast as possible. It can take quite a while, depending on the amount of data, but it only needs to be once and happens in the middle of the night.

Computing population estimates on demand

Now that we have all the data sketches ready in our database, how do we actually compute the population estimates for a given segment? When someone creates or edits a segment, the browser sends a request to the population estimation API. This request contains the list of targeting values, like in our example at the beginning.

The server will fetch all 8 required sketches from the db, deserialise them, do a union of all sketches within a targeting category and then an intersection between categories. This resulting sketch is then queried for its estimated count of unique users. Depending on the number of sketches that need to be retrieved, this typically takes less than 1 second.

If there are many geo radius criteria defined in a segment, up to several thousand sketches may need to be loaded from the database. This can sometimes take up to 20 seconds, which is still acceptable as these segments are not created or changed very often. The limiting factor is not the speed of computing the unions/intersections, which is very fast, but the time it takes to load all the bytes from the database. There are many ways to optimise this further, for example by precomputing the union per attribute across the lookback period of seven days, or even by doing the unions/intersections in the database. But for our current needs, it is good enough.

Conclusion

Even estimating unique counts for complex queries is not rocket science and can be done in real time, provided enough effort has been spent to compute things upfront. There are a number of existing robust implementations of the required data structures, or “sketches”, which can just be used as a library, with results that come close to magic!

]]>We built a service to estimate the number of website visitors reached by new audience segments in real time, for queries with any combination of user attributes. Here’s how we did it.By Manuel Weiss, with help from the ATE team

Audience targeting and segments

Schibsted’s Audience Targeting Engine (ATE) allows us to target advertising based on a user’s attributes, like their age, gender, search history, location, etc.ATE’s Segment Manager allows the creation of audience “segments”. A segment represents a group of users and is defined by a set of attributes that an advertiser is interested in (e.g. users aged 40-45, based in London). These segments can then be attached to an advertisement campaign.When an ad campaign manager sets up a segment to target a certain group of users, it is important to know the approximate size of that user group so that the number of impressions can be estimated.
[caption id="attachment_1732281" align="aligncenter" width="1551"] Example of a segment: users interested in business and technology, aged 40-45, based in London[/caption]

The problem

To calculate the number of users matching a segment, we can look at historical data and compute how many unique users we have observed over a given time period (last week, for example) that would fall into this segment. (This is also known as the “count-distinct problem”.)This is fairly easy to do for existing segments. But the estimate is most crucial when a new segment is created or an existing one changed. The problem is that it takes a while to go through the historical data and count all users matching the newly created segment. And we cannot precompute all possible segments, as they can consist of any combination of attributes. Additionally, it’s also possible to upload lists of user IDs, which can then also be combined (i.e. intersected) with other targeting attributes. Obviously, this custom list is unknown until it is uploaded.
[caption id="attachment_1732276" align="aligncenter" width="1931"] Example of unions and intersections for a segment[/caption]
The solution to that problem is to record the unique set of users separately for every criterion and then do the unions and intersections as needed when a new segment is created or changed. To give an example, let’s assume the following segment:

Gender: Male

Age groups: 36-45, 46-55

Location: Oslo, Bergen, Stavanger

Interests: Porsche, Tesla

We will have 8 sets of users: all male users, users aged 36 to 45 or 46 to 55, users in Oslo, Bergen or Stavanger and users interested in Porsche or Tesla. Now we’ll do a union of all sets within a targeting category and then an intersection between categories. This gives us the count of unique users for our segment.
In order to do this for any new segment, we need to store a set of unique user IDs for each possible value of each criterion:

3 for gender: Male, Female, Unknown

6 age groups

> 100.000 for location by district/region/country

Several 100.000s for interest categories (separate for each supported website)

Several millions for location by latitude/longitude

Several millions for search terms

Which creates a new problem: how do we efficiently store millions of user IDs for each of these sets, while being able to do unions and intersections?

To the rescue: sketches!

Luckily, smart people have been thinking about this forawhile and have come up with a smart solution: sketches! They are probabilistic data structures, like, for example, the slightly better known Bloom filters. The basic idea is similar to lossy compression (as we know it from JPEGs and MP3s): you don’t get quite the same thing back, but it is good enough for your purposes, using only a fraction of the original size.There are many different versions of these data sketches, as they're also called, but they all build on the following observation: if I store some information about specific patterns in the incoming data, I can estimate how many distinct items I have observed so far. As a very simplified example, if I flip a coin many times and only store the largest number of heads in a row, I can estimate how many times the coin was flipped. For a more detailed explanation of sketches, see here.All of these sketches rely on a uniform distribution of the incoming values, which can easily be achieved by hashing the original user IDs.So we can construct a data structure that allows us to add a very large number of values, and then ask how many distinct values were observed. But even better – we can take two such data structures and do a union (i.e. how many distinct values show up in either of the two: “lives in Oslo or Bergen”) or an intersection (i.e. how many distinct values show up in both: “is male and 46-55 years old”).For a more detailed introduction to these concepts, have a look at these excellent blogposts.In an upcoming blog post, we will discuss some data sketch implementations and the benchmarks we did. Here, we want to focus on how we used data sketches to solve the problem outlined in the introduction and how we implemented this system in production.

Commercial importance of population estimates

For our campaign managers, seeing the population estimates in real time as they work on their segments is absolutely crucial. A segment’s reach needs to be greater than a certain number of unique users to be commercially viable. What this size is, depends completely on the type of segment. The segments that make up the standard offering in the product portfolio need to be very large and reach a cross section of our users. Custom segments for a specific advertiser can be very small in comparison, but still very attractive for a particular campaign. And, the ad sales team always needs to know the size of each segment to make the call: should I sell this? Will the campaign be able to deliver?
[caption id="attachment_1732274" align="aligncenter" width="1024"] The segment creation user interface with the estimated weekly unique users in the top right.[/caption]

Using data sketches in a production system

Our previous solution

When we first started estimating reach, we had only a couple of targeting attributes, and needed to support internal users only. For each query, we would kick off a job via Spark Jobserver that would calculate the estimate on the fly based on HyperLogLog data structures kept in memory. This approach did not scale with the number of targeting attributes, the amount of data and the number of concurrent estimation requests. When there was an issue with Spark, the system would stop responding. The new design overcomes all these issues and can respond to many simultaneous requests.

Calculating the sketches

[caption id="attachment_1732282" align="aligncenter" width="633"] Data flow in our population estimation system[/caption]
Every night, we run a batch job on Spark which reads all the user profiles recorded on the previous day and calculates a data sketch for each attribute. The attributes in these profiles have been inferred by our User Modelling data science team based on browsing behaviour on our sites. Once all the sketches are calculated, we load them from S3 into a PostgreSQL RDS (serialised as a byte array).This might seem like an odd choice, given there’s nothing relational about our list of sketches. But we are relying on two very convenient features:

Obviously, the user defining a segment in the UI simply clicks on a point in a map.
[caption id="attachment_1732280" align="aligncenter" width="1024"] Lowest-resolution coverage for Norway, Sweden, Finland and Belarus. Each dot represents one data sketch. Areas without dots don’t have any active users.[/caption]

Partitioning in PostgreSQL

Another nice feature of PostgreSQL that we rely on is partitioning. Partitioning refers to splitting what is logically one large table into smaller physical pieces. We are loading new data sketches into the database every day and want to query the last seven days of data. At the same time, we want to remove older data when it is not in use anymore. As a precaution, we always keep 14 days of data in case there is some issue with loading new data.To achieve this, we have a parent table, for example low_res_geo_sketches, and one partition for each day. When we add a new partition, we remove the oldest from the parent table, but keep it around for another week. For selects on the parent table, PostgreSQL automatically includes data from all partitions associated with it.Another optimisation is to run the CLUSTER command after loading a new partition and creating the geo index. This reorganises the data physically on disk to match the index, so that retrieval is as fast as possible. It can take quite a while, depending on the amount of data, but it only needs to be once and happens in the middle of the night.

Computing population estimates on demand

Now that we have all the data sketches ready in our database, how do we actually compute the population estimates for a given segment? When someone creates or edits a segment, the browser sends a request to the population estimation API. This request contains the list of targeting values, like in our example at the beginning.The server will fetch all 8 required sketches from the db, deserialise them, do a union of all sketches within a targeting category and then an intersection between categories. This resulting sketch is then queried for its estimated count of unique users. Depending on the number of sketches that need to be retrieved, this typically takes less than 1 second.If there are many geo radius criteria defined in a segment, up to several thousand sketches may need to be loaded from the database. This can sometimes take up to 20 seconds, which is still acceptable as these segments are not created or changed very often. The limiting factor is not the speed of computing the unions/intersections, which is very fast, but the time it takes to load all the bytes from the database. There are many ways to optimise this further, for example by precomputing the union per attribute across the lookback period of seven days, or even by doing the unions/intersections in the database. But for our current needs, it is good enough.

Conclusion

Even estimating unique counts for complex queries is not rocket science and can be done in real time, provided enough effort has been spent to compute things upfront. There are a number of existing robust implementations of the required data structures, or “sketches”, which can just be used as a library, with results that come close to magic!

Matching a job description with a candidate's experience is not an easy task, even for humans.

An HR professional's workload includes lots of data-heavy tasks (like sifting through tons of candidate experiences), which can be very time-consuming. With the great AI awakening, can we expect machines to help HR with these repetitive and complex tasks?

At Schibsted, we operate classified ad services around the world, including several leading job search sites, and we are continuously working to improve the efficiency of our marketplaces to provide a good match between offer and demand. We see AI and deep learning as an enabler to further improve our user experience.

In the future, we expect AI in the job classified ads business to both help HR professionals source good candidates and enable candidates to find their next job. This post explores how the latest progress in NLP (Natural Language Processing) could reshape the job / candidate matching problem by using good representations (embeddings) of the content.

Challenges to overcome

The simplest way to match a candidate's skills with a job description is to use keywords in a term-based search. That means, for instance, in the IT field, if a Java specialist is looking for a new position, then “Java” would obviously be a good keyword to match. A term-based search will then match all the job offers where the term “Java” appears. However, no listings with other relevant keywords (like, say, J2EE) would appear, because they do not contain the keyword “Java”. Therefore, the candidate may miss seeing some relevant job offers.

This is why not only precision, but also recall (the fraction of relevant listings actually matched) does matter.

Among the challenges to good matching, some are particularly crucial:

Synonyms: Depending on the corporate culture, similar job positions and skills might be described using different words. A good model has to be able to catch synonyms and related words.

Polysemy: On the other hand, one specific keyword can have very different meaning in different contexts. E.g, in French, “chef” can refer both to a kitchen chef or to a team leader.

Embeddings trained on classified ads content

NLP (Natural Language Processing) has seen huge progress lately. One of the most popular recent techniques is the use of word embeddings (a.k.a., distributional semantic models). It has gained a lot of popularity by showing that we can represent a word with a vector of doubles that will “encode” the word from a semantic point of view.

The most famous word embeddings model was introduced in 2013 (Mikolov et al 2013) with the word2vec model. Word2vec has been analysed by many researchers, for instance Goldberg and Levy. The concept of this model is that a word representation can be inferred from its context. So if two words appear frequently in the same context, they should be represented by two nearby vectors. Internally, given a word (its representation in fact), the model tries to predict the surrounding word representations. In practice, it means that word2vec will affect high similarities between words that can be used in the same manner.

Here is the example developed in the word2vec article:

Say we represent each word of a vocabulary with a vector (e.g., in dimension 100). Each word is then represented by a distribution of numbers across 100 dimensions. The word “king” is represented like this:

W(“king”) = (0.92, 0.81, -0.2, … )

This representation is a projection into a 100-dimensional space. This projection can be seen as a compression. This list of 100 elements does not mean anything in itself, just like a zip file is un-readable in itself, but compresses the information.

Other words of the vocabulary, like “man” or “woman” have their own representation:

W(“man”) = (0.11, 0.85, -0.18, … )

W(“woman”) = (0.12, -0.25, 0.91, … )

We can then compose these vectors, making addition or substraction, like:

W(“king”) - W(“man”) + W(“woman”) = (0.93, -0.29, 0.89, …)

This resulting vector is a point in our 100-dimensional space. What if we look at what is the word whose representation in that space is the closest to (0.93, -0.29, 0.89, …) ? Answer is that we get the word “queen”. So, to recap:

This kind of result seems very powerful. The input of the model is only a corpus of text, and absolutely no notion of gender or royalty has been explicitly given to the model (through a dictionary or a relationship network).

Applying this model, if we feed an open-source, NLP python library (namely gensim) with job offer content from Leboncoin, Schibsted's french marketplace, we can compute which vectors are most similar to a given input.

Example 1 (IT techno)

[crayon-5a31d8fbe3ae2326658239/]Even if Leboncoin is not a niche specialist in the IT field, these similarities seem pretty good.

Example 2 (car make)

[crayon-5a31d8fbe3aed449128343/]From the king - man + woman = queen example, which can be re-formulated as

“man” is to “woman” what “king” is to ?

we can also apply the same logic to typos:

“cuisinier” (french word for ‘cook’) is to “cuisnier” (typo of the latter) “chauffeur” is to what?

(i.e if “cuisnier” is a typo of “cuisinier” what would be a typo of “chauffeur”? ) :

Example 3 (typos)

[crayon-5a31d8fbe3af0514149946/]As we see with these outcomes, the most similar vectors (in the 100-dimensional space) are also quite close in terms of semantics. That is why some present the similarity between two vectors as a ‘semantic distance.’ However, if that seems to perform well for positive similarity, it does not really handle negative similarity (antonyms).

This type of vector analogy is not specific to embeddings, but embeddings do a remarkable job at preserving the semantic relationship in low dimensions. On top of this, word2vec comes with computational optimizations that make it possible to train on very large corpuses very efficiently (in particular using a ‘negative’ sampling technique).

Applications

Assuming that we have a good representation of words, we will have a base on which to build more advanced applications.

Query expansion: In an experiment conducted in Schibsted’s Norwegian marketplace, Finn, we used the word vectors as an under-the-hood query expansion, with good customer feedback. From a small search query of typically 5 words, 4-5 times as many similar words were retrieved, using the word2vec model to form the expanded search query. This expanded query was used to broaden the search and retrieve more candidates to pool from.

Ad2Vec: from word embeddings to classified ads embeddings

Getting a good representation of a classified ad job offer would enable us to estimate its similarity to other offers based on their descriptions (to improve recommendations for instance).

Generally speaking, a classified ad can be seen as three elements: a title, a description and some pictures (when relevant). For job offers, we can focus on the first two elements.

Our first approach was to simply average out all the word embeddings of the ad content to get a representation of the whole ad. However, this did not appear to be the best aggregation method. Instead, we experimented with a couple of alternatives, using a tf-idf weighted sum of the underlying embeddings or projecting the ad into a higher-dimension space to get better representations.

From that, not only could one-to-one similarity be computed, but we can also try to get a more global view of the embeddings. We used the Google Tensorboard Embeddings Visualization tool, with a 3d T-SNE projection on a sub-sample of the ads. Each point is a job offer classified ad, colored according to its field. The plot below shows a sub-sample of 7 fields. The ad embeddings coupled with the T-SNE projection very clearly split the different fields.

Zooming in on the IT positions ads, we also get a very homogeneous word cloud (in French):

Conclusion

The word embeddings model and its derivatives is a very promising technique to qualify and enrich classifieds ads content. Natural Language Processing (NLP) techniques used here bring us closer to automating the Natural Language Understanding (NLU).

We focused here on job descriptions, but, for more general purposes, a global content embeddings will have to synthesize the text, pictures and all the meta-data attached to an ad.

What about deep learning? Word2vec does not actually use a deep network (it’s more part of the shallow machine learning models). Deep learning applied to text is still a huge research area. If you want to help us dig deeper on this topic, we are hiring!

]]>Matching a job description with a candidate's experience is not an easy task, even for humans. An HR professional's workload includes lots of data-heavy tasks (like sifting through tons of candidate experiences), which can be very time-consuming. With the great AI awakening, can we expect machines to help HR with these repetitive and complex tasks?At Schibsted, we operate classified ad services around the world, including several leading job search sites, and we are continuously working to improve the efficiency of our marketplaces to provide a good match between offer and demand. We see AI and deep learning as an enabler to further improve our user experience. In the future, we expect AI in the job classified ads business to both help HR professionals source good candidates and enable candidates to find their next job. This post explores how the latest progress in NLP (Natural Language Processing) could reshape the job / candidate matching problem by using good representations (embeddings) of the content.

Challenges to overcome

The simplest way to match a candidate's skills with a job description is to use keywords in a term-based search. That means, for instance, in the IT field, if a Java specialist is looking for a new position, then “Java” would obviously be a good keyword to match. A term-based search will then match all the job offers where the term “Java” appears. However, no listings with other relevant keywords (like, say, J2EE) would appear, because they do not contain the keyword “Java”. Therefore, the candidate may miss seeing some relevant job offers. This is why not only precision, but also recall (the fraction of relevant listings actually matched) does matter. Among the challenges to good matching, some are particularly crucial:

Synonyms: Depending on the corporate culture, similar job positions and skills might be described using different words. A good model has to be able to catch synonyms and related words.

Polysemy: On the other hand, one specific keyword can have very different meaning in different contexts. E.g, in French, “chef” can refer both to a kitchen chef or to a team leader.

Embeddings trained on classified ads content

NLP (Natural Language Processing) has seen huge progress lately. One of the most popular recent techniques is the use of word embeddings (a.k.a., distributional semantic models). It has gained a lot of popularity by showing that we can represent a word with a vector of doubles that will “encode” the word from a semantic point of view.
The most famous word embeddings model was introduced in 2013 (Mikolov et al 2013) with the word2vec model. Word2vec has been analysed by many researchers, for instance Goldberg and Levy. The concept of this model is that a word representation can be inferred from its context. So if two words appear frequently in the same context, they should be represented by two nearby vectors. Internally, given a word (its representation in fact), the model tries to predict the surrounding word representations. In practice, it means that word2vec will affect high similarities between words that can be used in the same manner.
Here is the example developed in the word2vec article:
Say we represent each word of a vocabulary with a vector (e.g., in dimension 100). Each word is then represented by a distribution of numbers across 100 dimensions. The word “king” is represented like this:

W(“king”) = (0.92, 0.81, -0.2, … )

This representation is a projection into a 100-dimensional space. This projection can be seen as a compression. This list of 100 elements does not mean anything in itself, just like a zip file is un-readable in itself, but compresses the information.
Other words of the vocabulary, like “man” or “woman” have their own representation:

W(“man”) = (0.11, 0.85, -0.18, … )

W(“woman”) = (0.12, -0.25, 0.91, … )

We can then compose these vectors, making addition or substraction, like:

W(“king”) - W(“man”) + W(“woman”) = (0.93, -0.29, 0.89, …)

This resulting vector is a point in our 100-dimensional space. What if we look at what is the word whose representation in that space is the closest to (0.93, -0.29, 0.89, …) ? Answer is that we get the word “queen”. So, to recap:

This kind of result seems very powerful. The input of the model is only a corpus of text, and absolutely no notion of gender or royalty has been explicitly given to the model (through a dictionary or a relationship network).
Applying this model, if we feed an open-source, NLP python library (namely gensim) with job offer content from Leboncoin, Schibsted's french marketplace, we can compute which vectors are most similar to a given input.

Example 1 (IT techno)

[crayon-5a31d8fbe3ae2326658239/]
Even if Leboncoin is not a niche specialist in the IT field, these similarities seem pretty good.

Example 2 (car make)

[crayon-5a31d8fbe3aed449128343/]
From the king - man + woman = queen example, which can be re-formulated as“man” is to “woman” what “king” is to ?we can also apply the same logic to typos:“cuisinier” (french word for ‘cook’) is to “cuisnier” (typo of the latter) “chauffeur” is to what? (i.e if “cuisnier” is a typo of “cuisinier” what would be a typo of “chauffeur”? ) :

Example 3 (typos)

[crayon-5a31d8fbe3af0514149946/]
As we see with these outcomes, the most similar vectors (in the 100-dimensional space) are also quite close in terms of semantics. That is why some present the similarity between two vectors as a ‘semantic distance.’ However, if that seems to perform well for positive similarity, it does not really handle negative similarity (antonyms).This type of vector analogy is not specific to embeddings, but embeddings do a remarkable job at preserving the semantic relationship in low dimensions. On top of this, word2vec comes with computational optimizations that make it possible to train on very large corpuses very efficiently (in particular using a ‘negative’ sampling technique).

Applications

Assuming that we have a good representation of words, we will have a base on which to build more advanced applications.

Query expansion: In an experiment conducted in Schibsted’s Norwegian marketplace, Finn, we used the word vectors as an under-the-hood query expansion, with good customer feedback. From a small search query of typically 5 words, 4-5 times as many similar words were retrieved, using the word2vec model to form the expanded search query. This expanded query was used to broaden the search and retrieve more candidates to pool from.

Ad2Vec: from word embeddings to classified ads embeddings

Getting a good representation of a classified ad job offer would enable us to estimate its similarity to other offers based on their descriptions (to improve recommendations for instance).Generally speaking, a classified ad can be seen as three elements: a title, a description and some pictures (when relevant). For job offers, we can focus on the first two elements.Our first approach was to simply average out all the word embeddings of the ad content to get a representation of the whole ad. However, this did not appear to be the best aggregation method. Instead, we experimented with a couple of alternatives, using a tf-idf weighted sum of the underlying embeddings or projecting the ad into a higher-dimension space to get better representations.From that, not only could one-to-one similarity be computed, but we can also try to get a more global view of the embeddings. We used the Google Tensorboard Embeddings Visualization tool, with a 3d T-SNE projection on a sub-sample of the ads. Each point is a job offer classified ad, colored according to its field. The plot below shows a sub-sample of 7 fields. The ad embeddings coupled with the T-SNE projection very clearly split the different fields.Zooming in on the IT positions ads, we also get a very homogeneous word cloud (in French):

Conclusion

The word embeddings model and its derivatives is a very promising technique to qualify and enrich classifieds ads content. Natural Language Processing (NLP) techniques used here bring us closer to automating the Natural Language Understanding (NLU).We focused here on job descriptions, but, for more general purposes, a global content embeddings will have to synthesize the text, pictures and all the meta-data attached to an ad.What about deep learning? Word2vec does not actually use a deep network (it’s more part of the shallow machine learning models). Deep learning applied to text is still a huge research area. If you want to help us dig deeper on this topic, we are hiring!

Deep learning is changing everything – and it's here to stay. Just as electronics and computers transformed all economic activities, artificial intelligence will reshape retailing, transport, manufacturing, medicine, telecommunications, heavy industry…even data science itself. And that list of applications is still growing, as is the list of complex tasks where AI does better than humans. Here at Schibsted we see the opportunities deep learning offers, and we're excited to contribute.

The first great advantage of deep learning is its scale. Andrew summarized it in the following chart:Deep learning models perform better when the amount of data is increased. Not only that, the larger the neural network, the better it works for larger datasets, unlike traditional models where once performance reaches a certain level, adding data or complexity to the model does not necessarily lead to better performance.

Another reason deep learning models are so powerful is their capacity to learn in an end-to-end fashion. Traditional models usually need significant feature engineering. For example, a model able to transcribe the voice of a person may need to do many intermediate steps with the inputs, e.g., finding the phonemes, correctly chaining them and assigning a word to each chain.

Deep learning models do not usually need that kind of feature engineering. You train them end-to-end, i.e. by showing the model a large number of examples. However, the engineering efforts, instead of being applied to transforming the features, go to the architecture of the model. The data scientist will need to decide and try the neuron types he want, the number of layers, how to connect them, etc.

Challenges in model construction

Deep learning models have their own challenges. Many decisions have to be taken during their construction process in order to make the model successful. If a wrong path is taken, much time and money will be wasted, so how can data scientists make informed decisions on what to do next to improve their model? Andrew showed us his classical decision-making framework used to develop models, but this time he extended it to other useful cases.

Let’s start with the basics: in a classification task (for example, making a diagnosis from a scan), we should have a good idea of the errors from:

Human experts

Training sets

Cross-validation (CV) set (also called development or dev set)

Once we have these errors a data scientist can follow a basic workflow to discover valid decisions in the construction of the model. First ask are your training errors high? If so, the model is not good enough; it may need to be richer (e.g., a larger neural network) with a different architecture, or need more training. Repeat the process until bias is reduced.

Once the training set errors are reduced, a low CV set error is needed. Otherwise, the variance is high, meaning more data is required, more regularization or a new model architecture. Repeat until the model performs well in the training and the CV set.Nothing new there. However, deep learning is already changing this process. If your model is not good enough there is always a “way out”: increase your data or make your model larger. In traditional models regularization is used to tune this trade-off, or new features are generated - which isn’t always easy. But with deep learning we have better tools to reduce both errors.

Refining the bias/variance process for artificial data sets

But if access to a vast amount of data isn’t always possible the alternative is to build your own training data. A good example could be the training of a speech recognition system, where an artificial training sample can be created by adding noise to the same voice. However, that does not mean that the training set will have the same distribution as the real set. For these cases the bias/variance trade-off needs to be framed differently.

Imagine that, for a speech recognition model, we have 50,000 hours of generated data but only 100 hours of real data. In such a case the best advice is to take the CV set and the test set from the same distribution. In that case, the generated set will be the training set, and the real set should be split into CV and test sets. Otherwise, there will be different distributions between the CV and the test set, which will be noticed once the model is “completed”. The problem is specified by the CV set, hence it should be as close to the real set as possible.

In practice, Andrew recommended splitting the artificial data into two parts: a training set and a small portion of it, which we will call “train/CV set”. With that, we will measure the following errors:

So, the gap between (1) and (2) is the bias, between (2) and (3) is the variance, the gap between (3) and (4) is because of the distribution mismatch, and the gap between (4) and (5) is because of overfitting.

With this in mind the previous workflow should be modified like this:

If the distribution error is high modify the training data distribution to make it as similar as possible to the test data. Proper understanding of the bias-variance problem allows faster progress in the application of machine learning.

Human-level performance

Knowing the performance level of humans is very important, as this will guide decisions. It turns out that once a model surpasses human performance it is usually much harder to improve because we are getting closer to the "perfect model", i.e. where no model can do better ("Bayes rate"). This was not a problem with traditional models, where it was hard to perform at super-human levels, but is becoming increasingly common in the realms of deep learning.

So, when building a model, take the human performance error of the most expert group of humans as reference this will be a proxy for the "Bayes rate". For example, if a team of doctors does better than one expert doctor use the error measured by the team of doctors.

How can I become a better data scientist?

Reading many papers and replicating results is the best and most reliable path towards becoming a better data scientist. It’s a pattern Andrew has seen across his students, and one that I personally believe in.

Even if almost all you do is "dirty work" - cleaning data, tuning parameters, debugging, optimizing the database, etc., - don’t stop reading papers and replicating models, because replication eventually leads to original ideas.

]]>Deep learning is changing everything – and it's here to stay. Just as electronics and computers transformed all economic activities, artificial intelligence will reshape retailing, transport, manufacturing, medicine, telecommunications, heavy industry…even data science itself. And that list of applications is still growing, as is the list of complex tasks where AI does better than humans. Here at Schibsted we see the opportunities deep learning offers, and we're excited to contribute.At the latest conference on Neural Information Processing Systems (NIPS 2016), Andrew Ng shared some ideas about deep learning. Let me share them with you. The first great advantage of deep learning is its scale. Andrew summarized it in the following chart:
Deep learning models perform better when the amount of data is increased. Not only that, the larger the neural network, the better it works for larger datasets, unlike traditional models where once performance reaches a certain level, adding data or complexity to the model does not necessarily lead to better performance.
Another reason deep learning models are so powerful is their capacity to learn in an end-to-end fashion. Traditional models usually need significant feature engineering. For example, a model able to transcribe the voice of a person may need to do many intermediate steps with the inputs, e.g., finding the phonemes, correctly chaining them and assigning a word to each chain.Deep learning models do not usually need that kind of feature engineering. You train them end-to-end, i.e. by showing the model a large number of examples. However, the engineering efforts, instead of being applied to transforming the features, go to the architecture of the model. The data scientist will need to decide and try the neuron types he want, the number of layers, how to connect them, etc.Challenges in model constructionDeep learning models have their own challenges. Many decisions have to be taken during their construction process in order to make the model successful. If a wrong path is taken, much time and money will be wasted, so how can data scientists make informed decisions on what to do next to improve their model? Andrew showed us his classical decision-making framework used to develop models, but this time he extended it to other useful cases.Let’s start with the basics: in a classification task (for example, making a diagnosis from a scan), we should have a good idea of the errors from:

Human experts

Training sets

Cross-validation (CV) set (also called development or dev set)

Once we have these errors a data scientist can follow a basic workflow to discover valid decisions in the construction of the model. First ask are your training errors high? If so, the model is not good enough; it may need to be richer (e.g., a larger neural network) with a different architecture, or need more training. Repeat the process until bias is reduced. Once the training set errors are reduced, a low CV set error is needed. Otherwise, the variance is high, meaning more data is required, more regularization or a new model architecture. Repeat until the model performs well in the training and the CV set.
Nothing new there. However, deep learning is already changing this process. If your model is not good enough there is always a “way out”: increase your data or make your model larger. In traditional models regularization is used to tune this trade-off, or new features are generated - which isn’t always easy. But with deep learning we have better tools to reduce both errors.Refining the bias/variance process for artificial data setsBut if access to a vast amount of data isn’t always possible the alternative is to build your own training data. A good example could be the training of a speech recognition system, where an artificial training sample can be created by adding noise to the same voice. However, that does not mean that the training set will have the same distribution as the real set. For these cases the bias/variance trade-off needs to be framed differently.Imagine that, for a speech recognition model, we have 50,000 hours of generated data but only 100 hours of real data. In such a case the best advice is to take the CV set and the test set from the same distribution. In that case, the generated set will be the training set, and the real set should be split into CV and test sets. Otherwise, there will be different distributions between the CV and the test set, which will be noticed once the model is “completed”. The problem is specified by the CV set, hence it should be as close to the real set as possible.In practice, Andrew recommended splitting the artificial data into two parts: a training set and a small portion of it, which we will call “train/CV set”. With that, we will measure the following errors:
So, the gap between (1) and (2) is the bias, between (2) and (3) is the variance, the gap between (3) and (4) is because of the distribution mismatch, and the gap between (4) and (5) is because of overfitting.With this in mind the previous workflow should be modified like this:
If the distribution error is high modify the training data distribution to make it as similar as possible to the test data. Proper understanding of the bias-variance problem allows faster progress in the application of machine learning.
Human-level performanceKnowing the performance level of humans is very important, as this will guide decisions. It turns out that once a model surpasses human performance it is usually much harder to improve because we are getting closer to the "perfect model", i.e. where no model can do better ("Bayes rate"). This was not a problem with traditional models, where it was hard to perform at super-human levels, but is becoming increasingly common in the realms of deep learning. So, when building a model, take the human performance error of the most expert group of humans as reference this will be a proxy for the "Bayes rate". For example, if a team of doctors does better than one expert doctor use the error measured by the team of doctors.How can I become a better data scientist?Reading many papers and replicating results is the best and most reliable path towards becoming a better data scientist. It’s a pattern Andrew has seen across his students, and one that I personally believe in. Even if almost all you do is "dirty work" - cleaning data, tuning parameters, debugging, optimizing the database, etc., - don’t stop reading papers and replicating models, because replication eventually leads to original ideas.

The new Schibsted

Schibsted is changing. Heroic efforts are underway to turn us into a global technology company. One such effort is the consolidation of our user base of some 200 million unique users a month. A user base of 200 million can provide a wealth of insights into users' behavior, activities and interests, and a wealth of opportunities - from enabling customized interactions to sharing best practice, all to ensure an excellent user experience.

Given the importance of taking advantage of such opportunities, the efficient and consistent acquisition, dissemination and analysis of data is essential. Without these analyses', we will not be able to inform the wide range of operations and decision making processes that depend on them. Pulse, Schibsted's internal tracking solution, is the technology that enables the acquisition and dissemination of data. While PulseMonitor ensures that we can efficiently and consistently analyse that data - by securing high quality data.

The purpose of PulseMonitor

Pulse provides us at Schibsted with the ability to record a user's behavior as they interact with our sites through the collection of discrete events. We track our users' behavior for a multitude of reasons: to reveal usage patterns, determine trends in activity, augment the user experience and much more. Our users trust us with their data, so it is vital that we take privacy seriously, and we do. At Schibsted we work hard to be transparent - because we recognize that our users’ trust is our most important asset.

With Pulse we aim to enable the creation of unique and valuable experiences, so each user that interacts with our sites and applications can achieve their goals; be that reading informative and reflected news articles, or buying that Swedish designer chair they have spent five years searching for. Tracking is necessary to enable a good user experience, but that doesn't mean we don't understand and take seriously the implications of users trusting us with their data.

Collecting data for the mere purpose of collecting data doesn't make much sense. What is important is that we collect the right data, and that it is of sufficient quality for our data scientists or the sites themselves. Maintaining a sufficient quality of data is therefore vital, and this is the purpose of PulseMonitor.

More than just a dashboard for showing the status of a tracking integration, PulseMonitor aims to ensure that sites across Schibsted are consistently integrated. To extract value from the data sites provide we need to be able to trust the information it contains, and capture enough of the overall picture to correctly analyse it. We need to be able to trust the conclusions drawn from those analyses'.

PulseMonitor ensures that sites are aware of their tracking integration, how they can improve their integration, how they compare to other sites, and most importantly, how well the data they collect meets the expectations of the data consumers.

Schemas, we have them for a reason

Each interaction a user has with a site, or each action performed by a backend system on their behalf, can result in a collected event. Each of these events has an associated type with an associated schema that it needs to conform to (see figure above for an example). The schema represents a contract, or blueprint, between the dispatcher and receiver of the event. The schema is important because it adds predictability, allowing us to validate an incoming event. This implies that we can sort events based on their type and version, and process them accordingly.

A schema also allows us to make some assumptions about the intrinsic quality of an event, where intrinsic quality refers to the data inherent to that event. For example, if an event has a field called address of type string, and the value contained within that field is indeed a string, the field at least has a correct type (even though it tells us nothing about the semantic correctness of the content it is a hint).

PulseMonitor can then cover the introspection of schemas by providing a simple user interface to easily discover event types, connections between event types, required fields, etc.

Collection pipeline

These events are collected by the Pulse software development kits (SDKs), and subsequently dispatched to the data collection pipeline. The current focus is on collecting client side events, and as such, we have SDKs for JavaScript, iOS and Android.

Schibsted uses Amazon Web Services (AWS) for running a number of services and this is also true for the data collection pipeline, depicted in the figure above. Events being dispatched by the SDKs are received by an endpoint called the Collector, which is a very simple service that quickly pushes the received events into a persistent queue. We use Kinesis as our persistent queue, it’s a sliding window queue that provides certain guarantees about reliability and retention. It’s called a sliding window queue because all events stored in Kinesis can be read as many times as required during the retention period - seven days in our case. It is up to the readers to keep track of processed events. Once the seven-day retention period has passed, events are removed automatically by Kinesis.

There are two paths in the collection pipe line: one for fast processing or streaming, and one for slow or bulk processing, i.e., volume of data is more important than low latency data. For PulseMonitor we rely on the fast processing path. In the collection pipeline, a component called Piper can forward a subset or forward all events, depending on filters, etc., to any dependent client.

PulseMonitor receives all events, since we are interested in providing a global view of Pulse's status. Furthermore, we hook onto the fast path, because we want the lowest possible feedback-loop latency for our users.

Immediate feedback on modifications made by a site to their tracking integration is related to timeliness and is very important.

PulseMonitor pipeline

Piper then forwards all the events to our own Kinesis stream, from where we read the events and push them into ElasticSearch (the component called Slurp in the above figure). Between Kinesis and ElasticSearch, Slurp allows us to modify events if required, with some basic hacking, slashing, validations or aggregations. It also enables us to push events into one or more ElasticSearch indices, depending on requirements.

PulseMonitor is a web service running queries towards ElasticSearch through data providers written using GraphQL, a powerful client-side friendly query language. With GraphQL, our webview is only one potential client; anyone else who wants to extract a subset of the metrics can do so through the GraphQL API and publish them to their own services, such as DataDog, or another ElasticSearch cluster.

So why not publish these metrics directly to DataDog? One thing we want to achieve with PulseMonitor is to provide suggestions or actions for improving an integration, which is not easily done with DataDog.

Also, we only want to provide insights into recent metric data with PulseMonitor. With approximately 700 million events ingested per day we don’t want to keep more than a small window of events available - perhaps a day or two. For long term insights, and trends over weeks, months or even years, we have separate teams.

PulseMonitor displays metrics and suggestions on simple data sets; we are not in the business of creating complex analyses', for retention, this is the job of our very competent data scientists.

What is data quality?

Quality cannot be assessed independently of the consumers who are the users of a product. Similarly, data quality cannot be assessed independent of those who use the data, i.e., the data consumers. At Schibsted, those consumers are primarily our data scientists. With improved data quality, Schibsted's data scientists can increase the accuracy of their models and extract more insights. With more accurate models, capturing a broader and more accurate picture of the state of affairs, enables the sites to more accurately improve their products. This is a mutually beneficial relationship.

Quality categories and dimensions

To understand how data quality can be measured there are a number of categories and dimensions that are commonly used, as shown in table 1. Given these definitions, we can invert the problem and say that data quality problems can be defined as those caused by collected data being unfit for use in one or more of these quality dimensions.

The intrinsic category refers to the quality the data has in its own right, i.e., whether it is accurate or inaccurate. These are dimensions related to the quality of the data itself, not how it is accessed or used. For example, average user reading times for an article might be accurate, but of no use for determining a user’s gender.

Accessibility relates to the ease of access and understanding of data. Are there processes in place to access certain sets of data? Is a particular set of credentials required? Are we able to easily find datasets that are useful for an analysis? Are we able to interpret the data without expert assistance? With accessibility we want to remove any problems that conceal or make data inaccessible.

Contextual refers to the quality of data, in the context of the task being performed. If our analysis relies on real-time data, data transmitted with significant delay will provide little value for the context it is processed in. Completeness is another example, we might be sending a number of events, all of which are correct, but only half of the type of events we should be sending, but the lack of such events might not be known or communicated. Being able to flag this as an issue, and have it easily communicated to all data producers will form a vital piece of functionality in PulseMonitor.

Representational data quality means data is easily interpreted, concise and consistent. Accessing large amounts of data is time-consuming, so introducing columnar data storage can reduce computational time for some sets of analysis. Are we able to access the stored data in a consistent format, e.g., metadata extracted from video, audio or images should be available in the same format as events collected from the frontend or backend. Having meaningful, concise and consistent representations of data is vital.

What can we do to ensure quality

For batch jobs, as mentioned with columnar storage, this may simply involve access to a data format that doesn't make it cumbersome to process large amounts of data. An overview of storage formats available for various datasets, and advice on the situations different formats are useful in can be very useful to data consumers. Similarly, a notification framework for the availability of bulk data sets for processing can be of great value, and provide statistics about the general dataset availability.

Since processing resources are closely linked to data accessibility, we could also display the number of resources dedicated to shared cluster resources and average times for running jobs, etc., or tips and tricks on how to optimize jobs. Such simple insights could then make it possible for data consumers to alert resource managers if jobs are piling up or request assistance from experts for optimizing their work.

With regard to proper interpretation and understanding, we can ensure the data is accompanied by descriptions and examples of use, to highlight how it can be interpreted, without specialist aid. If specialists are required to interpret the data, contacts or channels of communication can be provided to address relevant questions

Concise and consistent representation, e.g., extracting metadata from images and performing data queries in combination with text or other objects; this could also be made available via PulseMonitor as part of a documentation set, perhaps just providing a link to an API that can be used.

Ultimately, we want to increase data utilization with PulseMonitor, and the best way to achieve this is by removing common sources of faults.

Where are we now and where do we want to be?

Within the described quality categories and dimensions there are a number of high reward features we can integrate into PulseMonitor. The goal is to provide a real-time live-aggregation dashboard where a visualisation of metrics from the events currently flowing through the system can be seen. We already have a prototype for this (see figure above). Key metrics are event type distribution, number of events, distribution across different SDKs, etc. More importantly, this will provide valuable input for sites to increase the quality of their integration to extract further value from tracking.

There are different types of quality signals. For instance, a lack of event diversity across comparable sites, e.g., marketplaces, can be an indication that one marketplace is not collecting a sufficiently diverse set of events. Adding tracking for an additional set of event types, might enable them to collect more complete profiles of user behaviors, increasing the value of these user interactions with the site. In the long term, the ability to compare your integration with others in comparable markets is a powerful one. It enables sites to adapt to positive changes on other sites through awareness that one site can collect location data for more events. At the very least it creates an incentive to reach out and see if there are lessons to be learnt.

Other quality signals could be simple statistics on what SDK versions sites are shipping events from. If no events are being sent from iOS, perhaps a potential market is not being exploited; the numbers for other sites on that platform might indicate a potential benefit. We can also provide simple insights, such as percentages of users who opt out, percentage of events containing certain fields, e.g., location.

Such a dashboard can provide insights, such as the first time an event type was seen, or the percentage of those events received within a time period. It could also be a one-stop-shop for integrating tracking into a site.

We can then provide snippets of code for integrating the missing tracking SDKs, or documentation on how to track more types of events. It can also be used to display the schema, and show a distribution of events that are valid or conform to the given schema. Events that don't conform can then be listed, with hints about what is wrong with the event (and potential fixes) provided to the site.

Conclusion

Pulse is Schibsted's tracking solution, providing a unified view of Schibsted's user base, with the aim of extracting value from the behavioral data created by our users.

The efficient collection, dissemination and analysis of data is becoming increasingly popular. To support our data consumers', who are primarily Schibsted data scientists, we have to ensure that data utilization is high and there are a minimum of problems with data quality. This is important because the mutually beneficial relationship that occurs as a result of more accurate models can lead to more users or better retention, which means more data, etc.

PulseMonitor is a tool for sites to efficiently examine the status of their tracking integration, and for data scientists to highlight issues related to data quality problems, and have them removed.

The new Schibsted

Schibsted is changing. Heroic efforts are underway to turn us into a global technology company. One such effort is the consolidation of our user base of some 200 million unique users a month. A user base of 200 million can provide a wealth of insights into users' behavior, activities and interests, and a wealth of opportunities - from enabling customized interactions to sharing best practice, all to ensure an excellent user experience. Given the importance of taking advantage of such opportunities, the efficient and consistent acquisition, dissemination and analysis of data is essential. Without these analyses', we will not be able to inform the wide range of operations and decision making processes that depend on them. Pulse, Schibsted's internal tracking solution, is the technology that enables the acquisition and dissemination of data. While PulseMonitor ensures that we can efficiently and consistently analyse that data - by securing high quality data.

The purpose of PulseMonitor

Pulse provides us at Schibsted with the ability to record a user's behavior as they interact with our sites through the collection of discrete events. We track our users' behavior for a multitude of reasons: to reveal usage patterns, determine trends in activity, augment the user experience and much more. Our users trust us with their data, so it is vital that we take privacy seriously, and we do. At Schibsted we work hard to be transparent - because we recognize that our users’ trust is our most important asset. With Pulse we aim to enable the creation of unique and valuable experiences, so each user that interacts with our sites and applications can achieve their goals; be that reading informative and reflected news articles, or buying that Swedish designer chair they have spent five years searching for. Tracking is necessary to enable a good user experience, but that doesn't mean we don't understand and take seriously the implications of users trusting us with their data.Collecting data for the mere purpose of collecting data doesn't make much sense. What is important is that we collect the right data, and that it is of sufficient quality for our data scientists or the sites themselves. Maintaining a sufficient quality of data is therefore vital, and this is the purpose of PulseMonitor. More than just a dashboard for showing the status of a tracking integration, PulseMonitor aims to ensure that sites across Schibsted are consistently integrated. To extract value from the data sites provide we need to be able to trust the information it contains, and capture enough of the overall picture to correctly analyse it. We need to be able to trust the conclusions drawn from those analyses'.PulseMonitor ensures that sites are aware of their tracking integration, how they can improve their integration, how they compare to other sites, and most importantly, how well the data they collect meets the expectations of the data consumers.

Schemas, we have them for a reason

[caption id="attachment_797" align="aligncenter" width="1024"] View of the schema[/caption]
Each interaction a user has with a site, or each action performed by a backend system on their behalf, can result in a collected event. Each of these events has an associated type with an associated schema that it needs to conform to (see figure above for an example). The schema represents a contract, or blueprint, between the dispatcher and receiver of the event. The schema is important because it adds predictability, allowing us to validate an incoming event. This implies that we can sort events based on their type and version, and process them accordingly. A schema also allows us to make some assumptions about the intrinsic quality of an event, where intrinsic quality refers to the data inherent to that event. For example, if an event has a field called address of type string, and the value contained within that field is indeed a string, the field at least has a correct type (even though it tells us nothing about the semantic correctness of the content it is a hint).PulseMonitor can then cover the introspection of schemas by providing a simple user interface to easily discover event types, connections between event types, required fields, etc.

Collection pipeline

[caption id="attachment_798" align="aligncenter" width="1024"] Web and mobile software development kits (SDKs)[/caption]
These events are collected by the Pulse software development kits (SDKs), and subsequently dispatched to the data collection pipeline. The current focus is on collecting client side events, and as such, we have SDKs for JavaScript, iOS and Android.
[caption id="attachment_799" align="aligncenter" width="1024"] Data collection pipeline[/caption]
Schibsted uses Amazon Web Services (AWS) for running a number of services and this is also true for the data collection pipeline, depicted in the figure above. Events being dispatched by the SDKs are received by an endpoint called the Collector, which is a very simple service that quickly pushes the received events into a persistent queue. We use Kinesis as our persistent queue, it’s a sliding window queue that provides certain guarantees about reliability and retention. It’s called a sliding window queue because all events stored in Kinesis can be read as many times as required during the retention period - seven days in our case. It is up to the readers to keep track of processed events. Once the seven-day retention period has passed, events are removed automatically by Kinesis. There are two paths in the collection pipe line: one for fast processing or streaming, and one for slow or bulk processing, i.e., volume of data is more important than low latency data. For PulseMonitor we rely on the fast processing path. In the collection pipeline, a component called Piper can forward a subset or forward all events, depending on filters, etc., to any dependent client. PulseMonitor receives all events, since we are interested in providing a global view of Pulse's status. Furthermore, we hook onto the fast path, because we want the lowest possible feedback-loop latency for our users. Immediate feedback on modifications made by a site to their tracking integration is related to timeliness and is very important.

PulseMonitor pipeline

[caption id="attachment_800" align="aligncenter" width="1024"] PulseMonitor pipeline[/caption]
Piper then forwards all the events to our own Kinesis stream, from where we read the events and push them into ElasticSearch (the component called Slurp in the above figure). Between Kinesis and ElasticSearch, Slurp allows us to modify events if required, with some basic hacking, slashing, validations or aggregations. It also enables us to push events into one or more ElasticSearch indices, depending on requirements. PulseMonitor is a web service running queries towards ElasticSearch through data providers written using GraphQL, a powerful client-side friendly query language. With GraphQL, our webview is only one potential client; anyone else who wants to extract a subset of the metrics can do so through the GraphQL API and publish them to their own services, such as DataDog, or another ElasticSearch cluster. So why not publish these metrics directly to DataDog? One thing we want to achieve with PulseMonitor is to provide suggestions or actions for improving an integration, which is not easily done with DataDog. Also, we only want to provide insights into recent metric data with PulseMonitor. With approximately 700 million events ingested per day we don’t want to keep more than a small window of events available - perhaps a day or two. For long term insights, and trends over weeks, months or even years, we have separate teams. PulseMonitor displays metrics and suggestions on simple data sets; we are not in the business of creating complex analyses', for retention, this is the job of our very competent data scientists.

What is data quality?

Quality cannot be assessed independently of the consumers who are the users of a product. Similarly, data quality cannot be assessed independent of those who use the data, i.e., the data consumers. At Schibsted, those consumers are primarily our data scientists. With improved data quality, Schibsted's data scientists can increase the accuracy of their models and extract more insights. With more accurate models, capturing a broader and more accurate picture of the state of affairs, enables the sites to more accurately improve their products. This is a mutually beneficial relationship.

Quality categories and dimensions

To understand how data quality can be measured there are a number of categories and dimensions that are commonly used, as shown in table 1. Given these definitions, we can invert the problem and say that data quality problems can be defined as those caused by collected data being unfit for use in one or more of these quality dimensions.

Table 1:Data quality categories and dimensionsThe intrinsic category refers to the quality the data has in its own right, i.e., whether it is accurate or inaccurate. These are dimensions related to the quality of the data itself, not how it is accessed or used. For example, average user reading times for an article might be accurate, but of no use for determining a user’s gender.Accessibility relates to the ease of access and understanding of data. Are there processes in place to access certain sets of data? Is a particular set of credentials required? Are we able to easily find datasets that are useful for an analysis? Are we able to interpret the data without expert assistance? With accessibility we want to remove any problems that conceal or make data inaccessible.
Contextual refers to the quality of data, in the context of the task being performed. If our analysis relies on real-time data, data transmitted with significant delay will provide little value for the context it is processed in. Completeness is another example, we might be sending a number of events, all of which are correct, but only half of the type of events we should be sending, but the lack of such events might not be known or communicated. Being able to flag this as an issue, and have it easily communicated to all data producers will form a vital piece of functionality in PulseMonitor.Representational data quality means data is easily interpreted, concise and consistent. Accessing large amounts of data is time-consuming, so introducing columnar data storage can reduce computational time for some sets of analysis. Are we able to access the stored data in a consistent format, e.g., metadata extracted from video, audio or images should be available in the same format as events collected from the frontend or backend. Having meaningful, concise and consistent representations of data is vital.

What can we do to ensure quality

For batch jobs, as mentioned with columnar storage, this may simply involve access to a data format that doesn't make it cumbersome to process large amounts of data. An overview of storage formats available for various datasets, and advice on the situations different formats are useful in can be very useful to data consumers. Similarly, a notification framework for the availability of bulk data sets for processing can be of great value, and provide statistics about the general dataset availability.Since processing resources are closely linked to data accessibility, we could also display the number of resources dedicated to shared cluster resources and average times for running jobs, etc., or tips and tricks on how to optimize jobs. Such simple insights could then make it possible for data consumers to alert resource managers if jobs are piling up or request assistance from experts for optimizing their work. With regard to proper interpretation and understanding, we can ensure the data is accompanied by descriptions and examples of use, to highlight how it can be interpreted, without specialist aid. If specialists are required to interpret the data, contacts or channels of communication can be provided to address relevant questions Concise and consistent representation, e.g., extracting metadata from images and performing data queries in combination with text or other objects; this could also be made available via PulseMonitor as part of a documentation set, perhaps just providing a link to an API that can be used.Ultimately, we want to increase data utilization with PulseMonitor, and the best way to achieve this is by removing common sources of faults.

Where are we now and where do we want to be?

[caption id="attachment_801" align="aligncenter" width="1024"] Basic version of PulseMonitor[/caption]
Within the described quality categories and dimensions there are a number of high reward features we can integrate into PulseMonitor. The goal is to provide a real-time live-aggregation dashboard where a visualisation of metrics from the events currently flowing through the system can be seen. We already have a prototype for this (see figure above). Key metrics are event type distribution, number of events, distribution across different SDKs, etc. More importantly, this will provide valuable input for sites to increase the quality of their integration to extract further value from tracking.
[caption id="attachment_802" align="aligncenter" width="1024"] Actions needed[/caption]
There are different types of quality signals. For instance, a lack of event diversity across comparable sites, e.g., marketplaces, can be an indication that one marketplace is not collecting a sufficiently diverse set of events. Adding tracking for an additional set of event types, might enable them to collect more complete profiles of user behaviors, increasing the value of these user interactions with the site. In the long term, the ability to compare your integration with others in comparable markets is a powerful one. It enables sites to adapt to positive changes on other sites through awareness that one site can collect location data for more events. At the very least it creates an incentive to reach out and see if there are lessons to be learnt.
[caption id="attachment_803" align="aligncenter" width="1024"] Schema inspection[/caption]
Other quality signals could be simple statistics on what SDK versions sites are shipping events from. If no events are being sent from iOS, perhaps a potential market is not being exploited; the numbers for other sites on that platform might indicate a potential benefit. We can also provide simple insights, such as percentages of users who opt out, percentage of events containing certain fields, e.g., location. Such a dashboard can provide insights, such as the first time an event type was seen, or the percentage of those events received within a time period. It could also be a one-stop-shop for integrating tracking into a site. We can then provide snippets of code for integrating the missing tracking SDKs, or documentation on how to track more types of events. It can also be used to display the schema, and show a distribution of events that are valid or conform to the given schema. Events that don't conform can then be listed, with hints about what is wrong with the event (and potential fixes) provided to the site.

Conclusion

Pulse is Schibsted's tracking solution, providing a unified view of Schibsted's user base, with the aim of extracting value from the behavioral data created by our users. The efficient collection, dissemination and analysis of data is becoming increasingly popular. To support our data consumers', who are primarily Schibsted data scientists, we have to ensure that data utilization is high and there are a minimum of problems with data quality. This is important because the mutually beneficial relationship that occurs as a result of more accurate models can lead to more users or better retention, which means more data, etc. PulseMonitor is a tool for sites to efficiently examine the status of their tracking integration, and for data scientists to highlight issues related to data quality problems, and have them removed.

What’s the most important thing for sellers on a marketplace? Setting a good price. We take a look at the dynamics of pricing and how Schibsted can help users set prices for cars.

Introduction

Millions of cars are sold every year across Schibsted’s 42 worldwide marketplaces, on sites such as Coches.net in Spain, Finn.no in Norway or Blocket.se in Sweden. What more can we do to help people sell their cars - or, indeed, anything?

Setting an asking price is probably the most important action for any seller in a marketplace. The asking price directly affects the final price, and (because every seller wants the best price) also directly impacts the user experience on Schibsted classified sites.

However, the price has less obvious effects, as well. Too low and the seller may be flooded with offers and regret the price. Too high and they might receive no interest in the item - or many haggling requests, and be forced to lower the price. By supporting users in setting realistic prices, we believe we can improve their marketplace experience.

But how exactly can we support them? Classified advertising sites typically require only a photo, title, description, category and price, making price recommendation difficult, because we are forced to find similarities between classified ad features before applying pricing logic.

However, ads for some product types have more structured information - cars are a good example. Here collecting information on the car make, model and release year is very common. This makes it much simpler to recommend prices, because we can apply conventional regression techniques without thinking about complex similarity measures to unveil the dynamics of pricing.

Let’s shine a light on what drives car prices and how well we can predict them...

Data

One privilege of working as Data Scientists at Schibsted is the incredible amount of data available for studies like this one. The dataset used in this post comes from one of Schibsted’s classified sites. It is comprised of a few million car ads with the following nine features, in addition to the asking price, which is our target variable.

Table1: Data Model

The dataset consists of 39 car brands, with the following distribution:

As we can see, 25% of cars are younger than five years, whereas about 60% of cars are less than ten years old.

How does age impact your car’s value?

Car owners can easily estimate the cost of insurance, fuel and service per year, but they may have a harder time figuring out how much their car will be worth over in the future.

What will be the value of your car in, say, four years? And will the depreciation change if your car is a BMW or a Ford? Without data, these questions are hard to answer. But with the amounts of historical data we have at Schibsted, the depreciation of a car is more easily determined.

We decided to test this by plotting the depreciation curves for the 6 most popular car brands, in Figure 3 below. For each of the six car brands, we calculated the median price of the cars for each age level, and scale the values by setting the price to 100% at age zero. Hence in Figure 3, we can think of the vertical axis as the relative price of a car with respect to its initial value.

As it turns out, we can say that the price of your car has most likely halved after four years,no matter if it’s a BMW or a Ford. That seems a good rule of thumb. Actually, the price decreases by approximately 15% every year in an exponential fashion. At year 4 it is 50% of initial price, at year 8 it is 25%, and so on. If you don’t believe us, do the maths: 85% x 85% x 85% x 85% = 52%. This is what in finance is called “the magic of compounding”.

Mileage

Mileage in our dataset is grouped in buckets of size 5,000 km. From Figure 4, we can see that the largest group is in fact cars that have less that 5,000 km, which is comprised of about 5.5 % of the cars in the dataset. Note that in the graph below, the rightmost bar is the group of all cars driven 300,000 km or more.

It’s a known phenomenon that as the mileage of a car grows, its value falls. It’s common sense. However, can we derive a rule of thumb for the depreciation effect of mileage, like we did for the age? Let’s have a look.

How does mileage impact your car’s value?

Just as what we did with the age of the car, we plot the evolution of its price over mileage, but now directly in relative terms, i.e. as percentage of the price at zero mileage:

As expected, we can see that all car brands lose value quickly. However, it’s also apparent that Volkswagens and Fords depreciate quicker in the first 25,000 km, and in Ford’s case, it doesn’t even catch up with the other brands. Alongside the data from the 6 brands we have plotted the exponential decay curve y = c*e^ax, fitted on the medians of car prices from all the 6 brands.

It’s apparent that the exponential fits the curve of the car brands quite nicely, except for the case of Ford. While the price of a car roughly halves after 85,000 km, for some cars it halves quicker, like the Fords, which lose 50% of their value just after 65,000 km.

Figure 5 also allows us to establish some equivalences. We can see that the first 15% of the price loss, which is roughly equivalent to one year as per Figure 3, happens at around 16,000 km. The next 15% is at approximately 32,000 km. As a rule of thumb, a mileage of 16,000 km (which is about 10,000 miles) has the same effect as one year.

There is of course a flaw in the charts above. We’ve plotted the median price for each car brand evolving with the age or mileage, but including all the different models. It is likely that some car models depreciate differently from the median. Figure 6 shows the variation in price for Volvos, by plotting the interpercentile ranges of prices for Volvos for different mileages.

It is clear that there is a lot of variance in the price that the mileage alone doesn’t explain, and that the depreciation will quite possibly change for different car models. For example, if we look at the area between 25th and 75th percentiles, for cars with 50,000 km, the prices range from 50% to 80% of the initial price. The median, however, lies at around 65%. This shows how different prices can be inside the same brand, and it may be for many other factors not related to mileage.

Price Predictions

In this section, we will discuss some basic approaches to predicting prices for vehicles. For this purpose, we divide the dataset into a training set, which we use to train our models, and a test set, on which we test how accurate they can predict a price. The split is based on time, so that we use the newest 20% of the observations to test.

In order to measure the accuracy of our model, we use the mean and the median of the absolute percentage error (APE). Although it sounds complicated, it is not. Bear with us one moment.

The absolute percentage error is given by |(truth - prediction) / truth| for each observation. For example:

We have two cars A and B with their “true prices” €10,000 and €20,000 respectively. For car A, we predict a price of €12,000, and for car B we predict €15,000. The absolute percentage error for car A is |10000 - 12000| / 10000 = 20%. For car B, the absolute percentage error is |20000 - 15000| / 20000 = 25%. The mean absolute percentage error is therefore (20% + 25%) / 2 = 22.5%.

Of course the lower the Mean APE of the model, the better it is. Now that we have a way to measure how good a model is, we can build a very simple one by grouping the training set by different fields and calculating the mean (or some other number) of the price based on the observations in that group. Then, for any car that matches on those fields, we will use the computed number, i.e. the mean in this example, as the prediction for this car.

For example, if we group only by “make_name”, the prediction of this simple model for an Audi A4 Avant from 2010 will be just the average price of all cars of the make Audi. We dub this benchmark approach the GroupByRegressor. In Table 2 we report the Median and Mean APE on the test set:

Table 2: Results obtained by using group by and mean calculation

Interestingly, with this simple model we are on average about 15% away from the observed asking price - which actually isn’t that bad! Note that although the GroupByRegressor seems to work here, it will do a very bad job in the event that you have an observation in the test set that doesn’t match any observation in the training set on the grouped features - hence, this model won’t generalize to completely new observations.

We can do better.

A slightly more sophisticated approach is to train a K Nearest Neighbors (KNN) model. A KNN model identifies for every data point in the test set, given some error metric and the number K, the K closest data points in the training set. We then report the average price of those K neighbors as the prediction. This methodology mimics the behavior of users in our marketplaces, where they search for similar cars to the one they want to sell and use the prices of cars in the search results as the basis for setting the price of their own ad.

Using the nine features in Table 1, we define a simple pipeline where we transform the discrete features, make_name, model_name, model_version_name, gearbox and chassis into multiple binary vectors in a procedure known as one-hot encoding. The model_year and published_date are transformed into the age of a car by the doing year(published_date) - model_year, and is then along with engine_size and mileage scaled into having zero mean and unit variance.

We present the results of the KNN model in Table 3. We see an improvement of the Mean APE from 15.0% of the GroupByRegressor to 11.9% for the KNN. Not a bad improvement.

We also trained a gradient boosting machines model, using the XGBoost implementation. For this case we tried something different: we trained one model per unique car model (denoted OPMM below). We do this to avoid having to model interaction effects, i.e. the fact that the mileage affects the price of a Ford differently than the price of an Audi, as we observed before. Hence our model uses only the following five features: gearbox, chassis, age, engine_size and mileage. Included in Table 3 are the scores of the XGBoost regression trained on the entire training set, using all features, with all parameters set to the default of the xgboost sklearn interface, but without interaction effects.

Table 3: Regression model results

It’s immediately clear from Table 3 that both KNN and XGBoost were able to beat the results of the benchmark model, the GroupByRegressor. However, the method performing the best on this test set, XGBoost, only manages to beat it by 3.5 percentage points in terms of Mean APE. Interestingly, recommending the mean price of the 10 nearest neighbors works surprisingly well, being only .5 and .06 percentage points behind XGBoost in median and mean absolute percentage error, respectively.

Some final words

We now have a rule of thumb for the fall in a car’s price by age and mileage, namely that it halves every four years or after every 85,000km driven. We also built a model that can, with reasonable accuracy, estimate a car’s price based on that of similar cars.

This price prediction assists users to choose the right price for their item, but also help them figure out how much they should pay for something second-hand. The best news is, we're only scratching the surface in doing this with cars - there's much more work to be done!

Disclosure:

The data in this article is taken from one of Schibsted’s marketplaces, and was extracted for summer 2016. Prices in this article are listing prices, not the actual sales price.

]]>What’s the most important thing for sellers on a marketplace? Setting a good price. We take a look at the dynamics of pricing and how Schibsted can help users set prices for cars.

Introduction

Millions of cars are sold every year across Schibsted’s 42 worldwide marketplaces, on sites such as Coches.net in Spain, Finn.no in Norway or Blocket.se in Sweden. What more can we do to help people sell their cars - or, indeed, anything?Setting an asking price is probably the most important action for any seller in a marketplace. The asking price directly affects the final price, and (because every seller wants the best price) also directly impacts the user experience on Schibsted classified sites.However, the price has less obvious effects, as well. Too low and the seller may be flooded with offers and regret the price. Too high and they might receive no interest in the item - or many haggling requests, and be forced to lower the price. By supporting users in setting realistic prices, we believe we can improve their marketplace experience.But how exactly can we support them? Classified advertising sites typically require only a photo, title, description, category and price, making price recommendation difficult, because we are forced to find similarities between classified ad features before applying pricing logic.However, ads for some product types have more structured information - cars are a good example. Here collecting information on the car make, model and release year is very common. This makes it much simpler to recommend prices, because we can apply conventional regression techniques without thinking about complex similarity measures to unveil the dynamics of pricing.Let’s shine a light on what drives car prices and how well we can predict them...

Data

One privilege of working as Data Scientists at Schibsted is the incredible amount of data available for studies like this one. The dataset used in this post comes from one of Schibsted’s classified sites. It is comprised of a few million car ads with the following nine features, in addition to the asking price, which is our target variable.Table1: Data ModelThe dataset consists of 39 car brands, with the following distribution:
[caption id="attachment_730" align="alignnone" width="1024"]Figure 1: Proportion of Car Brands[/caption]
The dataset also consists of some 520 distinct car models. We limited our study to the the 300 car models that have more than 50 observations in the dataset.

Depreciation

Depreciation is the difference between a car’s value from when you buy it until you sell it. In this study we will be looking at how age and mileage can impact your car’s value.

Age

We obtain the age of a car as the difference between the year of the published_date field, and the model_year field. Below is a plot of the relative counts for each age group of size 5.
[caption id="attachment_720" align="aligncenter" width="1024"]Figure 2: Age Distribution of Cars[/caption]
Figure 2: Histogram of car age.As we can see, 25% of cars are younger than five years, whereas about 60% of cars are less than ten years old.

How does age impact your car’s value?

Car owners can easily estimate the cost of insurance, fuel and service per year, but they may have a harder time figuring out how much their car will be worth over in the future. What will be the value of your car in, say, four years? And will the depreciation change if your car is a BMW or a Ford? Without data, these questions are hard to answer. But with the amounts of historical data we have at Schibsted, the depreciation of a car is more easily determined. We decided to test this by plotting the depreciation curves for the 6 most popular car brands, in Figure 3 below. For each of the six car brands, we calculated the median price of the cars for each age level, and scale the values by setting the price to 100% at age zero. Hence in Figure 3, we can think of the vertical axis as the relative price of a car with respect to its initial value.
[caption id="attachment_733" align="aligncenter" width="1024"]Figure 3: Depreciation with time[/caption]
As it turns out, we can say that the price of your car has most likely halved after four years,no matter if it’s a BMW or a Ford. That seems a good rule of thumb. Actually, the price decreases by approximately 15% every year in an exponential fashion. At year 4 it is 50% of initial price, at year 8 it is 25%, and so on. If you don’t believe us, do the maths: 85% x 85% x 85% x 85% = 52%. This is what in finance is called “the magic of compounding”.

Mileage

Mileage in our dataset is grouped in buckets of size 5,000 km. From Figure 4, we can see that the largest group is in fact cars that have less that 5,000 km, which is comprised of about 5.5 % of the cars in the dataset. Note that in the graph below, the rightmost bar is the group of all cars driven 300,000 km or more.
[caption id="attachment_719" align="alignnone" width="1024"] Figure 4: Histogram of car mileages[/caption]
It’s a known phenomenon that as the mileage of a car grows, its value falls. It’s common sense. However, can we derive a rule of thumb for the depreciation effect of mileage, like we did for the age? Let’s have a look.

How does mileage impact your car’s value?

Just as what we did with the age of the car, we plot the evolution of its price over mileage, but now directly in relative terms, i.e. as percentage of the price at zero mileage:
[caption id="attachment_718" align="aligncenter" width="1024"]Figure 5: Depreciation of cars with mileage[/caption]
As expected, we can see that all car brands lose value quickly. However, it’s also apparent that Volkswagens and Fords depreciate quicker in the first 25,000 km, and in Ford’s case, it doesn’t even catch up with the other brands. Alongside the data from the 6 brands we have plotted the exponential decay curve y = c*e^ax, fitted on the medians of car prices from all the 6 brands. It’s apparent that the exponential fits the curve of the car brands quite nicely, except for the case of Ford. While the price of a car roughly halves after 85,000 km, for some cars it halves quicker, like the Fords, which lose 50% of their value just after 65,000 km.Figure 5 also allows us to establish some equivalences. We can see that the first 15% of the price loss, which is roughly equivalent to one year as per Figure 3, happens at around 16,000 km. The next 15% is at approximately 32,000 km. As a rule of thumb, a mileage of 16,000 km (which is about 10,000 miles) has the same effect as one year.There is of course a flaw in the charts above. We’ve plotted the median price for each car brand evolving with the age or mileage, but including all the different models. It is likely that some car models depreciate differently from the median. Figure 6 shows the variation in price for Volvos, by plotting the interpercentile ranges of prices for Volvos for different mileages.
[caption id="attachment_723" align="alignnone" width="1024"] Figure 6: Interpercentile ranges of prices of Volvo’s grouped by mileage[/caption]
It is clear that there is a lot of variance in the price that the mileage alone doesn’t explain, and that the depreciation will quite possibly change for different car models. For example, if we look at the area between 25th and 75th percentiles, for cars with 50,000 km, the prices range from 50% to 80% of the initial price. The median, however, lies at around 65%. This shows how different prices can be inside the same brand, and it may be for many other factors not related to mileage.

Price Predictions

In this section, we will discuss some basic approaches to predicting prices for vehicles. For this purpose, we divide the dataset into a training set, which we use to train our models, and a test set, on which we test how accurate they can predict a price. The split is based on time, so that we use the newest 20% of the observations to test.In order to measure the accuracy of our model, we use the mean and the median of the absolute percentage error (APE). Although it sounds complicated, it is not. Bear with us one moment.The absolute percentage error is given by |(truth - prediction) / truth| for each observation. For example:We have two cars A and B with their “true prices” €10,000 and €20,000 respectively. For car A, we predict a price of €12,000, and for car B we predict €15,000. The absolute percentage error for car A is |10000 - 12000| / 10000 = 20%. For car B, the absolute percentage error is |20000 - 15000| / 20000 = 25%. The mean absolute percentage error is therefore (20% + 25%) / 2 = 22.5%.Of course the lower the Mean APE of the model, the better it is. Now that we have a way to measure how good a model is, we can build a very simple one by grouping the training set by different fields and calculating the mean (or some other number) of the price based on the observations in that group. Then, for any car that matches on those fields, we will use the computed number, i.e. the mean in this example, as the prediction for this car. For example, if we group only by “make_name”, the prediction of this simple model for an Audi A4 Avant from 2010 will be just the average price of all cars of the make Audi. We dub this benchmark approach the GroupByRegressor. In Table 2 we report the Median and Mean APE on the test set:Table 2: Results obtained by using group by and mean calculationInterestingly, with this simple model we are on average about 15% away from the observed asking price - which actually isn’t that bad! Note that although the GroupByRegressor seems to work here, it will do a very bad job in the event that you have an observation in the test set that doesn’t match any observation in the training set on the grouped features - hence, this model won’t generalize to completely new observations.We can do better. A slightly more sophisticated approach is to train a K Nearest Neighbors (KNN) model. A KNN model identifies for every data point in the test set, given some error metric and the number K, the K closest data points in the training set. We then report the average price of those K neighbors as the prediction. This methodology mimics the behavior of users in our marketplaces, where they search for similar cars to the one they want to sell and use the prices of cars in the search results as the basis for setting the price of their own ad. Using the nine features in Table 1, we define a simple pipeline where we transform the discrete features, make_name, model_name, model_version_name, gearbox and chassis into multiple binary vectors in a procedure known as one-hot encoding. The model_year and published_date are transformed into the age of a car by the doing year(published_date) - model_year, and is then along with engine_size and mileage scaled into having zero mean and unit variance. We present the results of the KNN model in Table 3. We see an improvement of the Mean APE from 15.0% of the GroupByRegressor to 11.9% for the KNN. Not a bad improvement.We also trained a gradient boosting machines model, using the XGBoost implementation. For this case we tried something different: we trained one model per unique car model (denoted OPMM below). We do this to avoid having to model interaction effects, i.e. the fact that the mileage affects the price of a Ford differently than the price of an Audi, as we observed before. Hence our model uses only the following five features: gearbox, chassis, age, engine_size and mileage. Included in Table 3 are the scores of the XGBoost regression trained on the entire training set, using all features, with all parameters set to the default of the xgboost sklearn interface, but without interaction effects.Table 3: Regression model resultsIt’s immediately clear from Table 3 that both KNN and XGBoost were able to beat the results of the benchmark model, the GroupByRegressor. However, the method performing the best on this test set, XGBoost, only manages to beat it by 3.5 percentage points in terms of Mean APE. Interestingly, recommending the mean price of the 10 nearest neighbors works surprisingly well, being only .5 and .06 percentage points behind XGBoost in median and mean absolute percentage error, respectively.

Some final words

We now have a rule of thumb for the fall in a car’s price by age and mileage, namely that it halves every four years or after every 85,000km driven. We also built a model that can, with reasonable accuracy, estimate a car’s price based on that of similar cars.This price prediction assists users to choose the right price for their item, but also help them figure out how much they should pay for something second-hand. The best news is, we're only scratching the surface in doing this with cars - there's much more work to be done!Disclosure:The data in this article is taken from one of Schibsted’s marketplaces, and was extracted for summer 2016. Prices in this article are listing prices, not the actual sales price.

Your age can be accurately predicted from your behavior on classified sites. Here's how...

As Data Scientists at Schibsted, one of the many questions we’re trying to answer is: What does your classified browsing history tell us about you as a person?

The good news is: we have an estimated 200 million unique visitors per month. The bad news is: we don't always know very much about these visitors.

There are two simple reasons for this. First, not all of our visitors have user accounts. Second, not every user logs in when visiting our sites. We know very little about these “anonymous” visitors, so let's call them Strangers.

Some visitors we do know quite a bit about though: those who have logged in with a user account containing information about themselves, e.g., age, gender, and home location. Let’s call these visitors Users.

In our User Modeling team, the data we collect about Users is leveraged to build models which can predict things like age and gender for Strangers. Ideally, all of our Strangers will eventually become Users. Until that happens, these models help us predict the information we’re missing, effectively creating an inferred User Profile.

Ultimately, this inferred information helps us:

Better understand our visitors (“what’s the demographic breakdown of readers of this article?”)

Serve more relevant content (“what content should we show people browsing our sites?”)

Age and Classifieds

Now, let’s dive right into the data. What is immediately visible in the data we collect today, and how can this be used to predict age?

By analyzing the millions of page views generated by users of Finn (our classified site in Norway), we can see how interest in classified categories changes over time:

The proportion of pages visited in each category is represented by the height of the line (the y-axis). This is indexed at 1 so we can compare across categories. Think of the lines as measures of user interest in these different categories, sorted by age.

For instance, the graph shows people in their early 20’s have a high interest in house or flat rentals. It also shows that people in their 50’s are more interested in holiday houses.

Intuitively, the graph makes sense, and is explained by these underlying facts:

Holiday houses become more interesting almost linearly as you accumulate wealth and approach retirement.

The cool thing is that the data gives us all this information. A machine doesn’t need to know about Norwegian home ownership rates or their linearly increasing need to migrate to warmer temperatures - it can just look at our data!

And this is basically how we build models: just as in the graph above, where we know the ground truth (in this case the age of the users) machine-learning algorithms look at rough data representations to find relationships between interests and age, resulting in a model that can predict age.

If you give the algorithm a visitor’s page view history (a proxy for their interests), it can predict their age with some degree of error.

Comparing Countries

One of the things I love about Schibsted is that we have classified sites all over the world. This means we can compare users in different countries doing the same things: browsing, buying, and selling stuff online.

Let’s compare the interest in “Home for Rent” of Norwegian Finn users to that of French Leboncoin users.

The curves are fascinatingly similar. Both peak in your early twenties, stay high for about three years, then rapidly decline. The French home-renting interest starts just a little bit later - one year, to be exact. This is more or less consistent with the aforementioned statistics (these stats indicate the interest should start two years later though).

Your age can be accurately predicted from your behavior on classified sites. Here's how...

As Data Scientists at Schibsted, one of the many questions we’re trying to answer is: What does your classified browsing history tell us about you as a person?

The good news is: we have an estimated 200 million unique visitors per month. The bad news is: we don't always know very much about these visitors.

There are two simple reasons for this. First, not all of our visitors have user accounts. Second, not every user logs in when visiting our sites. We know very little about these “anonymous” visitors, so let's call them Strangers.

Some visitors we do know quite a bit about though: those who have logged in with a user account containing information about themselves, e.g., age, gender, and home location. Let’s call these visitors Users.

In our User Modeling team, the data we collect about Users is leveraged to build models which can predict things like age and gender for Strangers. Ideally, all of our Strangers will eventually become Users. Until that happens, these models help us predict the information we’re missing, effectively creating an inferred User Profile.

Ultimately, this inferred information helps us:

Better understand our visitors (“what’s the demographic breakdown of readers of this article?”)

Serve more relevant content (“what content should we show people browsing our sites?”)

Age and Classifieds

Now, let’s dive right into the data. What is immediately visible in the data we collect today, and how can this be used to predict age?

By analyzing the millions of page views generated by users of Finn (our classified site in Norway), we can see how interest in classified categories changes over time:

The proportion of pages visited in each category is represented by the height of the line (the y-axis). This is indexed at 1 so we can compare across categories. Think of the lines as measures of user interest in these different categories, sorted by age.

For instance, the graph shows people in their early 20’s have a high interest in house or flat rentals. It also shows that people in their 50’s are more interested in holiday houses.

Intuitively, the graph makes sense, and is explained by these underlying facts:

Holiday houses become more interesting almost linearly as you accumulate wealth and approach retirement.

The cool thing is that the data gives us all this information. A machine doesn’t need to know about Norwegian home ownership rates or their linearly increasing need to migrate to warmer temperatures - it can just look at our data!

And this is basically how we build models: just as in the graph above, where we know the ground truth (in this case the age of the users) machine-learning algorithms look at rough data representations to find relationships between interests and age, resulting in a model that can predict age.

If you give the algorithm a visitor’s page view history (a proxy for their interests), it can predict their age with some degree of error.

Comparing Countries

One of the things I love about Schibsted is that we have classified sites all over the world. This means we can compare users in different countries doing the same things: browsing, buying, and selling stuff online.
Let’s compare the interest in “Home for Rent” of Norwegian Finn users to that of French Leboncoin users.

The curves are fascinatingly similar. Both peak in your early twenties, stay high for about three years, then rapidly decline. The French home-renting interest starts just a little bit later - one year, to be exact. This is more or less consistent with the aforementioned statistics (these stats indicate the interest should start two years later though).

VG’s premium digital product, VG+ recently celebrated its fifth anniversary with over 70 000 paying subscribers. VG+ is a five year old publication published by a tabloid newspaper with no history of subscriptions.VG+ is now the fourth largest newspaper in Norway with ambitions to surpass the 100,000 subscriber milestone in 2016. In a country with a population of just over 5 million people the subscription numbers are starting to be substantial.

VG+ began as an iPad only product at a time when the iPad was anticipated to save the media industry. “The real game changer didn't come in 2010 with the iPad but came three years earlier with the launch of the smartphone. We just didn't realise it then.” Says the chief editor for VG+, Espen Olsen Langfeldt. Today VG+ is more popular on mobile and desktop than on the iPad.

Behind the success of VG+ lies a high functioning cross departmental collaboration between the technology, journalism and commercial departments. At the core of the success is the use of data insight to help drive decisions across all departments.

At VG breaking news is available for free and is ad supported on our popular website www.vg.no. Premium content is also promoted through the same site and is interspersed among free and open content. Data insight helps determine what content we promote taking both the commercial and editorial requirements into consideration.

Live data displayed on dashboards used by the editorial department help determine how long to promote a story, which row on the front page will have the most impact, what stories convert to new subscribers, which stories have high churn and engagement metrics like session length and time spent consuming content.

We have also used insight to determine which of the older stories we can promote again and again. A high converting old story that is not time limited can be promoted multiple times as long as it’s still relevant.

Funnel analysis and A/B testing of features helps to optimise the product and improve the performance of our authentication system and paywall.

Earlier this year VG reorganised the editorial department, the VG+ and VG Helg (weekend magazine publication) departments were merged. We now have dedicated reporters who create quality content for all our paid products; both digital and print. One of the key turning points was in getting journalists who traditionally only published to print to think about how their story will be published digitally from the moment they began planning a new story.

Titillating stories may sell well but often result in high churn. Insight into what content creates engaged loyal customers and what doesn’t work has helped both journalists and editors to define the type of content to create and promote, ultimately strengthening the brand and the quality of the product. The most engaging content tends to be stories that trigger an emotional response in our readers or in some way help them improve their lives.

Our commercial department have published an aggressive series of campaigns, learning from and optimising each campaign to determine what campaign period, price range and sales pitch creates the greatest number of subscribers.

Our latest 24 hour campaign gave users a 50% reduction on their first subscription and resulted in around 4000 new subscribers. Increasing our subscription base by 5% in one day! Many users took full advantage of the 50% offer and payed for a yearly subscription.

The campaign periods are clearly visible on the graph below by the intermittent spike in subscriber numbers, we do experience a low percentage of new subscribers who churn, immediately turning off the auto-renew subscription after taking advantage of a campaign offer. The subscriber base however always continues to grow after an initial period of campaign related churn.

Campaigns have been advertised through banner advertisements on our own site, facebook, programmatic buying, and below each of the pluss articles promoted on our free website. The most effective advertisement position has been through a text link displayed at the top of our free and open website pushing the campaign. 25% of all summer campaign sales were generated through the top strip text link!

Here is a short video from last year to give you an idea of how data insight is being used across departments working with VG+.

]]>VG’s premium digital product, VG+ recently celebrated its fifth anniversary with over 70 000 paying subscribers. VG+ is a five year old publication published by a tabloid newspaper with no history of subscriptions.VG+ is now the fourth largest newspaper in Norway with ambitions to surpass the 100,000 subscriber milestone in 2016. In a country with a population of just over 5 million people the subscription numbers are starting to be substantial. VG+ began as an iPad only product at a time when the iPad was anticipated to save the media industry. “The real game changer didn't come in 2010 with the iPad but came three years earlier with the launch of the smartphone. We just didn't realise it then.” Says the chief editor for VG+, Espen Olsen Langfeldt. Today VG+ is more popular on mobile and desktop than on the iPad.
Behind the success of VG+ lies a high functioning cross departmental collaboration between the technology, journalism and commercial departments. At the core of the success is the use of data insight to help drive decisions across all departments.
At VG breaking news is available for free and is ad supported on our popular website www.vg.no. Premium content is also promoted through the same site and is interspersed among free and open content. Data insight helps determine what content we promote taking both the commercial and editorial requirements into consideration.Live data displayed on dashboards used by the editorial department help determine how long to promote a story, which row on the front page will have the most impact, what stories convert to new subscribers, which stories have high churn and engagement metrics like session length and time spent consuming content. We have also used insight to determine which of the older stories we can promote again and again. A high converting old story that is not time limited can be promoted multiple times as long as it’s still relevant.Funnel analysis and A/B testing of features helps to optimise the product and improve the performance of our authentication system and paywall.Earlier this year VG reorganised the editorial department, the VG+ and VG Helg (weekend magazine publication) departments were merged. We now have dedicated reporters who create quality content for all our paid products; both digital and print. One of the key turning points was in getting journalists who traditionally only published to print to think about how their story will be published digitally from the moment they began planning a new story.Titillating stories may sell well but often result in high churn. Insight into what content creates engaged loyal customers and what doesn’t work has helped both journalists and editors to define the type of content to create and promote, ultimately strengthening the brand and the quality of the product. The most engaging content tends to be stories that trigger an emotional response in our readers or in some way help them improve their lives.
Our commercial department have published an aggressive series of campaigns, learning from and optimising each campaign to determine what campaign period, price range and sales pitch creates the greatest number of subscribers.
Our latest 24 hour campaign gave users a 50% reduction on their first subscription and resulted in around 4000 new subscribers. Increasing our subscription base by 5% in one day! Many users took full advantage of the 50% offer and payed for a yearly subscription.The campaign periods are clearly visible on the graph below by the intermittent spike in subscriber numbers, we do experience a low percentage of new subscribers who churn, immediately turning off the auto-renew subscription after taking advantage of a campaign offer. The subscriber base however always continues to grow after an initial period of campaign related churn. Campaigns have been advertised through banner advertisements on our own site, facebook, programmatic buying, and below each of the pluss articles promoted on our free website. The most effective advertisement position has been through a text link displayed at the top of our free and open website pushing the campaign. 25% of all summer campaign sales were generated through the top strip text link! Here is a short video from last year to give you an idea of how data insight is being used across departments working with VG+.
https://www.youtube.com/watch?v=QpNgZ2XAWNA