Oracle Blog

Thoughts, Tips, Rationale

Thursday Nov 17, 2011

Accounting for seasonality presents a challenge for the
accurate prediction of events. Examples of seasonality include:

·Boxed cosmetics sets are more popular during
Christmas. They sell at other times of the year, but they rise higher than other
products during the holiday season.

·Interest in a promotion rises around the time
advertising on TV airs

·Interest in the Sports section of a newspaper
rises when there is a big football match

There
are several ways of dealing with seasonality in predictions.

Time Windows

If the length of the model time windows is short enough
relative to the seasonality effect, then the models will see only seasonal
data, and therefore will be accurate in their predictions. For example, a model
with a weekly time window may be quick enough to adapt during the holiday
season.

In order for time windows to be useful in dealing with
seasonality it is necessary that:

The time window is
significantly shorter than the season changes

There is enough volume of
data in the short time windows to produce an accurate model

An additional issue to consider is that sometimes the season
may have an abrupt end, for example the day after Christmas.

Input Data

If available, it is possible to include the seasonality
effect in the input data for the model. For example the customer record may
include a list of all the promotions advertised in the area of residence.

A model with these inputs will have to learn the effect of
the input. It is possible to learn it specific to the promotion – and by the
way learn about inter-promotion cross feeding – by leaving the list of ads as
it is; or it is possible to learn the general effect by having a flag that
indicates if the promotion is being advertised.

For inputs to properly represent the effect in the model it
is necessary that:

The model sees enough
events with the input present. For example, by virtue of the model
lifetime (or time window) being long enough to see several “seasons” or by
having enough volume for the model to learn seasonality quickly.

Proportional Frequency

If we create a model that ignores seasonality it is possible
to use that model to predict how the specific person likelihood differs from
average. If we have a divergence from average then we can transfer that
divergence proportionally to the observed frequency at the time of the
prediction.

Definitions:

Ft = trailing average frequency of the event at time “t”. The
average is done over a suitable period of to achieve a statistical significant
estimate.

F = average frequency as seen by the model.

L = likelihood predicted by the model for a specific person

Lt = predicted likelihood proportionally scaled for time “t”.

If the model is good at predicting deviation from average,
and this holds over the interesting range of seasons, then we can estimate Lt as:

Lt = L * (Ft / F)

Considering that:

L = (L – F) + F

Substituting we get:

Lt = [(L – F) + F] * (Ft / F)

Which simplifies to:

(i)Lt = (L – F) * (Ft / F)+Ft

This latest expression can be understood as “The adjusted
likelihood at time t is the average likelihood at time t plus the effect from
the model, which is calculated as the difference from average time the
proportion of frequencies”.

The formula above assumes a linear translation of the
proportion. It is possible to generalize the formula using a factor which we
will call “a” as follows:

(ii)Lt = (L – F) * (Ft / F) * a+Ft

It is also possible to use a formula that does not scale the
difference, like:

(iii)Lt = (L – F) * a+Ft

While these formulas seem reasonable, they should be taken
as hypothesis to be proven with empirical data. A theoretical analysis provides
the following insights:

The Cumulative Gains Chart
(lift) should stay the same, as at any given time the order of the
likelihood for different customers is preserved

If F is equal to Ft then
the formula reverts to “L”

If (Ft = 0) then Lt in (i)
and (ii) is 0

It is possible for Lt to
be above 1.

If it is desired to avoid going over 1, for relatively high
base frequencies it is possible to use a relative interpretation of the
multiplicative factor.

For example, if we say that Y is twice as likely as X, then
we can interpret this sentence as:

If X is 3%, then Y is 6%

If X is 11%, then Y is 22%

If X is 70%, then Y is 85%
- in this case we interpret “twice as likely” as “half as likely to not
happen”

Thursday May 27, 2010

Given enough data that represents well the domain and models that reflect exactly the decision being optimized, models usually provide good predictions that ensure lift. Nevertheless, sometimes the modeling situation is less than ideal. In this blog entry we explore the problems found in a few such situations and how to avoid them.

1 - The Model does not reflect the problem you are trying to solve

For example, you may be trying to solve the problem: "What product should I recommend to this customer" but your model learns on the problem: "Given that a customer has acquired our products, what is the likelihood for each product". In this case the model you built may be too far of a proxy for the problem you are really trying to solve. What you could do in this case is try to build a model based on the result from recommendations of products to customers. If there is not enough data from actual recommendations, you could use a hybrid approach in which you would use the [bad] proxy model until the recommendation model converges.

2 - Data is not predictive enough

If the inputs are not correlated with the output then the models may be unable to provide good predictions. For example, if the input is the phase of the moon and the weather and the output is what car did the customer buy, there may be no correlations found. In this case you should see a low quality model.

The solution in this case is to include more relevant inputs.

3 - Not enough cases seen

If the data learned does not include enough cases, at least 200 positive examples for each output, then the quality of recommendations may be low.

The obvious solution is to include more data records. If this is not possible, then it may be possible to build a model based on the characteristics of the output choices rather than the choices themselves. For example, instead of using products as output, use the product category, price and brand name, and then combine these models.

If the input data in the training includes values that have changed or are available only because the output happened, then you will find some strong correlations between the input and the output, but these strong correlations do not reflect the data that you will have available at decision (prediction) time. For example, if you are building a model to predict whether a web site visitor will succeed in registering, and the input includes the variable DaysSinceRegistration, and you learn when this variable has already been set, you will probably see a big correlation between having a Zero (or one) in this variable and the fact that registration was successful.

The solution is to remove these variables from the input or make sure they reflect the value as of the time of decision and not after the result is known.

Saturday Dec 12, 2009

It has become quite common in RTD implementation to utilize different models to predict the same kind of value in different situations. For example, if the RTD application is used for optimizing the presentation of Creatives, where the creatives belong to a Offers which in turn belong to Campaigns which belong to Products which belong to Product Lines; it may be desirable to be able to predict at the different levels and use the models in a waterfall fashion as they converge and become more precise.

Another example is when using more than one Model or algorithm, whether internal to RTD or external.

In all these cases it is interesting to determine which of the models or algorithms is better at predicting the output. While RTD Decision Center reports provide good Model Quality reports that can be used to evaluate the RTD internal models, the same may not exist for external model. Furthermore, it may be desired to evaluate the different models in a level playing field, utilizing just one metric that can be used to select the "best" algorithm.

One method of achieving this goal is to use an RTD model to perform the evaluation. This pattern is commonly used in Data Mining to "blend" models or create an "ensemble" of models. The idea is to have the predictors as input and the normal positive event as output. When doing this in RTD, the Decision Center Predictiveness report provides the sorting of the different predictors by their predictiveness.

To demonstrate this I have created an Inline Service (ILS) whose sole purpose is to evaluate predictors which represent different levels of noise over a basic "perfect" predictor. The attached image represents the result of this ILS.

The "Perfect" predictor is just a normally distributed variable centered at 3% with a standard deviation of 7%, limited to the range 0 to 1. The output variable follows exactly the probability given by the predictor. For example, if the predictor is 13% there is a 13% probability of the positive output.

The other predictors are defined by taking the perfect predictor and adding a noise component. The noise is also normally distributed and has a standard deviation that determines the amount of noise.

For example, the "Noise 1/5" predictor has noise with a standard deviation of 20% (1/5) of the value of the prefect predictor.

You can see that the RTD blended model nicely discovers that the more noise there is in the predictor, the less predictive it is.

This kind of blended model can also be used to create a combined model that has the potential of being better than each of the individual models. This is particularly interesting when the different models are really different, for example because of the inputs they use or because of the algorithms used to develop the models.