Sunday, May 10, 2009

Modeling and Prediction Using R - Simplifying Data

Imagine you are an advertising agency, having the responsibility of managing ad inventory of a media house. The media house owns multiple television channels. Your agency has placed metering devices in a sample of homes spread across the geography, with a people meter that also tracks who is watching the channel. Data is fed into a central system where you collate them. I have taken this example of a television media house just to be able to illustrate the problem. This approach may not be appropriate to this exact scenario, but may be suitable to other similar scenarios.

So if there are:

10 different channels

200 different regions - demographics

20 different psychographic profiles

24 hours per day

7 days a week

Then you have 10*200*20*24*7 = 6,720,000 different combinations across which you collate the data. You need to store that volume of data indexed on most of the columns.

Let's say an advertiser wants to reach a certain number of eyeballs in a certain psychographic segment in a certain set of regions and at a certain time of day and day of week. You have to respond to this requirement with whether you have that many ad slots in the required category and whether they are still available (i.e, not yet booked by someone else).

First of all, to calculate the total slots available, you need to break the requirement into the above 5 dimensions, do a lookup for each dimension and sum up the result. For an ad suitable for everybody, you need to do 6,720,000 lookups!

Simplifying the problem:Let's simplify the problem. Though we have 24 hours a day, the number of slots available may not be varying every hour. For example, the trend could be that the available groups are 7AM-9AM, 9AM-5PM, 5PM-11PM, 11PM-7AM. We need to be able to discover this trend. And the hourly trend may be different in urban and rural regions. In rural regions, viewership during the day might be more and may fall off earlier in the evening compared to urban regions. All urban regions may have a similar pattern and all rural regions may have a different common pattern. The day of week may matter for a certain psychography and may not for another. On top of this, all patterns may not be statistically significant.

Though it seems daunting, once we find all significant patterns, we can reduce the number of dimensions of our data and have a much simpler set to query.

Implementing a solution:This solution can be implemented using the R statistical package to do part of the heavy lifting.

To capture seasonal trends in viewership, and to smoothen out aberrations, we can do a trend analysis of the data across multiple weeks. So if viewership increases towards the festive season, we should be able to capture it through the trend line. R has linear regression packages to help do this.

R has packages to do recursive partitioning (rpart, randomforest, etc.). Recursive partitioning can be used to simplify the data and come up with a model that partitions data with each partition representing one significant segment of our population. The partitions will be split only across most significant parameters. Variations that are not statistically significant or random in nature will be ignored and averaged across.

R is single threaded and would not scale to large number of records. To be able to work on the large volume of data that we may have, we'll have to split the data and have R work on each split in a distributed manner.