Predictive Analytics in Microsoft Excel: Building Your Own Data Collector

Suppose you supply a product that another company or companies resell on their own websites. Although those companies might share their web traffic information with you, it’s more likely that they regard it as proprietary. This chapter show how to retrieve data from the web and perform analytics without the active cooperation of the source website.

This chapter is from the book

The word analytics connotes, among other notions, the idea that the raw data that gets analyzed includes quantitative measures of online browsing behavior. Data collection instruments such as Omniture and Google Analytics are useful in part because they track a variety of behavior—hits, views, downloads, buys, and so on—along with information about the source of the visit. However, there are times that you want to measure product performance but can’t access the traffic data.

Suppose you supply a product that another company or companies resell on their own websites. Although those companies might share their web traffic information with you, it’s more likely that they regard it as proprietary. In that case, if you want to analyze end users’ purchasing behavior, you might be limited to any data that the resellers’ websites make generally available.

That’s the position that I’m in when it comes to sales of my books by web resellers such as Amazon. Although I hear rumors from time to time that Amazon shares data about actual sales with book authors, suppliers of music, and manufacturers of tangible products, it hasn’t reached me in any particularly useful form. So I have to roll my own.

Fortunately, one of the sales and marketing devices that Amazon employs is product rankings. As to books, there are a couple of different rankings: overall sales, and sales within categories such as Books > Computers & Technology > Microsoft > Applications > Excel. Although as an author I hope that my books achieve a high sales ranking in a category—that gives them greater visibility—I really hope that they achieve a nice, high overall sales ranking, which I believe to bear a closer relationship to actual sales.

So what I want to do is create a means of accessing information about book ratings (including titles that compete with mine), save the data in a form that I can use for analysis, and make inferences about the results of the analysis.

In this chapter, I show you how to do just that: get analytics without the active cooperation of the website. The techniques don’t require that you be interested in book sales. They can be used for anything from product sales to stock prices to yards gained per pass attempt.

Planning an Approach

Before you undertake something like the project I’m about to describe, there are a few issues you should keep in mind. If you can’t think of a satisfactory way to deal with them, you might want to consider taking a completely different approach.

A Meaningful Variable

Most important is the availability of one or more variables on a website that bear on what you’re interested in, even if only indirectly.

For example, Amazon does not publish on a publically accessible web page how many copies of a book it has sold, whether actual, physical books or downloaded electronic copies. But as I mentioned in the prior section, Amazon does publish sales rankings. For the project described here, I decided that I could live with the sales rankings as an indicator of sales figures.

I also learned that although Amazon usually updates the sales ranking every hour, sometimes the updates don’t take place. Sometimes they’re a little late. Sometimes several hours pass without any update occurring. And sometimes the rankings don’t entirely make sense; particularly in the wee hours, a ranking can drift from, say, 20,000 at 4:00 a.m. to 19,995 at 5:00 a.m. to 19,990 at 6:00 a.m. and so on. That kind of movement can’t reflect sales, and the deltas are much smaller and more regular than at other times of day. But I found that most of the time the updates take place hourly—give or take a couple of minutes—and correspond either to no sales (the rankings get larger) or to a presumed sale (the rankings get smaller, often by a generous amount).

Identifying Sales

I decided that I could live with changes in rankings as a stand-in for actual sales. I also decided that I could make an assumption about an increase in ranking, such as from 25,000 to 20,000. That’s a big enough jump that I can assume a sale took place. I can’t tell for sure how many units were sold. But these books don’t sell like a Stieg Larsson novel. Amazon sells four or five copies of one of my books each day. So when I see an improvement in ranking of a few thousand ranks, it almost certainly indicates the sale of a single unit.

Given that information (or, maybe more accurately, educated guesses), it’s possible to get a handle on what you need to include in a workbook to access, analyze, and synthesize the data sensibly.