A public diary on themes around my books

August 14, 2006

A billion dollar question

This weekend I ran a session at the "SciFoo" camp (an interdisciplinary meeting of scientists and technologists) at Google that was focused on an interesting statistical problem in the Long Tail. I've long argued that the "natural" shape of most markets is a powerlaw, and that any deviation from that shape is due to some bottleneck in distribution. Get rid of the bottleneck and you can tap the latent demand in the market, unlocking the potential of the Long Tail.

The usual example I give is this one, which shows US box office revenues over a three-year period (2003-2005). Remember that a powerlaw looks like a straight line in a log-log plot, so the key part of the below is where the real world data drops off the line (around rank 350).

In this case, the explanation for the fall-off is simple. They just ran out of screens. The carrying capacity of the US megaplex theater network is about 100 films per year, or 300 over three years. Over the same period about 13,000 films are shown in film festivals, although only a tiny fraction of them get mainstream commercial distribution. But if you can distribute niche films as easily as the blockbusters, the curve would look like the straight line predicted by the theory, which is, as it happens, exactly what we see with the Netflix data.

However, there is another common distribution that looks a lot like a powerlaw at first, but then deviates from the straight line on its own, even without scarcity effects and other distortions. It's called a lognormal distribution and it looks a little like this:

How can you tell one from the other? This is obviously an important question, since if the theory has any real predicative power, you've got to be able to say ahead of time whether the "natural" shape of the market is a powerlaw or a lognormal. Otherwise you can't tell if the fall-off is due to a removable bottleneck, such as inefficient distribution, or not.

The difference between those two curves is the subject of a lot of research at the cutting edge of complexity theory, and the simple answer seems to be that it comes down to the nature of the network effects that create unequal ("rich get richer") distributions such as the powerlaw and lognormal in the first place. I lay out the basics of that research in this presentation, which I gave on Saturday at Google. (Make sure you're displaying the notes field in that Powerpoint so you can read my narration)

One final note: This doesn't affect any of the conclusions of the book, which are based on real-world data rather than predictions (although I discuss the problem briefly in the notes on page 229). Instead, it's an issue when people attempt to make predictions based on the theory, such as estimating the latent value of a television or film archive based on the assumption that the natural shape of demand would be a powerlaw. But billions of dollars in valuation depend on getting that calculation right. If you're into this stuff, check out the presentation and see if you can see a path to a solution.

UPDATE: See related discussion: Jakob Neilsen analyzes the observed fall-off in web traffic on one site here. Chris Edwards gives another perspective on that here. And the always astute Nick Carr is following the conversations and adds commentary here.

In the academic world, there are good posts here (""how not to fit a straight line") and here (powerlaws versus lognormals in web tagging). Meanwhile, the two must-read papers in this domain are:

TrackBack

» The shape of the tail from Rough Type: Nicholas Carr's Blog
When it comes to evaluating a tail, which matters more: its length or its shape? The answer, of course, is largely a matter of personal taste. But Douglas Galbi argues, compellingly, that we have been so focused on long tails and short tails that we ha... [Read More]

Comments

A really great example of a distribution bottleneck, and one that I'm particularly interested in, is the art market. If you're ever going around the galleries up there in SF, ask some of them how many artists they can represent properly. An honest answer will probably be 12 to 20 at best. So with maybe two dozen serious art galleries in the city and thousands of aspiring artists, "shelfspace" becomes a huge issue. It's exponentially worse here in LA, and New York is beyond description.

I'm curious to see how Long Tail economics can apply to this market. Lots of people look at art online, but except for posters, mostly they still go to galleries to buy. Unless artists put their content into purely digital form, in which case I'm not sure if anyone would buy it, it seems like we're still stuck with the bottleneck.

Very interesting; I'm glad you posted on this. When you gave your talk at Amazon and showed this slide, my first thought was that the curve looked so regular that it could easily be some different distribution.

I don't personally have the stats mojo to solve this, but if you find the answer someday, please post about it.

Techies may know that there is special graph paper that will plot data that is normally distributed as a straight line. A bunch of years ago I wrote plotting software that could do the same for any distribution. It was originally intended for log-normal distributions.

The way it worked was that you gave it a function that computed the density distribution. It then numerically integrated this to get the cumulative distribution and did inverse linear interpolation to make the graph paper and plot the data.

If I still had the source (long gone), I could take data and see how log-normal it is.

The program did NOT try to find the parameters that would make a distribution best fit the data.

To visually assess whether a data set follows a normal distribution, plot a QQ chart. (See example of usability data that's indeed very close to a normal distribution, though websites usually have a few outliers.)

One reason for the difference is that distribution isn't the only limiting factor. Imagine a society that has few car dealerships and a media income of $300 a year. Beyond a certain point, no improvement in distribution is going to sell a new car to someone who can't afford it. The same can be said about some item, say a riding power lawn mower, that only a certain number of people can use--those with large lawns.

If you want an example of a rather stupid, artificially created limit on distribution, read what the guy above wrote about the limited number of art gallaries and then take a look at Washington State.

Children's Hospital in Seattle was placing art on its walls and selling it. That seemed a win/win situation for everyone. People got art, while the hospital and artist got much needed money. Poor kids got medical treatment. The state even got to collect sales tax. Then the state property tax assessors stepped in and said that art on the wall (unlike gifts in the gift store or food in the cafeteria) had nothing to do with the non-profit purpose of the hospital. If the hospital wanted to sell art, it had to pay a much higher for-profit property tax for the walls where it sells art. (How do you measure the horizontal square footage of a vertical wall? Is the commercial space just that of the painting, or is it the space of someone standing in front of the painting? How big is that person and should the taxes allow for two people?) Rather than face that sort of messy accounting, the hospital and a number of other non-profits, including my church, quit displaying art for sale.

The result is, of course, a lose/lose situation for everyone. Perhaps it would help if someone would explain the long tail to the twits in our state property tax assessors office.

My master's research in estimate uncertainty suggested that the lognormal BIVARIATE distribution does a good job in modeling the uncertainty associated with the ratio of the actual to the estimate (ACT/EST). The reason I'm posting this is out of curiousity whether or not the results we found for the lognormal bivariate would support use of the lognormal model for market estimates, such as latent value. We performed curve fitting against historical data across a range of product types and level of risk within one organization. Since market valuations are predictions, the lognormal for the ratio of the act/est may be more in line with real-world scenarios (like limited number of screens).

Michael Mitzenmacher here. I'd like to thank you for the compliment of mentioning my survey article. I thought I'd shamelessly self-promote and mention I have an editorial on the future of power law research and the difficult problem of figuring out when you really have a process following a power law in the current issue of Internet Mathematics. The article can also be found on my Web page at http://www.eecs.harvard.edu/ ~michaelm/ListByYear.html in pdf format. (Feel free to look at my other work as well!)

In a less self-promoting vein, there's another very good article on power laws in this issue of Internet Mathematics, and I encourage any academics or researchers out there to encourage your library to subscribe to this new journal.

Why on earth do you give all generate all your diagrams with Rank on the bottom axis? No one in the real world does that. We all use probability density and (cumulative) distribution graphs. And 'we' doesn't mean hardcore statisticians and the like. 'We' is everyone who has ever seen the infamous bell curve (i.e. normal distribution), and is being forced by you to compare it to a ranked curve.

I am really pleased to post my comment on this blog . It helped me with ocean of knowledge so I really belive you will do much better in the future . Good job web master .
http://www.shredderwarehouse.com

I think a lot of organizations are exploring new ways to become more cost efficient these days. Many big companies often use large amounts of paper in their everyday environments.So they need a paper shredders because Paper shredders are a simple solution to destroying many of the documents and materials that hold private information.Taking simple steps, such as purchasing a paper shredder, can help you avoid a lot of pain and suffering in the future. That seems well worth the small investment.Wanted to compliment on your site, it looks really good.

This is great. I learned about unconferences a couple years ago, and I got to participate in one recently at WikiSym. I even tried to run one as part of a community event ... but only 24 people came at all, so you can imagine what the unformatted sessions were like.

This is a very important way of communicating. Although it is somewhat demoralizing to list an idea that no one wants to talk about, the small group format is ideal for brainstorming. And the lack of an Expert role in the conversation is so important.

Did you notice any differences with regard to quality of conversation and group size for each topic? It would be an interesting study to see if there is a kind of self-organizing maximum that would surface, or if people would be willing to talk in very large groups because they are more attached to the topic.

FREE was available in all digital forms--ebook, web book, and audiobook--for free shortly after the hardcover was published on July 7th. The ebook and web book were free for a limited time and limited to certain geographic regions as determined by each national publisher; the unabridged MP3 audiobook (get zip file here) will remain free forever, available in all regions.