Histograms

I've never quite understood why someone might use a histogram with varying class width if they have sufficient data to produce a standard bar chart with constant class width.

Take, as an example, the attached image. If we assume that they had enough data to produce a histogram with constant class width, why might they choose to vary it? They appear to actually lose information by doing so...

Who is Participating?

I see what you mean...you would usually use a range of data when you don't have enough to look at each class individually. But that's not always true...

What if a company didn't really care about studying each best seller individuall, but just wanted to know what were their top 3 best sellers for the year? That would be ideal for a Histogram as opposed to a Bar graph.

Histograms can be converted/displayed as a normal curve. By viewing histograms one can visually gleam the mean, and standard deviation values as well as skewness. Histograms are best used when graphing incremental data (such as time intervals, flow or failure rates, etc). Specifically where there is a clear expected value.

Bar charts are just graphical representations of groupings. Most useful for giving a visual representation of data that cannot be averaged or calculated.

Looking at that graph I can see that the average commute time is 20 - 25 minutes. And that about 10% people have a commute time of 60 - 90 minutes. This kind of information cannot be gleamed from a bar chart.

Sure you could convert this to a bar chart and make similar statements - but a bar chart is not really a statistical model. Bar charts are best used for "What brand of car do you use to commute to work with?" Then try to find the mean of "Honda" - it just can't be done.

I think this is necessary, because histograms are based on real data (and usually each bar has CONFIDENCE INTERVAL). In case of you have enough cases for particular bar - the confidence interval is not too big (and you even don't designate it in the picture). In case of lack of experiments, you may face a situation, when drawing fixed-width histogram you will get confidence intervals number of times greater than height of bar. Such a histogram won't be informative and obvious (since some of columns will represent actually random numbers due to statistical insignificance), but nevertheless, sum of bars will be more relevant (since sum of random numbers is "less random" number than particular item of the sum (big bars have this property due to they are big, i.e. statistically significant).

So in your case, to create detailed fixed-width histogram, you have to count more cases, but this may be too expensive, moreover, when you count more cases, you will be able to plot histogram with more thin "good" columns (which generalize a lot of data), so plotting them with fixed width you will lose the information.

This story is similar to story about one inventor: he told that you never use pencil completely, so no need to make pencil's core for the full length. He got the bonus for invention and factory started making pencils with core 1 inch shorter than the whole pencil. 1 year later he told: our pencils have an extra inch of wood, it costs about $0.0001, we produce about 1 billion of pencils each year, so if we cut this inch of wood - we will save $10000 each year. And he got bonus once again.

The same story with histogram - large bars may be drawn thiner than small ones if you try to provide approximately the same confident interval. But in this case histogram looks inconveniently for reading. If you are oriented to the aesthetic , you'll refuse of the possible better resolution. In other words, in your case, to provide the same width you will have to join thin columns in such a way, they would became the same width as thick.

Part of the problem here is that the graph itself isn't clearly described.

What does "Travel time, minutes" mean? Is it a precise, repeatable number? When 1000 are grouped at, say, 40-45 minutes, does it mean that they took 40-45 minutes in their attempt to travel a fixed distance? Or is the average time of a number of trips for each person found to be in the 40-45 minute slot?

As it happens, my current travel time home-to-office is approx 53 minutes. But there is quite a range of times for me over a number of trips.

Since I hit a few highways with significant traffic each trip, and since weather can be so variable, and since a variety of military installations are clustered around that bring the unpredictable convoys that can snarl traffic, my "range" of times is pretty wide and it's regularly unpredictable. It might be that as travel times (for this graph) get longer, the variability also becomes wider.

I'd suspect that this could relate to any 'confidence interval' as mentioned above. A longer travel time brings more opportunity for variability.

When 1 bar describes 100 experiments of totally 1000, this means that if next day we do 1000 experiments again, we'll get about 95..105 (for example) experiments in this particular bar - this is ok and doesn't contradict to 100.

If some bar contains 1 experiment of 1000, this means that next time it may contain 0 and may contain 3. To prevent this we join a few neighbor bars with extremely small (and really random) height.