3.14

All those books that Google has been scanning for the past ten years are surprisingly rich in numbers as well as words. The Google Books data set released last December by a Harvard-Google team includes (by my count) 9,620,835,344 occurrences of 458,794 distinct numbers. (Plus another 31,293 numeric values that have dollar signs attached.)

In recognition of pi day, I want to zero in on some successive approximations to the world’s favorite irrational:

In tabular form here are the closest approximations found in the files, along with the abundance of each value:

3.141592

704

3.1415923

80

3.1415926

1141

3.14159265

1300

3.141592653

143

3.1415926535

286

3.141592653589

54

3.14159265358979

338

3.141592653589793

453

3.1415926535898

65

3.14159265359

177

3.1415926536

289

3.141592654

512

3.1415927

776

3.1415928

109

3.1415929

133

3.141593

843

The data set includes only items that appear at least 40 times in the collection of scanned volumes. Closer approximations to pi evidently fell below that threshold. In particular there is no sign of William Shanks’s famous 707-digit calculation, which was published in 1873. So, just for the sake of celebrating 3.14, here are 707 digits of pi—but unlike the product of Shanks’s many years of labor, I think these digits may be correct:

9 Responses to 3.14

That’s an interesting place to stop, as printing 707 digits gives “…20200″, but printing 708 digits gives “…201996″. It’s a rare (well, one in a hundred, I suppose) occurrence that three digits depend on your choice of rounding.

I would have expected a lot more occurrences of pi related values. Did your search allow for 1 (one) being scanned as l (ell), 0 (zero) as o or O (ohh)? I found that this made a big difference when matching hexadecimal literals

Trying to measure the impact of OCR errors in the Google Books data set is very much on my mind. But getting reliable answers is not easy. You might think we could just compare the frequency of 3.1416 with all the plausible variants: 3.l416, 3.i416, 3.i4l6, 3.!4ib, &.iAlh, and so on. There are at least two problems with this approach. The first is that entropy works against us: There’s only one correct reading, but there are exponentially many possible errors. Thus even if the total number of erroneous readings is large, any one variant is likely to be rare and therefore unlikely to reach the 40-count threshold for inclusion in the data set.

In this particular case, the situation is even worse. The OCR algorithm reads a string of decimal digits with a single noninitial period as a number, but in any other context a period is taken to be a word separator. Thus “3.1416″ –> “3.1416″ but “3.14i6″ –> “3″ “.” “14i6″.

With integers we have a better chance of learning something. I have not done any sort of comprehensive study, but here’s a single example, plucked more or less at random—the number of occurrences for a few likely misreadings of “1997″

"i997" 10835
"I997" 19602
"l997" 171548
"L997" 114

That’s more than 200,000 errors (assuming they are all in fact OCR failures, which is probably not far from the truth). This sounds horrendous. On the other hand, “1997″ itself occurs 19,761,552 times, so we’re looking at errors at the 1 percent level. Is that something to worry about? I just don’t know for sure. It probably depends on what you want to do with the data.

Thanks very much for that pointer to your posting on hex strings in the Google Books data. We must try roman numerals!

How about fractional approximations: 22/7, 335/113 and possibly 314/100? In particular, I expected 22/7 to show up on the graph, but it seems indistinguishable from the rest of the distribution.

Another curiosity on the graph: numbers rounded to 1 decimal place are 10-20 times as common as those rounded to 2 decimal places (the pattern of the terminal digits shows that terminal zeros are dropped), but these are about 40 times as common as numbers rounded to 3 decimal places. Perhaps section numbers are distorting the results.

To verify the 707 digits are correct, you should also publish a checksum. That way when I do the calculation, I can compare. I’ve seen several pi approximations, what are the correct float, double, and long double values? The int one is 3.

Characters such as & and ! are each treated as a single entity so the number of possible combinations is not that great.

What made you pick on the 40 per book threshold, it sounds a bit high? I suspect 3.14 is more likely to refer to a section number than a poor representation of pi and perhaps have a handful of occurrences.

I have found a worryingly wide number of different representation of pi in source code.

Roman numerals would be interesting; I would expect the usage to decrease over time. Another possibility is to search for incorrect values, such as the 3.1459265 I found in Mifit’s core/Helix.cpp

When talking about the number of digits after the decimal point so the “3.” does not get counted. The 706 digits which ended with “200” should have been left at “1995” without rounding up.

So, which book contained the best approximation of pi? The key board here is book no web location. I have heard of a book with 2,000,000 digits of pi. It was known as the most boring book in the world. I have printed the first 1,000,000 digits of pi for the fun of it.

If you want to find out more about the number William Shanks produced with the error that stood for 72 years, Google “William Shanks 707 digits” for my file. William Shanks calculated 709 and rounded down to 707 digits