Explorations

The Empire State Building always brings to mind that iconic image of King Kong atop the skyscraper, swatting away biplanes as he clutches Fay Wray in his massive hand. But how massive? How much would the 1933-version of the mighty Kong have weighed?

King Kong, the 1933 movie

IMDb provides some relevant information on scale. Apparently, the size of the enormous ape varies from location to location and scene to scene. The publicity described Kong as 50 foot tall, but the sets in the jungle of his home island were consistent with an 18ft beast. The models for close-up photography of his hand were built to a scale that would fit a 40 ft animal, and the New York scenes were consistent with a 24 foot scale. Since it was the image of Kong on the Empire State Building that sparked this thought, let’s go with that figure, and treat him as 7.32 m tall.

If we take a Western Gorilla as the model for Kong when calculating height/weight ratios, we can scale up the height and use the square-cube law to scale up the weight. A very large gorilla of this species would be around 1.8m high and would weigh about 230 kg. So Kong was just over 4 times as tall as a very large gorilla, and using the cube of that ratio to scale his weight, we need a factor of 67.25 to give us a final mass of just under 15,500 kilograms. Does this seem reasonable? Three times as big as an elephant? I guess it does.

It’s entirely feasible that Kong would be scornful of the aircraft, since the planes used in the scene were Curtiss O2C-2 'Helldivers', which have a gross mass of a little over 2000 kg, around one eighth of his weight. But those annoying planes, equipped as they are with machine guns, finally cause the mighty Kong to lose his grip and tumble the 381 metres (52 times his own height), to the street below. And compared to that iconic building, the giant ape comes in at less than 1 / 20,000 of the mass of the Empire State Building itself, which is estimated to weigh 331 million tons.

I've recently been taking forward an idea that's been in the back of my mind for a while: www.IsThatABigNumber.com is a website that has a simple aim: to put big numbers in context, and in so doing, start to develop a more intuitive feel for them.

While I can intellectually understand the meaning of large numbers, typically written in scientific notation (e.g., 2.5 x 10^8 or expressed in billions and trillions), that's not quite the same as having a "feeling" for very large numbers. In fact, when I really think about it, I think my sense of comfort with numbers runs out somewhere around the 1000 mark. That is, I think I can visualise 1000 items without things becoming blurry, but not much more than that. But that is another blog post for another day.

The topic for today is how we talk about numbers. The website IsThatABigNumber.com is all about numbers, and the expression of those numbers needs to be clear and comprehensible.

Take measurements of length: I was taught about the SI system, based on meters, kilograms and seconds. Now for scientists and engineers, it's perfectly fine to talk about 4 x 10^7 m. It's convenient for calculations and it's the proper thing to do. But if I want to explain how long the equator is, I want to about 40 thousand kilometers instead.

Because? Because that's the way folk talk. Not 4 x 10^7 m; not 40 Megameters; not even 40 million meters. In my mind, things that can be measured using "meters" as the unit range from a bit less than one meter, to a somewhat more than a thousand. Half a meter? 0.5m is just fine; a 10,000m race? That's fine too. 50,000m? Nah, I'm better with 50km; 0.02m? Nope, give me 2cm or 20mm.

So, here are some of the principles that I am using for IsThatABigNumber:

For all numbers)

Numbers are expressed in three parts: a base magnitude between 1 and 1000, followed by a multiple, and where needed, a unit. So the population of the world is expressed as 7 billion, not 7,000,000,000 (all those zeroes? too hard to grok)

The multiple used is based around powers of 1000, with the exception that ...

"12,500" is more natural than "12.5 thousand", so for numbers in the 1000 - 999,999 range, we make an exception and use numerals

But "12.5 million" is more natural than 12,500,000, so for a million and beyond, we use "*illion" words, to the limit of septillion - 10^24 (and I struggle with septillion!)

Beyond septillion, fall back to scientific notation starting with 10^27. In this area, the game is pretty much out of the hands of "folk", and in the hands of the scientists.

The, when it comes to units: for distance measures:

Meters are used between 1m and 999m

Kilometers are used for distances above 1km

Millimeters are used for distances below 1 m.

For measuring mass:

Kilograms are used for masses above 1kg

Grams are used for masses below 1 kg

(Thinking about using metric tons - 1000kg for bigger masses - but currently undecided)

Time is a whole separate problem, not yet addressed. For now, years are the only units in use, but really, days and seconds seem more natural for small time periods. But then this is about BIG Num8ers.

Money is the other measure included in IsThatABigNumber.com. For now, US Dollars are the standard unit, rendered with a "$" sign.

Do numbers make you numb?

Way back in May 1982, Douglas Hofstadter (he of "Gödel, Escher, Bach" fame) wrote an article for Scientific American called "Number Numbness, or Why innumeracy may be just as dangerous as illiteracy". To provoke the readers to think about how they internalise big numbers, he concocted this scenario:

'The renowned cosmologist Professor Bignumska, lecturing on the future of the universe, had just stated that in about a billion years, according to her calculations, the earth would fall into the sun in a fiery death. In the back of the auditorium a tremulous voice piped up: "Excuse me, Professor", but h-h-how long did you say it would be?" Professor Bignumska calmly replied, "About a billion years." A sigh of relief was heard. "Whew! for a minute there, I thought you'd said a million years."

The absurdity of the comment arises because a million and a billion years are both so far beyond our lifespans as to make the difference meaningless from a personal point of view. In the article, he makes the case that most people have little real grasp of large numbers: not really being able to distinguish millions from billions from trillions, even though there is a thousand-fold difference between each.

But while this distinction may not give us sleepless nights when used in comparison to human lifespans, there are areas of life (national and corporate budgets, national population statistics, even hard disk sizes) where the billion vs million distinction DOES affect our lives, and many of us lack the "Number Sense" to be aware, instinctively, of the difference. Hofstadter argues that this "numbness" to numbers causes a loss of perspective, to the detriment of public debate.

Numbers in the News

The media themselves often fail to establish a proper context for the numbers in the news. Any number ending in "...illion" just ends up in a mental category called "big number".

In November 2015, the UK public sector net borrowing was around £14 billion; debt was around £1.5 trillion. Are those big numbers? Of course they are, but are they unexpectedly big? Are they alarmingly big? Are they big in context?

Lionel Messi earns around 25 million Euros a year. Is this a big number? Of course it is, but how big, in context? And what context should we use? Other footballers? Other sports people? Other individuals? Corporations?

I'm a huge fan of the BBC Radio 4 programme "More or Less". This programme tears apart statistical claims floating about current debates: I think it makes a vital contribution to understanding what's really going on, and debunking inaccurate claims. And one question they will often start with, when looking at some reported statistic is "Is that a big number?".

All this is by way of introducing an idea I am currently working on - an online service to answer just that question. Enter a number, any number, and it'll respond with a bunch of relevant comparisons, to put the number in context.

For example: in 2015, there were 72.4 million cars sold in the world. Is that a big number? the web service tells us: "One for every 100 people in the world". 17.5 million cars sold in the USA? That's "One for every 18 people in the USA" Big numbers? You can draw your own conclusions. And that's the point: to allow people to make informed judgements by putting things in context.

We'll throw in a few quirky measures too, just for fun. How long is an Imperial Star Destroyer, in terms of X-Wings? How long is a football pitch in terms of iPhones laid end to end?

It's very much in development but you can play around with what's been done here (www.isthatabignumber.com). As you can see from all the not-yet-live links there's a lot more to come. We're hoping to use this as a hub for a variety of numeracy-related services: a number-led blog, educational resources.

One of the beauties of the "R" programming language is the vitality of the user community. Language users are continuously uploading newly developed or revised versions of extension functionality. Looking at the range of packages available on CRAN, the "Comprehensive R Archive Network" I was struck by how many of these packages had recent versions resistered. So, I decided to dig a little, and at the same time give you a little flavour of quick and dirty data exploration with R. Some highlights:

Around 8%! let's look at the distribution by age - for convenience convert weeks to approximate years:

ageInYears <- packages$age / 52
hist(ageInYears, breaks=20)

More than half the packages are fresher than 1 year old; and it's easy to see that the growth took off just about 4 years ago after several years of slow burn. Let's look at the growth just over the past year (roughly 44 weeks):

Previously on "R is for ..."

One of R's greatest strengths is the level of activity in the user community and the range of packages that have been developed and contributed to the general good. There are thousands of packages out there and the list grows daily. How is the young data scientist to stay on top of this flood of material?, I hear you ask. Various helpful lists have been contributed by bloggers and other commentators, such as 10 R packages I wish I knew about earlier. The CRANtastic website provides a list of the favourites based on user ratings http://crantastic.org/popcon, and r-bloggers provides a list by frequency of download in RStudio http://www.r-bloggers.com/a-list-of-r-packages-by-popularity/.

Dependencies

Another way of looking at this, is to look at which packages are most fundamental to the broader R community - which packages do package authors build upon. The CRAN repository provides structured data on each package: among the data provided are "Depends", and "Imports", which list the packages each is built upon. It seemed a fun thing to see which packages were most depended upon, which were the most fundamental in the R ecosystem.

First-Order

For this exercise I didn't bother distinguishing between "Depends" and "Imports" - I wrote a simple routine to take the list of packages from CRAN, and then for each, to harvest from the relevant page on the CRAN website, the contents of "Depends" and "Imports" properties, and stash those package names in a table which I called "antecedants". The table has columns "self", the package in question, "ante", the antecedant package and "order" the depth of the dependency.

That gave the first order dependencies, and here are some interesting glimpses into that table. I used table to count the order-1 dependency for each antecedant, to see which are most re-used, and then sort that table to reveal the top ten.

So something over a thousand packages are in some way re-used, for a total of over 10,000 order-1 dependencies, and the most popular include many of the usual suspects like ggplot and plyr.

Going Deeper

But just looking at the first level is not good enough. If your package builds on, say, ggplot2, which has among its antecedants, plyr, then of course plyr is an antecedant of your package too, but a second-order antecedant. So we need to get recursive, and we can do this just by analysing the antecedants table. So we can build the order 2 antecedants table based on the order 1 table; and the order 3 from the order 2, and so on, until we finally bottom out and reach the maximum depth. Along the way we need to make sure we don't double-count - if a packages uses ggplot2 and also uses plyr directly, we don't want to be double-counting plyr.