The humble phone number. Global, local, extensions, alternates, and sometimes pure garbage: Without data entry restraints there is no telling what you might find in a typical phone number data field. Until now.

The topic of phone number crunching has arisen at the Monastery before, multiple times, with answers and insightful speculations, but thus far all seem to have underestimated the complexity of the unrestrained Beast.

I come bearing loads of international phone number DATA found running rampant in the wilds of the data entry savanahs, plus my particular solution to the problem of making sense of it all. From a representative field of nearly 100,000 numbers I distilled a subset of über-patterns and their appearance frequencies.

What follows is my eventual solution for parsing these noisy numbers, a rather brute-force solution that evolved to fit the data at hand for a data set perhaps better suited for analysis by a neural net; my reflections on the nature of data entry, ambiguity, and alternate approaches; and most importantly, a scrambled but meaningful representative data set on which you are free to chew at your leisure -- I'm sure better approaches exist. I invite thoughts, code commentary, and better solutions.

warning: node size ~63k

I am here to chew bubble gum and parse some data...and I'm all out of bubble gum.
-- Me Nada

Phone numbers, like email addresses, are inherently impossible to prove "correct" merely by parsing. The only way to discover if a phone number is valid is to dial it and see what rings. More generally speaking, of course, there are general formats we expect to see, regardless of whether the number is indeed a valid number. Beyond the International Country Codes, however, numbers are subject to the vagaries and capacities of the host country's network. There are no guarantees, without knowing the rules for every country, what bits of a number apply to an area or province code, municipality, etc. The only parts of a number we can reasonably expect to identify in a globally generic way are:

International Dial Direct codes (what locals use to dial out of their country)

International Country Codes (what the world dials to reach a particular country after the IDD)

The local phone number, including area/province codes and possibly long distance codes

Extensions (to be dialed after a connection is made)

On top of this we should also expect indications of alternates for numbers, suffixes, and extensions in unconstrained data entry fields.

All is not doom and gloom for the country networks, however. If, for example, you happen to know that a large proportion of your numbers are supposed to be in a ten-digit format then you can use that information to infer information and rules of thumb, especially for parsing alternate suffixes.

1-900-ILOVEYOU: I have made no attempt to parse vanity numbers. First of all, none are represented in this data set. Second, I toyed with the idea and eventually decided that there was too much ambiguity involved with extracting extensions usually indicated by some combination of letters from 'extension', periods, hashes, and whitespace. I can see how distinguishing the difference might be done, but did not implement it since this data has loads of extensions but no vanity numbers.

Finally, there is one unavoidable fact: There are plenty of garbage entries that are either incomplete or incomprehensible even to a human. In a well-controlled universe this garbage would have been caught at the data-entry stage -- even a rudimentary attempt at enforcing validity would clean up much of the garbage. Such is not the case here, however, though those tasked with parsing the result may fervently wish it otherwise. Though no longer in vogue, the GIGO principle ultimately still stands for these cases.

As mentioned above, the original data comes from 100,000 or so international and U.S. domestic numbers entered into unconstrained entry fields. From these numbers I derived meta patterns with which to play:

Pattern Count

Generality

1503

single digits (\d), single alphas ([a-z])

1269

single digits, single alphas, whitespace (\d, [a-zA-Z], and \s+)

328

digit clusters, single alphas, whitespace (\d+, [a-zA-Z], and \s+)

312

digit clusters, alpha clusters, whitespace (\d+, [a-zA-Z]+, and \s+)

If extraction were the only goal, then the most general pattern collection at the bottom of that list would be sufficient for this particular data set. However, in this case we just have a raw data field that is supposed to have a phone number in it. Our job is to parse that number -- a more complicated task that presents its own set of challenges. Therefore in the __DATA__ section in the phone.pl script below I have included 1269 example entries that represent patterns derived from letting single alphanumerics (not clusters) float and collapsing whitespace.

These are not real phone numbers, except perhaps some by random chance. International country codes, where identifiable, were replaced with a random country code of identical length. 1's and 0's were largely left alone (due mostly to the presence of IDD codes) and the rest of the digits were sequentially overwritten. Text strings have been replaced with nonsense unless they are somehow generically germane (eg "Extension", "PAGER", "email only", " - xxxx", etc). The result is a set of fake but convincing numbers with valid country codes (when present) that each correspond to one of the patterns.

Each number is preceded by a percentage measuring its match frequency in the original data set. Though I did not use this information for parsing purposes, it is instructive to see that the majority of numbers fall into patterns that are reasonable to extract and that indeed, we need not fear for the collective skills of our world's typists. The percentages might also be useful to those seeking their own solutions -- either to form heuristics or to realize when "enough is enough" and be content with their 95%. Note that with a data set this size, percentages of less than 1% are common and can represent a significant number of entries.

There are three tasks involved with data such as this: extraction, normalization, and parsing. Though logically distinct, in reality they are hopelessly entangled. Much of the noise that gets dropped during normalization, for instance, can briefly serve as clues to the meaning of various parts of a phone number. My end result, therefore, is a series of steps, sequentially cohesive, that are executed in a specific order with each step passing the remnants of its operation to the next. The steps involved are a direct reflection of the nature of this particular data set.

Loosely stated, my approach boils down to the following steps:

Split entry into multiple numbers, if present.

Extract phone extensions.

Remove IDD prefixes (possibly using them to infer upcoming country codes)

Some clarifications are in order. In step #3 I mention removingIDD prefixes -- the sequence of numbers used to dial out of a particular country. These are of little use to someone wanting to dial to a number if they do not happen to live in a country with that same IDD. Sometimes in the data there is no '+' to indicate an international number -- sometimes there is merely an IDD, usually some combination of 0 and 1 -- so these codes can be handy to infer the imminent arrival of a country code. I store the IDD's where found, but other than the inferal process they are of no particular use for my purposes.

In step #4 I mention interpolating numbers. Sometimes a number might be listed as something like '555-555-6666, 7777, or 8888'. There are three numbers there, all beginning with '555-555'. There are cases such as '555 555 6666-7' that present ambiguity: is the '7' an extension or an alternate ending to another suffix? In my solution that particular example is interpreted as an alternate suffix '6667'. In the original data it was more obvious because these tended to appear as numerically sequential numbers, i.e., adjacent suffixes. The scrambling of the data has destroyed some of its intuitive "look" and might cause you to wonder about my decisions in these ambiguous areas. These decisions are not bulletproof -- at the time they just seemed more likely to be correct.

Step #5 is perhaps the most interesting. I broke down and eventually came to rely on a list of actual, valid Country Codes. Mechanically detecting country codes can go only so far. Think "+44 555 666 7777" vs "+445556667777". With no knowledge of country codes, other than perhaps their typical maximum length of three digits and Huffman encoding, there is no bulletproof way of pulling the country code out of the second example. In addition the mechanical approach cannot deal with invalid country codes following a '+'. So in the CountryCodes package I provide some routines for pulling valid country codes out of a string of digits; in addition, there is a small routine for grabbing an updated list of codes off of the Net. There are two methods included: pull_cc_smart and pull_cc_guess (not used) that illustrate the difference.

As I mentioned, I expect that the dataset is the most valuable contribution here. My code is not optimized, tricky, or beautiful -- it is merely a straightforward evolution of a solution from data with pollution. (Am I a poet or what?)

The test script phone.pl is a simple harness around the data. For each line it will print the raw entry and extracted phone numbers, separated by a colon. In cases where multiple number were extracted, they appear on a line of their own below the first number found.

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other