Cooperation in the development team in practice

Menu

Theory of probability in the office. How to correct typos in the data?

Previously I have shown how to use Excel and VBA language to automatically format financial statements or other documents. Today I will describe how to use theory of probability to correct typos in the data collected in our database e.g spreadsheets.

Surprised? Yes, we use theory of probability and statistics, completely useless piece of …. maths, to automatically find and correct typos made by our clients or colleagues.

– We don’t need that because we have autocorrection tools in our office software!
– Show me then, how to correct surnames or account numbers using these tools?– Oh… We need to wait when our customer will tell us about the mistake in his/her personal data…
– And then you need to say sorry ;)
– What can I do instead?
– Let’s look at the example below …

Compute the probability, that the city equals a value in a row. In the first set the probabilities will be: 0.8 for Warsaw and 0.2 for Waraw.

Notice that mistakes will be less probable than the correct values and the probability of mistakes will be lower the more data is collected. Don’t be afraid checking millions of entries!

To find typos among cities examine the values with the lowest probability. There will be much less data that should be examined manually and the probability that we will find mistakes among them is truly high. In the same time the probability that we will find a mistake among high probable values will be very low. Our work will be very efficient!

Repeat this procedure for surname. Here you may also notice that Conor is an uncle, a cousin, a friend or a other person that paid for a Novak :) You may use this information in the future.

How to find incorrect account number? Let’s group our data according to surname and the city after correcting these columns. We should get two sets of data:

Then the probability that account number equals specified value in a row should be: 0.8 for 0000-1111-0000 and 0.2 for 0000-111-0000. You know what it means ;)

That’s how theory of probability help us in everyday life :) But … it is only a very beginning of the story. In the next article I will show you a working solution in an Excel spreadsheet and an algorithm recognizing that Novk is a typo but Conor is something totally different. We will use levenshtein distance. Sounds complicated? Not necessarily!