Sniffing out bogus data

/* Imagine that you have some names and addresses. You are anxious to look at the addresses that may be spurious. As well as looking at such things as the vowel-to-consonant ratio and the digraph frequencies, (see my previous posts) you might want to also look at the occurence of sequences where you get three or more occurences of the same letter. Names like AAAAAA crop up quite a lot when people don't want to give you their names or addresses. It is not a big deal to cope with this, using a brute-gorce approach. Here is a technique based on the idea of generating the matches on the fly using a 'numbers' table which I'm assuming you have already. All you need to do is to define your alphabetic characters.

Normally, you'd create a rule or a constraint and prevent the stuff getting into the database in the first place, but it is tricky to create a defence before you know what's going to attack your database; Because of this, you'll always find something that slips through your first-line defenses.*/

/* so we'll create a typical name address database and fill it with 50,000 addresses just to check that our algorithm is going to work. I use SQL Data Generator just because makes this sort of operation laughably easy. */