Introduction

This Gibberish Classification algorithm aims to detect whether text is valid, or randomly typed in a keyboard. It returns a percentage where a low one means valid text, and a high one means gibberish text. The algorithm is at a pretty early stage, so there are still some incorrect return values.

If a result is lower than 50%, it's likely that the text is valid. If a result is higher than 50%, it's likely that the text is gibberish. The algorithm is optimized for the English language and for longer text; it will still work for shorter text (for example, one sentence), but then the results will be less accurate. The algorithm won't give a percentage lower than 1%, except if the input string is null or empty, then it returns 0%.

The C# implementation can be used for the .NET Framework 4.0 and higher (the binary in the download targets .NET 4.5). The Python implementation can be used in both Python 2.x and Python 3.x.

The Algorithm

It checks whether the amount of unique chars (in %, in chunks of 35 chars) is in a usual range.

It checks whether the amount of vowels (in %) of the letters is in a usual range.

It checks whether the word/char ratio (in %) is in a usual range.

Checking the unique characters

To check the % of unique characters, we first split the string in chunks of 35 characters. When doing this, it can happen that the last chunk does not have 35 characters -- in this case, if the chunk size is less than 10 characters, add these chars to the chunk before the last and delete the last. (This is only possible if there are 2 chunks or more)

After splitting the string into chunks, create an empty list. Then loop over all chunks. Calculate the amount of unique characters in the current chunk. Then divide it by the total amount of characters in the chunk and add it to the list.

After doing the above, calculate the average of the list, multiply it by 100, and return it.

That calculates the percentage of unique characters in chunks; checking whether it is in a usual range is done later.

Checking the vowels

When checking the amount of vowels, first initialize an integer vowels and total. We run over each character in the given string. If the character is not an alphabet letter, continue without doing something for that letter. If it is an alphabet letter, increase total by 1 and check whether it is a vowel: if it is, increase vowels by 1. After running over all characters, return vowels / total * 100.

Checking the word/char ratio

To check the word/char ratio, split the string by the regex [\W_] (splitting by all non-word characters and an underscore). Then remove all whitespace-only/empty items from the resulting array. Thereupon, divide the amount of words by the amount of chars, multiply it by 100 and return it.

Calculating "deviation score"

The above functions all return a percentage, but we cannot directly use these to calculate the final score -- first we have to calculate how much the percentage deviates from the usual range, and then give it a score. The higher this score, the more the percentage deviates.

The function to calculate this score has to accept three arguments: the given percentage, the lower bound of the usual range and the upper bound.

If the percentage is lower than the lower bound, return log(lower_bound - percentage, lower_bound) (where the second argument is the base).

If the percentage is higher than the upper bound, return log(percentage - upper_bound, 100 - upper_bound).

If the percentage is none of the above (meaning that it's in the usual range), return 0.

Calculating the final score

Using all above functions, we can calculate the final score! First we calculate the percentages using the first three functions. Then, we call the deviation score for each of them:

For the vowels %, the lower bound is 45 and the upper bound is 50 -> deviation_score(percentage, 45, 50)

For the unique chars %, the lower bound is 35 and the upper bound is 45 -> deviation_score(percentage, 35, 45)

For the word/char ratio, the lower bound is 15 and the upper bound is 20 -> deviation_score(percentage, 15, 20)

Where do I get these bounds from? Just from testing and running some paragraphs taken from the internet through all 3 functions.

After calculating these deviation scores, we go through them and we set them to 1 of they are lower than 1. The reason for this is that we are going to call log10 on these scores; having a score lower than 1 can lead to a negative logarithm (or an error in case of zero), which is undesired.

The next step is to calculate the logarithm on base-10 for all deviation scores and divide this by 6. (6, because the max number we can get from the log10 operation is 2, and we have three operations here). We return max(final_score, 1). We do not return the exact final score if it's below one because even if the final score is 0%, it's not impossible that the entered text is gibberish; it's just unlikely. The higher the final score, the higher the chance that a string is gibberish.

C# and Python implementation

In the C# implemenation, all methods are static and put in a GibberishClassifier class. In the Python implementation, all methods are put in a gibberishclassifier module. The Python version works in both Python 2.x and Python 3.x. Because the Python 2.x division truncates by default, we have to add this at the top of the file:

from __future__ import division

After doing that, division won't be truncating anymore in Python 2.x.

The first implemented method is the method to split a string into chunks. The C# method uses a for loop and Substring to take the appropriate amount of characters. The Python method uses a for loop, range and the slice notation.

Then the method to get the percentage unique chars per chunk is implemented. It uses the above method. The C# implementation uses .Distinct().Count() to get the count of unique characters in one chunk, and .Average() to calculate the average of all percentages of unique characters in the chunks. The Python implementation uses len(set(chunk)) to get the amount of unique characters in one chunk, and it uses sum to get the sum of all percentages, which gets divided by the amount of all percentages.

The C# method uses !Char.IsLetter to check if a char is not a letter; in that case, we go to the next char in the string. If it is a letter, it increments the total variable and it checks whether it is a vowel using "aeiouAEIOU".Contains(c). If it is a vowel, in increments the vowels variable. At the end of the method, it returns the percentage if total is not zero, and if it is zero, then the method returns 0.

The Python method uses not c.isalpha() to check if a chat is not a letter; in that case, we go to the next char in the string. If it is a letter, it increments the total variable and it checks whether it is a vowel using c in "aeiouAEIOU". If it is a vowel, in increments the vowels variable. At the end of the method, it returns the percentage if total is not zero, and if it is zero, then the method returns 0.

The C# method uses the Regex.Split method (in System.Text.RegularExpressions) to split the string. Then it uses the LINQ .Where method to remove all parts that are whitespace-only (it uses the String.IsNullOrWhitespace method to check that), and the Count() method to get the amount of words.

The Python method uses re.split (requires import re) to split the string, and it uses x for x in ... if ... to remove all whitespace-only parts. It uses x.strip() != "" to check whether a string is whitespace-only or empty. Then it uses the len method to find out the amount of words.

The last method is the one to do the actual classifying. If the inputted string is empty or null (C#) or None (Python), it returns 0%. It calls the above methods, calculates the deviation score and calculates the final score, as the algorithm describes.