Lexalytics Text Analysis Work with Common Crawl Data
This is a guest blog post by Oskar Singer
Oskar Singer is a Software Developer and Computer Science student at University of Massachusetts Amherst. He recently did some very interesting text analytics work during his internship at Lexalytics . The post below describes the work, how Common Crawl data was used, and includes a link to code.
At Lexalytics, I have been working with our head of software engineering, Paul Barba, on improving our accuracy with Twitter data for POS-tagging, entity extraction, parsing and ultimately sentiment analysis through building an interesting model-based approach for handling misspelled words.
Our approach involves a spell checker that automatically corrects the input text internally for the benefit of the engine and outputs the original text for the benefit of the engine user, so this must be a different kind of automated spell-correction.
The First Attempt:
Our first attempt was to take the top scoring word from the list of unranked correction suggestions provided by Hunspell, an open-source spell checking library. We calculated each suggestion’s score as word frequency from Common Crawl data divided by string edit distance with consideration for keyboard distance.
The resulting corrections were scored against hand-corrected tweets by counting the number of tokens that differed. Hunspell scored worse than the original tweets. It corrected usernames and hashtags and gave totally unreasonable suggestions. My favorite Hunspell correction was the mapping from “ur” (as in the short-form for “your” or “you’re”) to “Ur” (as in the ancient Mesopotamian city-state).
Hunspell also missed mistakes like misused homophones, which did not count as a misspelling when considered in isolation. This last issue seemed to be the primary issue with our data, so the problem required a method with the ability to consider context.
The Second (and final) Attempt:
We title the next attempt “the Switchabalizer”, and it can be summarized as a multinomial, sliding-window, Naive-Bayes word classifier. On a high level, we classify each of the target words in a piece of text, based on the preceding and succeeding words, as itself or one of its homophones.
The training process starts with a list of bigrams from the Common Crawl data paired with their occurrence counts. We use this data to calculate P(wi-1 | wi) = #(wi-1wi)/#(wi-1) and P(wi+1 | wi) = #(wiwi+1)/#(wi+1) where wi is the current word, wi-1 is the preceding word and wi+1 is the succeeding word. These probabilities are serialized and archived so they can be deserialized into C++ data structures instead of recalculated for each instantiation of the spell check object. In other words, we’re building a set of probabilities that each switchable “generated” the words preceding and succeeding wi.
The inference process starts with a set S of sets and an inverted index. Each s ∈ S represents a group of commonly confused homophones (e.g. two, too, 2, to), and no word is a member of multiple s ∈ S. The inverted index maps each word w in the union of all s ∈ S to the s in which w holds membership. Each word wi in the ordered sequence of words in a document is checked for an entry in the inverted index. If an entry V is found, the algorithm replaces wi with argmaxv∈V P(v) = P(wi-1 | v) + P(wi+1 | v).
As a matter of efficiency, we assumed that Wikipedia articles have perfect use of the target homophones. I wrote a Python script that took in text, randomly replaced target homophones with members of their switchable set, then output the result.
We ran the Switchabalizer on this data and compared to the original Wikipedia data. Comparing the corrections to the words changed by our test generator, Hunspell, even when forced to ignore usernames, had a 216% error rate (i.e. it made false corrections), and the Switchabalizer had a 20% error rate. Although the test data does not match the target data, the massive and varied data set provided by Common Crawl should ensure good results from the Switchabalizer on many types of data, hopefully even the near-nonsense from the bowels of Twitter.
The Switchabalizer approach is clearly superior to a traditional spell checker for our targeted issues, but still requires significant testing, tuning and improvement. The following section provides some possibilities for improvement and expansion. We hope this approach can be of use to other people with the same problem, and we would like to thank Common Crawl for the fantastic resource that they provide!
Possible future experiments include further testing on different types of data, integration of higher-order n-gram features, implementation of a discriminative model, implementation for other languages, and corrections of common misspellings like “ur”, which cannot be included in sets of switchables without risking the model mapping words to non-words.
The commented Python scripts that generate the testing data and perform feature extraction/training/feature selection can be found on my github account at https://github.com/oskarsinger/PythonScriptsFromLexalytics/tree/master/AutomatedSpellCheck/