Bitext Blog

Improve text mining using Phrase Extraction

As we have mentioned before in this blog, structured data is invaluable for businesses looking to extract relevant information from text. Whereas the problem used to be how to get enough useful data for the results to be meaningful, the challenge today is how to process the large amounts of it that are available. This task becomes almost impossible to achieve without the right tools because,on top of being vast, the data is most often unstructured. At Bitext, we offer a range of Text Analytics Tools that allow users to structure their raw data to extract the information most relevant to their goals.

Our Phrase Extraction tool applies syntactic analysis to detect key ideas and trends fast and accurately in any data set. Whether the text is standard, such as news or legislation, or colloquial, such as social media or a blog, our deep analysis is able to extract the main concepts.

How does this syntactic analysis help? Not all words hold the same semantic weight in a sentence (some words carry more meaning than others), our analysis allows us to discern those words that hold no meaning (prepositions, conjunctions, etc.), from those that carry meaning (nouns, verbs, adjectives). By finding and tagging Noun Phrases, Verbal Phrases, and Adjectival Phrases, we know where to look for meaningful information. For example, if we find the most frequent noun phrases in a text, we then have the main entities the text talks about.

Phrase Extraction renders a better analysis than other methods like extracting concepts through keywords or bigrams for two reasons:

It avoids irrelevant combinations of words that bigrams would include, such as “but I” in the sentence “I love Samsung phones, but I hate their customer service." As shown below, our analysis will target the relevant phrases [“love”, “Samsung phones”, “hate” and “customer service”].

Unlike the keywords approach, Phrase Extraction understands compound concepts such as “checking account”, or “customer service”. This is essential for a good extraction of concepts and ideas since it is not the same to talk about one’s “checking account” than talking about one’s “twitter account”. And as we see below, to a bank it will be more interesting to have this kind of analysis.

Once your text has undergone Bitext’s Deep Linguistic Analysis, it is not only structured and tagged, but the concepts are also normalized so we can find all instances of the same concept appearing in the text. This output can be used to extract the main concepts and ideas mentioned in a text, but also to build categorization or sentiment dictionaries. The results can also be tailored to specific customer needs and combined with our other services such as entity extraction.

If you want to know more about it take a look and try Phrase Extraction here.