Web Development

Computational Linguistics and Text Mining

By Frank Jennings, September 19, 2013

A method to fingerprint the structure of English sentences and compute the grammatical distance between fragments.

Distance Between Text Fragments

Having this research as a base, is it possible to find out how two text fragments are different from each other? In information theory, there are known ways to measure the edit distances between two fragments of sentences. These techniques are not effective since they don't evaluate the grammatical structure of the text fragments. For instance, Levenshtein proposed a model wherein single-character edits for transforming one text fragment to another is considered as the distance between these fragments). In the method that I devised using computational linguistics techniques, I have found an effective way to find the "grammatical distance" between text fragments.

Consider two similar text fragments, Fragment 1 and Fragment 2:

Grammatical deconstruction of the text based on POSTAL of Fragment 1: Life is an untold tale. generates the results in Figure 2.

Figure 2: Life is an untold tale.

Similarly, for Fragment 2: Acceptance is a subtle pain. the result is shown Figure 3.

Figure 3: Acceptance is a subtle pain.

Any string-comparison tool or distance-measuring tool will not treat these two fragments as "equal" though they are "grammatically equal." When I say that two text fragments are grammatically equal, they are identical in their POS sequences and ordering as shown in Figures 2 and 3.

Computing the Distance

The grammatical distance between these two strings is computed as follows: Both the POSTALS Prints are overlaid and the distance of one POS tag to the corresponding node in the other fragment is computed considering lesser weighting of distance for the earlier POS tags to a greater weighting of distance for the later POS tags.

What is more interesting is that when I have the ability to compute the POSTALS distance between two random text fragments, I can enable the system to construct similar sentences based on the POSTALS distance. I have built a GUI which, based on the input sentence and the POSTALS distance, generates random sentences (Figure 4).

Figure 4: GUI for the POSTALS system.

Conclusion

Using a combination of computational linguistics and text mining, we can build a complex text-generating system that can "talk" and "respond" like humans when trained effectively. What is the need of computing the POSTALS Print? How will this information be useful in linguistic studies? I believe:

In future, machines will have the ability to construct simple-to-complex English sentences with almost 100% accuracy provided the mined text base is reasonably huge and accurate.

The machines will start "understanding" the mood of sentences and be able to group sentences based on mood and other factors.

A grammar-checking system that flags bad grammar and auto-suggests right structures purely based on mined data is possible.

An AI-based learning system that can "understand" and "communicate" with us in English is possible.

When they do this, this kind of fingerprinting and categorization of sentences will play a crucial part in their linguistic capabilities.

Frank Jennings is a Senior Content and Community Lead at Adobe Systems.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!