Gaps and Errors: A Linguistic Struggle

Scholars associated with the CAS project 'SynSem: From Form to Meaning – Integrating Linguistics and Computing' face a number of challenges 'regular' people would never consider, although they use language every day.

‘One might think language is made up of a set of grammatical rules, which it follows without exception. But this is a somewhat idealistic notion,’ Dan Flickinger, a senior researcher at Stanford University in the US, said.

For the past 25 years, Flickinger has tried to find a computational approach to linguistic questions -- an issue that is harder to solve than one might imagine. Computer programs are strictly rule-based -- they follow instructions and cannot ‘think’ for themselves. Language and people, on the other hand, are much more flexible.

Some grammatical rules cannot be broken or altered for a sentence to be correct -- ‘the’ cannot come after the noun, for example. Some rules are only applied in formal writing -- ‘my brother and I’ and not ‘me and him’ -- though exceptions are slowly being accepted as informal phrasings. But people also understand grammatically incorrect sentences, which is often the case when talking with foreigners new to the language or listening to Yoda from the Star Wars franchise.

This is much harder for computers to understand.

Finding the missing pieces

‘As linguists, we like to sit down in our chairs and think about how language is being used, how people structure their phrases, and how they encode its meaning.’ Flickinger said.

At CAS Oslo, Flickinger has invested his time in understanding the particular linguistic phenomenon called gapping.

‘For anyone listening, the sentence "I gave my sister a flower and my brother a book" makes perfect sense, but it is actually extremely inconvenient when trying to construct a computational framework,’ Flickinger said.

He explains that the sentence’s meaning is essentially a conjunction made of two sentences -- ‘I gave my sister a flower’ ‘and’ ‘I gave my brother a book.' But when put together, the sentence is made more efficient by dropping the second ‘I gave.' Our minds can, based on the first part of the sentence, fill in the gap (hence the term gapping).

But making a computer do the same is a difficult process.

‘One has to tell the computer how to do everything by using rules,' Flickinger said. 'Constructing rules for gapping is no easy task, and the problem becomes more complicated the longer the sentence is as there will be more "noise."'

As an example, take the sentence ‘I gave my sister a flower for her birthday today and my brother a book for his last Sunday.’ The sentence may not be any harder for people to understand than the previous example, but the second phrase is now missing both ‘I gave’ and ‘birthday.’

Understanding incorrect language

When new rules are written, they have to be tested. One way of doing this is by machine learning.

Gosse Bouma is an associate professor at the University of Groningen in the Netherlands who specialises in applications of natural language processing. He explains that when they use this technique, they feed positive data into the computer -- giving it a library of ‘right answers’ to go with the rules.

The machine makes a model based on the samples of input, which it can use to make data-driven predictions for future input. This method is generally speaking a form of pattern recognition, and machines need large amounts of data in order to make reliable predictions.

Although being limited to the use of the Dutch language, Bouma and his co-workers have found an endless source of data: Twitter.

‘In the course of a day, we collect about 200 000 Dutch tweets,’ Bouma said. ‘Tweets are a great source of data, as the length of a tweet is limited.’

As most people know, one of Twitter’s hallmarks has been its 140-character limit, which in late 2017 was doubled to 280. This constraint means a tweet only contains short and concise sentences -- a perfect size for data inputs.

Of course, the limit has also spurred creativity, as one needed to cram their thoughts into 140 characters. It often resulted in a lot of abbreviations, as well as shortened, or even missing, words. And some of these trends have become something more than just social media slang.

‘Lol.' ‘Omg.' ‘Bea.' ‘Gonna.' ‘Thx.' ‘Pls.' 'Selfie.'

‘I didn’t go to class today, because tired.'

Should a linguistic framework accommodate these kinds of ‘errors,’ or should it convert all ‘faulty’ tweets into grammatically correct sentences?

Can one size fit all?

‘There is not one theory or framework that can solve all problems, but different frameworks can do different things well,’ Flickinger said. ‘Our job has to be to find the strengths and shortcomings of the frameworks so that they can be improved. In the end, we might have a program that represents language well.’

‘The ultimate goal will be to create an automatic system for all languages,’ Bouma said. ‘Here at CAS Oslo, we are working together on a cross-linguistic framework.'

If possible, such a framework could be used for a number of applications in no matter what language.

The researchers will have to map similarities and differences between languages and find ways to describe them as rules for a computer to understand.

‘We work with a lot of languages -- English, Norwegian, Dutch, Swedish, Polish, and Czech -- and having all of us working together under the same roof has been very fruitful,’ Flickinger said.

‘Creating such a framework will take much longer to complete than we have time at CAS Oslo, and so I believe our collaboration will continue when our stay is over,’ Bouma said.

Although all of computational linguistics seems to be facing a very complex problem, there is no doubt the field has progressed tremendously in recent years.

‘When I first started working with this 20 years ago, I thought it would be purely theoretical. I never thought it would actually be implemented in the ways like we see today,’ Bouma said.

Now there are question-answering bots, search algorithms, translation web sites, and autocorrection -- all due to computational representations of language.

Finding the solution to a linguistic problem introduces a new piece to the puzzle. Although the puzzle is far from finished, and there is a lot of work to be done, Bouma and Flickinger believe their field will contribute to the creation of many wondrous applications people have yet to invent.