How can we help you today?

Use regex to harvest multi-word terms

Hans CafeTran Wiki

started a topic
over 3 years ago

Hello Kevin,
Can you define or is your platform already offering prebuilt regex to harvest multi-word glossary candidates like:
The old man
I for me
a brand-new day
See the link in my posting about regex in the Tips&Tricks subforum.
Cheers
H

kevin

said
over 3 years ago

Hi Hans,

TM-Town currently harvests multi-word terms by searching for n-grams (bi-grams, tri-grams, 4-grams), extracting those that occur at a high frequency, and filtering out uniques (i.e. removing a bi-gram that is part of a larger tri-gram).

I'd be interested to learn more about how you would use a regex. One concern I have with a regex solution is that it is difficult to make a language independent solution. In other words, it might work fine for European languages, but probably will fail hard for asian languages.

TM-Town can still get much, much better at extracting multi-word terms, so open to any and all ideas.

Kevin

Hans CafeTran Wiki

said
over 3 years ago

Hello Kevin,
The solution you describe has already been offered in many other tools. None of them succeeds in anything else but producing a lot of noise, with occasionally a gem.
Language-independent approaches won't be possible. But consider it from the user's perspective instead from the developer's perspective: why should a French translator be interested in a solution for Russian?
I think that with the community it should be possible to define some very productive regexes for the most frequently used languages in the cat world. No need to cover all n languages of the world.
Why, for instance not use the fact that German nouns start with a capital letter? Why throw away this valuable linguistic marker just because other languages don't use it?
Why not use the sheer mathematical linearity of adjective conjugations to find that multi-word compounds actually are belonging to the same nest.
Why not use MT, perhaps even simultaneously to several languages to isolate the noun compounds, verb compunds etc.
Cheers,
Hans

1 person likes this

kevin

said
over 3 years ago

Hello Hans,

Thanks, all great points! I definitely plan to spend more time soon on improving the term extraction in TM-Town. I've already made some adjustments over the past few days based on your email feedback.

If you are willing, it would be great if you could send me a small sample doc in your language of choice (maybe German) and a second document with the terms that you would like/expect to see extracted. This way I have a base to test against.

I know you are busy, so no worries if you can't, but much appreciated if possible.