Michael Beijer

started a topic
over 3 years ago

Hi Kevin,

I was wondering when/if you will be taking integration to the next level, by adding fuzzy matching (and colour-coded highlighting/fonts, strikethrough, etc.) to the online TMs that can be used from within CafeTran? Without this, these TMs are only of limited use.

kevin

said
over 3 years ago

Hi Michael,

Could you detail a little more about what you would like to see? I think CafeTran will currently highlight the fuzzy match portion in the source results (I am using the default settings and it appears in light green for me). For example, here where I highlight 'Video of yourself' and click 'Search your segments'. You'll see 'Video of yourself' is highlighted in the results on the left.

Are you looking for something more? Do you also want the fuzzy match "guess" highlighted in the target sentence? This would require word aligning each segment...which is something I have considered, but not yet implemented.

Kevin

Michael Beijer

said
over 3 years ago

Hi Kevin,

No, I don't mean highlighting any matches when you select a phrase and press 'Search your segments' (or CT’s R1 button), but fuzzy matching and highlighting as shown in a TM window, automatically, when reaching a new segment. So: searching all online TMs for potential matches and fuzzy matches.

I know that LF Aligner uses bilingual glossaries to assist its aligner; is this in any way related to your suggestion? Also, can you expand on what you mean by ‘word aligning each segment’?

Michael

Michael Beijer

said
over 3 years ago

PS [warning! I am thinking out loud here, so much of it is likely to be total nonsense]:

Does ‘word aligning each segment’ mean: using bilingual glossaries to predict which words on each side (src and trgt) will be connected? I know very little about this. All I really know is LF Aligner uses bilingual glossaries (actually, tons of aligned term lists, in diff. languages, from Hunalign) to help its aligner.

This thought today led me to wonder if it would be possible to put all the many bilingual Dutch-English glossaries I have collected over the years to good use: use them to help improve subsegment matching. I know that a few other CT users have accumulated large glossaries over the years, consisting of very short phrases, and of course single terms. Could all of this data be used for something like this?

I

Igor Kmitowski

said
over 3 years ago

Hi Michael,

Interesting idea to have the glossary support during the subsegment matching but think of usability in CafeTran context. Such a glossary can produce the match on its own so do we really need to extract the same match from the TM segment? That would be useful for a tool that does not have a separate glossary matching function.

Igor

kevin

said
over 3 years ago

Hi Michael,

No, I don't mean highlighting any matches when you select a phrase and press 'Search your segments' (or CT’s R1 button), but fuzzy matching and highlighting as shown in a TM window, automatically, when reaching a new segment. So: searching all online TMs for potential matches and fuzzy matches.

The example I showed was highlighting a phrase, but it should also show you exact matches and fuzzy matches automatically if you have the correct settings in place. If you right-click the "Search your segments" tab you'll see some additional options. You want to have 'Automatic Search' selected and also 'Search your segments by default'. Now when you click 'Next' to move to the next segment it will automatically search your TM-Town TMs and return any exact matches or fuzzy matches, sorted by the "closeness" of the match, with the closer or exact matches being at the top.

I know that LF Aligner uses bilingual glossaries to assist its aligner; is this in any way related to your suggestion? Also, can you expand on what you mean by ‘word aligning each segment’?

LF Aligner is for aligning parallel texts on a segment level. It uses bilingual glossaries to assist in this.

When I said 'word aligning each segment' I probably should have said 'word aligning each translation unit'. This wiki explains the general concept. Word alignment involves matching each source word with its corresponding target translation. A source word can be aligned with multiple target words, one target word, or no target words.

Typically the output of a word alignment will look something like the following:

0-0 0-1 1-2 2-3

This shows the pair matching of the aligned words. 0-0 would mean the the word in the first position in the source segment (the first position is 0) is aligned with the word in the first position in the target segment, etc.

Word alignment is used for machine translation. One of the features I want to add to TM-Town in the near future is the ability for one to train their own machine translation engine for private use. You could do this now on your home computer if you were interested. The most popular open source machine translation tool is called Moses. One of the steps to train Moses with your TMs is that you need to word align your TMs. The most popular word alignment tools are:

One thing with word alignment is that it is very time consuming and also requires a lot of memory.

Anyway, back to the question of subsegment matching. I've only briefly read the Lift thread, so I'll have to look into that more. I think though in a way MT is a form of subsegment matching. If you have a machine translation engine trained on your work, and you use the results from that alongside results from your TMs and glossaries, I think that can help improve efficiency. In my opinion sometimes translators might dismiss MT because they are expecting the full segment MT result to be perfect, instead of using the MT result as another reference point (particularly subsegments within the result) to translate faster.

Kevin

kevin

said
over 3 years ago

When I said "Do you also want the fuzzy match "guess" highlighted in the target sentence?" I was thinking of something like Linguee, where they show the results highlighting the target word as well. I used the word "guess" in quotes as word alignment tools, while probably having a decent level of accuracy, are definitely not perfect (yet?).

Michael Beijer

said
over 3 years ago

Hi Kevin,

I see. But I mean not just showing the exact matches and fuzzy matches, but also indicating any differences between them and the src segment (via highlighting, colours, strikethrough, etc.), like in a CAT tool. That would make it much more useful.

Thanks for all the info on aligning, etc!

Michael

kevin

said
over 3 years ago

Hi Michael,

I see. I think that should be possible, but would be a question/request for Igor. I'll ping him and ask about it.

Kevin

w

woorden

said
over 3 years ago

Yes, Kevin, ping him. I'm sure Igor will love the idea. It will make the free version of CafeTran unlimited (as I mentioned a few months ago). Or your TM-T will be kicked out of the free version. Win-win!

Hans

Michael Beijer

said
over 3 years ago

Hi Kevin,

Thanks!

I really like the idea of being able to keep my TMs online, primarily because I am hoping it will make it possible to use larger TMs in CafeTran than is currently possible.

kevin

Hans, if what you mention becomes an issue I think Igor could limit the TM-Town extension to only those who have paid for a license. Problem solved.

Kevin

w

woorden

said
over 3 years ago

That's what I said, "Or your TM-T will be kicked out of the free version," no good. No good for Kevin.

But then again, I'm harmless. Be sure to keep a working CT install back-up.

kevin

said
over 3 years ago

I don't think it is a bad thing if it is limited to only users who have purchased a license for CafeTran. I think those are the people who are using the extension now anyway. This is similar to the integrations with ProZ.com. You have to be a paying ProZ.com member to use the feature to send your portfolio samples from ProZ.com to TM-Town.

There may be a slight benefit to being able to try out the TM-Town extension in the free trial. In that case, instead of only limiting to those with a license, maybe one could get 500 queries from TM-Town (or some similar limitation) in the free trial version.