This ticket proposes to integrate the generated alignments in Content translation as an additional criteria to consider during the parameter mapping process. When a template is added to the translation for a language pair with alignment data available,the alignments will be used to identify additional mappings that could not be identified with the default approaches. That is, metadata from templateData and parsoid will still be used anyways, the alignments will surface additional possible mappings that were not considered before.

Since the alignment information comes with probability data, we need to define a reasonable threshold. In this case I think it makes more sense to err on the side of the information being copied to the wrong parameter (high coverage) rather than being lost (high accuracy), but we may need to experiment and iterate on the exact value.

Regarding metrics, it would be great to measure how many templates can be adapted with this method, in general, and compared to those incomplete or not adapted. Depending on the complexity of this, a separate ticket can be created.

After a brief analysis of the generated JSON mapping, this is what I am planning to do:

The JSON files are big. Parsing and doing lookup is going to be slow. Even if we plan to load and keep in cache, it is 12 MB for the 210 files for 15 languages. And expected to grow as we add more languages. So I plan to load all these data to an sqlite database and use queries to check if a mapping exist and retrieve its mapping.

The template alignment system gave us the following mapping and scores:

Source param

Target Param

Score

apellido

cognom

0.733333677522359

año

any

0.621203767666562

conferencia

conferència

0.79699115204052

fecha

data

0.562465710685373

nombre

nom

0.770534231312654

título

títol

0.674103936218413

url

url

0.599459685454743

Great work, and thanks for the clear example, @santhosh. It is great to see that this is automatically providing extra mappings that we were not finding before.

It's interesting that the mapping was found for word pairs that are very different such as apellido/cognom, or fecha/data; but it failed to find common similar words such as formato/format or páginas/pàgines. Maybe @diego can confirm whether these were cut out because of the threshold, because those words were not available in the corpora used, or something else.

Given that there are no false positives in the obtained mappings, if this example were representative, we may even consider making the threshold a bit less strict to get some more mappings.