Do you feel like taking a sizeable bite into useful functionality for the Translate project?

These mini projects are mostly standalone, in other words, they do not affect other parts of Pootle or the Translate Toolkit but are useful tools for the Translate project. If you would like to hack on one then please contact the team and we will assign the project to you.

Both XLIFF and TMX can make use of a standard for segmentation. What is segmentation? This is how you break up paragraphs of text into sentences. It differs for languages and the standard developed by LISA allows you to define segmentation rules for text.

The advantage of segmentation is that it create better reuse. If you think that in one text you might have a paragraph that contains a sentence that in another piece of text is an exact match, but if we are not able to get to the sentence level we will not see the match. So segmentation makes our Translation Memory much more usable. And if it is part of XLIFF it makes matching in the actual translation even more usable.

Your job would be to implement the segmentation standard and integrate that into XLIFF and TMX as needed.

You have two pieces of text, the original and the translated text. But you do not have them combined in a bilingual translation file. Perhaps this is old work you inherited from someone else or you've found a source of good translations and you want be able to use them in your Translation Memory. You might only have the latest source document and an old translation so you don't expect them to align completely.

In this case we need an alignment tool. The tool should be able to read the files using our base classes and present the texts side by side, hopefully using the segmentation rules to make good guesses. The role of the user is to validate the alignment and to adjust it if needed.

The end result should be all the text items have been aligned or rejected.

The program then can outputs a new bilingual translation file eg XLIFF or PO or a TMX Translation Memory file.

Some notes pasted from another page:

At the moment, po2tmx can create a TMX file, but the TUs in the TMX often contain more than one sentence. A TM is more useful to translators if it contains sentences, not paragraphs. So, how about a tool that takes such a TMX and attempts to converts each one into sentences (source and target). The output can be CSV, so that a translator can open it in a graphical CSV editor and correct misalignments. The output can even be a plaintext tab delimited file so that one can open it as a table in OpenOffice and use shortcuts to correct the alignment.

The idea would be to keep the first sentence of each TU aligned. This means that if a previous TU had dissimilar number of sentences in the source and target, there would be empty cells above the current TU (either in the source or in the target).

(note: some initial work has been done on a tool as described here. See poglossary.)

When you translate you should start with a glossary of terms. Most glossary words are frequently occurring in a body of text. But you
might also have frequently occurring phrases that you would want to translate differently from the single words. The glossaries are then used
by translators and reviewers to check translations and to ensure consistency.

The glossary extractor tool would look at a number of source files and extract candidate words and phrases. The user would be able to set the
frequency levels eg how many times must it occur before we extract it, list of stop words, maximum phrase length etc.

The user should be able to eliminate words, check context in the originating text, pull online definitions and link them to the glossary
entry or add their own clarification notes (this might be the role of a separate glossary editor)

The output would be a TBX file or other file that can be imported into an application to populate the translations of the terms

Think of this as a glossary guesser. Use statistical techniques to take an empty glossary file and using your existing translation try to guess what might be a translation for a glossary entry.

The simple case is where the term occurs on its own in translations. The harder case would be where the word occurs in a sentence or paragraph. Some initial work is already in place in poterminology. Obviously plain TM approaches can also get you started (look into pot2po).

The toolkit provides a framework that allows you to define a storage format (e.g.. Gettext PO, .properties, etc.)
and allow a converter to migrate translation between those and the base formats (XLIFF, PO). The following
are format that would be useful to add to the Translate Toolkit. They are in no particular order, but
we have limited them to ones that we regard as most useful

Gettext is the home of PO format. It would be good if the Gettext tools could also do XLIFF. These are areas that
need to be modified to allow full use of XLIFF. They are in the order of most important to least important.

msgfmt - Extend the PO compiler msgfmt to compile XLIFF files into MO files. As a first step there is a Python compiler in the Translate Toolkit that could be extended

xgettext & msgmerge - Allow us to extract and create XLIFF Template files and merge them with existing file

intltool - although not pure Gettext you need to adapt these tools and methods to allow GNOME projects to use the Gettext tools for XLIFF or PO

align the new msgctxt facility in gettext-0.15 with the context specifiers in XLIFF.

Most Wikis, CMSs, general websites DO NOT do proper content negotiation. In this mini-project we are not concerned about the actual content but simply about the interface. It would be nice if for instance MediaWiki's interface defaulted to the users preferred language when they view the site. Most of these systems allows people to specify their language when they sign in. But that is not enough.

This project would look at a few things, such as:

Create a library or function within PHP, PHP development frameworks that makes it easy to correctly set language based on the users preferred language setting but also allow cookie based dropdown selected languages to work in conjunction with it.

Implement this in some key Wikis: MediaWiki, DokuWiki, Tikiwiki, etc

Document clearly how to use this

Do similar things for languages and frameworks written other languages: Python, etc.

I often get abbreviations in the text that I know might be written in full form elsewhere in the text, but I can't guess what it might be. So how about a tool that will search for possible full forms of abbreviations. Input VFS, and search for stuff like Virtual File Server, virtual file server, etc. Also add stopword list so that short words inside full forms can be ignored.

This is a hack that will work for you now. It searches in the source (msgid or source tu), ignores case and searches for a structure of words that start with V them M then L. It wouldn't find XML - eXtensible Markup Language.

I wish the Toolkit could export to table in a wordprocessing document and reimport from a table in a wordprocessing document. This would make it possible for almost anyone to help translate in a Toolkit based project. The best table format is probably an OpenDocument table, so that MS Word users can use it without screwing it up too much. The table can have three columns (or more, but additional columns are ignored -- possibly to be used by the proofreader for notes to the translator, etc).

If ODT is too difficult, how about exporting to a three column in HTML? An HTML file can be opened in WYSIWIG in MS Word and OpenOffice.org, and although the saved HTML file will have horrible machine generated code to give anyone in alt.html.critique a heart attack, it will still be a valid table which can simply be converted back to PO.

po2csv isn't really feasible because different programs have different CSV definitions. Excel and Calc interpret a CSV file in two different ways. So exporting the PO to CSV only works for tools that can correctly interpret the Toolkit's chosen dialect of CSV.

The underlying toolkit CSV module supports various flavours of CSV. It currently uses the Excel flavour. So it is possible to output for different spreadsheets if needed. It might be better to understand what exactly fails, I know we had to hack things to prevent the loss of leading single quotes which in most word processors are interpreted as meaning 'treat this as text' --- Dwayne Bailey 2007/10/18 03:13

Well, Excel doesn't convert cleanly to a word processing format and back. Excel wasn't designed as a text editing tool anyway. The “normal” program to edit text in, surely, is a word processor. -- Samuel

It would be nice if one could do a pofilter check that takes a list of words from an input file and checks to see if they occur in the target text. This list of words would be a blacklist of terms that should not be used in the translation, no matter what. Useful for when a client decides to change his prescribed terminology and you want to do a bulk pogrep on your files while keeping the blacklist of words all in one place.

The blacklist should respect word boundaries. So if “klik” is on the blacklist, then “toeganklik” should not trigger it.

What I mean is that a pofilter check should take a bilingual list of words as input file, and check to see if a term in the source text was translated using using the right word in the target text. In other words, if the bilingual list contains:

computer = rekenaar

then pofilter will check which source texts contain 'computer' and the check if all of their corresponding target texts contain 'rekenaar'. Those that don't, fail the check.

The bilingual list check should not respect word boundaries (or: should do a fuzzy check), so that “rekenaar” would also match “rekenaars” and “berekenaar”.

Okay, this is a Pootle wish list item… perhaps it belongs elsewhere. Currently, you can do a search for a word in Pootle, and Pootle jumps to the next instance of that word. The advantage of this, is context. The disadvantage is not seeing all the instances in one page. I suggest the following:

Let there be a tickbox option next to the search box in Pootle whereby a search result opens in a new browser window. The result can be a normal plaintext PO file that would be the normal result of a pogrep action (but a single page, not multiple pages), or… it can be a simple HTML file in which the search term is highlighted in each string.

The current search system assumes that the user might want to edit the strings that form the search result. The purpose of the proposed system would be to do quick searches on term usage, but not enable users to edit the strings there and then.

(Actually… well… it's a pity that the files in Pootle are nested in a directory tree, otherwise this search method could open multiple pages that advanced users can download, edit, and upload if they wanted.)