The following are projects ideas for Google Summer of Code 2008. The project has applied to be accepted as a mentoring organisation and these are project ideas that we have gathered together.

Our main aim has been to identify work that can be completed in 3 months, that is useful to us as the project but that is also challenging and interesting.

Our software include:

Translate Toolkit - this is a set of tools and libraries that allows files to be transformed into translatable formats. Allows translatable formats to be managed and manipulated. The other tools use this as a library to allow management and manipulation of translatable formats.

Description: The XLIFF standard is an XML based standard for localisation. It can store various state information and can be adapted to manage a translation workflow.

By Workflow we mean the simple process that moves from untranslated → translated → reviewed → approved. There are also processes for updating existing translations. These can be more complex where the review is 'authoritative' that is the reviewer can make changes vs 'non-authoritative' that is these are simply suggestions to the translator who then decides if she wishes to fix them.

This work would involve defining levels of workflow. Finding a suitable toolkit or implementing the workflow classes. Implementing the workflow and creating and adapting existing tools to manage the workflow.

What this is not is a workflow engine. Our goal is not to make a workflow editor, no matter how much people beg, but to create a set of standard workflows that meet the needs of current translations. If you do this well it will be possible for anyone to adapt this in the code and in time we should cover all conceivable workflows that could be created without burdening this with an editing tool.

Your main aim is to stay focused on the basics of the workflow and deliver a solution that implements workflow.

You might want to consider how notification forms part of the workflow, email, jabber, RSS. But this is not a crucial component.

Poke the code:
Not much code to poke I'm afraid.
phase - is useful to understand some tools used to manage process

Description:
Segmentation is the process of taking a block of text and breaking it into sentence segments. While initially this looks simple you might find problems when you want to segment e.g. and have to build rules and lists of words that are not to be segmented.

The main advantage of segmenting is that it allows us to match existing translations at a sentence level. Thus in a block of text you might have 3 sentences and 1 of which will match 100% while the others might match less and need to be reviewed. If you had not segmented you would probably not have matched anything.

Another advantage of segmentation is that it allows us to recover old translations. If someone translated some text but didn't keep the Translation Memory (the 1-to-1 map of translations) then we can use segmentation to break both source and target text into segments and try to align the texts.

Your main tasks in this project will be to:

Integrate ICU into the toolkit to allow us to use their segmentation rules (or find some similar established segmentation software, or expand the existing segmentation software in the toolkit)

Implement the SRX standard that allows segmentation rules to be specified in XML.

Create an alignment tools that will allow two pieces of text to be segmented and aligned. The tool should allow people to merge items that the tool segmented. Move them up or down and generally edit the text. The output will be in TMX format so that the text can be used as a Translation Memory.

An interesting aspect which you might want to include is some idea of automatic alignment that will try to guess which pieces of text should belong together.

Description:
In translation it is important to have glossaries as these guide existing and new translators to use the correct words. Glossaries are like dictionaries but usually very focused on a specific domain and don't need all the detail you would find in a traditional dictionary. Usually they contain only the Source and Target words. They might contain a definition or an indication of the part of speech.

The TBX format is a format for TermBase Exchange, i.e. to allow glossaries to be exchanged. The Translate Toolkit currently has very basic support for this format. Full or better support would allow much more detail and important information to be stored and shared.

Glossaries are immediately useful but they would be more useful if your translation tool was able to warn you when a word that should have been used has not been used. In order to do this it needs a stemmer so that it can find the root of the English text before looking up your words and phrases in the glossary.

For this project a student would have to:

Develop a glossary extraction tool

Implement the majority of the TBX specification

Implement a stemmer e.g. snowball stemmer

Using the stemmer create a terminology matching checker

A basic terminology editor

These should be integrated with or extend existing tools: Translate Toolkit, Pootle or our offline editor.

For those who want a greater challenge you should add a glossary populator. This would take the a blank glossary and try to populate it based on existing translations. These would be checked by a human but the population stage should save quite a bit of time.

Description:
One of the primary roles of the Translate Toolkit is to covert formats that you want to translate into translatable formats. Thus we can convert MediaWiki text into Gettext PO. We do this so that translators can use one set of tools instead of having to learn a new tools for every new format. In the same way that coders use one tool we think localisers deserve the same respect.

The toolkit already supports a large number of formats, most of these are focused on localisation, not content translation.

Your primary work would be to flesh out the current format support to allow many more formats to be supported. This is done by implementing the format on our base classes. Then the creation of a conversion tool that will convert the format to a translatable format.

The list of formats that we would want supported are the following:

PDF - so that the almost 300million PDF documents might become translatable

RC files - these are used by WINE and ReactOS, we'd like to localise those applications. It would also make many Windows applications localisable.

TTX - A proprietary format used by the proprietary Trados translation tools. Supporting this format would allow 1) tools using the Translate Toolkit to be used as commercial localisation tools, 2) allow translators using Trados to translate free software as we'll be able to convert to their format.

Qt .ts (expand to support v1.1), Qt .qm (allow correct compilation of these formats - this would allow all Qt related software to be translated and resources to be compiled.

Others - the formats page lists many others that we might want to support

Implementing the properties, PO and html formats to convert to XLIFF according to the XLIFF rep guides would help push this to a hard project. Defining rep guides for the more important formats listed here or currently handled by the toolkit would also turn this into a hard project.

Description:
Implement OpenID authentication on Pootle (Pootle is a web-based translation tool built on the toolkit). This will allow authentication against any Pootle or OpenID server allowing people to login easily. The vision of Pootle was never to be centralised but to allow various Pootle servers to be created for various needs, thus many people would need to access multiple Pootle servers. Using OpenID they can suddenly access all of these servers.

OpenID also allows the exchange of various pieces of personal data including email, name, etc. These should also be shared so that the user only needs to maintain one set of personal details.

All of this would be implemented in the Translate Toolkit so that it can be shared by Pootle and by other users of the toolkit.

Pootle currently does not implement a very good system for tracking progress or individual translation work. Using SIMILE Timeplot would allow people to see very interesting aspects to the translation progress.

Pootle also does not do a very good job of notifying people about changes in state that needs their attention. New files for translation, someone has just registered, a suggested translation provided. These need to be either emails or jabber messages that can be clicked on and responded to by a Pootle user.

Description:
Any previous translation is termed your translation memory (TM). Translators use/leverage these to give fuzzy matches. This ensures that they save time and also that they consistently translate across their translation tasks.

Placeables are pieces of 'text' that can be replaced in a TM without really altering the context of the text. A placeable would be any of the following:

Accelerator keys: e.g. & in KDE, _ in Gnome, ~ in OpenOffice.org

Variables: %s, $1, $var$, etc

Tags: <b>, <tag>, etc

Numbers: 1,000.00

When using a TM froM Gnome on KDE you want it to be able to recognise and alter the accelerator key thus ensuring a 100% match. For variables and tags you want the matcher to be able to discard and alter these intelligently. For number you want to be able to alter a number and even the formating of a number as needed.

Your work in this project would see you doing the following:

Creating a placeables framework such that any format can define these palceables

Getting features upstreamed to for instance Gettext to make your work easier.

Building an XML-RPC TM server that would store TMs, allows people to submit TMs, allow people to query and would do transformations on the fly

Make alterations to our TMX support as needed.

Implement full identification of placeable variables for Gettext PO and adapt po2xliff to include placeables in the resultant XLIFF file.

Poke the code:

pofilter - provides a method to specify and detect variables in some formats

Description: While creating a diffing tool might seems rather redundant and in fact easy this project involves a few more things.

Normal diffing tools are in most cases useless for checking changes in translated files. Very often the context diff is cutoff so that in fact you can't see the full context. Slight changes in layout are shown as diffs yet the actual Source or Target text remain unchanged. Changes in the header are marked as difference when they are not really that important. All of these issues lead to noisy diffs that mask the actual content that should be examined.

Another useful area to see diffs is the new Gettext feature that allows previous messages to be stored. In this case as you have the current and previous translation you would be able to see a diff of these two. The same applies to alt-trans items in XLIFF. You can compare various fuzzy matches to see exactly how the source text of the suggested fuzzy match differs from your source text.

In Pootle we use the Python diff module quite effectively to show difference between suggested translations and the current translation. With character level diffs and the correct way to represent them these become very powerful.

Description: The Translate Toolkit has two parsers for PO files. The first written in Python that we call pypo and the second which uses libgettextpo from the Gettext package which is written in C which we call cpo.

The majority of the work to get cpo working is in place in the Translate Toolkit. But the hard parts are not yet complete. Currently you can run almost all of the Translate Toolkit commands and they will work. But we are not releasing memory correctly. Thus we cannot use this in Pootle.

There are also a number of features within cPO that are not yet implemented in pypo. These include previous messages and some new header functionality.

Your task would be to complete the porting to cpo. You will need to manage the correct releasing of memory within Python. We will test your implementation on the Pootle translation server. We believe that this will reduce both memory usage and improve speed. If succesfull the OpenOffice.org Pootle server will be your first grateful user.

Your other tasks in terms of fleshing out the PO coverage will include: implementing the feature in both cpo and pypo. Ensuring that pot2po works correctly with the new features. You will also need to examine po2xliff and implement the conversion of msgctxt and previous messages to XLIFF.

Description: Pootle is a file based online translation tool. We don't use a database backend. This has many benefits but some performance disadvantages. The primary problem is that we can only run a single instance of Pootle and cannot rely on Apache with mod_python to allow the server to run several instances of Pootle.

The main issues behind this problem is that we do not fully implement locking of files within the server. Thus your first task will be to implement locking of files for reading and writing, we expect some performance impact so you might want to address that after implementing the locking.

Once locking is in place we can move onto working with Pootle on mod_python. Pootle uses a web toolkit called jToolkit that can already work with mod_python. Your task will be to setup and document how to make Pootle work with jToolkit on mod_python. Any bugs that occur and need correcting will be yours.

Lastly you will need to check and test that Translate Toolkit command line tools can be used on the Pootle installation. In other words these tools must also understand locking. You could easily address this by putting the locking logic in the toolkit itself which probably makes the most sense.

While not essential you may also want to ensure that Pootle can reload its configuration while running. Currently changes while Pootle is running are lost.

At the end of this you will have greatly improved the usefulness of Pootle for large installation as you will have removed the potential of blocking in long running tasks. You will also have made it possible to safely use all the command line tools on a running instance of Pootle.

Description: Pootle makes use of the jToolkit web framework. This has served us well but jToolkit is no longer maintained and we feel that we would be better served moving to a newer better maintained framework.

There has been some work porting Pootle to the Django framework which you may wish to continue. But we are open to discussion as to which target although we are currently quite interested in Turbogears.

Your first tasks will be to put Pootle on diet. This will mean taking as much functionality out of jToolkit and out of Pootle and into Translate Toolkit. This will ensure that good functionality that we will wish to reuse in offline tools is preserved. It also ensures that Pootle becomes smaller and hopefully easier to migrate.

Next will be the task of migration. We would like whoever takes on this role to approach this as an iterative process. We don't want to rewrite Pootle we want you to migrate it to the new platform. Once that is done features can be improved, added and performance reexamined.

Poke the code:

the current (incomplete) migration of pootle in the subversion branch django-migration

Description:
While working with Pootle at OLPC, we have come across a number of feature requests, most (if not all) can be implemented within the GSoC timeframe. Some of the most high priority ones among them are

Ability for the translation admin to merge the translations with the latest POT for her project.

Ability for the Pootle admin to easily set permission for languages on a global basis (ie: give user foo admin rights for all Spanish translation projects)

Support for validation of translated strings on submission (equivalent of msgfmt --check, but for individual strings),

Integration with http://open-tran.eu/ (use the XMLRPC interface of the site to generate suggestions)

Ability for language administrator to get in touch with members of the translation team

Ability for translators to use an intermediate language (eg: an Aymara translator can use Spanish to understand the English msgids better) while translating (partially implemented, needs some more work).