The Translate Toolkit - the technology platform for Pootle, Virtaal and many other localisation tools. It contains our core technologies for format support, natural language technology, and many useful tools and converters powering teams for Mozilla, OpenOffice.org and many others.

Generic skill requirements:

Python. Experience in another OO language will help, but might make the work harder for you.

Experience in computational linguistics is useful in some projects, but most do not need any specific language requirement.

Experience in localisation is helpful to understand the needs of a localiser.

Description: The translate toolkit has a terminology extraction
tool poterminology used to generate glossaries from translation
files (integrated in Pootle). the algorithm is mostly based on word
frequency (but does some fancy stop words magic) and is more suited to software localization (for instance by default it
counts the number of occurrences of a phrase in the source code using
location comments in PO files).

The project would involve improving poterminology to use NLP
techniques. for example parts of speech tagging to refuse phrases that
are unlikely to be good terms, better segmentation and stemming before
counting word/phrase frequency, etc.

Description: The translate toolkit implements a set of translation
quality checks. These are used by the pofilter command and are integrated in Pootle.

One of the most difficult tasks when reviewing the work of multiple
translators is ensuring that they make use of a unified terminology list to ensure consistency and to avoid confusing the users. This project would involve implementing a new quality check that detects whether the terms in a terminology file have been used consistently.

A simple keyword search would lead to too many false positives,
stemming and other NLP preprocessing should be used to improve the test
results.

Optional Task: develop a different fuzzy matching algorithm for
languages that lack a usable stemmer.

Description: The Translate Toolkit includes tmserver,
a lightweight translation memory web service. Virtaal uses it to
display suggestions for translations based on previous translations. tmserver relies on a sqlite3 database and makes use of full text indexing further narrow down the relevant strings. It then picks the most appropriate matches by using Levenshtein distances.

This approach works well for very close matches where the distance threshold is quite high. If the acceptable distance threshold is lowered the quality of matches decreases dramatically.

This project would involve exploring multiple ideas for improving
matches, for example:

a different distance algorithm that gives lesser weight to case or punctuation changes.

Description: Virtaal already provides spell checking by means of Gtkspell. Gtkspell is a bit limited in terms of support for Windows and OSX, and is not really extendible. We would like to provide richer functionality to users to only spell check translatable text, and to ignore accelerators, for example.

Your task would be to implement a GUI for spell checking similar to Gtkspell with the same level of functionality as a start (using enchant, supporting the personal word list, providing suggestions on right click). Then we need to add support for ignoring accelerators, and to define regions to be spell checked.

A successful candidate will probably look into the API in the Translate Toolkit for dealing with placeables to ensure that only translatable text is passed to the spell checker.

A magnificent success would be integration with the MS Office spell checker over COM (we have python code to do that) and/or integration with the platform spell checker on OSX (Enchant has some initial support for this without build scripts).

Skills: Some Python, some Django. Used more than one SCM/VCS tool before.

Description: Pootle can update translation files
and commit new translations using various VCS systems (SVN, CVS, Git,
etc.), allowing translation coordinators to submit the work without
having to deal with the complexity of version control and making it
easier to delegate commit rights to translators without worrying about
them touching code or other files. However, version control support has
to be setup from the command line, but it is a difficult and obscure
process.

The project will involve implementing a way to setup version control
directly from the web interface using VCS checkout URIs. This should
include both anonymous read only checkouts and authenticated ones
where commits are allowed.

Description: Placeables are special parts of text that should be
copied unchanged when translating. The Translate Toolkit has support
for two kinds of placeables: on the one hand, explicit placeables encoded in the translation file as such (XLIFF placeables). On the other hand, discovered placeables, those are things like xml tags, emails, URLs, numbers, filenames, variables, patterns of text parsed via regular expressions and unlikely to change on translation.

Pootle lacks any support for placeables, complex xml/html tags, URLs
and variables have to be typed manually which is error prone and might
require a keyboard layout switch (slowing down translators), and
offers no way of handling XLIFF placeables (we depend on XLIFF
placeables for translating ODF documents).

The project will involve using Toolkit's placeables support to highlight
placeables in the source text (original text), and with JavaScript
insert the placeables text when clicked.

XLIFF placeables cannot be inserted as-is in the text area (they need
to be interpreted as XML, not as inline text), and they tend to be ugly so no need to display them in full, they should instead be visually
displayed using some graphics and a textual token (like 1)
inserted in their place and replaced with the proper tags on insert.

Optional Task: select and insert placeables using keyboard shortcuts.

Virtaal has very rich placeables support. Play with it to get a sense
of what Pootle needs.

Description: Pootle's automated translation
quality review is one of its most powerful features. Built on
Translate Toolkit's filters it allows translators to step through
strings that fail a number of quality checks.

The project will involve redesigning the UI and workflow of quality
checks review to introduce a number of improvements:

Display quality checks when viewing or editing translations. not just

in review mode.

Some quality check failures are specific to certain parts of the text (missing xml/html tags for instance). These should be highlighted in the source (original) text and also in the target (translated) text when possible.

Automated quality checks are most of the time just a guess. A difference in punctuation between the source and target might be a mistake or a deliberate choice of the translator. Only a human reviewer can tell. But Pootle lacks a way of indicating false positives which makes it difficult to estimate progress in translation review.

Description: Translation Memory is one of the most popular
features of CAT tools, at the moment Pootle's support
for TM is quite primitive. compare with Virtaal which can get
suggestions for translations from a variety of local and remote
sources and presents it in an intuitive interactive widget.

The project would involve implementing at least one of these TM sources:

directly from Pootle's database

Translate Toolkit's TMServer

OpenTran.eu

Implement an interactive jQuery based widget for displaying TM
suggestions and inserting the selected suggestion.

The TM widget should order suggestions based on how similar the
original text is to the text currently being translated. for Virtaal
we use Levenshtein distance to measure match quality. some quality
measure will have to be implemented in JavaScript.

If all three sources are implemented and there is time left you can
implement Machine Translation support (Apertium, Google, etc)

Description: Pootle has a popular Terminology feature where
translations for specific keywords are suggested based on either a
site wide terminology glossary or a Project specific one. but lacks
support for remote terminology.

The Project would involve writing JavaScript to query one or more
remote terminology sources (OpenTran.eu is our favorite) and
interactively display results.

A redesign of the current terminology suggestions UI since it cannot
fit the large number of suggestions remote glossaries tend to return.

Description: Pootle's translation form is large
and somewhat complex due to the many features it supports. As more
features are added (review some of the project ideas above) it might
get too cluttered.

One way to avoid the complexity is by implementing a richer text
editing widget in JavaScript (think tinymce), that is able to
incorporate many of the features through context menus, toolbars and
keyboard shortcuts.

Play with Virtaal's editing interface for inspiration on what a simple
but powerful translation widget looks like.

Description: Pootle versions prior to 2.0
supported user defined translation goals, in which certain files could
be grouped as a single goal to break down translation work. and files
could be assigned to specific users to translate them. This feature
was lost when Pootle was ported to Django and needs to be
reimplemented.

The Project would involve:

designing the Database models/relations for specifying file level goals and assignments and implementing

explore the possibility of Unit level goals and assignments

implementing the UI/views for specifying goals, assigning work to users, tracking progress on goals and user's assigned work. collecting statistics about users work.

Description: The classical software localization workflow assumes
that only one or two translator will work on a file or set of strings
and then maybe one or two more will review their work. But as Pootle
is increasingly being used as a social translation tool a single file
or even a single unit might be translated and retranslated by many. It
is difficult for reviewers to keep track of activity with such large
teams.

Pootle collects useful statistics about user contributions by these
statistics only measure quantity of work, not quality. It is difficult
with large groups to build a reputation as a skilled or dedicated
translator based on these stats only.

We can take inspirations from Wikis here and offer full revision
history and more granular tracking of user contributions.

The Project would involve:

Designing models for translation revisions

UI for displaying translation history with diffs. While we take inspiration from wikis we hope the interface can be much simpler than what a typical wiki offers.

Revert action?

Views for browsing user activity

Calculate a score (karma) for users based on quality of contributions (could be measured through quality checks, how many contributions end up being final revisions, how large the average diff between their contribution and final revision)

This would involve designing a RESTful API that consumes and produces
JSON for file level tasks. While full Virtaal integration would be
most welcome it is not required for this task to be completed,
simple example CLI tools are sufficient.

Browse available files on a remote Pootle server, displaying last updated and translation progress.

Download file from a remote Pootle server, upload back to server on save.

Description: Pootle has proved to be a simple system popular with many small language teams and for several projects. Online translation has unfortunately always been a bit slow due to network latencies, especially from countries with lesser Internet connectivity. The addition of some clever AJAX to some pages will help make Pootle feel much more interactive, and might even lessen the load on the server a bit.

A start to this project could be to provide Pootle statistics in JSON form for AJAX code (and other clients) to be able to obtain it from Pootle easily. Pages showing statistics could then test if stats are available when the page is built. If not, we rather put in an AJAX call to do it later while ensuring that the page is still sent to the user quite quickly.

The main part of the project will be to allow continuous editing on the translate page of Pootle with AJAX queries helping to keep the data available by sending submissions asynchronously and prefetching data necessary for translating the next units. A proper implementation will have to support all features of the translate page, including terminology and translation memory, translator and developer notes, suggestions, etc.

Description:
Segmentation is the process of taking a block of text and breaking it into segments, such as sentences. While initially this looks simple, you might find problems as soon as you start using non-trivial text. Abbreviations in English could confuse a simple method, for example.

The main advantage of segmenting is that it allows us to use translation memory at a sentence level. Thus in a block of text you might have 3 sentences and 1 of which will match 100% while the others might match less and need to be reviewed. If you had not segmented you would probably not have matched anything.

The Translate Toolkit already has a simple tool for sentence segmentation, called posegment. This will give you some idea of where to start to do the segmentation in different languages. For Virtaal, you would have to use this information to indicate the current segment in the current string and allow a user to interact with it (for example with Ctrl+down and TM lookup).

Your main tasks in this project will be to:

Provide a GUI to display the currently active segment

Enable some current string level actions to work on a segment level instead and define the user interaction for these cases (like copying source to target, TM lookup and reuse).

Allow the user to correct the segmentation where the automatic method went wrong by altering the bounds of the segment as detected by the automatic method.

Description: The XLIFF standard is an XML based standard for localisation. It can store various state information and can be adapted to manage a translation workflow. Furthermore XLIFF can contain suggestions in <alt-trans> tags that could be reviewed in an editor and removed as the unit is updated.

By workflow we mean the simple process that moves from untranslated → translated → reviewed → approved. There are also processes for updating existing translations. These can be more complex where the review is 'authoritative' (the reviewer can make changes) vs. 'non-authoritative' (these are simply suggestions to the translator who then decides if she wishes to fix them).

This work would involve defining the possible states for XLIFF and other storage formats and defining an API that will make it easier for our tools to access and manipulate these in useful workflows for translators. Whereas currently most of our tools enforce a fuzzy/not fuzzy way of thinking about units, we should now have a list of states that are applicable to the format being used.

This is not a workflow engine. Our goal is not to make a workflow editor, but to create a set of standard workflows that meet the needs of current translations and exposes the inherent workflow of the file format.

Your main aim is to stay focused on the basics of unit states for the major formats (PO, XLIFF, TS) and deliver a solution that allows basic interaction with it in Virtaal and/or Pootle.

Furthermore you can look at helping users of Virtaal and/or Pootle to deal with suggestions in <alt-trans> tags by cleaning them, or removing them as they are used.

Description:
While working with Pootle at OLPC, we have come across a number of feature requests, most (if not all) can be implemented within the GSoC timeframe. Some of the most high priority ones among them are

Support for validation of translated strings on submission (equivalent of msgfmt --check, but for individual strings),

Ability for language administrator to get in touch with members of the translation team

This isn't exactly a “code” project as such, but I think this would be important to do this year. It was very nice to see the machine translation support added for Kreyòl in Virtaal, but the language is not even added to pootle.locamotion.org, as it hasn't ever been considered as a localization target. Machine translation has many limitations - and especially if translations are being done by non-native speakers, it is important to bring in native speakers (who may not have much command of French, let alone English) to improve and correct the inevitable mistakes. Adding the language to the Pootle instance is easily done - more ambitious is the necessary outreach to find native speakers who can perform the localization. That has always been the hard part, but if there is some money available from Google for a student to do this work, that might make the difference in jump-starting this project.

Description:
Pootle still doesn't have support to transcribe online videos/audios that we can find at Youtube, Vimeo, etc. Currently there are some propietary online apps that can do that, (dotsub.com, subtitle-horse.org , etc) , but nothing open source, yet.

A start to this project would be to provide the core support in Pootle to create a transcription from a video streaming url. The transcription page view should have a flash player embedded(recommended the JW FLV Player with Copyleft license) so the user can adjust the timing of the subtitles.

In addition Pootle should allow to distinguish transcriptors, proofreaders and translators users. And have different permission rights for them. The interface should support the ability to play the audio/video at the time position where the phrase is.

Extra features:

Support to configure a Speech Recognition tool or API to automatize a initial transcription done by machine.

Support to configure a multi-lingual speech synthesis system tool (e.g. Festival) to provide audios of the translated subtitles.

Modifying the source code of the JW player to provide a fast access to the subtitles list for each video(e.g. SubPly player)