OmegaT is a free translation memory application written in Java. It is a tool intended for professional translators. It does not translate for you! (Software that does this is called “machine translation”, and you will have to look elsewhere for it.)

amaGama is a web service written in Python implementing a large-scale translation memory on top of PostgreSQL. A translation memory is a database of previous translations which can be searched to find good matches to new strings.

There are currently no releases of amaGama, but the source code is available in the https://github.com/translate/amagama repository.

A public deployment of amaGama is available at http://amagama.locamotion.org. Check the documentation to learn how to use it.

This is the toolset used at Softcatalà to build the translation memories for all the projects that we know exist in Catalan language and have their translations available openly.

The toolset contains the following components with their own responsibility:

Builder (fetch and build memories)

Download and unpack the files from source repositories
Convert from the different translation formats (ts, strings, etc) to PO
Create a translation memory for project in PO and TMX formats
Produce a single translation memory file that contains all the projects
Web

Provides a web application and an API that allow users download memories and search translation
Provides an index-creator that creates a Whoosh index with all the strings than then the user can search using the web app
Provides an download-creation that creates a zip file with all memories that the user can download
Terminology (terminology extraction)

Analyzes the PO files and creates a report with the most common terminology across the projects
Quality (feedback on how to improve translations)

TMop is an open-source software written in Python designed for cleaning and maintaining a Translation Memory (i.e. a collection of (source, target) segments, called Translation Units, used to aid human translators operating in a Computer-assisted Translation framework).

The goal of TMop is to identify and remove from the TM all the “bad” TUs, in which any of the two textual elements is either:

Heartsome Translation Studio 8.0 is the latest version of Heartsome’s CAT software series. This version features many revolutionary improvements compared to previous versions, especially with regard to ease of use and file format support.

Heartsome Translation Studio 8.0 has incorporated feedback based on practical experience from project managers, translators and proofreaders in the localization industry, which has resulted in a wealth of improvements and innovations, including:

As it is now quite easy to make lots of large TMXes with Bitextor and other tools, it would be good to have some tools for processing them. There are various things that it would be useful to do. For example, given a TMX file:

strip out translation units for any given two languages (tmx-extract). Bitextor and other tools generate TMX files with many possible combinations of languages, for a file of “no is en da sv”, just give me all TUs which are “en-da”.
strip out duplicate translation units (tmx-uniq).
sort the file by: line length, language, etc. (tmx-sort)
trim the file of dubious TUs (tmx-trim) — very short translations of long segments, very different punctuation, translations where the translation is exactly the same as the reference, translations which only consist of numbers, would be nice to have an option to give an MT of the target language to try and do better edit-distance, etc.
re-perform language identification (tmx-rident) of all segments given a number of options (e.g. you know the file is in either Swedish or Danish, but some entries come up as Icelandic).
re-format a TMX so that it fits the standard (tmx-clean), e.g. turns ‘&’ into & etc. and optionally removes formatting.[1]
merge TMX files (tmx-merge), merge TMX files and uniq them on the way.
split a TMX file with many different languages (tmx-split) into tmx files with each of the different language pairs, optionally while re-identifying the language of each segment before placing it in a separate file.

To prepare the data for training the translation system, we have to perform the following steps:
tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation.
truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.
cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.

So get the mosedecoder first:
cd ..
git clone https://github.com/moses-smt/mosesdecoder.git

Now it’s time to preprocess the bilingual pairs, we select the fr-en data as the example:

The org en data like this:

SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold.
Lately, with gold prices up more than 300% over the last decade, it is harder than ever.
Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.
Wouldn’t you know it?

SAN FRANCISCO – It has never been easy to have a rational conversation about the value of gold .
Lately , with gold prices up more than 300 % over the last decade , it is harder than ever .
Just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .
Wouldn ’ t you know it ?

Truecase:

The truecaser first requires training, in order to extract some statistics about the text:

San FRANCISCO – It has never been easy to have a rational conversation about the value of gold .
lately , with gold prices up more than 300 % over the last decade , it is harder than ever .
just last December , fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment , sensibly pointing out gold ’ s risks .
wouldn ’ t you know it ?

WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Help on module wordsegment:
NAME
wordsegment - English Word Segmentation in Python
FILE
/Library/Python/2.7/site-packages/wordsegment.py
DESCRIPTION
Word segmentation is the process of dividing a phrase without spaces back
into its constituent parts. For example, consider a phrase like "thisisatest
".
For humans, it's relatively easy to parse. This module makes it easy for
machines too. Use `segment` to parse a phrase into its parts:
>>> from wordsegment import segment
>>> segment('thisisatest')
['this', 'is', 'a', 'test']
In the code, 1024908267229 is the total number of words in the corpus. A
subset of this corpus is found in unigrams.txt and bigrams.txt which
should accompany this file. A copy of these files may be found at
http://norvig.com/ngrams/ under the names count_1w.txt and count_2w.txt
respectively.
Copyright (c) 2016 by Grant Jenks
Based on code from the chapter "Natural Language Corpus Data"
from the book "Beautiful Data" (Segaran and Hammerbacher, 2009)
http://oreilly.com/catalog/9780596157111/
Original Copyright (c) 2008-2009 by Peter Norvig
FUNCTIONS
clean(text)
Return `text` lower-cased with non-alphanumeric characters removed.
divide(text, limit=24)
Yield `(prefix, suffix)` pairs from `text` with `len(prefix)` not
exceeding `limit`.
isegment(text)
Return iterator of words that is the best segmenation of `text`.
load()
Load unigram and bigram counts from disk.
main(args=())
Command-line entry-point. Parses `args` into in-file and out-file then
reads lines from in-file, segments the lines, and writes the result to
out-file. Input and output default to stdin and stdout respectively.
parse_file(filename)
Read `filename` and parse tab-separated file of (word, count) pairs.
score(word, prev=None)
Score a `word` in the context of the previous word, `prev`.
segment(text)
Return a list of words that is the best segmenation of `text`.
DATA
ALPHABET = set(['0', '1', '2', '3', '4', '5', ...])
BIGRAMS = {u'0km to': 116103.0, u'0uplink verified': 523545.0, u'1000s...
DATADIR = '/Library/Python/2.7/site-packages/wordsegment_data'
TOTAL = 1024908267229.0
UNIGRAMS = {u'a': 9081174698.0, u'aa': 30523331.0, u'aaa': 10243983.0,...
__author__ = 'Grant Jenks'
__build__ = 2048
__copyright__ = 'Copyright 2016 Grant Jenks'
__license__ = 'Apache 2.0'
__title__ = 'wordsegment'
__version__ = '0.8.0'
VERSION
0.8.0
AUTHOR
Grant Jenks

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

The most recent Windows version of WordNet is 2.1, released in March 2005. Version 3.0 for Unix/Linux/Solaris/etc. was released in December, 2006. Version 3.1 is currently availalbe only online.

For example, we will use WordNet3.0 as the stable release version, which now supports UNIX-like systems, including Linux, Mac OS X and Solaris. Before install WordNet from the source code, we should download it first. Download a tar-gzipped version: WordNet-3.0.tar.gz

If everything is ok, you can find WordNew3.0 in the “/usr/local/WordNet-3.0/” directory, and in the binary subdirectory “/usr/local/WordNet-3.0/bin”, you can find the related binary files: wishwn wn wnb

Hyponyms of noun book
7 of 11 senses of book
Sense 1
book
=> authority
=> curiosa
=> formulary, pharmacopeia
=> trade book, trade edition
=> bestiary
=> catechism
=> pop-up book, pop-up
=> storybook
=> tome
=> booklet, brochure, folder, leaflet, pamphlet
=> textbook, text, text edition, schoolbook, school text
=> workbook
=> copybook
=> appointment book, appointment calendar
=> catalog, catalogue
=> phrase book
=> playbook
=> prayer book, prayerbook
=> reference book, reference, reference work, book of facts
=> review copy
=> songbook
=> yearbook
HAS INSTANCE=> Das Kapital, Capital
HAS INSTANCE=> Erewhon
HAS INSTANCE=> Utopia
Sense 2
book, volume
=> album
=> coffee-table book
=> folio
=> hardback, hardcover
=> journal
=> novel
=> order book
=> paperback book, paper-back book, paperback, softback book, softback, soft-cover book, soft-cover
=> picture book
=> sketchbook, sketch block, sketch pad
=> notebook
Sense 3
record, record book, book
=> logbook
=> won-lost record
=> card, scorecard
Sense 4
script, book, playscript
=> promptbook, prompt copy
=> continuity
=> dialogue, dialog
=> libretto
=> scenario
=> screenplay
=> shooting script
Sense 5
ledger, leger, account book, book of account, book
=> cost ledger
=> general ledger
=> subsidiary ledger
=> daybook, journal
Sense 9
Bible, Christian Bible, Book, Good Book, Holy Scripture, Holy Writ, Scripture, Word of God, Word
=> family Bible
HAS INSTANCE=> Vulgate
HAS INSTANCE=> Douay Bible, Douay Version, Douay-Rheims Bible, Douay-Rheims Version, Rheims-Douay Bible, Rheims-Douay Version
HAS INSTANCE=> Authorized Version, King James Version, King James Bible
HAS INSTANCE=> Revised Version
HAS INSTANCE=> New English Bible
HAS INSTANCE=> American Standard Version, American Revised Version
HAS INSTANCE=> Revised Standard Version
Sense 10
book
HAS INSTANCE=> Genesis, Book of Genesis
HAS INSTANCE=> Exodus, Book of Exodus
HAS INSTANCE=> Leviticus, Book of Leviticus
HAS INSTANCE=> Numbers, Book of Numbers
HAS INSTANCE=> Deuteronomy, Book of Deuteronomy
HAS INSTANCE=> Joshua, Josue, Book of Joshua
HAS INSTANCE=> Judges, Book of Judges
HAS INSTANCE=> Ruth, Book of Ruth
HAS INSTANCE=> I Samuel,1 Samuel
HAS INSTANCE=> II Samuel,2 Samuel
HAS INSTANCE=> I Kings,1 Kings
HAS INSTANCE=> II Kings,2 Kings
HAS INSTANCE=> I Chronicles,1 Chronicles
HAS INSTANCE=> II Chronicles,2 Chronicles
HAS INSTANCE=> Ezra, Book of Ezra
HAS INSTANCE=> Nehemiah, Book of Nehemiah
HAS INSTANCE=> Esther, Book of Esther
HAS INSTANCE=> Job, Book of Job
HAS INSTANCE=> Psalms, Book of Psalms
HAS INSTANCE=> Proverbs, Book of Proverbs
HAS INSTANCE=> Ecclesiastes, Book of Ecclesiastes
HAS INSTANCE=> Song of Songs, Song of Solomon, Canticle of Canticles, Canticles
HAS INSTANCE=> Isaiah, Book of Isaiah
HAS INSTANCE=> Jeremiah, Book of Jeremiah
HAS INSTANCE=> Lamentations, Book of Lamentations
HAS INSTANCE=> Ezekiel, Ezechiel, Book of Ezekiel
HAS INSTANCE=> Daniel, Book of Daniel, Book of the Prophet Daniel
HAS INSTANCE=> Hosea, Book of Hosea
HAS INSTANCE=> Joel, Book of Joel
HAS INSTANCE=> Amos, Book of Amos
HAS INSTANCE=> Obadiah, Abdias, Book of Obadiah
HAS INSTANCE=> Jonah, Book of Jonah
HAS INSTANCE=> Micah, Micheas, Book of Micah
HAS INSTANCE=> Nahum, Book of Nahum
HAS INSTANCE=> Habakkuk, Habacuc, Book of Habakkuk
HAS INSTANCE=> Zephaniah, Sophonias, Book of Zephaniah
HAS INSTANCE=> Haggai, Aggeus, Book of Haggai
HAS INSTANCE=> Zechariah, Zacharias, Book of Zachariah
HAS INSTANCE=> Malachi, Malachias, Book of Malachi
HAS INSTANCE=> Matthew, Gospel According to Matthew
HAS INSTANCE=> Mark, Gospel According to Mark
HAS INSTANCE=> Luke, Gospel of Luke, Gospel According to Luke
HAS INSTANCE=> John, Gospel According to John
HAS INSTANCE=> Acts of the Apostles, Acts
=> Epistle
HAS INSTANCE=> Revelation, Revelation of Saint John the Divine, Apocalypse, Book of Revelation
HAS INSTANCE=> Additions to Esther
HAS INSTANCE=> Prayer of Azariah and Song of the Three Children
HAS INSTANCE=> Susanna, Book of Susanna
HAS INSTANCE=> Bel and the Dragon
HAS INSTANCE=> Baruch, Book of Baruch
HAS INSTANCE=> Letter of Jeremiah, Epistle of Jeremiah
HAS INSTANCE=> Tobit, Book of Tobit
HAS INSTANCE=> Judith, Book of Judith
HAS INSTANCE=> I Esdra,1 Esdras
HAS INSTANCE=> II Esdras,2 Esdras
HAS INSTANCE=> Ben Sira, Sirach, Ecclesiasticus, Wisdom of Jesus the Son of Sirach
HAS INSTANCE=> Wisdom of Solomon, Wisdom
HAS INSTANCE=> I Maccabees,1 Maccabees
HAS INSTANCE=> II Maccabees,2 Maccabees

Hyponyms of noun book
7 of 11 senses of book
Sense 1
book
=> authority
=> curiosa
=> formulary, pharmacopeia
=> trade book, trade edition
=> bestiary
=> catechism
=> pop-up book, pop-up
=> storybook
=> tome
=> booklet, brochure, folder, leaflet, pamphlet
=> textbook, text, text edition, schoolbook, school text
=> workbook
=> copybook
=> appointment book, appointment calendar
=> catalog, catalogue
=> phrase book
=> playbook
=> prayer book, prayerbook
=> reference book, reference, reference work, book of facts
=> review copy
=> songbook
=> yearbook
HAS INSTANCE=> Das Kapital, Capital
HAS INSTANCE=> Erewhon
HAS INSTANCE=> Utopia
Sense 2
book, volume
=> album
=> coffee-table book
=> folio
=> hardback, hardcover
=> journal
=> novel
=> order book
=> paperback book, paper-back book, paperback, softback book, softback, soft-cover book, soft-cover
=> picture book
=> sketchbook, sketch block, sketch pad
=> notebook
Sense 3
record, record book, book
=> logbook
=> won-lost record
=> card, scorecard
Sense 4
script, book, playscript
=> promptbook, prompt copy
=> continuity
=> dialogue, dialog
=> libretto
=> scenario
=> screenplay
=> shooting script
Sense 5
ledger, leger, account book, book of account, book
=> cost ledger
=> general ledger
=> subsidiary ledger
=> daybook, journal
Sense 9
Bible, Christian Bible, Book, Good Book, Holy Scripture, Holy Writ, Scripture, Word of God, Word
=> family Bible
HAS INSTANCE=> Vulgate
HAS INSTANCE=> Douay Bible, Douay Version, Douay-Rheims Bible, Douay-Rheims Version, Rheims-Douay Bible, Rheims-Douay Version
HAS INSTANCE=> Authorized Version, King James Version, King James Bible
HAS INSTANCE=> Revised Version
HAS INSTANCE=> New English Bible
HAS INSTANCE=> American Standard Version, American Revised Version
HAS INSTANCE=> Revised Standard Version
Sense 10
book
HAS INSTANCE=> Genesis, Book of Genesis
HAS INSTANCE=> Exodus, Book of Exodus
HAS INSTANCE=> Leviticus, Book of Leviticus
HAS INSTANCE=> Numbers, Book of Numbers
HAS INSTANCE=> Deuteronomy, Book of Deuteronomy
HAS INSTANCE=> Joshua, Josue, Book of Joshua
HAS INSTANCE=> Judges, Book of Judges
HAS INSTANCE=> Ruth, Book of Ruth
HAS INSTANCE=> I Samuel, 1 Samuel
HAS INSTANCE=> II Samuel, 2 Samuel
HAS INSTANCE=> I Kings, 1 Kings
HAS INSTANCE=> II Kings, 2 Kings
HAS INSTANCE=> I Chronicles, 1 Chronicles
HAS INSTANCE=> II Chronicles, 2 Chronicles
HAS INSTANCE=> Ezra, Book of Ezra
HAS INSTANCE=> Nehemiah, Book of Nehemiah
HAS INSTANCE=> Esther, Book of Esther
HAS INSTANCE=> Job, Book of Job
HAS INSTANCE=> Psalms, Book of Psalms
HAS INSTANCE=> Proverbs, Book of Proverbs
HAS INSTANCE=> Ecclesiastes, Book of Ecclesiastes
HAS INSTANCE=> Song of Songs, Song of Solomon, Canticle of Canticles, Canticles
HAS INSTANCE=> Isaiah, Book of Isaiah
HAS INSTANCE=> Jeremiah, Book of Jeremiah
HAS INSTANCE=> Lamentations, Book of Lamentations
HAS INSTANCE=> Ezekiel, Ezechiel, Book of Ezekiel
HAS INSTANCE=> Daniel, Book of Daniel, Book of the Prophet Daniel
HAS INSTANCE=> Hosea, Book of Hosea
HAS INSTANCE=> Joel, Book of Joel
HAS INSTANCE=> Amos, Book of Amos
HAS INSTANCE=> Obadiah, Abdias, Book of Obadiah
HAS INSTANCE=> Jonah, Book of Jonah
HAS INSTANCE=> Micah, Micheas, Book of Micah
HAS INSTANCE=> Nahum, Book of Nahum
HAS INSTANCE=> Habakkuk, Habacuc, Book of Habakkuk
HAS INSTANCE=> Zephaniah, Sophonias, Book of Zephaniah
HAS INSTANCE=> Haggai, Aggeus, Book of Haggai
HAS INSTANCE=> Zechariah, Zacharias, Book of Zachariah
HAS INSTANCE=> Malachi, Malachias, Book of Malachi
HAS INSTANCE=> Matthew, Gospel According to Matthew
HAS INSTANCE=> Mark, Gospel According to Mark
HAS INSTANCE=> Luke, Gospel of Luke, Gospel According to Luke
HAS INSTANCE=> John, Gospel According to John
HAS INSTANCE=> Acts of the Apostles, Acts
=> Epistle
HAS INSTANCE=> Revelation, Revelation of Saint John the Divine, Apocalypse, Book of Revelation
HAS INSTANCE=> Additions to Esther
HAS INSTANCE=> Prayer of Azariah and Song of the Three Children
HAS INSTANCE=> Susanna, Book of Susanna
HAS INSTANCE=> Bel and the Dragon
HAS INSTANCE=> Baruch, Book of Baruch
HAS INSTANCE=> Letter of Jeremiah, Epistle of Jeremiah
HAS INSTANCE=> Tobit, Book of Tobit
HAS INSTANCE=> Judith, Book of Judith
HAS INSTANCE=> I Esdra, 1 Esdras
HAS INSTANCE=> II Esdras, 2 Esdras
HAS INSTANCE=> Ben Sira, Sirach, Ecclesiasticus, Wisdom of Jesus the Son of Sirach
HAS INSTANCE=> Wisdom of Solomon, Wisdom
HAS INSTANCE=> I Maccabees, 1 Maccabees
HAS INSTANCE=> II Maccabees, 2 Maccabees