16 years ago when the Web was strictly 1.0, Google was but in its nascent state, and there were a mere 361 million Internet users, David Bowie had made one of the most visionary statements about the future of the Internet:

What the Internet is going to do to society is unimaginable

Indeed, with more than 3 billion Internet users today, one can safely accede that Bowie’s prediction has come to fruition.

If you are in the language service industry, you are undoubtedly on the lookout for ways in which you can improve the productivity of your team – more translated words in less time – that’s what drives your clients as well as you. Automated Machine Translation (MT) seems to be the logical step forward in today’s world of content explosion and tightening deadlines. However, for most Language Service Providers (LSPs), the challenge lies in the actual implementation of this sophisticated technology.

For this reason, it is important that no matter what translation management tools you use, it should be integrated with a powerful MT engine that is reliable, scalable, flexible, and can be trained and re-trained constantly for maximum efficiency and quick turnaround times.

In today’s fast-paced world of content explosion on the Internet, the need for translating this organically growing content with the help of machines has become inevitable. While post-editing the machine translated content will always be required, choosing the right MT features will ensure that translators do not spend countless frustrating hours on those edits.

In this Kantanwebinar, The KantanMT Professional Services Team, Tony O’Dowd and Louise Faherty (Quinn) will show how you can improve the translation productivity of your team, and manage effort estimations and project deadlines better with a powerful MT engine.

The rapidly evolving, dynamic marketplace today has created an enormous spike in the demand for Machine Translation (MT) in a number of industries. According to a new study by Grand View Research, the global Machine Translation market is expected to reach USD 983.3 million by 2022. This is a huge leap from 2014, when the MT market was valued at USD 331.7 million, and this growth projection mirrors a trend in the market. Thanks to globalization, there is an increased demand for cost efficiency in translation, but the amount of linguistic knowledge and time required for translating all the content for a particular business exceeds the capacity of human translation alone.

Key Insights into the Machine Translation Market

Some of the key insights about Machine Translation that the study discusses are summed up here:

Statistical Machine Translation (SMT) is a clear winner over Rules Based Machine Translation (RBMT), when it comes to the present market requirement.

Globalization and the need to address diverse cultural groups has led to the popularity of translation technology in Asia Pacific, thus opening up new potential markets for MT providers.

Machine translation as a service (MTSaaS) makes use of SMT and is accessible via the web. This allows users to customise their MT engines with their own Translation Memories (TMs).

What this means is that, deploying an integrated MT solution will become a critical success factor for gaining market share in the future.

Potential Challenges

The study reveals major challenges for the MT industry, which includes a lack of quality translations and Quality Estimation (QE) and competition from free translation service providers. Needless to say, a well-rounded back-end knowledge base, along with efficient NLP (Natural Language Processing) capabilities and a scalable model are critical to gaining competitive advantage in the market. The MT providers need to go above and beyond their role as simply providing machine translation services; they need to become solution providers.

How is KantanMT contributing to the MT market?

The KantanMT platform offers massive competitive advantage, not only because we were one of the first entrants in the MT market, but also because thanks to our strategic market insights, we have already identified most of these challenges and developed solutions to address them. As solution providers, we use an intuitive approach that can be summed up in a few words: speed, scalability, simplicity, and security.

Speed

In a market where new products and innumerable variants of those products are being developed almost every day, it is important to have on-demand translated content ready to be deployed. KantanMT helps its clients have the first leap advantage over their competitors by translating content on the fly.

KantanMT engines have the capacity to translate 114 million words in a single day and as of 7 September, 2015, we have exceeded 2 billion translated words, with 1 billion words being translated in the last two months itself.

KantanMT Platform statistics as of 8th September 2015

Scalability

As a business trying to make its mark in the global MT market, it is extremely important to have a solution that has limitless scalable potential. KantanMT engines with scaling technologies such as the KantanAutoScale are devised to ensure that no matter how sudden the spike in content is, the quality and volumes of translated content will never suffer.

The power of KantanMT’s engine was summed up Tony O’Dowd, Founder and Chief Architect of KantanMT.com,

“We are only starting to see the potential growth of the Machine Translation market, and I doubt any other player can operate at this scale as flawlessly.”

Simplicity

Simplicity is at the very core of KantanMT. The company name itself is derived from the Japanese word for simplicity 簡単 (かんたん). KantanMT strives to take the complexity out of the user interface, while powerful MT engines do all the hard work in the back end. Easy to understand analytics can be generated through the KantanMT engines to gather insights into improving engine quality and maintaining translation quality.

Security

Cloud based MT solutions have become the industry norm. However, security concerns are high – especially, if you are in the eCommerce industry or deal with legal information. KantanMT’s multilayered security approach protects and monitors translations ensuring all industry secrets are safe. Unlike a number of open source translating tools, you own the source as well as translated words.

Final words

One of the key findings of the Grand View Research review points out that “strategic joint ventures, coupled with mergers and acquisitions, (which) have been among the key strategies adopted” by major players in the Machine Translation industry. KantanMT recognises the importance of both industry and academic relationships in building a complete MT ecosystem.

As a team of people with an unbridled passion for innovations in the Machine Translation industry, Monday’s news about Reverie Technologies, a Bengaluru-based startup bagging a $4M investment did not come as much surprise to us. This brilliant news serves to highlight once again that in the ever-changing world of retail marketing and globalization, any business with plans to accelerate their products into global markets needs to localize their content for enhanced user experience. This goes on to drive global revenues and increase brand equity in existing and new markets. Continue reading →

When it comes to Machine Translation, we know that quantity does not always equal quality. In your opinion, how many words will it take to build a fully functional engine?

Tony O’Dowd: Great question! Based on the entire community of Kantan users today, we have more than 7600 engines on our system. Those engines range from very small all the way up to very large. The biggest engines, which are in the eCommerce domain, contain about a billion words each.

If we exclude all the billion word MT engines so they don’t distort the results then the average size of a KantanMT engine today is approximately 5 million source words.

For example, if you look at our clients in the automotive industry, they have engines in and around 5 million source words, which are producing very high quality MT output.

How long does it take to build an engine of that size?

TOD:Again using KantanMT.com as an example. We can build an MT engine at approx. 4 million words per hour. Therefore, a 5 million-word engine takes approximately an hour to an hour and a half or 90 minutes to build. Compared with other MT providers in the industry, this is insanely fast.

This speed is possible because of our AWS cloud infrastructure. At the moment, we have 480 servers running the system. With such fast build times, our clients can retrain their engines more frequently, giving them higher levels of productivity and higher levels of quality output than most other systems. Read a client use case where speed had a positive impact on MT quality for eCommerce product descriptions (Netthandelen/Milengo case study).

How long does it take to accumulate that many words?

TOD:Most of our clients are able to deliver those words themselves, but our clients who don’t have 5 million source words will normally upload what they have and select one of our stock engines to help them reach a higher word count.

When we look at building an engine for a client, we look at the number of source words, but the key number for us is the number of unique words in an engine. For instance, if I want to have a high quality German engine in a narrow domain it might consist of 5 million source words. More importantly, the unique word count in that engine is going to be close to a million or slightly more than a million unique words.

If I have a high unique word count, I know the engine is going to know how to translate German correctly. Therefore, we don’t look at one word count, we look at a number of different word counts to achieve a high quality engine.

Another factor to consider is the level of inflected forms in the language. This is an indicator of how many words are needed. In order to educate and train the system we need more examples and usage examples of those inflected forms. Generally speaking, highly inflected languages require a lot more training data, so to build a Hungarian engine, which is an incredibly inflected language you will need in excess of 2-3 times the average word count to get workable high quality output.

What kind of additional monolingual data do you have?

TOD: There are 3 areas where we can help in improving suitable, relevant and quality monolingual data.

We have a library of training data stock engines on KantanMT.com, which all include monolingual data in a variety of domains (Medical, IT, Financial etc.).

In addition to stock engines, most of our clients upload their own monolingual data either as PDF files, docx or simple text files and we normalise that data. We have an automatic process in place to cleanse the data and convert it into a suitable format for machine translation/machine learning.

We also offer a spider service, where clients give us a list of domain related URLs where we can collect monolingual data. For example, we recently built a medical engine for a client in the US, in Mexican Spanish and we collected more than 150k medical terms from health service content, which provided a great boost to the quality and more importantly the fluency of the MT engine.

Selçuk Özcan: At Transistent, we collect data from open source projects and open source data. First, we define some filters to ensure that we have the relevant monolingual data from the open source tools, which also includes spidering techniques. We then create a total corpus with the monolingual data we collected, which is used for training the MT engine.

What is the difference between pre-normalisation and final normalisation?

SÖ:The normalisation process is related to the TMS (Translation Management System), CMS (Content Management System) and TM (Translation Memory) systems. Pre-normalisation is applied to the text extracted from your systems to assure that the job will be processed properly. Final normalisation is then applied to the MT output to ensure that content is successfully integrated into the systems.

Can pre-normalisation and final normalisation be applied to corpora from TMs?

SÖ: It is possible to implement normalisation rules to corpora from TM systems. You have to configure your rules depending on your TM tool. Each tool has its own identification and encoding features for tags, markups, non-translatable strings and attributes.

How many words is considered too many in a long segment?

TOD:As part of our data cleansing policy, any data uploaded to a Kantan engine goes through 12 phases of data cleansing. Only segments that pass those 12 phases are included in the engine training. That may seem like a very harsh regime, but it is in place for a very good reason.

At KantanMT, the 3 things we look for in training data is:

Quality

Relevance

Quantity

We make sure that all the data you upload is very clean from a structural and linguistic point of view before we include it in your engine. If the training data fails any of those 12 steps, it will be rejected. For example, one phase is to check for long segments. By default, any segments with more than 40 words are rejected. This can be changed depending on the language combination and domain, but the default is 40 words or 40 tokens.

SÖ:As Tony mentioned, it also depends on the language pairs. Nevertheless, you may also want to define the threshold value according to the dynamics of your system; i.e. data, domain, required target quality and so on. We usually split the segments with 40 – 45 words.

How long does it take to normalise the data?

SO: The time frame for normalising data depends on a number of factors, including the language pair and differences between linguistic structures that you are working on, how clean the data is and the source of the data. If you have lots of formulas or non-standard characters, it will take longer to normalise that data.

For Turkish it might take an average of 10-15 days to normalise an average of 10 million words. Of course, this depends on the size of the team involved and the volume of data to be processed.

TOD:The time required to normalise data is very much data driven. A rule of thumb in the Kantan Professional Services Team is; standard text consisting mostly of words, such as text from a book, online help or perhaps user interface text, where the predominant token is a word is normalised very quickly because there are no mixed tokens in the data set – only words.

However, if you have numerical data, scientific formulas and product specifications such as measurements with a lot of part numbers there is a high diversity of individual tokens as opposed to simple words. This type of data takes a little bit longer to normalise because you have to instruct the engine, which you can do using the GENTRY programming language and Named Entity Recognition (NER) software.

We have GENTRY and NER built into the KantanMT.com so we can educate the engine to recognise those tokens. This is important because if the engine doesn’t recognise the data it can’t handle it during the translation phase.

The more diverse the tokens in your input, the longer the normalisation process is, and conversely the less diverse the tokens are the quicker the data can be processed. If it’s just words the system can handle this automatically.

We use this rule of thumb when working with clients to estimate how long it will take to build their engines, as we need to be able to give them some sense of a schedule around building an actual MT engine.

What volume of words would you suggest for a good Turkish engine?

SÖ: It makes no sense to work on a Turkish MT system, if you do not have at least a million words of bilingual data and 3 million words of monolingual data. Even in this case, you will have to work more on analyses, testing procedures and rule sets. Ideally, you will have approx. 10 million words of bilingual data. It’s the basic equation in terms of SMT Engine training, the more data you have the higher quality you achieve.

How long does it take to build an engine for Turkish?

SÖ: It depends on the language pair and the field of expertise or domain. Things may be harder if you are working on a language pair each of which has different linguistic structure such as English and Turkish. However, it’s not impossible to maintain a mature MT system for such a language pair, you will just need to have a longer time on it. Another parameter that effects the required time to have a mature MT system is the quality of your data to be utilised for the intended MT system. It is hard to give a specific time estimation without looking at the data but in general, it will probably take 2 to 6 months to have the intended production system.

What is Gap Analysis and Kantan TimeLine ?

Gap Analysis identifies and reports any untranslated words in the training data set and allows you to take preventive measures quickly by fine tuning training data and filling data gaps.The KantanTimeLine™ provides a chronological history of activities for each engine and uses version control for precise management of released and production-ready engines.

Using Kantan TimeLine and Gap Analysis:

In KantanBuildAnalytics, click the Gap Analysis tab to see the amount of untranslated words that remain in the generated translations. You will be directed to the Gap Analysis page, where you will see a breakdown of any gaps in your training data.

A table appears with 3 headings: ‘#’, Unknown Word, Reference/Source, KantanMT Output. Under those headings you will find details of any untranslated words, their source and the KantanMT Output.

Click Download to download your Gap Analysis report.

Note: You can also click the Timeline tab to view your profiles’s Timeline, which is essentially a record of the changes you have made on your engine.

This is one of the many features provided in KantanBuildAnalytics, which aids Localization Project Managers in improving an engine’s quality after its initial training. To see other features used in KantanBuildAnalytics suite please see the links below.

Translation Error Rate (TER) is a method used by Machine Translation specialists to determine the amount of Post-Editing required for machine translation jobs. The automatic metric measures the number of actions required to edit a translated segment inline with one of the reference translations. It’s quick to use, language independent and corresponds with post-editing effort. When tuning your KantanMT engine, we recommend a maximum score of 30%. A lower score means less post-editing is required!

How to use TER in KantanBuildAnalytics™

The TER scores for your engine are displayed in the KantanBuildAnalytics™ feature. You can get a quick overview or snapshot in the summary tab. But for a more in depth analysis and to calculate the amount of post-editing required for the engine’s MT output select the ‘TER Score’ tab, which takes you to the ‘TER Scores’ page.

Place your cursor on the ‘TER Scores Chart’ to see the ‘Translation Error Rate’ for each segment. If you hold the cursor over the segment, a pop-up will appear on your screen with details of each segment under these headings, ‘Segment no.’, ‘Score’, ‘Source’, ‘Reference/Target’ and ‘KantanMT Output’.

To see a breakdown of the ‘TER Scores’ for each segment in a table format scroll down. You will now see a table with the headings ‘No’, ‘Source’, ‘Reference/Target’, ‘KantanMT Output’ and ‘Score’.

To see an even more in depth breakdown of a particular ‘Segment’ click on the ‘Triangle’ beside each number.

To download the ‘TER Scores’ of all segments click on the ‘Download’ button on the ‘TER Scores’ page.

This is one of the many features included in KantanBuildAnalytics, which can help the Localization Project Manager improve an engine’s quality after its initial training. To see other features used in KantanBuildAnalytics please see the links below.