Workshop sponsors

The MultilingualWeb project is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. Coordinated by the W3C, the project aims to raise the visibility of existing best practices and standards and identify gaps. This second workshop in Pisa, Italy, is hosted jointly by the Istituto di Informatica e Telematica and Istituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche.

Each main session except Localizers begins with a half-hour 'anchor' presentation. This is followed by a series of 15 minute talks. Timing is very strict. Questions & answers are saved for a (typically) half hour discussion slot at the end of each session. (And can of course be continued in breaks and at the evening reception!) All attendees participate in all sessions.

The IRC log is the raw scribe log, which has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC was used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following on IRC can also add contributions to the flow of text themselves.

This page is still a work in progress. Where no link is provided to slides, we are still waiting to receive them. Some video links are unavailable because the speaker requested it. In one case the speaker was unable to attend the workshop, but their slides are available. One speaker withheld their slides. You can also find links to all videos on the VideoLectures workshop page. Thanks to VideoLectures for hosting the videos, and CNR for the recording.

Director of the Institute for Informatics and Telematics (IIT), Italian National Research Council (CNR)

Workshop opening and welcome

The Italian approach to Internationalized Domain Names (IDNs)

abstract Basically this is a system through which you can "write" on the
Internet, for example in Danish or Chinese, using accented letters or
non-Latin characters. Until recently, the choice of domain names was
limited by the twenty-six Latin characters used in English (in addition to
the ten digits and the hyphen "-"). IDN, introduced by
ICANN (Internet Corporation for Assigned Names and Numbers) represents a breakthrough, for hundreds of millions of Internet users in the world that until now were forced to use an alphabet that was not their own.
With regard to Italy, the impact of accents will certainly be less marked,
but it will give everyone the opportunity to register domains which
completely match the name of the person, company or brand name chosen.

Complementarity of information found in media reports across different countries and languages

abstract There is ample evidence that information published in the media
in different countries is largely complementary and that only the biggest
stories are being discussed internationally. This applies to facts (e.g. on
disease outbreaks or violent events) and to opinions (e.g. the same subject
may be discussed with very different emotions across countries), but there
is also a more subtle bias of the media: National media prefer to talk
about local issues and about the actions of their politicians, giving their
readers an inflated impression of the importance of their own country.
Monitoring the media from many countries and aggregating the information
found there would allow readers a less biased and more equilibrated view,
but how to achieve this aggregation? The speaker will give evidence of such
information complementarity from the Europe Media Monitor family of
applications (accessible at http://emm.newsbrief.eu/overview.html) and show
first steps towards the aggregation of information from highly multilingual
news collections.

abstract This talk will describe the use of XForms to simplify the administration of multi-lingual forms and applications. A number of approaches are possible, using generic features of XForms, that allow there to be one form, with all the text centralised, separate from the application itself. This can be compared to how style sheets allow styling to be centralised away from a page, and allow one page to have several stylings; the XForms techniques can provide a sort of Language-Sheet facility.

abstract The W3C's Widget specifications have seen a great deal of support and uptake within industry. Widget-based products are now numerous in the market and play a central role in delivering packaged web applications to consumers. Despite this, the W3C's Widget specifications, and its proponents, have faced significant challenges in both specifying and achieving adoption of i18n capabilities. This talk would describe how the W3C's Web Apps and i18n Working Group collaborated to create an i18n model, the challenges we faced in the market and within the W3C Consortium, and how some of those challenges were overcome. This talk would propose some rethink of best practices and relay some hard lessons learned from the trenches

abstract HTML5 is proposing changes to the markup used for internationalization of web pages. They include character encoding declarations, language declarations, ruby, and the new elements and attributes for bidi support. HTML5 is still very much work in progress, and these topics are still under discussion. The talk aims to spread awareness of proposed changes so that people can participate in the discussion.

abstract I'll address two common i18n problems that users of current mainstream
browsers face. Users should get content from multilingual Web sites automatically in a
language they understand, hence they need a way to tell their preferences.
Some browsers give users this option, but others don't. I'll demonstrate live if and how languages can be set in various browsers and discuss the usability issue that browser vendors have to deal with: the trade-off between functionality and a simple user interface.
Users should also be able to enter email addresses with international domain names into forms. That might not be possible in modern browsers that already support HTML5's new email input type. I'll show how to validate email addresses not being too restrictive and eventually raise the
question: Does the HTML5 specification have to be changed to reflect the
users' needs?.

What's Next in Multilinguality, Web News & Social Media Standardization?

abstract The Web is no longer just a protocol (HTTP) and a mark-up language
(XHTML); rather, it has become an ecosystem of different content mark-up standards,
conventions, proprietary technologies, and multimedia (audio, video, 3D). The static Web page
is no longer the sole inhabitant of that ecosystems: Web applications (from CGI to AJAX),
Web services, and social media hubs with huge transaction volumes that exhibit some
properties of ITsystems and social fabric. In this talk, I would like to discuss some of the challenges that this diversity implies for the technology and stack, to assess the standardization situation, and to speculate what the future may (and perhaps should?) bring.

abstract Office.com is one of the largest multilingual content driven web-sites in the world. With more than 1 billion visits per year, it reaches 40 languages. For the Office 2010 release, authoring and publishing for Office.com was changed to make use of Microsoft Word and SharePoint. A large migration effort was undertaken to move 5 million+ assets for 40 markets to new file formats and management systems. In this talk we will present lessons learnt for designing and managing multilingual web-sites from this major re-engineering exercise.

abstract Internationalization Tag Set (ITS) is set of generic elements and
attributes which can be used in any XML content format to support easier
internationalization and localization of documents. In this talk
examples and advantages of using ITS in formats like XHTML, DITA and
DocBook will be shown. Also problems of integration with HTML5 will be
briefly discussed.

Obstacles for following i18n best practices when developing content at INAF

abstract This talk addresses the following: a. what could be the best way to produce multilingual web content to comunicate astrophysical science and projects;
b. how we could educate and persuade our creators to follow internationalization while using their preferred web authoring tools or web content management systems.

abstract Additional standards are required to facilitate the use and construction of multilingual web sites. The user interface standards should be a best practices guide combining existing mechanisms such as transparent content negotiation (TCN) and new techniques such as a language button in the browser. Servers should expect the same API to the content, though eventually one should address the whole cycle of Authorship, Translation and Publishing Chain (ATP-chain).

abstract Web on-the-go is now an everyday reality. It touches all of our lives from the moment we wake, to our commute, from work to an evening out on the town. This reality presents both an opportunity and an incredible challenge as Web content managers attempt to optimize customer engagement. Because visitors do not see themselves as part of a global audience but as individuals, we will examine the WCM software requirements that enable organizations to maintain central control, while providing their audiences with locally relevant and translated content. From a Global Brand Management perspective, we will examine how organizations can manage, and build and sustain a global brand identity by reusing brand assets across all channels (multiple, multilingual websites, email and mobile websites). We will also take a fresh look at automated personalization and profiling, and how Web content can be targeted for specific language requirements as well as the local interests of local audiences.

abstract Although support for standards such as XLIFF and TMX has increased
interoperability among tools, today's translation-related processes are
facing challenges beyond the ability to import and export files. They
require standards that are granular and more flexible.
Using concrete examples of the ways that various tools can interoperate
beyond the exchange of files, this session walks through some of the issues
encountered and outlines the use of a new approach to standardization in
which modular standards that, similar to Lego® blocks, could serve as core
components for tomorrow's agile, interoperable, and innovative
translation technologies.

Multilingual transformations on the web via XLIFF current and via XLIFF next

abstract David will argue that content metadata must survive language
transformations to be of use in multilingual web. In order to achieve
that goal, content creation and content langauge transformation related
meta-data must be congruent, i.e. designed upfront with the
transformation processes in mind. To make the point for XLIFF as the
principal vehicle for critical metadata throughout multilingual
transformations, it will be necessary to give a high level overview of
XLIFF structure and functions, both in the current version and the next
generation standard that is currently a major and exciting work in
progress in the OASIS XLIFF TC.

Interoperability Now! A pragmatic approach to interoperability in language technology

abstract Existing language technology standards give the false impression of
interoperability between tool. There's a gap to bridge that is mostly
about mindsets, technology and mutual consent on the interpretation of
standards. A couple of players agreed to search for this mutual consent
based on existing standards to bridge this gap. The talk will give some
background on the issues with the use of existing standards and how
Interoperability Now! is approaching this.

abstract GTS has developed a plugin for websites developed using the open-source Wordpress CMS. It is the only solution that supports post-editing MT and allows content publishers to create their own translation community. This talk will present our system and describe some of the challenges in translation of dynamic web content and the potential rewards that our concept holds.

abstract Opera Software has a large community, with members from all over the world. The talk will present various obstacles encountered and lessons learned from using a community of external volunteer resources for localization in a closed-source environment. Included topics will be training and organization of volunteers and managing terminology and branding, as well as other issues that come with the territory.

Flexibility and robustness: The cloud, standards, web services and the hybrid future of translation technology

abstract First 5 minutes: Introducing the current state of affairs, describing leading innovations. Also lamenting the demise of LISA. Second 5 minutes: Describing the possible future and who will be the winners, who will be the losers. Last 5 minutes: What we can do to get standards moving internally in medium, large, organisations.

abstract The web is an open space and the standards by which it is "governed" must be open. However, one barrier clearly remains to make the web even more
transnational and truly global. This has been called "the language barrier".
Language Service Providers translation business model is clearly antiquated and it is increasingly
being questioned when we face real translation needs by web users. Here,
immediacy is paramount. This talk is about open standards in machine translation
technologies and workflows, supporting a truly multilingual web.

details To further promote networking among attendees, there will be a reception at 8pm in the Capitolium Hall of the Chiostro di San Francesco, a wonderful ancient cloister, next to the church of St. Francesco. Entry is free to workshop participants. See a map showing the route from the workshop location to the church. The Capitolium Hall has frescoes by Niccolò di Pietro Gerini with Histories of the life of Christ (1392). The rectangular cloister is from the 14th century.

5 April

0900

Machines

Dave Lewis

Centre for Next Generation Localisation: Trinity College Dublin

Semantic Model for end-to-end multilingual web content processing

abstract This talk will present a Semantic Model for end-to-end
multilingual web content processing flows that encompass content
generation, its localisation and its adaptive presentation to users. The
Semantic Model is captured in the RDF language in order to both provide
semantic annotation of web services and to explore the benefits of using
federated triple stores, which form the Linked Open Data cloud that is
powering a new range of real world applications. Key applications include
the provenance-based Quality Assurance of content localisation and the
harvesting and data cleaning of translated web content and terminology
needed to train data-driven components such as statistical machine
translation and text classifiers

abstract Developing multilingual Webservices in agile software teams is a multi-facetted enterprise which comprises various areas that include methodology, governance and localization. We will report on our employment of standards and best practices, particularly where and how they fit or did not fit, and the gaps we have encountered including our strategy to bridge them effectively as well as some of our workarounds.

abstract Small markets, limited language resources, tiny research communities – these are some of the obstacles in development of technologies for smaller languages. In this presentation we will share experience and best practices from EU collaborative projects with a particular focus on acquiring resources and developing machine translation technologies for smaller languages. Novel methods help to collect more training data for statistical MT, involve users in data sharing and MT customization, collect multilingual terminology and adapt MT to terminology and stylistic requirements of particular applications.

abstract With the constant growth of web based content large collections of textual become available. Many if not most professional non-English web sites offer translated webpages to English and other languages of their clients and partners. This are usually professional translation and are abundant. We call this Hidden Web. We intend to present possibilities, problems and best practices for harnessing such aligned textual corpora. Such data can then be efficiently used as a translation memory for example as help for a human translators or as training data for machine translation algorithms.

abstract No question about it, companies are embracing social media and working it
on a global scale. But the expansion is not without its challenges. Chief
among them is how to effectively communicate on multiple platforms, in
multiple languages, with a variety of cultural audiences. So how are
companies making it happen? In what ways are they using social media
globally? What are the emerging best practices for dealing with language
and culture on blogs, Twitter, community forums and other platforms?

abstract There is little doubt that the web is being fundamentally transformed by
social media. The realization that we now live a significant part of our
lives online is giving rise to new perspectives on text analytics and to
new interaction paradigms. Emotions and experiences are key to
communication in social media: recognizing and tracking them in highly
dynamic multilingual text streams produced by users around Europe, or even
around the globe, is an emerging area for research and innovation. I will
illustrate this with a few examples derived from online reputation
management and large scale mood tracking.

abstract The talk will touch, from the perspective of a Language Service Provider (LSP), on how Multilingual Search Engine Optimization (MSEO) is already an essential part of the language Localization process. The presentation will provide an in-depth look at the nascent Best Practices and explain the concepts behind Multilingual Search Engine Optimization.

Controlled and uncontrolled environments in social networking websites and linguistic rules for multilingual websites

abstract In social networking websites, a "controlled" component, generated by content creators, must coexist with an "uncontrolled" component, that is generated by the users. Even if the latter is more difficult to control, it is the former that create more challenges in terms of l10n/i18n. The use of a crowdsourcing approach has proven successful for Facebook, but this was achieved thanks to the implementation of standard linguistic rules that are complex and detailed but, at the same time, easily understandable by the actors involved in the translation process.

abstract Users are increasingly using social media and different devices next to the 'traditional' web and offline media. Information that was previously unavailable or inaccessible is today shaping their opinions and buying behaviour. As a result, users' expectations have changed and have raised the bar for any organization that interacts with them. They expect that information is always targeted and relevant to their needs, available in their language and on the device of their choice. The presentation will highlight some of the specific challenges that are emerging as well as demonstrate the technology available to solve them.

abstract This presentation will give a summary of the joint TAUS-LISA survey on translation industry interoperability and a report from the recent Standards Summit in Boston (February 28-March 1) as well as perspectives on open translation platforms from TAUS Executive Forums.

From multilingual documents to multilingual websites: challenges for international organizations with a global mandate

abstract International organizations face many challenges when trying to reach their global audience in as many languages as possible. The Food and Agriculture Organization of the United Nations (FAO) works in six languages (Arabic, Chinese, English, French, Russian and Spanish) to try to have an impact in the agricultural sector of its member countries. The presentation will focus on the need of multilingual support on the Web and will refer to standards and best practices needed . It will cover aspect such as the creation and deployment of multilingual content, the translation needs and possible integration of TM and MT, the availability of CAT tools, etc.

On the way to sharing Language Resources: principles, challenges, solutions

abstract This talk will present the basic features of the META-SHARE architecture, the
repositories network, and the metadata schema. We will then discuss the principles that META-SHARE uses regarding language resource sharing and the instruments that support them, the membership types along with the privileges and obligations they entail, as well as the legal infrastructure that META-SHARE will employ to achieve its goals. We will conclude by elaborating on potential synergies with neighbouring initiatives and
future plans at large.