First Stop: Europe • Lycos search technology initially for ASCII only. In-house work to make data paths 8-bit clean, to accommodate European languages. • Otherwise relatively straightforward. Components such as ad servers, Web servers, etc., require little if any changes. • Euro service came online in May 1997.

What’s Unicode? Where’s Japan? • The more interesting problem. • Business reasons to introduce Japanese search. • But not a lot of international(ization) experience within Lycos at the time. • We needed assistance and chose Basis Technology.

Goals • Quickdeployment of Japanese search • 1995 to 1997, Japanese Internet more than doubling each year • Marketing need to launch in Japan ASAP • Economical and efficient solution • Produce reusable internationalized code • Poise Lycos for even quicker deployment into other languages • Get "more bang for the buck"

Two Main Functions of aSearch Engine • Building a catalogCompiling an indexed catalog of webpages from the Internet • Performing a queryDelivering a list of webpages matching certain keywords and parameters input by the user

Japanese Issues for Catalog • Double-Byte: Japanese characters are double-byte. • Multiple encodings: Japanese webpages use 3 encodings: Shift-JIS, EUC-JP, and ISO-2022-JP. • Options: Multiple vs. Single Catalog • Three catalogs: one in Shift-JIS, one in EUC-JP, one in ISO-2022-JP (an awkward and complicated solution to implement)OR • One catalog: all catalog data either in one Japanese encoding or in Unicode

Single Catalog Options • A) Convert all data to one Japanese encoding • ISO-2022-JP, Shift-JIS, or EUC-JP • B) Convert all data to Unicode: • The quick andeconomicalchoice, Unicode is . . . • A superset of all scripts and character set encodings used on the Web, therefore reusable for other languages • More easily implemented into existing code originally written for processing single-byte ASCII

Encoding Conversion • Purpose: Convert data between encodings used on the Web and Unicode (which is still not used universally on the Web) • From 寿司 in Shift-JIS you want 寿司 in Unicode • Functionality provided by Basis Technology's Rosette embedded in Lycos code as source • Rosette is a cross-platform C++ library for Unicode; http://www.basistech.com/products/ • Complete set of mapping tables between Unicode and major legacy encodings • Conversions performed quickly and economically with minimal impact on performance

Why Encoding Auto-Detection? • In order to convert text to another encoding, you have to know where you’re starting from. Or you could get . . . • Ex. Text in EUC-JP when viewed as other encodings. EUC-JP: 寿司 コンピュータ 花見 Shift-JIS: ｼﾊ ･ｳ･ﾔ･蝪ｼ･ｿ ｲﾖｸｫ ASCII:

Encoding Auto-Detection • Purpose: to correctly identify encoding of webpage or query in order to convert properly from one encoding to another. • Functionality provided by Basis Technology's Rosette • Auto-detection on Japanese text in Shift-JIS, EUC-JP, or ISO-2022-JP encodings • Enhanced tiebreaker functionality to auto-detect very short strings (queries)

Japanese Word Breaking • Purpose: To return indexable units (words) for creating an index, or for breaking the query into words to look up in the index. • Problem: Japanese words are not delimited by spaces • Solution: Basis Technology's Japanese Morphological Analyzer (http://www.basistech.com/products/) • Dictionary-based Japanese word breaking • Elimination of stop words (ex. “a”,”the”, etc.) • Looks for longest word match

Selecting Unicode Representation (1) UCS2 characteristics • Depending on the task, either the UCS2 or UTF8 representation of Unicode was used in different parts of the Lycos search • Characteristics of UCS2 • Each coded character element is fixed width, 16 bits • Data paths must all accommodate 16 bits • Text in UCS2 is easy to manipulate and analyze (from a programming viewpoint)

Selecting Unicode Representation (2) UTF8 characteristics • Characteristics of UTF8 • Each coded character is composed of one to six octets (one octet = 8 bits) • Data paths need only be "8-bit clean" • None of the octets in a multi-byte character are null (i.e., has the value of zero) • Text in UTF8 is difficult to manipulate or analyze. • "8-bit clean" = computer code which treats all 8 bits of a byte as significant. True of any computer code that processes European languages properly, but not necessarily true of code that processes only ASCII which only uses 7 bits per character.

Unicode in the Lycos System • UCS2: Japanese Morphological Analyzer from Basis Technology • Using UCS2 is the quick and economical way to process huge volumes of Japanese text. • UTF8: Lycos Catalog • Economy of disk space: ASCII is smaller in UTF8On the Web: ASCII 79%, double-byte Asian less than 5%, European encodings and others 16% • Ease of integration with existing code(a.k.a. transmissibility) • Based on the number of Web hosts on the Internet by country (total number of hosts for English-speaking domains as a percentage of the total number of hosts worldwide). Source: Survey by Network Wizards, http://www.nw.com

Project Complete: Lycos Japan (1) • Quick:Prototype of Japanese search is produced in two months.Lycos Japan: http://www.lycos.co.jp • Beta version of Japanese search debuts July 1998; enters competitive Japanese search engine race in 4th place* • Upon formal launch grabs 2nd place in October 1998**According to Search Desk, http://www.searchdesk.com

Project Complete: Lycos Japan (2) • E-conomical:Today, Lycos has spider, catalog and query software, which may easily be set to make catalogs in different languages by swapping in and out localized pieces: • Settings for target domains • Encoding detection and conversion calls • Language-specific word breaker (if needed)