Introduction to Inputting Amharic

For those of you who missed the Spring Africana Librarians Council (ALC) meeting, or are curious about how to input Ethiopic script into MARC library records, this may help serve as an overview. It would work best for a user who has had some preliminary exposure to Amharic, or for someone working alongside a native speaker. Ethiopic (or Ge’ez) script is also used for the Tigre and Tigrinya languages, among others.

Update #1 (4/12/16): Some of the more difficult distinctions to make visually are between, for example, syllables like ሳ (sā) and ላ (lā); ሰ (sa), ስ (se) and ለ (la); or ጻ (ṣi) and ጾ (ṣo). It is also important to keep in mind the distinction between transliterated glottal vowels like ʼa (አ) and pharyngeal ones like ʻa (ዐ). Glottals are romanized using the alif, while pharyngeals are romanized using the ayn character. It is also worth noting here that the Ethiopic calendar has thirteen months (one is very short), and is offset from the Gregorian calendar by seven to eight years.

Update #2 (5/25/16): There is some movement in the direction of developing OCR (optical character recognition) for Amharic and Tigrinya, using the open source OCR engine Tesseract. Look for the language packs listed here.

Euan Cochrane helped me find a free front-end product that works with Tesseract; I will be looking to pull the pieces together over the next couple of weeks to do some testing.

Update #3 (5/26/16): Preliminary testing of Tesseract in Amharic is moving ahead. I started by giving it what should have been an easy test, a Wikipedia page on South Africa. A sample line or two is as follows:

Input: በደቡብ አፍሪቃ ሕገ መንግሥት መሠረት 11 ልሳናት በእኩልነት ይፋዊ ኹኔታ አላቸው።

Output: በየበበ አፍረፆሐገመገማሥን መሠረን ገገ ልስናን በአከልነን ርፋዊ ጤል አሳኘውዞ

The accuracy rate is running at about 42% so far. More work is needed.