The Tigrigna language lacks text and speech corpora for developing speech technologies. In this work, after considering the phonetic nature of Tigrigna, we have gathered and pre-processed an initial and relatively large corpus of sentences. Using the syllable as basic phonetic unit, two different sub-corpora are developed from that initial text corpus, one that is phonetically rich and the other that is balanced. Two different methods are used for that purpose, which are variants of already publ...

The Tigrigna language lacks text and speech corpora for developing speech technologies. In this work, after considering the phonetic nature of Tigrigna, we have gathered and pre-processed an initial and relatively large corpus of sentences. Using the syllable as basic phonetic unit, two different sub-corpora are developed from that initial text corpus, one that is phonetically rich and the other that is balanced. Two different methods are used for that purpose, which are variants of already published methods. Finally, the frequencies of occurrence of the syllabic units in the balanced corpus are contrasted with a previously reported study of Tigrigna phonetics, observing consistency between both.