Breadcrumbs

Swahili and Somali Query Translations of CLEF Bilingual Dataset Available to Researchers

10/23/19 3:00 AM

Center for Intelligent Information Retrieval (CIIR) researchers within the University of Massachusetts Amherst College of Information and Computer Sciences are providing a dataset that consists of Swahili and Somali queries translated from the CLEF 2000-2003 Campaign for Bilingual Ad-Hoc Retrieval Tracks (http://catalog.elra.info/en-us/repository/browse/ELRA-E0008/).

For researching on low-resource languages, the CIIR has produced an extension of 200 queries by translating all four years of bilingual queries (2000-2003) into Swahili and Somali, with topic set IDs of C001-C200 corresponding to the other languages that exist in the CLEF data. They used a translation organization to translate the title and description of the English queries from that topic set into Swahili and Somali languages. Somali is in the Afro-Asiatic language family, and Swahili is in the Niger-Congo language family. Both are mostly spoken in Africa.

More information can be found in their paper, “Simulating CLIR Translation Resource Scarcity using High-resource Languages,” by authors Hamed Bonab, James Allan, and Ramesh Sitaraman in the Proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2019).