التونسية
Tunisian Arabic Corpus

For more information on how to use this search tool, please
see the Using the Corpus.

There are several limitations to this system. The accuracy of the
stemmer is currently 88%, which means that approximately one out of ten of the words are parsed
incorrectly. For this reason, it's a good idea to try searching both for stems and lemmas.
The parser does not parse plurals
(so بنت and بنات are listed as two
separate stems). In addition, the parser does not account for spelling changes, so
عليها and عليه will be under the stem
علي, separate from على, and
كرهبتها will be under the stem كرهبت,
separate from the stem for الكرهبة. (Code to correct for orthographic
changes is planned for the next version of the parser.)

About the Project

Because of the many varieties of Arabic, there can never be “one” authoritative corpus of the language.
To achieve the best results for language-learning resources and natural language processing,
corpora for both the standard language and the spoken varieties need to be available. To this end,
Tunisiya.org is a project, led by Karen McNeil and Miled Faiza, seeking to build a four-million-word
corpus of Tunisian Spoken Arabic.

To find out more about the Tunisian Arabic Corpus project, please read the attached
summary paper.

Project Status

There are currently 2,006 texts in the corpus, comprising 881,964 words.
The main categories currently included are displayed in the chart on the right. As you can see, the
internet sources are currently dominant ("Web" is a category for materials that have been harvested
from the internet but not yet put into more specific categories.)

Quarter million new words added to corpus

Saturday, April 14, 2018

Thanks to some new large texts and technological improvements that enabled the parsing of previously unanalyzed texts, we have now added almost 250,000 parsed words to the corpus.

Special thanks to Emna Souissi, Assistant Professor of Computer Science at ENSIT (University of Tunis) for her contribution of a 25,000 corpus of SMS and Facebook communications. Anyone who uses this data should, in addition to citing the Tunisian corpus, cite her corpus as well:

Jihene Younes, Hadhemi Achour and Emna Souissi, "Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web" In Proceedings of the 1st International Workshop on Natural Language Processing for Informal Text (NLPIT 2015) In conjunction with The International Conference on Web Engineering (ICWE 2015), Rotterdam, The Netherlands, 2015, pp. 3-14.

Corpus is Back Up!

Saturday, March 31, 2018

I've finished the move to the new webhost (as well as upgrading the code from Python2 to Python3). I'm still ironing out a few bugs, but everything should be mostly working. This new setup will be much easier to maintain and edit, so expect some improvements to the site coming soon!

Corpus Temporarily Down

Friday, March 23, 2018

We're in the process of migrating the corpus to another webhost, and the database is going to be down during this transition. Our apologies, and we will have the data back up as soon as possible. Email karenlmcneil@gmail.com if you have an urgent request.

Download Function Improved

Friday, April 29, 2016

The download function has been improved, so that the elements (before context, search term, and after context) appear in the correct order.

Corpus To Be Presented at University of Vienna: July 6, 2015

Tuesday, June 23, 2015

Karen will be giving the keynote address at the International Symposium on Tunisian and Libyan Arabic Dialects, at the University of Vienna on July 6, 2015. Her presentation is entitled "Tunisian Arabic Corpus: Creating a Written Corpus of an "Unwritten" Language." She will also be presenting separately about her research on the use of fī ("in") as a marker of the progressive verbal aspect in Tunisian (and Libyan) Arabic. This work was informed by data from the corpus.

Problem with Search Function

Corpus Presented at Brown University Digital Humanities Workshop

Saturday, October 18, 2014

Karen had an opportunity to present a poster about the corpus at Brown's Digital Islamic Humanities Workshop. Here's the handout, which provides a brief overview of the project and its current status: TACHandout.pdf.

Search Tool Improved

Friday, May 30, 2014

There were several improvements made to the search tool:

A "category" field was added, so you can filter results by text category.

Bug Fix: Added validation to the form, so that it will not allow users to submit empty queries (which used to lead to errors)

Bug Fix: Added validation to check that any regular expression entered is valid.

Right now the new search tool is only here on the index page. There were some difficulties adding it to the corpus results page, but we'll try to straighten them out in the next update.

Google Chrome Problem Fixed

Wednesday, October 10, 2012

It was brought to our attention that the concordance results were not displaying correctly in Google Chrome. The issue has now been fixed.

Stability Improved

Friday, August 31, 2012

We've added a test server, to validate any changes before they go live. So if you've visited the site and been greeted with an unpleasant error message, this should ensure that that doesn't happen anymore.

Large Amount of Web Data Added

Thursday, August 30, 2012

A large number of internet texts have been added to the corpus, using WebBootCaT (through Sketch Engine). These texts will need to be de-duped, and may contain non-Tunisian material, but at a first pass they seem to be largely Tunisian. They come from blogs, forum postings, YouTube comments, and other informal sources. There's also some erotic fiction (expanding the breadth of vocabulary represented into previously uncovered teritory), and there may be other fiction as well. This would be a great addition to the corpus, since there is no prose fiction currently represented, with the exception of folktales. In addition to being a welcome addition in and of themselves, these new texts will also provide the sites (especially blogs) where more Tunisian texts can be gathered.

Results Now Downloadable

Tuesday, July 24, 2012

A link has been added to the concordance page which allows the search results to be downloaded as a .cvs file. The cvs file can then be opened up in Microsoft Excel or any text editer for further analysis.

Search Capability Upgraded

Monday, July 23, 2012

The search tool has been upgraded with a morphological parser, allowing users to search for words by the stem and get results for all inflected forms. The parser currently has an accuracy of 88% (recall: 0.868, precision: 0.970, F-score: 0.916). Future versions of the parser will attempt to improve this accuracy.
The parser is a rule-based parser, with some additional statistical processing to improve results. For more details on the internal workings of the parser and how it was developed, an informal paper on the topic is available.