Shared Task Dates

Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS is typically present on the intersentential, intrasentential (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes) levels. CS presents serious challenges for language technologies such as Parsing, Machine Translation (MT), Automatic Speech Recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Traditional techniques trained for one language quickly break down when there is input mixed in from another. Even for problems that are considered solved, such as language identification, or part of speech tagging, performance degrades at a rate proportional to the amount and level of the mixed-language present.

This workshop aims to bring together researchers interested in solving the problem and increase community awareness of the possible viable solutions to reduce the complexity of the phenomenon. The workshop invites contributions from researchers working in NLP approaches for the analysis and processing of mixed-language data especially with a focus on intrasentential code-switching. Topics of relevance to the workshop will include the following:

Development of linguistic resources to support research on code-switched data

In this occasion we organize a Named Entity Recognition (NER) shared task in CS data with the purpose of providing even
more resources to the community. The goal is to allow participants to explore the use of supervised, semi-supervised
and/or unsupervised approaches to predict the entity types of CS data. We will release the gold standard data for tuning
and testing systems in the following language pairs: Spanish-English (SPA-ENG), and
Modern Standard Arabic-Egyptian (MSA-EGY). We will use Twitter data for both languages.

Participants for this shared task will be required to submit output of their systems within a pre specified
time window in order to qualify for evaluation in the shared task. They will also be required to submit a
paper describing their system.

Entity Types

Updates will be given through the workshop Google group:codeswitching_workshop@googlegroups.com,
and the Twitter account: @WCALCS. Direct updates
will be sent by email to the participants based on the information provided in the registration form.

Data Release

IMPORTANT

We will be sending directly the ENG-SPA data to the
registered participants because some of the development tweets are not available
to fetch anymore. If you have registered to the shared task already but haven't
received the data, please don't hesitate to send us an email.

For both MSA-EGY and ENG-SPA tweets, we provide packages that retrieve, tokenize,
and synchronize the NE types of the training and development data:
MSA-EGY Package
and ENG-SPA Package. Instructions
on how to use the packages are included. Additionally, the data has been tagged using
the IOB scheme along with the listed entity types above.

Task Details

The languages pairs ENG-SPA and MSA-EGY are independent tasks. Although we highly encourage
submissions on both pairs, participants can choose from one or both languages.

When the test phase starts, we will open a competition in CodaLab. The shared task will be devided
two competitions, one for ENG-SPA and the other for MSA-EGY. We will add more information when the
test phase date is closer.

NOTE: Participants can use any resources (e.g., pre-trained word embeddings,
gazetteers, etc.) that they consider appropriate for the task. In terms of the competition, there is no
difference between with or without resources. However,
we highly encourage participants to keep track of the perfomance when adding resources to include such insights
in the paper.