These are unofficial data splits for the corpus MADCAT Arabic (LDC2013T15, LDC2013T09, LDC2012T15).
LDC is providing only training data for these corpora and not the original dev/eval sets, so the original
training data have been split into three different disjoint parts (i.e. there shouldn't be sentences/lines
from the same document in different sets -- as each document is handwritten/transcribed
by a different author in the MADCAT data) to allow for evaluation of the performance in the usual way.

Also, please not that the license relates only for the splits. You still need to obtain the original databases
and respect the databases' license!

It contains the madcat xml name and segment id (s{1,2,3,4}). For example: