Language Data

L33 - Yahoo News Ranked Multi-label Corpus, version1.0 (59MB)

Tagging textual documents/articles with relevant tags is an important problem for many applications including Yahoo News, Newsroom, Tumblr, and other textual media platforms. Multilabel learning is at the core of this problem and recently got a revived interest. There are many standard datasets available for this task but all of them provide features and not the actual text of the documents. This corpus provides the actual text so that the researchers can derive their own features that are good best for their algorithms. Apart from that, this corpus to the best of our knowledge is the only one that provides a ranking of labels for each document in terms of its importance.
Related publications to be cited:
1. "RIPML: A Restricted Isometry based Approach to Multilabel Learning, Akshay Soni and Yashar Mehdad. FLAIRS 2017.”
2. "DocTag2Vec: An embedding based Multilabel Learning approach for Document Tagging", 2017. Sheng Chen, Aasish Pappu, Akshay Soni and Yashar Mehdad.