Tools and Datasets

The RSS file contains 10,995 items from 33 blog authors. Each Author’s posts are separated into their own folders and stored in text files. Each text file contains one post and each post has all of the html still in line.

The tweets file contains 64,808 tweets from 49 authors. Each tweet is stored in a text file labeled with the tweet’s unique ID (assigned from twitter). Each author’s tweets are stored in a file with the authors name on it.

The system submitted to PAN Text Alignment task of PAN’13. Given a pair of source and suspicious files the program gives the character index and offset of plagiarized parts in both the files. The output is currently in xml format. The program can be downloaded here: text_alignment.tar

Please read the readme file in the root folder, which has instructions for running the program and for changing the parameters.