I am looking for a text data-set. I need this data-set for a set of experiments which compares effectiveness of a set of algorithms working in paragraph-level with the same algorithms working in document-level. For this reason, I want a data-set which has both paragraph-level and document-level labels. It's ok if only a subset of items in each level has labels. I found a lot of paper, which worked on a text processing on both paragraph-level and document-level data-sets but none of their data-sets is publicly available.

Edit: I want to do a set of experiments at the paragraph-level and see if the result is better than learning the same concept at the document-level. As long as labels are binary, it's not important what is labels.

can you clarify what "at the document level" means? so you want a bunch of content in marked up, <p>, but they're all going to be in an html document (ideally)....so i'm just confused. sorry.
– albert♦Feb 8 '17 at 20:52

It's a big file. The best, for testing, is probably to download the Baby edition (less than 200 Mb once uncompressed). As you can see in the sample below, the documents are labelled <wtext>, the paragraphs <p>, the sentences <s>(and the words <w>, of course).

Thank you. It's a huge corpus. And I don't understand it's structure. Would you please tell me for example that what I can use for text-level labels and what for paragraph-level labels?
– user85361Feb 8 '17 at 8:00

It's good if the labels are not the same in paragraph-level with labels in document-level.
– user85361Feb 8 '17 at 8:11