I'm looking to create a sentence extraction program, so a program that aims to get the most important sentences from a body of text. The first step for me is to try to evaluate what characteristics important sentences share, and how they are different from non-important sentences.

As such, I am looking for a corpus of documents which have the most important sentences somehow marked. I already have some metrics in mind, and I would like to do inference on how important each of those metrics are in distinguishing the interesting from the filler, and so need some data.

Edit: I believe that it defeats the purpose to use the results of a different sentence extraction program, and am instead looking for the results of human work.

What would be the common assessment of "important"? It seems like a considerable judgment call, probably best left to whatever researcher was going to act on the definition.
– Joe GermuskaDec 31 '15 at 18:15

1

@JoeGermuska That is a great question, one I was hoping to avoid by looking at the results of someone else's work. But to be specific, I am ideally looking for the results of "Here is an article, circle the sentences which you feel to be important" addressed to an untrained individual. As such, what "important" means is not exactly defined, but is left to be interpreted by the individual completing the task, and may vary.
– John MaddenJan 1 '16 at 0:33

1

@FranckDernoncourt I actually ended up making one by paying folks on Mechanical Turk to label docs. I'll throw it on my github and link to it when I get home, thanks for reminding me of this question.
– John MaddenAug 26 '17 at 15:24

1

@FranckDernoncourt I actually already had em on there, I've pasted in the link in an answer.
– John MaddenAug 26 '17 at 21:13

1 Answer
1

I actually ended up paying folks on MechanicalTurk to label questions as important/unimportant from a couple of news articles I downloaded. There are 410 sentences total, which I have on my github here.