Changes (2)

| *One line summary* | Create a word frequency list from a Lucene index and try to ascertain the subject matter of the collection that the index was created against. | | *Detailed description* | The solution for [AQuA:Characterising Externally Generated Content]&nbsp;generated a Lucene index of the collection content. &nbsp;A small piece of Java code was developed to scan through the terms in the text content field of the Lucene documents (the metadata wasn't trawled). &nbsp;A list was created of the terms in the index and the frequency of the terms (the number of times that they&nbsp;occurred&nbsp;in the index). \\

The initial results were&nbsp;disappointing&nbsp;as Lucene indexed all of the words and the most frequently used words were ones that occurred commonly in plain English. \\ The General Service List&nbsp;[http://jbauman.com/aboutgsl.html|http://jbauman.com/aboutgsl.html]&nbsp;is a list of commonly&nbsp;occurring&nbsp;words deemed to be most useful to people learning English, and their frequency. Andrew Jackson used this list to determine how much more frequently words were used in the Lucene index, in comparison to "common English", as defined by the GSL. |

!sampleTop50relativeWeights.png|align=center,border=1,width=300! {column} {section}Of course, a lovely next step would be to link each word to the corresponding search results, allowing the context of the usage of the word to be explored.&nbsp;