The category of financial news was removed (which is otherwise overrepresented in the data source) and instead of the raw keywords and headlines, we manually described the trends detected.
These are the top 50 trends, with the top 10 trends detected highlighted in bold, everything is ordered chronologically.
January
2014-01-29: Obama's state of the union address

September
2014-09-05: NATO summit with respect to IS and Ukraine conflict
2014-09-11: Scottish referendum upcoming - poll results are close
2014-09-23: U.N. on legality of U.S. air strikes in Syria against ISIS
2014-09-26: Star manager Bill Gross leaves Allianz/PIMCO for Janus

December
2014-12-11: CIA prisoner and U.S. torture centers revealed
2014-12-15: Sydney cafe hostage siege2014-12-17: U.S. and Cuba relations improve unexpectedly
2014-12-18: Putin criticizes NATO, U.S., Kiev
2014-12-28: AirAsia flight QZ-8501 missing
Similar to the result for 2013 it mentions many key geo-political events of 2014.
There probably is one "false positive" there: 2014-04-09 has a lot of articles talking about "experience" and "views", but not all refer to the same topic (we did not do topic modeling yet).

There are also some events missing that we would have liked to appear; many of these barely did not make it into the top 50, but do appear in the top 100, such as the Sony cyber-attack (#51) and the Ferguson riots on November 11 (#66).

Significant Trends in Large Data Streams

In this model, trends are considered significant, if the number of articles is 3 standard deviations higher than the expected value - a very classic definition from statistics. To make this work on streaming data, we used exponentially weighted averages and standard deviations. To reduce spurious trends (in particular first occurrences of terms) we added a simple bias term akin to Laplacian correction that removes such background noise. The main challenge is to scale this up to every term - and term combination - in the data set: Facebook is mentioned every day, but the combination of Facebook and WhatsApp was rarely occurring until they bought it. But also single terms can be useful to track, as seen in below chart: Ukraine trended most when the Malaysia Airlines plane was shot down in July 2014 (bottom chart), although it had more coverage in March 2014.

To make this approach scale up to monitoring every word and word pair mentioned over time, we employ a classic hashing/sketching trick. We accept heavy-hitters style inaccuracy in rare terms, but with a high probability we won't miss any frequently mentioned trend by using multiple hash functions: in order to miss a trend, it needs to collide with a more frequent term in every hash function.
Using a fixed amount of memory for the hashtable (we used a 256 MB hash table) we can this way track trends without specifying keywords in advance, even on large data sets such as Twitter. Using our algorithm, data sets such as this much smaller news data set can be processed on a single Raspberry Pi (Model B, 512 MB).

Visualizing Trend Clusters

Essential to understanding the results is visualization. Showing absolute numbers, and the significance as interpreted by the algorithm is helpful; but there may be many words and word pairs trending at the same time. To visualize this, we also created a semantic word cloud, where the words are not randomly placed (as common with word clouds) but reflect the association of words with each other. In the following image, you can see July 20, when two major clusters trend: in the Israel-Gaza conflict many people were killed (green cluster on the left) but also fighting in eastern Ukraine with pro-Russian rebels causes many fatalities. Links in this figure indicate terms that trend together, colors indicate a cluster structure obtained from this data.

Bio: Dr. Erich Schubert is a research and teaching assistant at the Ludwig-Maximilians-Universität München, Germany. He finished his PhD in 2013 on "Generalized and Efficient Outlier Detection for Spatial, Temporal, and High-Dimensional Data Mining" and is one of the lead authors of the open-source ELKI data mining toolkit. He is expanding his research into text-mining and big data analysis, and interested in post-doc and assistant professor opportunities in his research areas.