Anomaly detection

Anomaly detection

By anomaly, we refer to a pattern that is unusual, unexpected or rare in the data under analysis. QMiner can detect anomalies on several types of static or dynamic data.

Anomalies in text streams

Let’s try to detect novel topics published in the media. Novel (or “anomalous”) topics are topics that were never or rarely observed in the past. We assume that there is a stream of articles being pushed to the system.

Next, we define the pipeline that processes the news articles. The pipelines have several processing modules: the model, the trigger, the anomaly detection algorithm and the actual alert.

To keep a model of the past articles, we need feature space aggregator on the article store. The model will use the text from the “Text” of the article and will use “tfidf” weight to compute the importance of various words. The model will need at least two articles to be initialized and will consider pairs of two words (the n-grams) parameter.

This prototype examines articles at a constant rate. We use an artificial time series to simulate this constant rate. We implement it using the tick stream aggregator for the ‘Articles’ store. The model defined before updates at this rate.

The novelty detection is implemented using the Nearest Neighbors anomaly detector aggregator. The aggregator takes timestamped features as input. The time stamp is provided by the previously defined tick aggregator and feature space aggregator. It considers a window of 200 most recent articles and computes the distances between these. The most distant articles are considered as anomalies. The rate of anomalies can be tuned using the “rate” parameter - play with it on RunKit!

Now that the processing pipeline is defined, we can start pushing the stream of data in the system Articles are loaded from our server.

// Define a global variable for storing the results.letresults=[];letrates=[];letcount=[0,0,0,0];// For the example, we use English articles from the intyernational media about the sky champion Peter Prevc. Other data sets are also available.letARTICLES_URL="http://atena.ijs.si/data/novelty/PeterPrevcENG.json";// let ARTICLES_URL = "http://atena.ijs.si/data/novelty/MicrosoftENG.json";// let ARTICLES_URL = "http://atena.ijs.si/data/novelty/EuropeanComissionENG.json";letgot=URL=>require("got")(URL,{json:false});letresponse=awaitgot(ARTICLES_URL);letlines=response['body'].split('\n');lines.forEach(line=>{if(line!=""){letcurrentArticle=JSON.parse(line);articles.push({Time:currentArticle["date"]+"T"+currentArticle["time"],Text:currentArticle["body"],Title:currentArticle["title"],Number:1});results.push({rate:anomaly.getInteger(),title:currentArticle["title"]});rates.push(anomaly.getInteger());}});