1. Opprentice的介绍

系统遇到的挑战：

Definition Challenges: it is difficult to precisely define anomalies in reality.（在现实环境下很难精确的给出异常的定义）

Detector Challenges: In order to provide a reasonable detection accuracy, selecting the most suitable detector requires both the algorithm expertise and the domain knowledge about the given service KPI (Key Performance Indicators). To address the definition challenge and the detector challenge, we advocate for using supervised machine learning techniques. （使用有监督学习的方法来解决这个问题）

2. 背景描述：

KPIs and KPI Anomalies:

KPIs: The KPI data are the time series data with the format of (time stamp, value). In this paper, Opprentice pays attention to three kinds of KPIs: the search page view (PV), which is the number of successfully served queries; The number of slow responses of search data centers (#SR); The 80th percentile of search response time (SRT).

Anomalies: KPI time series data can also present several unexpected patterns (e.g. jitters, slow ramp ups, sudden spikes and dips) in different severity levels, such as a sudden drop by 20% or 50%.

4. Opprentice’s Design:

Architecture: Operators label the data and numerous detectors functions are feature extractors for the data.

Label Tool:

人工使用鼠标和软件进行标注工作

Detectors:

(i) Detectors As Feature Extractors: （Detector用来提取特征）

Here for each parameter detector, we sample their parameters so that we can obtain several fixed detectors, and a detector with specific sampled parameters a (detector) configuration. Thus a configuration acts as a feature extractor:

data point + configuration (detector + sample parameters) -> feature,

(ii) Choosing Detectors: (Detector的选择，目前有14种较为常见的）

Opprentice can find suitable ones from broadly selected detectors, and achieve a relatively high accuracy. Here, we implement 14 widely-used detectors in Opprentice.

Opprentice has 14 widely-used detectors:

“Diff“: it simply measures anomaly severity using the differences between the current point and the point of last slot, the point of last day, and the point of last week.

“MA of diff“: it measures severity using the moving average of the difference between current point and the point of last slot.

The other 12 detectors come from previous literature. Among these detectors, there are two variants of detectors using MAD (Median Absolute Deviation) around the median, instead of the standard deviation around the mean, to measure anomaly severity.

(iii) Sampling Parameters: （Detector的参数选择方法，一种是扫描参数空间，另外一种是选择最佳的参数）

Two methods to sample the parameters of detectors.

(1) The first one is to sweep the parameter space. For example, in EWMA, we can choose to obtain 5 typical features from EWMA; Holt-Winters has three [0,1] valued parameters . To choose , we have features; In ARIMA, we can estimate their “best” parameters from the data, and generate only one set of parameters, or one configuration for each detector.

Random Forest is an ensemble classifier using many decision trees. It main principle is that a group of weak learners (e.g. individual decision trees) can together form a strong learner. To grow different trees, a random forest adds some elements or randomness. First, each tree is trained on subsets sampled from the original training set. Second, instead of evaluating all the features at each level, the trees only consider a random subset of the features each time. The random forest combines those trees by majority vote. The above properties of randomness and ensemble make random forest more robust to noises and perform better when faced with irrelevant and redundant features than decisions trees.

Configuring cThlds: （阈值的计算和预估）

(i) methods to select proper cThlds: offline part

We need to figure cThlds rather than using the default one (e.g. 0.5) for two reasons.

(1) First, when faced with imbalanced data (anomalous data points are much less frequent than normal ones in data sets), machine learning algorithems typically fail to identify the anomalies (low recall) if using the default cThlds (e.g. 0.5).

(2) Second, operators have their own preference regarding the precision and recall of anomaly detection.

The metric to evaluate the precision and recall are:

(1) F-Score: F-Score = 2*precision*recall/(precision+recall).

(2) SD(1,1): it selects the point with the shortest Euclidean distance to the upper right corner where the precision and the recall are both perfect.

(3) PC-Score: （本文中采用这种评估指标来选择合适的阈值）

If r>=R and p>=P, then PC-Score(r,p)=2*r*p/(r+p) + 1; else PC-Score(r,p)=2*r*p/(r+p). Here, R and P are from the operators’ preference “recall>=R and precision>=P”. Since the F-Score is no more than 1, then we can choose the cThld corresponding to the point with the largest PC-Score.

(ii) EWMA Based cThld Prediction: （基于EWMA方法的阈值预估算法）

In online detection, we need to predict cThlds for detecting future data.

Use EWMA to predict the cThld of the i-th week ( or the i-th test set) based on the historical best cThlds. Specially, EWMA works as follows:

If , then 5-fold prediction

Else , then +, where is the best cThld of the (i-1)-th week. is the predicted cThld of the i-th week, and also the one used for detecting the i-th week data. is the smoothing constant.

For the first week, we use 5-fold cross-validation to initialize . As increases, EWMA gives the recent best cThlds more influences in the prediction. We use in this paper.

5. Evaluation（系统评估）

在 Opprentice 系统中，红色表示 Opprentice 系统的方法，黑色表示其他额外的方法。

Opprentice has 14 detectors with about 9500 lines of Python, R and C++ code. The machine learning block is based on the scikit-learn library.

本文作为智能运维系统的探索，这篇论文的标题是《Focus: Shedding Light on the High Search Response Time in the Wild》，来自于清华大学裴丹教授。目标是解决在运维过程中，发现高搜索响应时间之后，使用机器学习算法发现异常的原因和规则。该系统（Focus）使用过2.5个月的数据，并且分析过数十亿的日志。下面将会详细介绍这篇文章的主要内容。

问题描述：

To help search operators dubug HSRT (high search response time)，Focus is a search log analysis framework to answer the three questions:

(1) What is the HSRT condition?

(2) Which HSRT condition types are prevalent across days?

(3) How does each attribute affect SRT in those prevalent HSRT condition types?

解决方案：

Focus has one component for each of the above questions:

(1) A decision tree based classifier to identify HSRT conditions in search logs of each day;

(2) A clustering based condition type miner to combine similar HSRT conditions into one type, and find the prevalent condition types across days; following Occam’s razor principle.

(3) An attribute effect estimator to analyze the effect of each individual attribute of SRT within a prevalent condition type.

基础知识准备：

(A) Search Logs:

For each measured query, its search log records two types of data: SRT and SRT components, Query Attributes.

(1) SRT and SRT components:（特征层）

is when a query is submitted; is when the result HTML file has been downloaded; is when a brower finishes parsing the HTML; is when the page is completely rendered. SRT is measured by , the user-received search response time.

is the server response time of the HTML file, which is recorded by servers; is the network transmission time of the HTML file; is the browser parsing time of the HTML; is the remaining time spent before the page is rendered, e.g. download time of images from image servers.

(2)Query Attributes:（特征层）

The search logs record the following attributes for each measured query:

(iii) Localtion: Based on the client IP, convert IP to its geographic location. In total, there are 32 provinces.

(iv) #Image: the number of embedded images in the result page.

(v) Ads: A result page contains paid advertise links or not.

(vi) Loading Mode: The loading mode of a result page can be either synchronous or asynchronous.

(vii) Background page views: On the service side, the search engine S also post-analyzes the logs and generates the background page views. The background PVs (page views) for a query q is measured by the number of queries served within 30 seconds before and after q is served.It reflects the average search request load where q is served. Due to confidentiality constraints, we normalize specific background PVs (page Views) by the maximum value.（事后分析，统计出一些必要的特征，输入 Focus 系统的机器学习模型中）

(B) HSRT and HSRT Conditions:（样本层）

Usually, we can use cumulative distribution fraction (CDF) of SRT in the search logs to determine the high search response time condition (HSRT condition). In this paper, we define HSRT as the SRT longer than 1s.

Challenges of Identifying HSRT Conditions: In order to identify HSRT conditions in multi-dimensional search logs.（以下是这个系统的一些难点和挑战点）

(a) Naive Single Dimensional Based Methods: including pair-wise correlation analysis and so on, but is inefficient.

(b) Attributes can be potentially interdependent on each other: that means Naive Bayes Method may not applicable in this situation.

关键思想和系统概况

Condition is a combination of attributes and specific values in search logs.

HSRT Condition is a condition that covers at least 1%$ of total queries, and has the fraction of HSRT large than the global level:

(# of HSRT queries in a HSRT condition / #of queries in a HSRT condition) > (# of HSRT queries / # of queries). This is in order to assign to labels and we can change this definition in practice. （这只是用来打标签的定义，用于判断什么是HSRT，在实际的应用中，我们可以根据具体的场景采用不同的定义，例如返回码等指标）。

‘Focus’ System Overview:

Input: search logs（日志）

(i) Use a decision tree based classifier to identify HSRT conditions in search logs every day; （每天可以使用决策树模型从日志中提取HSRT条件）。

(ii) Use a clustering based condition type miner to identify condition types of similar HSRT conditions, and fine prevalent condition types across days; （用于把类似的条件融合在一起）。

(iii) Use an attribute effect estimator to analyze how an attribute affects SRT and SRT components in each prevalent condition type. （用于判断哪些属性或者特征对这个标签影响更加深远）。

Output: prevalent condition types and their attributes effects on SRT.（第二步输出的条件以及第三步属性的重要性）。

Part (ii): Condition Type Miner: group HSRT conditions according to (1) the same combination of attributes, (2) the same value from each category attribute, and (3) similar interval for each numeric attribute, using Jaccard Index to measure the similarity between intervals. （条件的融合）。

Part (iii): Attribute Effect Estimator: With each condition type

,

we design a method to understand how each attribute condition affects SRT.

For example, what is the HSRT fraction caused by in ? What SRT components (e.g. and ) are affected by ?

Main Idea: flip condition to the opposite to get a variant condition type . In the past days, we have the number of HSRT events in total, the number of HSRT events in condition and the number of HSRT events in condition . As a result, we believe the historical data based comparison can provide a reasonable estimate of the attribute effects. The comparison between and in these days is based on the specific HSRT conditions of these days. （用于判断哪些属性更能够引起 HSRT）。

In Table IV, the results are sorted by the variation of the fraction of HSRT in condition types (HSRT% column) caused by flipping an attribute condition.

(i) We highlight the variations greater than zero (getting worse after flipping an attribute condition).

(ii) We focus on that flipping the HSRT branching attribute conditions can yield improvements on HSRT%. For example, the condition #image>x are all ranked at the top. It means we need to reduce the impact of images on SRT and we can get the highest potential improvement of HSRT.

(iii) Table III and Table IV are the output of Focus to the operators for these months.