Early classification of multivariate temporal observations by extraction of interpretable shapelets.

Ghalwash MF, Obradovic Z - BMC Bioinformatics (2012)

Bottom Line:
Early classification of time series is beneficial for biomedical informatics problems such including, but not limited to, disease change detection.In addition, extracting patterns from the original time series helps domain experts to gain insights into the classification results.The time series were classified by searching for the earliest closest patterns.

Background: Early classification of time series is beneficial for biomedical informatics problems such including, but not limited to, disease change detection. Early classification can be of tremendous help by identifying the onset of a disease before it has time to fully take hold. In addition, extracting patterns from the original time series helps domain experts to gain insights into the classification results. This problem has been studied recently using time series segments called shapelets. In this paper, we present a method, which we call Multivariate Shapelets Detection (MSD), that allows for early and patient-specific classification of multivariate time series. The method extracts time series patterns, called multivariate shapelets, from all dimensions of the time series that distinctly manifest the target class locally. The time series were classified by searching for the earliest closest patterns.

Results: The proposed early classification method for multivariate time series has been evaluated on eight gene expression datasets from viral infection and drug response studies in humans. In our experiments, the MSD method outperformed the baseline methods, achieving highly accurate classification by using as little as 40%-64% of the time series. The obtained results provide evidence that using conventional classification methods on short time series is not as accurate as using the proposed methods specialized for early classification.

Conclusion: For the early classification task, we proposed a method called Multivariate Shapelets Detection (MSD), which extracts patterns from all dimensions of the time series. We showed that the MSD method can classify the time series early by using as little as 40%-64% of the time series' length.

Figure 4: Candidate distance threshold. The distance threshold δ1 splits the dataset into two datasets so that it has 4 true positives, 0 false positive, 4 true negatives, and 1 false negative. The information gain of δ1 is 0.4090. The distance threshold δ2 divides the dataset into two datasets so that it has 4 true positives, 1 false positive, 3 true negatives, and 1 false negative. The information gain of δ2 is 0.1591. Hence, δ1 has better information gain than δ2.

Mentions:
where mc is the number of time series of class c and M is the number of all time series. To compute the distance threshold, the method sorts the distances between the shapelet and all time series. Then, it finds the mid point between two consecutive distances as a candidate for the threshold. The dataset is then divided into two datasets DL and DR as illustrated in Figure 4. The dataset DL contains all time series such that the distance between the shapelet and time series is less than or equal to the candidate threshold. The dataset DR contains the rest of the time series. Then the entropies EL and ER of the datasets DL and DR are computed, respectively. By comparing the entropy before and after the split, we obtain a measure of information gain which is computed as

Figure 4: Candidate distance threshold. The distance threshold δ1 splits the dataset into two datasets so that it has 4 true positives, 0 false positive, 4 true negatives, and 1 false negative. The information gain of δ1 is 0.4090. The distance threshold δ2 divides the dataset into two datasets so that it has 4 true positives, 1 false positive, 3 true negatives, and 1 false negative. The information gain of δ2 is 0.1591. Hence, δ1 has better information gain than δ2.

Mentions:
where mc is the number of time series of class c and M is the number of all time series. To compute the distance threshold, the method sorts the distances between the shapelet and all time series. Then, it finds the mid point between two consecutive distances as a candidate for the threshold. The dataset is then divided into two datasets DL and DR as illustrated in Figure 4. The dataset DL contains all time series such that the distance between the shapelet and time series is less than or equal to the candidate threshold. The dataset DR contains the rest of the time series. Then the entropies EL and ER of the datasets DL and DR are computed, respectively. By comparing the entropy before and after the split, we obtain a measure of information gain which is computed as

Bottom Line:
Early classification of time series is beneficial for biomedical informatics problems such including, but not limited to, disease change detection.In addition, extracting patterns from the original time series helps domain experts to gain insights into the classification results.The time series were classified by searching for the earliest closest patterns.

Background: Early classification of time series is beneficial for biomedical informatics problems such including, but not limited to, disease change detection. Early classification can be of tremendous help by identifying the onset of a disease before it has time to fully take hold. In addition, extracting patterns from the original time series helps domain experts to gain insights into the classification results. This problem has been studied recently using time series segments called shapelets. In this paper, we present a method, which we call Multivariate Shapelets Detection (MSD), that allows for early and patient-specific classification of multivariate time series. The method extracts time series patterns, called multivariate shapelets, from all dimensions of the time series that distinctly manifest the target class locally. The time series were classified by searching for the earliest closest patterns.

Results: The proposed early classification method for multivariate time series has been evaluated on eight gene expression datasets from viral infection and drug response studies in humans. In our experiments, the MSD method outperformed the baseline methods, achieving highly accurate classification by using as little as 40%-64% of the time series. The obtained results provide evidence that using conventional classification methods on short time series is not as accurate as using the proposed methods specialized for early classification.

Conclusion: For the early classification task, we proposed a method called Multivariate Shapelets Detection (MSD), which extracts patterns from all dimensions of the time series. We showed that the MSD method can classify the time series early by using as little as 40%-64% of the time series' length.