Efficient Algorithms for Mining Large Spatio-Temporal Data

View/Open

Date

Author

Metadata

Abstract

Knowledge discovery on spatio-temporal datasets has attractedgrowing interests. Recent advances on remote sensing technology meanthat massive amounts of spatio-temporal data are being collected,and its volume keeps increasing at an ever faster pace. It becomescritical to design efficient algorithms for identifying novel andmeaningful patterns from massive spatio-temporal datasets. Differentfrom the other data sources, this data exhibits significantspace-time statistical dependence, and the assumption of i.i.d. isno longer valid. The exact modeling of space-time dependence willrender the exponential growth of model complexity as the data sizeincreases. This research focuses on the construction of efficientand effective approaches using approximate inference techniques forthree main mining tasks, including spatial outlier detection, robustspatio-temporal prediction, and novel applications to real worldproblems.

Spatial novelty patterns, or spatial outliers, are those data pointswhose characteristics are markedly different from their spatialneighbors. There are two major branches of spatial outlier detectionmethodologies, which can be either global Kriging based or localLaplacian smoothing based. The former approach requires the exactmodeling of spatial dependence, which is time extensive; and thelatter approach requires the i.i.d. assumption of the smoothedobservations, which is not statistically solid. These two approachesare constrained to numerical data, but in real world applications weare often faced with a variety of non-numerical data types, such ascount, binary, nominal, and ordinal. To summarize, the main researchchallenges are: 1) how much spatial dependence can be eliminated viaLaplace smoothing; 2) how to effectively and efficiently detectoutliers for large numerical spatial datasets; 3) how to generalizenumerical detection methods and develop a unified outlier detectionframework suitable for large non-numerical datasets; 4) how toachieve accurate spatial prediction even when the training data hasbeen contaminated by outliers; 5) how to deal with spatio-temporaldata for the preceding problems.

To address the first and second challenges, we mathematicallyvalidated the effectiveness of Laplacian smoothing on theelimination of spatial autocorrelations. This work providesfundamental support for existing Laplacian smoothing based methods.We also discovered a nontrivial side-effect of Laplacian smoothing,which ingests additional spatial variations to the data due toconvolution effects. To capture this extra variability, we proposeda generalized local statistical model, and designed two fast forwardand backward outlier detection methods that achieve a better balancebetween computational efficiency and accuracy than most existingmethods, and are well suited to large numerical spatial datasets.

We addressed the third challenge by mapping non-numerical variablesto latent numerical variables via a link function, such as logitfunction used in logistic regression, and then utilizingerror-buffer artificial variables, which follow a Student-tdistribution, to capture the large valuations caused by outliers. Weproposed a unified statistical framework, which integrates theadvantages of spatial generalized linear mixed model, robust spatiallinear model, reduced-rank dimension reduction, and Bayesianhierarchical model. A linear-time approximate inference algorithmwas designed to infer the posterior distribution of the error-bufferartificial variables conditioned on observations. We demonstratedthat traditional numerical outlier detection methods can be directlyapplied to the estimated artificial variables for outliersdetection. To the best of our knowledge, this is the firstlinear-time outlier detection algorithm that supports a variety ofspatial attribute types, such as binary, count, ordinal, andnominal.

To address the fourth and fifth challenges, we proposed a robustversion of the Spatio-Temporal Random Effects (STRE) model, namelythe Robust STRE (R-STRE) model. The regular STRE model is a recentlyproposed statistical model for large spatio-temporal data that has alinear order time complexity, but is not best suited fornon-Gaussian and contaminated datasets. This deficiency can besystemically addressed by increasing the robustness of the modelusing heavy-tailed distributions, such as the Huber, Laplace, orStudent-t distribution to model the measurement error, instead ofthe traditional Gaussian. However, the resulting R-STRE modelbecomes analytical intractable, and direct application ofapproximate inferences techniques still has a cubic order timecomplexity. To address the computational challenge, we reformulatedthe prediction problem as a maximum a posterior (MAP) problem with anon-smooth objection function, transformed it to a equivalentquadratic programming problem, and developed an efficientinterior-point numerical algorithm with a near linear ordercomplexity. This work presents the first near linear time robustprediction approach for large spatio-temporal datasets in bothoffline and online cases.