NIST 2015 Pre-Pilot Data Science Evaluation

We participated in the 2015 Pre-Pilot Data Science Evaluation organized by the National Institute of Standards and Technology (NIST). The primary goal of the pre-pilot evaluation is to develop and exercise the evaluation process in the context of data science. The evaluation consists of four tasks including data cleaning, data alignment, forecasting and prediction. Our DSR lab has participated the data cleaning and traffic event prediction tasks, and submitted several running systems of different algorithms and configurations. Most of the submissions are based on final student projects results in Dr. Daisy Zhe Wang’s 2015 Fall Introduction to Data Science class. The course gave an introduction to the basic data science techniques including programming in Python, exploratory and statistical analysis, and Map-Reduce for small and big data manipulation and data analytics. In our class, 7 groups of 3-6 students participated in the pre-pilot, where 4 groups were mainly undergrads and 3 groups are mainly grads (master + PhD).

Since the 1980s, NIST has been conducting evaluations of data-centric technologies, including automatic speech transcription, information retrieval, machine translation, speaker and language recognition, image recognition, fingerprint matching, event detection from text, video, multimedia, and automatic knowledge base construction, among many others. These evaluations have enabled rigorous research by sharing the following fundamental elements: (1) the use of common tasks, datasets, and metrics; (2) the presence of specific research challenges meant to drive the technology forward; (3) an infrastructure for developing effective measurement techniques and measuring the state-of-the-art; and (4) a venue for encouraging innovative algorithmic approaches. The tasks included in the pre-pilot are illustrated in Figure 1 and consist of: 1) Cleaning: finding and eliminating errors in dirty data. 2) Alignment: relating different representations of the same object in different data sources. 3) Prediction: determining possible values for an unknown variable based on known variables. 4) Forecasting: determining future values for a variable based on past values.

In the data cleaning task, we are given a set of detector measurements (see Figure 2 for detector distribution) for vehicle velocity, lane occupancy and flows. Among the given measurements, part of the values are incorrect (e.g. added noises by an artificial program), and our task is to correct those incorrect values. The major challenges of this task come from three aspects. First, the data is big (150GB text data with around 1.46 billion measurement entries), which makes it difficult for system development, debugging and fine tuning. Second, the model used for noising the data is unknown, which makes it difficult to reliably detect erroneous values. Finally, when a measurement is considered as incorrect, replacing that value with a correct one can also be a challenge as well. For this task, we’ve submitted three runs, but neither of them performs better than the baseline systems. The evaluation metric used was mean of the absolute errors (MAE), and our best performing system had MAE value of 0.40 while the baseline systems had MAE around 0.28. According to the report from the NIST, most of our errors come from the false alarms (where flow value is correct but we consider it as incorrect). A following up discussion with NIST in a workshop confirms that false alarms can easily occur due to the noising model they have used. In addition, while some of the erroneous flow values can be detected easily (like extremely high or negative values), a good portion of them are not. In the future, we should improve our system to reliably detect such incorrect flow values and reduce the amount of false alarms.

Figure 2. Detector Distribution

Figure 3. Number of Event Occurrence by Year

For the prediction task, we have developed systems that can predict the number and types of traffic events for a given (geographical bounding, interval of time) pair. In this task, we have event data of the past 12 years (see Figure 3 for number of event occurrence by year), where the location of events is surrounding the Baltimore-DC areas. At the testing stage, a system needs to predict the number of events will occur within given geographical bounding and time interval (1 month). We have submitted 7 runs in total. While all the runs are based on regression models for prediction, some of them are making use of extra data such as weather and Open Street Maps in addition to time and event counts in the past years. We have also tested different regression models, including linear regression models, second-order polynomial regression models and support vector regression models. The evaluation metric used in this task was Root Mean Squared Error (RMSE), and our submitted systems had RMSE values ranging from 5.17 to 33.44. Based on the report, we found that the best performing systems have the following features: 1) cleaning the noisy data (e.g. “zero counts”) before feeding into regression models; and 2) Using higher order regression models than linear regression; and 3) train one model per year instead of per month to alleviate the curse of dimensionality effect caused by sparse training data points.

Overall, the benefit of participating the Pre-Pilot evaluation is tremendous. The students had a very realistic and positive experience learning to tackle big data science challenges including data volume in data cleaning task, data veracity and value in traffic event prediction task. The tasks of pre-pilot evaluation are independent, allowing entry from groups with different expertise and can be done by student groups at undergrad and grad level with different sophistication of models and tools. Students in our data science class have learned significant amount of knowledge from using basic tools like Pandas, Hadoop or Apache Spark to developing scalable systems, and combining machine learning into big data analytics. We have also learned valuable information about the evaluation process in the context of data science, and provide feedback to further improve the future evaluation (e.g. upcoming Pilot by NIST). We conclude with three observations based on our participation in the NIST Data Science Evaluation workshop held at NIST March 2016 (please see our presentation slides part 1part 2 for more details): 1) Prototype with simple model first with less data types and analyze potential correlative relationship between data types; 2) Curse of dimensionality problems for the prediction task when given training data is sparse; 3) Maybe useful if part of the ground-truth data for cleaning task is released to avoid over-aggressive or under-aggressive cleaning schemes.