M.S. in Mathematics with Emphasis in Data Mining

﻿Tarleton State University houses the Center for Agribusiness Excellence, which uses data mining techniques to screen all of the USDA's crop insurance data for fraud, and in 2010, CAE was awarded a Top 10 Data Mining Case Study by the Institute of Electrical and Electronic Engineers (IEEE). In partnership with CAE, the mathematics department offers an M.S. in mathematics with emphasis in data mining, and students in this program are eligible for $25,000 research assistantships.

Research assistants have a 15-month appointment, beginning on June 1 and concluding on August 31 of the following year. During the first 12 months, they work 20 hours per week for CAE while completing coursework, and during the last three months, they work 40 hours per week for CAE.

Detecting Anomalous Crop Insurance Claims using Satellite Images

Using the difference between the red and infrared bands in a satellite image, it is possible to calculate the normalized difference vegetation index, or NDVI, which serves as a proxy for the amount of green vegetation in a given geographic region, and therefore, the health of crops being grown in that area. A k-means algorithm was applied to cluster NDVI curves for Nebraska crop insurance claims, resulting in a relatively healthy cluster (Cluster 1) and an unhealthy one (Cluster 2).

This clustering was then compared to spot checklist (SCL) flags, used by CAE to flag anomalous insurance claims. A Fisher's exact test comparing the clustering to the SCL flags resulted in a p-value less than 10-5, demonstrating a highly statistically significant association between the NDVI clusters and the SCL flags.

Below, Charles, Adam, Rebecca, and Dan are shown speaking with Kirk Bryant, Deputy Director for Strategic Data Acquisition and Analysis for the USDA Risk Management Agency at the National Consortium for Data Science Data Showcase.

Bayesian Ensemble Models of Climate Variability in South Texas

Possibly the most important application of data mining in the 21st century is building and refining models of climate change and then using those models to predict climate behavior in local regions. Juliann Booth and Nina Culver are using Bayesian model averaging to predict future precipitation in South Texas, an important concern, given the projected decline in water availability in this region by 2050.

Thirty-five CMIP5 models f1,...,f35for temperature and precipitation were obtained from the World Climate Research Programme's Working Group on Coupled Modeling. For each model fk, the probability of observing a temperature/precipitation measurement y is p(y|fk), and the probability that fk is the best model given observed target data yT ﻿is p(fk|yT). Synthesizing these two types of probabilities using Bayes' theorem yields the overall probability of observing a future temperature/precipitation measurement ﻿y﻿ as follows:

Below is a visualization of temperature predictions for the thirty-five CMIP5 models for the South Texas region being studied.

Modeling Nitrate Contamination in Water Wells Based on Proximity to CAFOs

Nitrate contamination of ground water is a serious health concern, which can lead to conditions such as methemoglobinemia (blue baby disease), miscarriages, and non-Hodgkin lymphoma, and the EPA has therefore set a maximum contaminant level (MCL) for nitrate of 10 mg/L. Proximity of concentrated animal feed operations (CAFOs) to water wells has been linked to nitrate contamination of those wells, and Charles Tintera and Lain Tomlinson are currently applying data mining techniques to model this relationship more accurately.

A novel feature of this project is modeling flowpaths in the aquifer from a given CAFO using the hydraulic gradient obtained from the Global Information System (GIS). By taking into account the distance from a well to a CAFO's flowpath, the length of that flowpath, and the waste application rate at that CAFO, a CAFO Migration Score (CMS) is calculated to summarize the overall impact of CAFOs on the well under consideration. The Epanechnikov kernel is applied to model diminished probabilities of contamination that result from increased distances from the flowpath.

Once CAFO migration scores were computed, a logistic regression model demonstrated a highly statistically significant relationship between CMS and nitrate contamination (P = 7.19 x 10-12). In the image below, 344 wells have been broken into 10 deciles based on CAFO migration score, so each point in this plot represents approximately 34 wells. The x-coordinate of each point is the average CMS value for wells in that decile, and the y-coordinate is the observed number of wells in that decile with nitrate concentrations exceeding 3 mg/L. The plot indicates strong agreement between the observed data and the logistic regression model, as confirmed with a Hosmer-Lemeshow goodness of fit test.

Charles and Lain are now working to extend this analysis to include more variables, such as depth to water table, pH, total dissolved solids, percent clay, percent organic matter, and annual rainfall. They are also applying random forests, support vector machines, k-nearest neighbors, and other classification algorithms to improve the model's classification accuracy. Because testing for nitrate contamination is expensive, the goal is to provide a tool that will help farmers estimate a well's probability of being contaminated using readily available information about that well.

﻿Where are they now?

Below are LinkedIn profiles for some of our previous graduates of the data mining program.