meltt

Matching Event Data by Location, Time and Type

Framework for merging and disambiguating event data based on spatiotemporal co-occurrence and secondary event characteristics. It can account for intrinsic "fuzziness" in the coding of events, varying event taxonomies and different geo-precision codes.

meltt provides a method for integrating event data in R. Event data seeks to capture micro-level information on event occurrences that are temporally and spatially disaggregated. For example, information on neighborhood crime, car accidents, terrorism events, and marathon running times are all forms of event data. These data provide a highly granular picture of the spatial and temporal distribution of a specific phenomena.

In many cases, more than one event dataset exists capturing related topics -- such as, one dataset that captures information on burglaries and muggings in a city and another that records assaults -- and it can be useful to combine these data to bolster coverage, capture a broader spectrum of activity, or validate the coding of these datasets. However, matching event data is notoriously difficult:

Jittering Locations, different geo-referencing software can produce slightly different longitude and latitude locations for the same place. This results in an artificial geo-spatial "jitter" around the same location.

Temporal Fuzziness, given how information about events are collected, the exact date of an event reported might differ from source to source. For example, if data is generated using news reports, they might differ in their reporting of the exact timing of the event---especially if precise on-the-ground information is hard to come by. This creates a temporal fuzziness where the same empirical event falls on different days in different datasets .

Conceptual Differences, different event datasets are built for different reasons, meaning each dataset will likely contain its own coding schema for the same general category. For example, a dataset recording local muggings and burglaries might have a schema that records these types of events categorically (i.e "mugging", "break in", etc.), whereas another crime dataset might record violent crimes and do so ordinally (1, 2, 3, etc.). Both datasets might be capturing the same event (say, a violent mugging) but each has its own method of coding that event.

In the past, to overcome these hurdles, researchers have typically relied on hand-coding to systematically match these data, which needless to say, is extremely time consuming, error-prone, and hard to reproduce. meltt provides a way around this problem by implementing a method that automates the matching of different event datasets in a fast, transparent, and reproducible way.

More information about the specifics of the method can be found in an upcoming R Journal article as well as in the packages documentation.

Installation

The package can be installed through the CRAN repository.

install.packages("meltt")

Or the development version from Github

# install.packages("devtools")

devtools::install_github("css-konstanz/meltt")

Currently, the package requires that users have both Python (>= 2.7) and a version of the numpy module installed on their computer. To quickly get both, install an Anaconda platform. meltt will use these programs in the background.

Usage

In the following illustrations, we use (simulated) Maryland car crash data. These data constitute three separate data sets capturing the same thing: car crashes in the state of Maryland for January 2012. But each data set differs in how it codes information on the car's color, make, and the type of accident.

enddate: if the event occurred across more than one day, i.e. an "episode";

longitude & latitude: geo-location information;

model_tax: coding scheme of the type of car;

color_tax: coding scheme of the color of the car;

damage_tax: coding scheme of the type of accident.

The variable names across dataset have already been standardized (for reasons further outlined below).

The goal is to match these three event datasets to locate which reported events are the same, i.e., the corresponding data set entries are duplicates, and which are unique. meltt formalizes all input assumptions the user needs to make in order to match these data.

First, the user has to specify a spatial and temporal window that any potential match could plausibly fall within. Put differently, how close in space and time does an event need to be to qualify as potentially reporting on the same incident?

Second, to articulate how different coding schemas overlap, the user needs to input an event taxonomy. A taxonomy is a formalization of how variables overlap, moving from as granular as possible to as general as possible. In this case, it describes how the coding of the three car-specific properties (model, color, damage) across our three data sets correspond.

Generating a taxonomy

Among the three variables that exist in all three in datasets we consider the damage_tax variable recorded in each of dataset for an in-depth example:

unique(crash_data1$damage_tax)

# [1] "1" "5" "4" "6" "2" "3" "7"

unique(crash_data2$damage_tax)

# [1] "Flip"

# [2] "Mid-Rear Damage"

# [3] "Front Damage"

# [4] "Side Damage While In Motion"

# [5] "Hit Tree"

# [6] "Side Damage"

# [7] "Hit Property"

unique(crash_data3$damage_tax)

# [1] "Vehicle Rollover" "Rear-End Collision"

# [3] "Sideswipe Collision" "Object Collisions"

# [5] "Side-Impact Collision" "Liable Object Collisions"

# [7] "Head-On Collision"

Each variable records information on the type of accident a little differently. The idea of introducing a taxonomy is then, as mentioned before, to generalize across each category by clarifying how each coding scheme maps onto the other.

crash_taxonomies$damage_tax

# data.source base.categories damage_level1

# 1 crash_data1 1 Multi-Vehicle Accidents

# 2 crash_data1 2 Multi-Vehicle Accidents

# 3 crash_data1 3 Multi-Vehicle Accidents

# 4 crash_data1 4 Single Car Accidents

# 5 crash_data1 5 Multi-Vehicle Accidents

# 6 crash_data1 6 Single Car Accidents

# 7 crash_data1 7 Single Car Accidents

# 8 crash_data2 Mid-Rear Damage Multi-Vehicle Accidents

# 9 crash_data2 Side Damage Multi-Vehicle Accidents

# 10 crash_data2 Side Damage While In Motion Multi-Vehicle Accidents

# 11 crash_data2 Flip Single Car Accidents

# 12 crash_data2 Front Damage Multi-Vehicle Accidents

# 13 crash_data2 Hit Tree Single Car Accidents

# 14 crash_data2 Hit Property Single Car Accidents

# 15 crash_data3 Rear-End Collision Multi-Vehicle Accidents

# 16 crash_data3 Side-Impact Collision Multi-Vehicle Accidents

# 17 crash_data3 Sideswipe Collision Multi-Vehicle Accidents

# 18 crash_data3 Vehicle Rollover Single Car Accidents

# 19 crash_data3 Head-On Collision Multi-Vehicle Accidents

# 20 crash_data3 Object Collisions Single Car Accidents

# 21 crash_data3 Liable Object Collisions Single Car Accidents

The crash_taxonomies object contains three pre-made taxonomies for each of the three overlapping variable categories. As you can see, the damage_tax contains only a single level describing how the different coding schemes overlap. When matching the data, meltt uses this information to score potential matches that are proximate in space and time.

Likewise, we similarly formalized how the model_tax and color_tax variables map onto one another.

The color and model taxonomies contain more levels than the damage taxonomy representing specific to increasingly broader categories under which both color and model of the cars can be described. For example, the model_tax goes from make_level1, which contains a schema with 7 unique entries using the Euro coding of car models as a way of specifying overlap, to make_level3, which contains a schema with only two categories (i.e. differentiation between large and small vehicles).

Generally, specifications of taxonomy levels can be as granular or as broad as one chooses. The more fine-grained the levels one includes to describe the overlap, the more specific the match. At the same time, if categories are too narrow, it is difficult to conceptualize potential matches across datasets. As a rule, there is thus a trade off between specific categories that can better differentiate among possible duplicate entries and unspecific categories that more easily recognize potentially matching information across datasets.

As a general rule, we therefore recommend to include, whenever it is conceptually warranted, both specific fine-grained categories and a few increasingly broader ones. In this case, meltt will have more information to work with when differentiating between sets of potential matches. In establishing which entries are most likely to correspond, meltt in case of more than two potential matches in one dataset always automatically favors the one that more precisely corresponds. A good taxonomy is the key to matching data, and is the primary vehicle by which a user's assumptions -- regarding how data fits together -- is made transparent.

A few technical things to note:

Taxonomies must be organized as lists: each taxonomy data.frame is read into meltt as a single list object.

Taxonomies must be named the same as the variables they seek to describe: meltt relies on simple naming conventions to identify which variable is what when matching.

names(crash_taxonomies)

# [1] "model_tax" "color_tax" "damage_tax"

colnames(crash_data1)[7:9]

# [1] "model_tax" "color_tax" "damage_tax"

colnames(crash_data2)[7:9]

# [1] "model_tax" "color_tax" "damage_tax"

colnames(crash_data3)[7:9]

# [1] "model_tax" "color_tax" "damage_tax"

Each taxonomy must contain a data.source and base.categories column: this last convention helps meltt identify which variable is contained in which data object. The data.source column should reflect the names of the of the data objects for input data and the base.categories should reflect the original coding of the variable on which the taxonomy is built.

Each input dataset must contain a date,enddate (if one exists), longitude, and latitude column: the variables must be named accordingly (no deviations in naming conventions). The dates should be in an R date formate (as.Date()), and the geo-reference information must be numeric (as.numeric()).

Matching Data

Once the taxonomy is formalized, matching several datasets is straightforward. The meltt() function takes four main arguments:

...: input data;

taxonomies =: list object containing the user-input taxonomies;

spatwindow =: the spatial window (in kilometers);

twindow =: the temporal window (in days).

Below we assume that any two events in two different datasets occurring within 4 kilometers and 2 days of each other could plausibly be the same event. This ''fuzziness'' basically sets the boundaries on how precise we believe the spatial location and timing of events is coded. It is usually best practice to vary these specifications systematically to ensure that no one specific combination drives the outcomes of the integration task.

We then assume that event categories map onto each other according to the way that we formalized in the taxonomies outlined above. We fold all this information together using the meltt() function and then store the results in an object named output.

output <- meltt(crash_data1, crash_data2, crash_data3,

taxonomies = crash_taxonomies,

spatwindow = 4,

twindow = 2)

meltt also contains a range of adjustments to offer the user additional controls regarding how the events are matched. These auxiliary arguments are:

smartmatch: when TRUE (default), all available taxonomy levels are used and meltt uses a matching score that ensures that fine-grained agreements is favored over broader agreement, if more than one taxonomy level exists. When FALSE, only specific taxonomy levels are considered.

certainty: specification of the the exact taxonomy level to match on when smartmatch = FALSE.

partial: specifies whether matches along only some of the taxonomy dimensions are permitted.

averaging: implement averaging of all values events are match on when matching across multiple data.frames. That is, as events are matched dataset by dataset, the metadata is averaged. (Note: that this can generate distortion in the output).

weight: specified weights for each taxonomy level to increase or decrease the importances of each taxonomy's contribution to the matching score.

At times, one might want to know which taxonomy level is doing the heavy lifting. By turning off smartmatch, and specifying certain taxonomy levels by which to compare events, or by weighting taxonomy levels differently, one is able to better assess which assumptions are driving the final integration results. This can help with fine-tuning the input assumptions for meltt to gain the most valid match possible.

Output

When printed, the meltt object offers a brief summary of the output.

output

# MELTT Complete: 3 datasets successfully integrated.

# ===================================================

# Total No. of Input Observations: 195

# No. of Unique Obs (after deduplication): 140

# No. of Unique Matches: 34

# No. of Duplicates Removed: 55

# ===================================================

In matching the three car crash datasets, there are 195 total entries (i.e. 71 entries from crash_data1, 64 entries from crash_data2, and 60 entries from crash_data3). Of those 195, 140 of them are unique -- that is, no entry from another dataset matched up with them. 55 entries, however, were found to be duplicates identified within 34 unique matches.

Given that meltt objects can be saved and referenced later, the summary function offers a recap on the input parameters and assumptions that underpin the match (i.e. the datasets, the spatiotemporal window, the taxonomies, etc.). Again, information regarding the total number of observations, the number of unique and duplicate entries, and the number matches found is reported, but this time information regarding how many of those matches were event-to-event (i.e. events that played out along one time unit where the date is equal to the end date) and episode-to-episode (i.e. events that played out over a couple of days).

A summary of overlap is also provided, articulating how the different input datasets overlap and where. For example, of the 34 matches 5 occurred between crash_data1 and crash_data2, 4 between crash_data1 and crash_data3,
4 between crash_data2 and crash_data3, and 21 between all three.

Visualization

For quick visualizations of the matched output, meltt contains three plotting functions.

plot() offers a bar plot that graphically articulates the unique and overlapping entries. Note that the entries from the leading dataset (i.e. the dataset first entered into meltt) is all black. In this representation, all matching (or duplicate) entries are expressed in reference to the datasets that came before it. Any match found in crash_data2 is with respect to crash_data1, any in crash_data3 with respect to crash_data1 and crash_data2.

plot(output)

tplot() offers a time series plot of the meltt output. The plot works as a reflection, where raw counts of the unique entries are plotted right-side up and the raw counts of the removed duplicates are plotted below it. This offers a quick snapshot of when duplicates are found. Temporal clustering of duplicates may indicate an issue with the data and/or the input assumptions, or it's potentially evidence of a unique artifact of the data itself.

Users can specify the temporal unit that the data should be binned (day, week, month, year). Give that the data only covers one month, we'll look at the output by day.

tplot(output, time.unit="day")

Similarly, mplot() presents a summary of the spatial distribution of the data by plotting the spatial points onto a Google map. Events where matches were detected are labeled by blue diamonds. Again, the goal is to get a sense of the spatial distribution of the matches to both identify any clustering/disproportionate coverage in where matches are located, and to also get a sense of the spread of the integrated output.

mplot(output)

mplot() also contains an interactive = argument that when set to TRUE generates an interactive Google map in the user's primary browser for more granular inspection of the spatial matches. Information regarding the input criteria in which each entry was assessed (e.g. the taxonomy inputs) are retained and can be referenced by hovering over the point with the mouse.

Extracting Data

meltt provides two methods for extracting data from the output object.

meltt.data() returns the de-duplicated data along with any necessary columns the user might need. This is the primary function for extracting matched data and moving on with subsequent analysis. The columns = argument takes any vector of variable names and returns those variables in the output. If no variables are specified, meltt returns the spatio-temporal and taxonomy variables that were employed during the match. In addition, the function returns a unique event and data ID for reference.

uevents <- meltt.data(output,columns = c("date","model_tax"))

head(uevents) # first 6 entries

# meltt.dataID meltt.eventID date model_tax

# 1 crash_data1 1 2012-01-01 Full-Sized Pick-Up Truck

# 2 crash_data1 2 2012-01-01 Mid-Size Car

# 3 crash_data1 3 2012-01-02 Cargo Van

# 4 crash_data1 4 2012-01-02 Mini Suv

# 5 crash_data1 5 2012-01-02 Mid-Size Car

# 6 crash_data1 6 2012-01-02 Cargo Van

dim(uevents) # the unique events after de-duplication

# [1] 140 4

meltt.duplicates(), on the other hand, returns a data frame of all events that matched up. This provides a quick way of examining and assessing the events that matched. Since the quality of any match is only as good as the assumptions we input, its key that the user qualitatively evaluate the meltt output to assess whether any assumptions should be adjusted. Like meltt.data(), the columns = argument can be customized to return variables of interest.

Note that the data is presented differently than in meltt.data(); here each dataset (and its corresponding variables) is presented in a separate column. This representation is chose for ease of comparison. For example, the entry for row 1 denotes that the 55th entry in the crash_data2 data matched with entry 57 from the crash_data3, whereas no entry from crash_data1 matched (as indicated with "dataID" and "eventID" 0 and "date" NA). The requested columns are intended to assist with validation.

Inside the Output Object

Like most S3 objects, the output from meltt is a nested list containing a range of useful information. The output from meltt retains the original input data and taxonomies and the specification assumptions as well as lists of contender events (i.e. events that were flagged as potential matches but did not match as closely as another event). Note that we are expanding meltt's functionality to include more posterior function to ease extraction of this information, but for now, it can simply be accessed using the usual $ key convention.