meltt: Matching Event Data by Location, Time and Type

View source: R/meltt.R

melttR Documentation

Matching Event Data by Location, Time and Type

Description

meltt merges and disambiguates event data based on spatiotemporal co-occurrence and secondary event characteristics. It can account for intrinsic "fuzziness" in the coding of events through the incorporation of user-specified taxonomies and adjusts for different degrees of geospatial and temporal precision by allowing for the specification of spatiotemporal "windows".

Usage

meltt(...,taxonomies, twindow, spatwindow, smartmatch = TRUE, certainty = NA,
	  partial = 0, averaging = FALSE, weight = NA, silent = FALSE)

Arguments

...

input datasets. See Details.

taxonomies

list of user-specified taxonomies. Taxonomies map onto a specific variable in the input data that contains the same name as the input taxonomy. See Details.

twindow

specification of temporal window in unit days. See Details.

spatwindow

specification of a spatial window in kilometers. See Details.

smartmatch

implement matching using all available taxonomy levels. When false, matching will occur only on a specified taxonomy level. Default = TRUE. See Details.

certainty

specification of the the exact taxonomy level to match on when smartmatch = FALSE. Default = NULL. See Details.

partial

specifies whether matches along less than the full taxonomy dimensions are permitted. Default = 0. See Details.

averaging

implement averaging of all values events are match on when matching across multiple dataframes. Default = FALSE. See Details.

weight

specified weights for each taxonomy level to increase or decrease the importances of each taxonomy's contribution to the matching score. Default = NA. See Details.

silent

Boolean specifying whether or not messages are displayed. Default = FALSE.

Details

meltt expects input datasets to be of class data.frame. Minimally each data must have columns "date" (formatted as "YYYY-mm-dd" or "YYYY-mm-dd hh:mm:ss"), "longitude" and "latitude" (both in degree; we assume global coordinates formatted in WGS-84) and the columns representing the dimensions used in the matching taxonomies. Note that meltt requires at least two datasets as input and can otherwise, in principle, handle any number of datasets.

The input taxonomies is expected to be of class list, which contain one or more taxonomy data frames. Each taxonomy must have a column denoting the "base.category" (i.e. the version of the variable that appears in each data frame) and a "data.source" column that matches the object name of the dataset containing those variables. All subsequent column in each taxonomy denote the user-specified levels of generalization, which capture the degree to which the taxonomy category generalizes out. The most left column must contain the most granular levels while the furthest right the broadest. Error will be issued if taxonomy levels are not in the correct order.

The twindow and spatwindow inputs specify the temporal and spatial dimensions for which entries are considered to be spatio-temporally proximate, and with that, potential matches (i.e. duplicate entries). For all potential matches, meltt then leverages the secondary information about events (formalized through the mapping of categories specified in taxonomies) to identify most likely matches.

meltt by default uses smartmatch, which leverages all taxonomy levels, i.e., establishes agreement on any taxonomy level while discounting inferior (i.e. more coarse) agreement using a matching score. When smartmatch is set to false, a certainty must be set, specifying which taxonomy level (i.e., 1 for the base level of the taxonomy, 2 for the next broader level etc.) two events must agree on to be considered a match.

partial specifies the number of dimensions along which no matching information is permitted for events to still be considered a potential match. In this case, every dimension not matched is assigned the worst matching score in the calculation of the overall fit. By default, all dimensions are considered, i.e. partial=0. averaging allows for users to take the average of all input information (date, longitude, latitude, taxonomy, etc.) when merging more than one dataset. When set to FALSE, events use the input information of the first or most left dataset in the order the data was received.

weight allows to weigh matches for different taxonomies in order to discount one (or several) event dimensions compared to others or vice versa. If weight=NA the package assumes homogeneous weights of 1. If weights are manually specified the must sum up to the total number of taxonomy dimensions used, i.e., the normalized overall weight always has to be 1. If not, the package returns an error.

Value

Returns an object of class "meltt".

The functions summary, print, plot overload the standard outputs for objects of type meltt providing summary information and and visualizations specific to the output object. The generic accessor functions meltt_data, meltt_duplicates, tplot, mplot extract various useful features of the integrated data frame: the unique de-duplicated entries, all duplicate entries (or matches), a histogram of the temporal distribution and a map of the integrated output.

An object of class "meltt" is a list containing at least the following components. First, a list named "processed" that contains all outputs of the integration process:

complete_index

a data.frame of initial input data (location information, time stamp, and secondary criteria) converted to a numeric matrix. The matrix is what is processed by the meltt algorithm.

deduplicated_index

a posterior data.frame of initial input data converted to a numeric matrix with duplicate entries have been removed. It further contains information about "episodal events" (i.e. events that span more than one time unit with an end and start date) that potentially match to unique events but could not be automatically assigned as matches (or not).

event_matched

Numeric matrix containing indices for each matching event from each input dataset. The leading data set is the furthest left, every matching event to its right is identified as a duplicate of the initial entry and is removed.

event_contenders

Numeric matrix containing indices for each "runner up" event from each input dataset that was identified as a potential but less optimal match based on its matching score.

episode_matched

Numeric matrix containing indices for each matching "episodes" (i.e. events that span more than one time unit with an end and start date) from each input dataset. Only contains matches between episodes. Matches between events and episodes must be manually reviewed by users (see meltt_inspect).

episode_contenders

Numeric matrix containing indices for each "runner up" episodes from each input dataset that was identified as a potential but less optimal match based on its matching score.

Second, it contains a comprehensive summary of the input data, parameters and taxonomy specifications. Specifically it returns:

inputData

List containing the original object name and information of the input data prior to integration.

parameters

List containing information on all input parameters on which the data was integrated.

inputDataNames

Vector of the object names of the input datasets. These names are carried through the integration process to differentiate between input datasets. The index keys contained in the numeric matrix representations of the data follow the order the data was entered.

taxonomy

List containing the taxonomy (secondary assumption criteria) datasets used to integrate the input data. The list contains: the names of the taxonomies (which must match the names of the variables they seek to generalize in the input data), an integer of the number of input taxonomies, a vector containing information on the depth (i.e. the number of columns) of each taxonomy, and a list of the original input taxonomies.

Author(s)

Karsten Donnay and Eric Dunford.

References

Karsten Donnay, Eric T. Dunford, Erin C. McGrath, David Backer, David E. Cunningham. (2018). "Integrating Conflict Event Data." Journal of Conflict Resolution.

See Also

meltt_data, meltt_duplicates, meltt_inspect, tplot, mplot

Examples


data(crashMD)
output = meltt(crash_data1, crash_data2, crash_data3,
                taxonomies = crash_taxonomies, twindow = 1, spatwindow = 3)
plot(output)

# Extract De-duplicated events
dataset = meltt_data(output)
head(dataset)


meltt documentation built on Oct. 27, 2022, 1:05 a.m.