knitr::opts_chunk$set(collapse = TRUE, echo=TRUE,comment = "#>") library(crimelinkage) CACHE = FALSE # Set TRUE for development, then set to FALSE for release

`crimelinkage`

packageThe `crimelinkage`

package^[This work was supported by the National Institute of Justice project 2010-DE-BX-K255: *Statistical Methods for Crime Series Linkage*.] is a set of tools to help crime analysts and researchers with tasks related to crime linkage. This package specifically includes methods for criminal case linkage, crime series identification and clustering, and suspect identification.
More details about the methodology and its performance on actual crime data can be found in [@NIJreport; @CaseLinkage; @CrimeClust; @Consistency]

The object of the `crimelinkage`

package is to provide a statistical approach to criminal linkage analysis that discovers and groups crime events that share a common offender and prioritizes suspects for further investigation. Bayes factors are used to describe the strength of evidence that two crimes are linked. Using concepts from agglomerative hierarchical clustering, the Bayes factors for crime pairs can be combined to provide similarity measures for comparing two crime series. This facilitates crime series clustering, crime series identification, and suspect prioritization.

Make sure you have installed and loaded the `crimelinkage`

package.

#- If the crimelinkage package is not installed, type: install.packages{"crimelinkage"} library(crimelinkage)

Crime linkage will require two primary data sets: one with crime incidents and the other that links offenders with the crimes they have committed. `crimelinkage`

has some fictitious data (named `crimes`

and `offenders`

) that can help illustrate how the various linkage methods work.
The `crimes`

data is a data frame of 490 simulated crime events. You can get the first 6 records by typing:

data(crimes) head(crimes)

Each crime has an ID number, spatial coordinates, 3 categorical MO features, and a temporal window of when the crime could have occurred. More details about these fields can be found by typing `?crimes`

. An `NA`

indicates the record is missing a value.

The `offenders`

data links the solved crimes to the offender(s).

data(offenders) head(offenders)

The crimes series from an individual offender can be extracted using `getCrimeSeries`

. For example, offender O:40 has committed `r length(getCrimeSeries("O:40",offenders)$crimeID)`

known crimes.

cs = getCrimeSeries("O:40",offenders) cs

If we want to see the actual incident data, use the function `getCrimes()`

getCrimes("O:40",crimes,offenders)

Co-offenders can be identified with the `getCriminals()`

function. For example, offender O:40 has `r length(getCriminals(cs$crimeID,offenders))`

known co-offenders

getCriminals(cs$crimeID,offenders)

We can also examine all offenders of a *particular* crime,

getCriminals("C:78",offenders)

*Pairwise case linkage* is the processes of determining if a given pair of crimes share the same offender. This is a binary classification problem where each crime pair is considered independently. In order to carry out pairwise case linkage, crime pairs must be compared and evidence variables created.

The first step is to combine all the crime and offender data using the `makeSeriesData()`

function:

seriesData = makeSeriesData(crimedata=crimes,offenderTable=offenders)

Note that the data.frame passed into

`crimedata=`

must at minimum have columns (exactly) named:`crimeID`

,`DT.FROM`

, and`DT.TO`

(if the times of all crimes are known exactly, then`DT.TO`

is not needed). The`crimeID`

column should be a character vector indicating the crime identification number. The`DT.FROM`

and`DT.TO`

columns should be datetime objects of class`POSIXct`

. See the functions`strptime()`

,`as.POSIXct()`

, and the package`lubridate`

for details on converting date and time data into the proper format.

From the series data, we can get some useful information about the offenders and crime series. For example,

nCrimes = table(seriesData$offenderID) # length of each crime series table(nCrimes) # distribution of crime series length mean(nCrimes>1) # proportion of offenders with multiple crimes

provides some info on the length of crime series. From this, we see that `r table(nCrimes)[1]`

offenders were charged with only one crime, `r table(nCrimes)[2]`

were charged with two crimes, and the proportion of offenders that had multiple crimes in the data is `r round(mean(nCrimes>1),3)`

.

The rate of co-offending can also be examined with the following commands:

nCO = table(seriesData$crimeID) # number of co-offenders per crime table(nCO) # distribution of number of co-offenders mean(nCO>1) # proportion of crimes with multiple co-offenders

We see that `r table(nCO)[1]`

crimes were committed by a single offender and
`r 100*round(mean(nCO>1),3)`

% of crimes had multiple offenders.

Case linkage compares the variables from two crimes to assess their level of similarity and distinctiveness. To facilitate this comparison, the crime variables for a pair of crimes need to be transformed and converted into *evidence variables* that are more suitable for case linkage analysis [@CaseLinkage].

The first step is to get the crimes pairs we want to compare. The function `makePairs()`

is used to get the linked and unlinked pairs that will be used to build a case linkage model.

set.seed(1) # set random seed for replication allPairs = makePairs(seriesData,thres=365,m=40)

The crime pairs are restricted to occur no more that 365 days apart. The threshold `thres`

is used to minimize the effects of offenders that may change their preferences (e.g., home location) over time. The crime pairs only include solved crimes.

The linked pairs (crimes committed by same offender) have `type = 'linked'`

. For these, the `weight`

column corresponds to $1/N$, where $N$ is the number of crime pairs in a series. These weights will be used when we construct the case linkage models. This attempts to put all crime series on an equal footing by downweighting the crime pairs from offenders with large crime series.

The unlinked pairs (`type = 'unlinked'`

) were generated by selecting `m = 40`

crimes from each crime group and comparing them to crimes in different crime groups. Crime groups are the connected components of the offender graph. This approach helps prevent (unknown) linked crimes from being included in the unlinked data. The weights for unlinked pairs are 1.

This generates `r sum(allPairs$type=="linked")`

linked pairs and `r sum(allPairs$type=="unlinked")`

unlinked pairs, with weights.

The next step is to compare all crime pairs using the function `compareCrimes()`

. Crime $V_i$ is compared to crime $V_j$ (rows from crime incident data) and evidence variables are created that measure the similarity or dissimilarity of the two crimes across their attributes. The list `varlist`

specifies the column names of `crimes`

that provide the spatial, temporal, and categorical crime variables. More information about how the evidence variables are made can be found in the help for `?compareCrimes`

.

varlist = list( spatial = c("X", "Y"), temporal = c("DT.FROM","DT.TO"), categorical = c("MO1", "MO2", "MO3")) # crime variables list X = compareCrimes(allPairs,crimedata=crimes,varlist=varlist,binary=TRUE) # Evidence data Y = ifelse(allPairs$type=='linked',1,0) # Linkage indicator. 1=linkage, 0=unlinked

This information is now in the correct format for building classification models for case linkage.

head(X) table(Y)

Now that we have the linkage data in a suitable format, we can build case linkage (i.e., classification) models and compare their performance. First, partition the linkage data into training (70\%) and testing (30\%) sets and create a data.frame `D.train`

containing the training data:

set.seed(3) # set random seed for replication train = sample(c(TRUE,FALSE),nrow(X),replace=TRUE,prob=c(.7,.3)) # assign pairs to training set test = !train D.train = data.frame(X[train,],Y=Y[train]) # training data

Technically, any binary classification model can be used for case linkage (as long as it accepts weighted observations^[If a classification method can't handle weighted observations then sampling can be used.]). We will illustrate with three models: logistic regression (with stepwise selection), naive Bayes, and boosted trees.

Make the formula for the models

vars = c("spatial","temporal","tod","dow","MO1","MO2","MO3") fmla.all = as.formula(paste("Y ~ ", paste(vars, collapse= "+"))) fmla.all

Fit stepwise logistic regression using the `glm()`

and `step()`

functions:

fit.logistic = glm(fmla.all,data=D.train,family=binomial,weights=weight) fit.logisticstep = step(fit.logistic,trace=0) summary(fit.logisticstep)

The stepwise procedure chooses the spatial, temporal, and time-of-day distance and MO1 as the relevant predictor variables.

A naive Bayes model can be constructed with the `naiveBays()`

function. This function fits a non-parametric naive Bayes model which bins the continuous predictors and applies a shrinkage to control the complexity. The `df=`

argument is the degrees of freedom for each predictor, `nbins=`

is the number of bins to use for continuous predictors, and `partition=`

controls how the bins are constructed. The resulting component plots can be made with the `plotBF()`

function.

Here we fit a naive Bayes model with up to 10 degrees of freedom for each component using 15 bins which are designed to have approximately equal number of training points in each bin (quantile binning).

NB = naiveBayes(fmla.all,data=D.train,weights=weight,df=10,nbins=15,partition='quantile') #- Component Plots plot(NB,ylim=c(-2,2)) # ylim sets the limits of the y-axis

The naive Bayes model also gives the strongest influence to the spatial and temporal distances and MO1.

Fit boosted trees with the `gbm()`

function in the `gbm`

package. This package needs to be installed if not already done so.

library(gbm) # if not installed type: install.packages("gbm") set.seed(4) fit.gbm = gbm(fmla.all,distribution="bernoulli",data=D.train,weights=weight, shrinkage=.01,n.trees=500,interaction.depth=3) nt.opt = fit.gbm$n.trees # see gbm.perf() for better options #- Relative influence and plot print(relative.influence(fit.gbm,n.trees=nt.opt)) par(mfrow=c(2,4)) for(i in 1:7) plot(fit.gbm,i)

The boosted tree model uses an interaction depth of 3 (allowing up to 3-way interactions). It finds that the spatial and temporal distance variables have the strongest influence.

The models can be evaluated on the test data.

D.test = data.frame(Y=Y[test],X[test,]) X.test = D.test[,vars] Y.test = D.test[,"Y"]

Now, we will predict the values of the Bayes factor (up to constant) from each method and store in an object called `BF`

. The predictions are evaluated with `getROC()`

:

#- Predict the Bayes factor for each model BF = data.frame( logisticstep = predict(fit.logisticstep,newdata=X.test,type='link'), nb = predict(NB,newdata=X.test), gbm = predict(fit.gbm,newdata=X.test,n.trees=nt.opt) ) #- Evaluation via ROC like metrics roc = apply(BF,2,function(x) getROC(x,Y.test))

For each model, the `getROC()`

function orders the crime pairs in the testing set according to their estimated Bayes Factor and the predictive performance is determined from the number of actual linkages found in the ordered list.
The first plot shows the proportion of all linked crimes that are correctly identified (True Positive Rate) if a certain number of crime pairs are investigated.

with(roc[[1]], plot(Total,TPR,typ='l',xlim=c(0,1000),ylim=c(0,1),las=1, xlab='Total Cases Examined',ylab='True Positive Rate')) grid() for(i in 2:length(roc)){ with(roc[[i]], lines(Total,TPR,col=i)) } legend('topleft',names(roc),col=1:length(roc),lty=1, cex=.75,bty="n") title("Proportion of overall linked crimes captured")

The next plot shows the proportion of the ordered crime pairs that are actual linkages (Precision) for a given number of cases examined.

with(roc[[1]], plot(Total,PPV,typ='l',xlim=c(0,1000),ylim=c(0,1),las=1, xlab='Total Cases Examined',ylab='Precision')) grid() for(i in 2:length(roc)){ with(roc[[i]], lines(Total,PPV,col=i)) } legend('topright',names(roc),col=1:length(roc),lty=1, cex=.75,bty="n") title("Proportion of identified crimes that are actually linked")

The purpose of this analysis is to illustrate some capabilities of the `crimelinkage`

package for **pairwise case linkage**.
More functionality will be shown in other vignettes.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.