Crime Series Identification and Clustering"
In crimelinkage: Statistical Methods for Crime Series Linkage

knitr::opts_chunk$set(collapse = TRUE, echo=TRUE,comment = "#>",
                      cache=FALSE)
library(crimelinkage)

Introduction to Crime Series Linkage

Statistical linkage and clustering of criminal events can be used by crime analysts to create of lists of potential suspects for an unsolved crime, identify groups of crimes that may have been committed by the same individuals or group of individuals, for offender profiling, and for predicting future events. Pairwise case linkage attempts to establish if a \emph{pair} of crimes share a common offender. In practice, there is more interest in crime series linkage, which attempts to identify the set of crimes committed by a common offender. For example, a crime analyst may need to identify the additional crimes that are part of a crime series (crime series identification) or discover all of the crime series in a criminal incident database (crime series clustering).

Crime series clustering is fundamentally a clustering problem where we seek to group the crimes into clusters that correspond to a serial offender (or group of offenders). Instead of simultaneously clustering all crimes in a criminal database, Crime series identification is focused on identifying the additional unsolved crimes that are part of an existing crime series. Crime series identification can be useful, for example, in interrogations [@Adderly-Musgrove-2003; @Adderly-2004] by providing investigators a list of additional crimes (similar to the known crimes) that a suspect may be responsible for. A crime analyst could also use crime series identification when investigating a crime series or suspected crime series to generate a list of additional crimes to jointly investigate.

The crimelinkage package provides several tools for crime series identification and clustering based on methods from hierarchical and model-based clustering.

Hierarchical Based Approaches

The hierarchical-based approaches to crime series linkage are algorithmic in nature and involve creating measures of similarity between sets of crimes. The pairwise (crime-crime) similarity is a function of the Bayes factors obtained from case linkage. The group (series-series) similarity scores are based on a linkage method from hierarchical cluster analysis (e.g., single, complete, average linkage).

Because the hierarchical-based crime series clustering and crime series identification methods are based on the output from pairwise case linkage, they require having a model for estimating the log Bayes factor for a crime pair. Details for getting a Bayes factor model are given in the vignette Statistical Methods for Crime Series Linkage. The steps are repeated here for convenience. This analysis uses a naive Bayes to estimate the log Bayes factor.

#-- Load the package and get the example crime data
library(crimelinkage)
data(crimes)               # some example crime incident data
data(offenders)            # some example crime offender data
seriesData = makeSeriesData(crimedata=crimes,offenderTable=offenders)

#-- Make Crime Pairs for Case Linkage
set.seed(1)         # set random seed for replication
allPairs = makePairs(seriesData,thres=365,m=40)

#-- Make Evidence Variables for Case Linkage
varlist = list( spatial = c("X", "Y"), 
                temporal = c("DT.FROM","DT.TO"), 
                categorical = c("MO1",  "MO2", "MO3"))    # crime variables list
X = compareCrimes(allPairs,crimedata=crimes,varlist=varlist,binary=TRUE) # Evidence data
Y = ifelse(allPairs$type=='linked',1,0)      # Linkage indicator. 1=linkage, 0=unlinked


#-- Get Training Data
set.seed(3)                                        # set random seed for replication
train = sample(c(TRUE,FALSE),nrow(X),replace=TRUE,prob=c(.7,.3))  # assign pairs to training set
test = !train
D.train = data.frame(X[train,],Y=Y[train])          # training data

#-- Fit naive Bayes model and make estimateBF() function 
vars = c("spatial","temporal","tod","dow","MO1","MO2","MO3") 
fmla.all = as.formula(paste("Y ~ ", paste(vars, collapse= "+")))
NB = naiveBayes(fmla.all,data=D.train,weights=weight,df=10,nbins=15,partition='quantile')

estimateBF <- function(X){           # estimateBF() returns the estimated log Bayes factor
  predict(NB,newdata=X)
}

Agglomerative Hierarchical Crime Series Clustering

Hierarchical clustering is an algorithmic approach to crime series cluster analysis that sequentially forms a hierarchy of cluster solutions. The agglomerative approach starts with every observation (e.g., crime incident) in its own cluster. Then it sequentially merges the two closest clusters to form a new larger cluster. This process is repeated until all observations are in the same cluster or a stopping criterion is met.

This algorithm requires two similarity measures to be specified: the pairwise similarity between two observations and the similarity between two groups of observations. There are three primary approaches to measuring the similarity between groups of observations. \emph{Single linkage}, or nearest neighbor, uses the most similar pair between the two groups as the group similarity measure. In contrast, \emph{complete linkage} uses the least similar pair between two groups as the measure of group similarity. \emph{Average linkage} uses the average similarity between all pairs in the two groups.

The crimelinkage package provides the function crimeClust_hier() for agglomerative hierarchical crime clustering. It uses the log Bayes factor as the pairwise similarity measure. That is, the similarity between crimes $i$ and $j$ is $S(i,j) = \log BF(i,j)$, where $BF(i,j)$ is the estimated Bayes factor for linkage. Then one of: average, single, or complete linkage is used for the similarity between groups.

In this example, we will cluster all of the unsolved crimes from crimes data and display dendrogram of results with plot_hcc() function.

#-- Get unsolved crimes
unsolved = subset(crimes, !crimeID %in% seriesData$crimeID)

#-- Run agglomerative hierarchical crime clustering
tree = crimeClust_hier(unsolved,varlist,estimateBF,linkage='average', binary=TRUE)

#-- Plot results in dendrogram using plot_hcc()
plot_hcc(tree,yticks=seq(-2,6,by=2),type="triangle",hang=.05,main="Average Linkage")

The dendrogram (using average linkage) merges r sum(-tree$height + tree$offset>=4) crime pairs with a log Bayes factor of more than 4.

Examining the crimes C:431 and C:460 (which were the most similar), we can see they are close in space, around 2 weeks apart in time, and have matching values for MO1, MO2, and Mo3.

#-- Examine crimes C:431 and C:460 
subset(crimes,crimeID %in% c('C:431','C:460'))

If there is a particular crime of interest, then the function clusterPath() will find the sequence of merges and scores for the event.

#-- Find path info for crime C:429
cp = clusterPath('C:429',tree)
cp[cp$logBF>0,]                 # only return path for scores > 0

This shows that crime C:429 first get grouped with crime r cp[1,2] with a log Bayes factor of r round(cp[1,1],2). Then this pair gets grouped with r cp[2,2] with an average log Bayes factor of r round(cp[2,1],2).

Hierarchical Based Crime Series Linkage

This approach to crime series identification compares an unsolved crime to every crime in criminal incident database and calculates its similarity as the log Bayes factor (according to the model developed for case linkage). Then it aggregates the similarity scores over the crime groups using single, complete, or average linkage. Single linkage uses the largest score (most similar crime) from each group, complete linkage uses the smallest score (least similar crime) from each group, and average linkage uses the average score as the group score.

To give an example, extract the solved and unsolved crimes from the crimes data.

solved = subset(crimes, crimeID %in% seriesData$crimeID)
unsolved = subset(crimes, !crimeID %in% seriesData$crimeID)

The function seriesID() can be used to find the most similar crime series to the unsolved crime.

crime = unsolved[2,]             # use the 2nd unsolved crime C:392
crime
results = seriesID(crime,solved,seriesData,varlist,estimateBF)
head(results$score)

This shows that the unsolved crime is most similar to the crime(s) in crime group r results$score[1,1] with an average linkage log Bayes factor of r round(results$score[1,'average'],2). To get the crimes and offenders associated with these groups, just use the subset() function with the groups object:

subset(results$groups,group=='12')      # most similar crime series
subset(results$groups,group=='154')     # 2nd most similar series
subset(results$groups,group=='9')       # a series with multiple crimes

We can do this for another unsolved crime

crime4 = unsolved[4,]             # use the 4th unsolved crime
results4 = seriesID(crime4,solved,seriesData,varlist,estimateBF)
head(results4$score)

Because the scores are so low (log Bayes factors around 1), this unsolved crime is not very similar to any other solved crimes in the crime database. Perhaps this is the start of a new crime series?

It is also possible to compare a crime to all unsolved crimes to detect potential unsolved crime series.

#- using crime C:394 (the 4th unsolved crime)
pairs = data.frame(i1=unsolved$crimeID[4],i2=unique(unsolved$crimeID[-4]))  
X = compareCrimes(pairs,unsolved,varlist,binary=TRUE)     # Evidence data
score = data.frame(pairs,logBF=estimateBF(X))  
head(score[order(-score$logBF),])

There are no unsolved crimes that are very similar to this one - probably not enough evidence to link this crime to any others.

This approach also gives similar results to what was obtained from the hierarchical clustering path approach:

C429 = which(unsolved$crimeID %in% 'C:429')       # now use crime C:429
pairs = data.frame(i1=unsolved$crimeID[C429],i2=unique(unsolved$crimeID[-C429]))  
X = compareCrimes(pairs,unsolved,varlist,binary=TRUE)     # Evidence data
score = data.frame(pairs,logBF=estimateBF(X))  
head(score[order(-score$logBF),])             

#-- results from hierarchical clustering
cp = clusterPath('C:429',tree)
cp[cp$logBF>0,]

Bayesian Model-Based Approaches

This section illustrates the partially-supervised Bayesian model-based clustering approach to crime series linkage of [@CrimeClust]. This approach is partially-supervised because the offender is known for a subset of the events, and utilizes spatiotemporal crime locations as well as crime features describing the offender\'s modus operandi.

The hierarchical model naturally handles complex features often seen in crime data, including missing data, interval censored event times, and a mix of discrete and continuous variables. It can also provide uncertainty assessments for all model parameters, including the relative influence of each feature (space, time, method of entry, etc.) in the model. In addition, the model produces posterior clustering probabilities which allow analysts to act on model output only as warranted.

The function crimeClust_bayes() is used for the Bayesian model-based clustering approach. Because it uses both solved and unsolved crimes, the labels (crime group) for the solved crimes is also passed into the function. (Note: this function will take 20+ minutes to run.)

#-- Make the crime group labels for each crime (NA for unsolved crimes)
seriesData$CG = makeGroups(seriesData,method=2)      # method=2 uses unique co-offenders
group_info = unique(seriesData[,c('crimeID','CG')])  # extract the group info
A = merge(crimes,group_info,by="crimeID",all.x=TRUE) # add group info to crimes
A = A[order(A$CG),]                                  # order by crime group

#-- Run MCMC
fit = crimeClust_bayes(A$CG, spatial=A[,c('X','Y')],t1=A$DT.FROM,t2=A$DT.TO,
                       Xcat=A[,c("MO1","MO2","MO3")],maxcriminals=1000,
                       iters=3000,burn=1000,update=100,seed=5)

#-- Extract pairwise probabilities
pp = fit$p.equal    # probability that crime i is linked to crime j
diag(pp) = NA

The matrix pp contains the pairwise estimated probability that two crime are linked (share a common offender). We can use this information for crime series identification.

#-- Save the results that take a long time to run
save(A,fit,pp,file="vignettes/MCMC-results.RData")

load("MCMC-results.RData")

Using the image.plot() from the fields package, we can see how strongly the unsolved crimes are linked to the existing (solved) crime series.

library(fields) # if not installed, type: install.packages("fields")

#-- Get index of unsolved crimes
ind.unsolved = which(is.na(A$CG))          # index of unsolved crimes
n = nrow(A)                                # number of crimes

#-- Image plot of linkage probabilities
fields::image.plot(1:n,ind.unsolved,pp[1:n,ind.unsolved],
           xlab="Crime",ylab="Unsolved Crime",
           main="Probability crimes are linked")

We see that some unsolved crimes are linked to solved crimes with a posterior probability above 0.25. These crimes may be worth further investigations.

Here we plot the maximum posterior probability that an unsolved crime is linked to another crime (solved or unsolved).

#-- Find strongest linkages
unsolved.probs = apply(pp[ind.unsolved,],1,max,na.rm=TRUE)  # maximum probability
plot(ind.unsolved,unsolved.probs,xlab="unsolved crime",ylab='maximum probability of linkage')
abline(h=0.25)
ind = ind.unsolved[unsolved.probs > 0.25]
investigate = as.character(A$crimeID[ind])       # crimeIDs for crimes with strongest linkage
investigate

This shows that r investigate are the crimes with the strongest linkages (posterior probabilities greater that 0.25). A particular crime can be investigated in more detail with the function bayesProb():

bp = bayesProb(pp[A$crimeID %in% "C:417"])
bp$crimeID = A$crimeID[bp$index]
bp$CG = A$CG[bp$index]
head(bp)

For this example, our model provides a list of the most likely crimes associated with the unsolved crime C:417. The first two crimes (C:15 and C:26) are solved crimes indicating that the offender(s) responsible for these crimes may also be responsible for C:417. The next two crimes, C:459 and C:446 do not have a group ID. This means that they are unsolved crimes. By providing the posterior probabilities, crime analysts may choose to investigate further only if the linkage is strong enough.