Setup

# devtools::install_github("dkahle/ggmap")
library(learnr)
library(RSocrata)
library(ggplot2)
library(ggmap) 
library(e1071)
library(class)
library(caret)
library(PRROC)
library(pROC)

Data

For this notebook we use data on incidents of crime in the City of Chicago. This data "... is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system." It contains a number of basic information about each crime incident, such as date, location, type and whether there was an arrest. Here we only pull in data from January 2018.

Source: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

ccj2018 <- read.socrata("https://data.cityofchicago.org/resource/6zsd-86xi.json?$where=date between '2018-01-01' and '2018-01-31'")

#str(ccj2018)
head(ccj2018)
names(ccj2018)

Some quick data preparation since most variables seem to be of type character by default. We also exclude cases with missing values.

ccj2018$arrest <- as.factor(ccj2018$arrest)
ccj2018$latitude <- as.numeric(ccj2018$latitude)
ccj2018$longitude <- as.numeric(ccj2018$longitude)

ccj2018 <- subset(ccj2018, complete.cases(ccj2018[,c(9,20,21)]))

Train and test set

Next, we split the data into a train and test set.

set.seed(765)
train <- sample(1:nrow(ccj2018), 0.8*nrow(ccj2018))
c_train <- ccj2018[train,]
c_test <- ccj2018[-train,]

In addition, we also need X and y data frames for both data pieces as input for knn(). In the next sections, the outcome will be arrest and we use (only) latitude and longitude as features.

X_train <- ccj2018[train,c(20,21)]
X_test <- ccj2018[-train,c(20,21)]
y_train <- ccj2018[train,9]
y_test <- ccj2018[-train,9]

A quick look at our outcome variable.

summary(y_train)
summary(y_test)

We can plot the crime incidents colored by arrest.

ggplot(X_train) + 
  geom_point(data = X_train, aes(x = longitude, y = latitude, color = y_train), size = 1, alpha = 0.5) 

kNN

In order to find a useful kNN setup, we tune k using 10-Fold Cross-Validation. This can be done e.g. with tune.knn().

set.seed(761)
tune <- tune.knn(X_train, y_train, k = 1:25, tunecontrol = tune.control(sampling = "cross"), cross = 10)
summary(tune)
plot(tune)

Seems like k = 21 is a good choice. We pass this information to knn(), together with X from the test data. Note that the resulting object are the test set predictions, since with kNN there is no separate model to be stored.

y_knn <- knn(X_train, X_test, y_train, k = 21, prob = TRUE)

We can also add a logistic regression model for comparison, although this is unlikely to perform well given the prediction task at hand.

logit <- glm(arrest ~ latitude + longitude, data = c_train, family = binomial)
summary(logit)

Given the logit object, we can generate predicted risk scores for the test set and transform those into predicted classes. Note that we are using an arbitrary classification threshold (0.5), which might not be the best option.

yp_logit <- predict(logit, newdata = c_test, type = "response")
y_logit <- as.factor(ifelse(yp_logit > 0.5, "TRUE", "FALSE"))

Prediction performance

Now we can inspect the prediction performance of kNN and the logit model using confusionMatrix() from caret, which can be used to (also) display a lot of performance measures, given predicted classes.

confusionMatrix(y_knn, y_test, mode = "everything", positive = "TRUE")
confusionMatrix(y_logit, y_test, mode = "everything", positive = "TRUE")

Additionally, ROC and PR curves are helpful for evaluating prediction performance with categorical outcomes. Here we could (e.g.) use the PRROC package. As an example, we only consider the knn model.

First, get predicted risk scores.

yp_knn <- 1 - attributes(y_knn)$prob

Then, create helper objects...

pc <- yp_knn[y_test == "TRUE"]
nc <- yp_knn[y_test == "FALSE"]

...that can be passed to roc.curve() (see ?roc.curve).

roc <- roc.curve(scores.class0 = pc, scores.class1 = nc, curve = T)

Finally, we can print and plot the resulting roc object.

roc
plot(roc, scale.color = heat.colors(100))

Same for PR curve.

pr <- pr.curve(scores.class0 = pc, scores.class1 = nc, curve = T)
pr
plot(pr, scale.color = heat.colors(100))

Try to calculate precision at top 100, i.e. the expected precision when classifying the 100 test incidents with the highest risk scores as being arrests (TRUE). For this, we need to create a new prediction vector. The function order() might be helpful here.

yp <- data.frame(yp_knn, y_test)
yp <- yp[order(-yp_knn),]
yp$yt_knn <- "FALSE"
yp[1:100,]$yt_knn <- "TRUE"

Next, compute the precision given the new predicted classes and y_test.

precision(as.factor(yp$yt_knn), yp$y_test, relevant = "TRUE")

Classification thresholds

In the previous plots, we have seen that performance measures such as sensitivity and specificity are highly dependent on the underlying classification threshold. Therefore, lets try to find a threshold that satisfies some optimality criterion, instead of simply using 0.5. For this purpose, we have to create another roc object for the knn result, now using the pROC package.

roc2 <- roc(y_test, yp_knn)
roc2

This package provides the function coords(), which can be used for threshold optimization (see ?coords). Note that in an actual application, we couldn't use the test set for this purpose, so another hold-out set would be needed.

knn_t <- coords(roc2, x = "best", best.method = "closest.topleft", best.weights = c(1, 0.2))
knn_t

We can now use this new threshold to predict class membership.

y_knn2 <- as.factor(ifelse(yp_knn > unlist(knn_t[1]), "TRUE", "FALSE"))

And finally build a confusion matrix using the predicted classes from above.

confusionMatrix(y_knn2, y_test, mode = "everything", positive = "TRUE")

References



kimbrianj/mlforsocialscience documentation built on March 12, 2024, 12:07 a.m.