The reclues
package mimics the SQL Murder Mystery done by
Northwestern’s Knight
Lab - their game is set
up to use SQL
to solve the mystery, reclues
makes their data
available in R
to solve the mystery using R
;-).
A brief of the task at hand is given in their walkthrough. Want to play the SQL version of the game online? Here it is courtesy of Simon Willison!
A crime has taken place and the detective needs your help. The detective gave you the crime scene report, but you somehow lost it. You vaguely remember that the crime was a murder that occurred sometime on Jan.15, 2018 and that it took place in SQL City. All the clues to this mystery are buried in a huge database, and you need to use your R skills to navigate through this vast network of information. Your first step to solving the mystery is to retrieve the corresponding crime scene report from the police department’s database.
Take a look at the cheatsheet
(this Readme
file) to learn some tips
on how to do this using the tidyverse tools! From there, you can use
your R skills to find the murderer. The crime may be solved with other
tools besides tidy tools, please feel free to use the dataset as a
learning tool for other packages in R or base R itself if that is your
wish. The cheatsheet is for tidyverse tools for now as a start, but
since I am learning myself I may expand this to include data.table etc.
in future. Have fun!
The goal of reclues is to provide the datasets from the SQL Challenge
mentioned above within R, and some cheatsheet
tidyverse commands for
getting you on your way to solving the mystery.
Please see the reclues package
site for how to use the
package, as well as the vignette included in the Get Started
section
which contains info on using the actual sqlite database included in the
package.
The datasets will be available as soon as you install the package. These are the datasets available and the data contained within them.
| Table Name | Fields | Rows | | :------------------------: | :------------------------------------------------------------------------------------- | ----------: | | crime_scene_report | date, type, description, city | 1,228 rows | | drivers_license | id, age, height, eye_color, hair_color, gender, plate_number, car_make, car_model | 10,007 rows | | facebook_event_checkin | person_id, event_id, event_name, date | 20,011 rows | | get_fit_now_check_in | membership_id, check_in_date, check_in_time, check_out_time | 2,703 rows | | get_fit_now_member | id, person_id, name, membership_start_date, membership_status | 184 rows | | income | ssn, annual_income | 7,514 rows | | interview | person_id, transcript | 4,991 rows | | person | id, name, license_id, address_number, address_street_name, ssn | 10,011 rows |
The raw SQLite database as per @knightlab is also available through the
get_db()
function. To use the SQLite DB for your investigation you
will need the DBI package.
install.packages("DBI")
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("sciencificity/reclues")
library(reclues)
library(dplyr)
# basic example code
# see the first 6 observations
head(crime_scene_report)
#> # A tibble: 6 x 4
#> date type description city
#> <int> <chr> <chr> <chr>
#> 1 20180115 robbery A Man Dressed as Spider-Man Is on a Robbery Spr~ NYC
#> 2 20180115 murder Life? Dont talk to me about life. Albany
#> 3 20180115 murder Mama, I killed a man, put a gun against his hea~ Reno
#> 4 20180215 murder REDACTED REDACTED REDACTED SQL Ci~
#> 5 20180215 murder Someone killed the guard! He took an arrow to t~ SQL Ci~
#> 6 20180115 theft Big Bully stole my lunch money! Chicago
glimpse(crime_scene_report)
#> Observations: 1,228
#> Variables: 4
#> $ date <int> 20180115, 20180115, 20180115, 20180215, 20180215, ...
#> $ type <chr> "robbery", "murder", "murder", "murder", "murder",...
#> $ description <chr> "A Man Dressed as Spider-Man Is on a Robbery Spree...
#> $ city <chr> "NYC", "Albany", "Reno", "SQL City", "SQL City", "...
# Notice that the type field contains info on the
# type of crime which took place?
# Want to see how many reports of each incident type we have?
table(crime_scene_report$type)
#>
#> arson assault blackmail bribery fraud murder robbery
#> 148 145 130 135 130 148 134
#> smuggling theft
#> 117 141
Other great packages to explore your data are DataExplorer
and
skimr
.
create_report(dataset)
creates an html report
with summary stats, missing data, graphs of categorical data etc.skim(dataset)
creates a nice summary of your dataset
separating the different types of data allowing you to look at
summary stats by data type.In R dplyr’s select()
works much the same as SELECT
in SQL. You use
it to get only specific columns you are interested in.
Let’s say I wanted a closer look at the plate_number
, car_make
and
car_model
from the drivers_license table?
# Let's have a look at a few columns of interest from
# the drivers_license table
drivers_license %>%
select(plate_number, car_make, car_model) %>%
head(3) %>%
# formattable func from the formattable package just prints a nice table in the Readme
formattable::formattable()
plate\_number
car\_make
car\_model
P24L4U
Acura
MDX
XF02T6
Cadillac
SRX
VKY5KR
Scion
xB
# There are also helper functions to select columns of interest
# starts_with('start_text') will help select columns that begin with start_text
# ends_with('end_text') will help select columns that end with end_text
drivers_license %>%
# Maybe I am only interested in the columns describing the car...
select(starts_with('car')) %>%
head(3) %>%
# formattable just prints a nice table in the Readme
formattable::formattable()
car\_make
car\_model
Acura
MDX
Cadillac
SRX
Scion
xB
SQL Equivalent is:
SELECT plate_number, car_make, car_model FROM drivers_license LIMIT 3
Here’s a snippet from the online SQL version:
Let’s say we wanted to see a part of the data - the head() function
returns 6 observations and performs a similar functionality as the
LIMIT
keyword in SQL.
head()
gives you the first 6 observations of the data in the
“table”tail()
gives you the last 6 observations of the data in the
“table”You can also specify a number as an argument to the head()
or tail()
functions. For example, head(15)
and tail(10)
will give you the
first 15, and last 10 observations respectively.
crime_scene_report %>%
select(description) %>%
head(8) %>%
# formattable func from the formattable package just prints a nice table in the Readme
formattable::formattable()
description
A Man Dressed as Spider-Man Is on a Robbery Spree
Life? Dont talk to me about life.
Mama, I killed a man, put a gun against his head…
REDACTED REDACTED REDACTED
Someone killed the guard\! He took an arrow to the knee\!
Big Bully stole my lunch money\!
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua.
A lone hunter stalks the night, firing arrows into the Darkness. There
is no hiding, no escape. In the distance, the beast falters, tethered to
the void. The killing blow comes without hesitation, without mercy.
SQL Equivalent is:
SELECT description FROM crime_scene_report LIMIT 8
Here’s a snippet from the online SQL version:
Maybe I am interested in having a look at all variables associated with a person but I just want to have a look at the data not bring back all 10,011 rows.
person %>%
head(4) %>%
formattable::formattable()
id
name
license\_id
address\_number
address\_street\_name
ssn
10000
Christoper Peteuil
993845
624
Bankhall Ave
747714076
10007
Kourtney Calderwood
861794
2791
Gustavus Blvd
477972044
10010
Muoi Cary
385336
741
Northwestern Dr
828638512
10016
Era Moselle
431897
1987
Wood Glade St
614621061
SQL Equivalent is:
SELECT * FROM person LIMIT 4;
Here’s a snippet from the online SQL version:
Let’s say we wanted to see the different types of criminal activity
the reports identify ….
The type
field in the crime scene reports table seems to contain this
info. We will use the distinct
function from dplyr
.
library(magrittr)
# the magrittr package contains the pipe %>% function
# Take the crime scene report data AND THEN
# give me the distinct values for the `type` variable.
crime_scene_report %>%
distinct(type) %>%
formattable::formattable()
type
robbery
murder
theft
fraud
arson
bribery
assault
smuggling
blackmail
SQL Equivalent is:
SELECT DISTINCT(type) FROM crime_scene_report
Here’s a snippet from the online SQL version:
Let’s say we were wondering which city has the highest number of crimes - here we want the city and a count of the times that city is mentioned in the crime scene report …
crime_scene_report %>%
count(city) %>%
arrange(desc(n)) %>%
# filter to limit the print-out
filter(n >= 7) %>%
formattable::formattable()
city
n
Murfreesboro
9
SQL City
9
Duluth
8
Evansville
8
Jersey City
8
Toledo
8
Dallas
7
Hollywood
7
Kissimmee
7
Lancaster
7
Little Rock
7
Newark
7
Portsmouth
7
Reno
7
Waterbury
7
Wilmington
7
Hhmmm looks like SQL City is quite notorious for crime!
SQL Equivalent is:
SELECT city, count(city) AS n FROM crime_scene_report GROUP BY city ORDER BY n DESC
Here’s a snippet from the online SQL version:
Sometimes there are fields like crime_scene_report.description
which
are hard to see because the text runs over several lines. Even using
View() or printing just the description to the screen sometimes does not
help.
Enter pull()
from the dplyr package which extracts a column from
the data.
Hint: You will need something like this to read some of the textual description and transcript information.
crime_scene_report %>%
head(8) %>%
pull(description) %>%
# these next 2 lines are just for displaying the result nicely in the Readme
tibble::enframe(name = NULL) %>%
formattable::formattable()
value
A Man Dressed as Spider-Man Is on a Robbery Spree
Life? Dont talk to me about life.
Mama, I killed a man, put a gun against his head…
REDACTED REDACTED REDACTED
Someone killed the guard\! He took an arrow to the knee\!
Big Bully stole my lunch money\!
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua.
A lone hunter stalks the night, firing arrows into the Darkness. There
is no hiding, no escape. In the distance, the beast falters, tethered to
the void. The killing blow comes without hesitation, without mercy.
interview %>%
filter(stringr::str_length(transcript) >= 230) %>%
pull(transcript) %>%
# these next 2 lines are just for displaying the result nicely in the Readme
tibble::enframe(name = NULL) %>%
formattable::formattable()
value
I was hired by a woman with a lot of money. I don’t know her name but I
know she’s around 5’5" (65“) or 5’7” (67"). She has red hair and she
drives a Tesla Model S. I know that she attended the SQL Symphony
Concert 3 times in December 2017.
Let’s say we’re interested in finding the people that start with a
Z. We will use the stringr
package for this. The str_detect()
function can be used in conjunction with regular expressions - here we
looking for names that start with (^)
Z.
library(stringr)
person %>%
filter(stringr::str_detect(name, "^Z")) %>%
# Limit to top 5 for the print-out
head(5) %>%
formattable::formattable()
id
name
license\_id
address\_number
address\_street\_name
ssn
10452
Zachary Angeloro
702210
3713
Currant Ave
965949567
10797
Zack Pentecost
150590
1839
Rushwood St
532875652
11264
Zada Laban
808317
1663
Testa Circle
983015244
14182
Zack Karwoski
964918
1546
W Middleton Way
482427372
14930
Zella Pietrzyk
440964
3680
Sandgate Circle
443068729
SQL Equivalent is:
SELECT * FROM person WHERE name LIKE ‘Z%’
Here’s a snippet from the online SQL version:
dplyr
has joining functions such as inner_join()
, left_join()
etc.
for joining one table to another. This mimics the SQL
INNER JOIN
etc.
You will notice that the person
table has a field called id
and the
interview
table has a person_id
field. Let’s join these tables and
see what we get.
person %>%
# Since the two tables have diff field names for the common field
# we have to specify the `by` argument.
# by = c('field_name_from_left_table' = 'field_name_from_right_table')
inner_join(interview, by = c('id' = 'person_id')) %>%
# Let's say we're only interested in interviews from people who live
# on some Drive abbreviated to 'Dr'
filter(stringr::str_detect(address_street_name, 'Dr')) %>%
# Limit for print-out
head(3) %>%
formattable::formattable()
id
name
license\_id
address\_number
address\_street\_name
ssn
transcript
10027
Antione Godbolt
439509
2431
Zelham Dr
491650087
nearer to watch them, and just as she came up to them she heard one of
10034
Kyra Buen
920494
1873
Sleigh Dr
332497972
a kind of serpent, thatâs all I can say.â
10039
Francesco Agundez
278151
736
Buswell Dr
861079251
Beau–ootiful Soo–oop\!
SELECT * FROM person INNER JOIN interview ON person.id = interview.person_id WHERE address_street_name LIKE ‘%Dr%’ LIMIT 3
Here’s a snippet from the online SQL version:
Solution Checker
Head over to ‘The SQL Murder Mystery
Page’ OR ‘The SQL Murder Mystery
Walkthrough’ to check
your solution! At the bottom of both pages there is a Check your
solution
section where you enter the name of the individual you
suspect committed the crime.
Run the following commands in R once you think you’ve solved the
problem. You will need the DBI package and if you’ve been using the
datasets to solve the mystery and not the SQLite database (i.e. the
individual dataframes of person
, drivers_license
etc.) then
uncomment the first line to make a connection to the database, run the
queries below after you’ve put in the culprit you suspect, and then
disconnect from the database.
conn <- reclues::get_db()
# Replace 'Insert the name of the person you found here' with the name of the individual you found.
DBI::dbExecute(conn, "INSERT INTO solution VALUES (1, 'Insert the name of the person you found here')")
# Did we solve it? You'll either get a "That's not the right person." or a "Congrats,..." message.
DBI::dbGetQuery(conn, "SELECT value FROM solution;")
DBI::dbDisconnect(conn)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.