# load packages library(learnr) library(lubridate) library(tidyverse) # Copy files if (!dir.exists("css")) dir.create("css") purrr::walk( dir("../css/"), ~ file.copy(stringr::str_glue("../css/{.}"), stringr::str_glue("css/{.}"), overwrite = TRUE) ) # set knitr options knitr::opts_chunk$set(echo = FALSE) # load data download.file( url = "https://mpjashby.github.io/crimemappingdata/aggravated_assaults.xlsx", destfile = agg_assaults_file <- tempfile(fileext = ".xlsx"), mode = "wb" ) agg_assault_data <- readxl::read_excel(agg_assaults_file, sheet = "Austin")
A major step in using any data to make decisions or draw conclusions is data wrangling: the process of transforming data from the format in which we originally have it to the format needed to analyse and present it to our audience.
Start off by watching this video that walks through the different steps in wrangling a dataset. We will cover all the steps in the video in more detail during the rest of this tutorial.
In this tutorial we will learn how to wrangle data in R using functions – specialised pieces of code that do something to the data we give it. The code to use a function (sometimes called calling the function) has two parts: the function name followed by a pair of parentheses, inside which are zero or more arguments separated by commas. Arguments are a way of providing input that a function works on, or to fine-tune the way the function works (we will see many examples of this later). Remember that you can identify a function in R because the name will always have parentheses after it.
One basic R function is sqrt()
, which calculates the square root of a number.
The sqrt()
function has only one argument: the number that we want to find the
square root of.
::: {.tutorial}
Type sqrt(2)
in the box below and click Run Code
to calculate the square
root of 2.
:::
sqrt(2)
When you run code in R, by default R prints the output of your code – in this
case, just the number r format(sqrt(2), digits = 7)
(for now, you can ignore
the number [1]
in square brackets).
#| echo: false package_count <- available.packages(repos = "https://cran.rstudio.com/") |> nrow() |> scales::comma(accuracy = 100)
R contains thousands of different functions that do different things. A few functions are contained in the default installation of R that you have already installed (this is sometimes referred to as base R). But most functions are contained in packages, which are extensions to base R. Most packages focus on a particular type of data analysis, so that there are packages devoted to time-series analysis, testing whether events are clustered in particular places, network analysis and thousands of other tasks. Packages are often developed by experts in the field, and are typically updated to introduce new features.
To use a package in R, we must do two things:
The install.packages()
function downloads and installs packages from the
Comprehensive R Archive Network (universally
known as CRAN), which contains about r package_count
different packages. Some
packages that are still in the early stages of development are not available on
CRAN, but all the packages we will use are there.
So to install (for example) the package called tidyverse
, which we will use
extensively in this tutorial, we would run the R code:
install.packages("tidyverse")
We only have to install a package once for each computer that we will use to
run R, although we would have to do it again if we updated to a new version of
R. Once a package is installed on our computer, we have to load it so that we
can use it in our code. We load packages using the library()
function, which
should probably have been called load_package()
but isn't. So to load the
tidyverse
package, we run the code:
library(tidyverse)
Many packages are focused on specialist tasks and so are only used occasionally,
but a few packages are likely to be useful in almost all the code we write.
Fortunately, packages can themselves load other packages, and all the main
packages we need are themselves loaded by the tidyverse
package. That is why
you will often see library(tidyverse)
at the top of R code in subsequent
tutorials – that short line of code loads several packages containing hundreds
of functions that we can use in data analysis.
Before we can do anything with any data, we have to load it into R. In this course we will read tabular data in comma-separated values (CSV) and Excel formats, as well as spatial data in different formats (because there are lots of ways to store spatial data). We will learn how to read CSV and Excel data now, but leave loading spatial data until later.
Tabular data contains multiple columns where every column has the same number of rows. For example, crime data might have columns for the type of crime, date and address at which the crime occurred.
#| echo: false tibble::tribble( ~type, ~date, ~address, "homicide", as_date(now(tzone = "UTC") - years(1) + days(10)), "274 Main St", "non-residential burglary", as_date(now(tzone = "UTC") - months(5) + days(20)), "541 Station Rd", "personal robbery", as_date(now(tzone = "UTC") - days(8)), "10 North Av" ) |> dplyr::mutate(date = strftime(date, "%d %b %Y")) |> knitr::kable(caption = "Crime data in rectangular format")
#| echo: false san_fran_rob <- read_csv("https://mpjashby.github.io/crimemappingdata/san_francisco_robbery.csv")
Data stored in CSV format is easy to load with the read_csv()
function from
the readr
package. readr
is one of the
packages loaded by the tidyverse
package, so all we need to do to use this
package is include the code library(tidyverse)
on the first line of our R
script. We will use comments (lines of code beginning with #
) to help explain
as we go.
# Load the tidyverse suite of packages, including the readr package # that contains the read_csv() function library(tidyverse) # We can load data from a file in the same folder as our R script san_fran_rob <- read_csv("san_francisco_robbery.csv") # Or another folder on your computer ('../' is short for the parent # folder of the current folder) san_fran_rob <- read_csv("../san_francisco_robbery.csv") # Or directly from a file online san_fran_rob <- read_csv("http://example.com/san_francisco_robbery.csv")
In each of these examples, the code stores the result of the read_csv()
function in an object named san_fran_rob
. Objects are places where we can
store data. To create an object and store our data in it, we use the assignment
operator <-
(a less-than sign followed by a dash). Continually typing <-
can
be tedious, so in RStudio we can use the keyboard short cut Option+-
(on Mac)
or Alt+-
(on Windows or Linux) to insert the complete operator.
::: {.box .notewell}
When choosing object names, it is important to remember that if you assign a
value (such as the number 1
or the result of the function read_csv()
) to an
object name, R will overwrite any existing value of that object name. We can see
this in a simple example:
one_to_ten <- 1:10 one_to_ten <- sqrt(2)
If we were to run this code, the object one_to_ten
would not actually hold the
numbers from one to ten, but instead the value r format(sqrt(2), digits = 7)
(the square root of two). There is also no way to undo assignment of a value to
an object, so once you have run the code one_to_ten <- sqrt(2)
it is not
possible to recover any previous value that was assigned to the object
one_to_ten
.
:::
Objects come in several different types, with tabular data typically being
stored as a data frame. The read_csv()
function actually produces a
modern variation on the data frame called (slightly strangely) a tibble, which
makes use of some advances in how R handles data since the data-frame format was
set 20 years ago. Tibbles behave just like data frames almost all of the time
(so much so that people working with tibbles often call them data frames) except
for a few occasions where they behave in a more-convenient way.
We can use read_csv()
to load data from https://mpjashby.github.io/crimemappingdata/san_francisco_robbery.csv
and store it in an object called san_fran_rob
.
::: {.tutorial}
Use the code above to help you, or click the Solution
button to reveal the
answer.
:::
san_fran_rob <- read_csv("https://mpjashby.github.io/crimemappingdata/san_francisco_robbery.csv")
If the data are loaded successfully, R will list the columns in the data and the type of variable (numeric, date etc.) stored in each column. The format of this is somewhat esoteric, but if you are interested they are explained in the 'Extra detail' box below.
To see the first few rows of data currently stored in an object, we can use the
head()
function.
::: {.tutorial}
Type the code needed to view the first few rows of the san_fran_rob
object and
click the Run Code
button.
:::
head(san_fran_rob)
Loading data from Microsoft Excel files is very similar to loading CSV data,
with a few important differences. Functions to load Excel data are contained in
the readxl
package, which was installed automatically when we installed the
tidyverse
package.
There are two main things we must do to import Excel data that are not required
for importing CSV data. The first is that the readxl
package cannot directly
load files from a URL, instead only loading files that are present on your
computer. To get round this, we will first download an Excel file and store it
in a temporary directory (to avoid cluttering up our computers).
::: {.box .notewell}
Using download.file()
on Windows
If you are using a Windows computer, you may find that the download.file()
function in the code below does not work as expected. This is because Windows
handles files in a way that distinguishes between plain-text files such as
.txt
and .csv
files and binary files, which includes most other file types
(including compressed files). Since aggravated_assaults.xlsx
is not a
plain-text file, on Windows you need to specify that you want it to be
downloaded as a binary file. To do this, add the argument mode = "wb"
to the
download.file()
function so that it reads:
download.file( url = "https://mpjashby.github.io/crimemappingdata/aggravated_assaults.xlsx", destfile = temp_file, mode = "wb" )
If you are using a Mac or a Linux computer then you do not need to worry about this.
:::
# Specify the name of and location of our temporary file: it does not matter # what this file is called or where it is stored, so we use the tempfile() # function to create a file in the correct location automatically temp_file <- tempfile(fileext = ".xlsx") # Download the Excel file and store it in the temporary location download.file( url = "https://mpjashby.github.io/crimemappingdata/aggravated_assaults.xlsx", destfile = temp_file, mode = "wb" )
The download.file()
function does not produce any output if the file has been
successfully downloaded, so you will not see any output when you run this code.
Now we have downloaded our data, we can load it into R. Since Excel files can
contain multiple sheets, we need to specify which sheet we would like to load
into a tibble. We can use the excel_sheets()
function to get a list of sheets
in an Excel file:
#| echo: false # Load the readxl package library(readxl) # Specify the name of and location of our temporary file: it does not matter # what this file is called or where it is stored, so we use the tempfile() # function to create a file in the correct location automatically temp_file <- tempfile(fileext = ".xlsx") # Download the Excel file and store it in the temporary location download.file( url = "https://mpjashby.github.io/crimemappingdata/aggravated_assaults.xlsx", destfile = temp_file, mode = "wb" )
# Load the readxl package library(readxl) # Get a list of sheets in an Excel file excel_sheets(temp_file)
We can now load the sheet containing data for Austin and view the first few rows of the resulting object:
agg_assault_data <- read_excel(temp_file, sheet = "Austin") head(agg_assault_data)
Now we have learned how to load our data into an object, we can use other R functions to work with that data in many different ways.
::: {.box .notewell}
Different types of data are loaded into R with different functions, e.g. CSV
files are loaded with the read_csv()
function from the readr
package and
Microsoft Excel files are loaded with the read_excel()
function from the
readxl
package. You can download a
table of functions for reading data into R
to remind you of which function to use to load each type of file.
:::
::: {.box .reading}
Learn more about how to read data into R by reading this chapter of the free online book R for Data Science.
Excel data can often be messy and the readxl
package contains various other
functions that can be used to deal with this. You can learn more about how to
handle messy Excel data in this online tutorial.
:::
::: {.tutorial}
Answer the following questions to check your understanding of what we've learned so far in this tutorial. If you get a question wrong, you can keep trying until you get the right answer.
#| echo: false quiz( caption = "", question( "What R package contains the function `read_csv()` to read CSV data?", answer("`readr`", correct = TRUE), answer( "`readxl`", message = "The `readxl` package provides functions for reading Microsoft Excel data, not CSV data – try looking further up this page for the answer." ), answer( "`reader`", message = "That's nearly right, but you must spell R package names exactly correctly for R to be able to find and load the package – try looking further up this page for the answer." ), answer( "`readcsv`", message = "That's not quite right – try looking further up this page for the answer." ), correct = random_praise(), allow_retry = TRUE, random_answer_order = TRUE ), question( "What R code prints the first few rows of the tibble called `san_fran_rob`?", answer( "`head(san_fran_rob)`", correct = TRUE, message = "To see the first few rows of an object in R, we can use the `head()` function." ), answer( "`message(san_fran_rob)`", message = "`message()` is used to print a text message, but if a vector longer than one item is used inside `message()` then what is printed will be all the items in the vector stuck together, so `message(san_fran_rob)` prints a very long and incomprehensible string of text." ), answer( "`summary(san_fran_rob)`", message = "`summary()` is used to print a summary of an object, but if the object is a data frame then it will print a summary of each variable in the dataset, rather than the first few rows." ), answer( "`peak_inside(san_fran_rob)`", message = "There is no function called `peak_inside()` in any package that we have loaded, so this code will result in an error message." ), correct = random_praise(), allow_retry = TRUE, random_answer_order = TRUE ), question( "If we create an object using the code `number_ten <- 10` and then run the code `number_ten <- sqrt(2)`, what value will the object `number_ten` now have?", answer("`1.414214` (the square root of 2)", correct = TRUE), answer( "`10` (the number 10)", message = "That's not quite right, because the original value of `number_ten` has been overwritten by the code `number_ten <- sqrt(2)`." ), answer( "`10.41421` (10 plus the square root of 2)", message = "If you assign a value to an existing object that already stores another value, the first value is simply overwritten rather than the two values being added together (which wouldn't make sense for non-numeric values) – try looking further up this page for the answer." ), answer( "`14.14214` (10 times the square root of 2)", message = "If you assign a value to an existing object that already stores another value, the first value is simply overwritten rather than the two values being multiplied together (which wouldn't make sense for non-numeric values) – try looking further up this page for the answer." ), correct = random_praise(), allow_retry = TRUE, random_answer_order = TRUE ) )
:::
In this section we will learn how to reduce the size of our data by selecting only the columns we need and discarding the rest. This can be particularly useful if we are working with a very-large dataset, or if we want to produce a table containing only some columns.
We can use the select()
function from the dplyr
package (one of the packages that is loaded automatically when we call the
library(tidyverse)
function) to select columns.
If we wanted to select just the date
and location_type
columns from the
agg_assault_data
we loaded in the previous section:
select(agg_assault_data, date, location_type)
In a previous section, we mentioned that the code needed to run (or call) a
function in R has two parts: the function name followed by a pair of
parentheses, inside which are zero or more arguments separated by commas. The
arguments in the select()
function (and many other functions in the dplyr
package) work in a slightly different way to many other functions. Here, the
first argument is the name of the data object that we want to select from. All
the remaining arguments (here, date
and location_type
) are the names of the
columns we want to select from the data.
We can select as many columns as we want, by just adding the names of the columns separated by commas.
::: {.tutorial}
Write the code necessary to select the longitude
and latitude
columns from
the agg_assault_data
object:
:::
select(agg_assault_data, longitude, latitude)
The columns in our new dataset will appear in the order in which we specify them
in the select()
function.
We can also use select()
to rename columns at the same time as selecting them.
For example, to select the columns date
and location_type
while also
renaming location_type
to be called type
:
select(agg_assault_data, date, type = location_type)
If we want to rename a column while keeping all the columns in the data, we
can instead use the rename()
function (also from the dplyr
package):
rename(agg_assault_data, type = location_type)
Remember that functions in R generally do not change existing objects, but instead produce (or return) new ones. This means if we want to store the result of this function so we can use it later, we have to assign the value returned by the function to a new object (or overwrite the existing object):
agg_assault_locations <- select(agg_assault_data, lon = longitude, lat = latitude) head(agg_assault_locations)
::: {.box .reading}
You can learn more about selecting, filtering and arranging data using the
functions in the dplyr
package by reading this Introduction to dplyr
tutorial.
:::
::: {.tutorial}
#| echo: false quiz( caption = "", question( "Which `dplyr` function allows you to change the name of columns while keeping all the columns in the original data", answer("`rename()`", correct = TRUE), answer( "`select()`", message = "`select()` does allow you to rename columns, but keeps only the columns you specify" ), correct = random_praise(), allow_retry = TRUE, random_answer_order = TRUE ), question( "Which `dplyr` function allows you to choose only some columns in the original data?", answer("`select()`", correct = TRUE), answer( "`rename()`", message = "`rename()` does allow you to rename columns, but keeps all the columns in the original data" ), correct = random_praise(), allow_retry = TRUE, random_answer_order = TRUE ) )
:::
Often in crime mapping we will only be interested in part of a particular
dataset. In the same way that we can select particular columns in our data,
we can filter particular rows using the filter()
function from the dplyr
package.
If we were only interested in offences in the agg_assault_data
dataset that
occurred in residences, we could use filter()
:
filter(agg_assault_data, location_type == "residence")
Note that:
location_type
is not surrounded by quotes but the column
value "residence"
is, and==
(equal to) operator is used, since a single equals sign =
has
another meaning in R.We can filter using the values of more than one column simultaneously. To filter
offences in which the location_category
is 'retail' and the location_type
is 'convenience store':
filter( agg_assault_data, location_category == "retail", location_type == "convenience store" )
As well as filtering using the ==
operator, we can filter using the
greater-than (>
), less-than (<
), greater-than-or-equal-to (>=
) and
less-than-or-equal-to (<=
) operators. For example, we can choose offences that
occurred in residences on or after 1 July 2019:
filter( agg_assault_data, location_type == "residence", date >= as.Date("2019-07-01") )
Sometimes we will want to filter rows that are one thing or another. We can do
this with the |
(or) operator. For example, we can filter offences that
occurred either in leisure facilities or shopping malls on or after 1 July 2019:
filter( agg_assault_data, location_category == "leisure" | location_type == "mall", date >= as.Date("2019-07-01") )
If we want to filter offences that have any one of several different values of
the same column, we can use the %in%
(in) operator. To filter offences that
occurred in either streets or publicly accessible open spaces:
filter(agg_assault_data, location_category %in% c("open space", "street"))
The code c("open space", "street")
produces what is referred to in R as a
vector (sometimes referred to as an atomic vector, especially in error
messages). A vector is a one-dimensional sequence of values of the same type
(i.e. all numbers, all character strings etc.). For example, a vector might hold
several strings of text (as in the vector c("open space", "street")
) or a
series of numbers such as c(1, 2, 3)
. There is lots we could learn about
vectors, but for now it's only necessary to know that we can create vectors with
the c()
or combine function.
If we wanted to re-use a vector of values several times in our code, it might make sense to store the vector as an object. For example:
# Create vector of location types we are interested in location_types <- c("open space", "street") # Filter the data filter(agg_assault_data, location_category %in% location_types)
Finally, you can filter based on the output of any R function that returns
TRUE
or FALSE
. For example, missing values are represented in R as NA
. We
can test whether a value is missing using the is.na()
function. If we wanted
to remove rows from our data that had missing location types, we would filter
for those rows that are not NA
. We can do this by combining the is.na()
function with the !
(not) operator:
filter(agg_assault_data, !is.na(location_type))
We will see lots more examples of how to use filter()
in future tutorials.
::: {.tutorial}
#| echo: false quiz( caption = "", question( "What is a vector (sometimes known as an atomic vector) in R?", answer("A type of object that stores a one-dimensional sequence of values of the same type", correct = TRUE), answer( "A type of object that stores a one-dimensional sequence of values that can be of different types", message = "Vectors are objects that store a one-dimensional sequence of values, but all the values must be of the same type." ), answer( "A type of object that stores a tibble or data frame", message = "Vectors can only store one-dimensional data, whereas a tibble or data frame has two dimensions (rows and columns). Try looking further up the page for more information about vectors." ), answer( "There is no such thing as a vector in R", message = "Vectors are a type of object in R. Try looking further up the page for more information about vectors." ), correct = random_praise(), allow_retry = TRUE, random_answer_order = TRUE ), question( 'Which offences (rows) will be returned by the code `filter(agg_assault_data, location_type %in% c("restaurant", "mall"))`?', answer("Offences that occurred either in restaurants *or* in shopping malls", correct = TRUE), answer( "Offences that occurred in both restaurants *and* in shopping malls (e.g. at restaurants inside shopping malls)", message = "The `|` (or) operator is used when we want to find rows that match one criteria or the other, not both together." ), answer( "Offences that occurred anywhere *except* restaurants or shopping malls (e.g. in homes)", message = "The `|` (or) operator is used when we want to find rows that match one criteria or the other (or both), but they must match at least one of the two." ), answer( "No offences, because the way the two criteria are combined is illogical", message = "The `|` (or) operator is used when we want to find rows that match one criteria or the other (or both). Since the location types 'restaurant' and 'mall' are different from one another, there is nothing illogical about this combination. Try looking further up the page for more information about the `|` (or) operator." ), correct = random_praise(), allow_retry = TRUE, random_answer_order = TRUE ), question( "What does the `<=` operator mean?", answer("less than or equal to", correct = TRUE), answer( "greater than or equal to", message = "Not quite right: the greater-than-or-equal-to operator is `>=`." ), answer( "less than", message = "Not quite right: the less-than operator is `<`." ), answer( "greater than", message = "Not quite right: the greater-than operator is `>`." ), correct = random_praise(), allow_retry = TRUE, random_answer_order = TRUE ) )
:::
It is often useful to create new columns in our data, or change the values of
existing columns. The mutate()
function in the dplyr
package gives us a way
to transform existing columns in our dataset using almost any R function.
For example, say we wanted to create a new column in our aggravated-assault
dataset specifying the day of the week on which each crime occurred. We can do
this using the wday()
function from the lubridate
package (using the label = TRUE
argument to produce weekday names, rather
than numbers):
library(lubridate) mutate(agg_assault_data, weekday = wday(date, label = TRUE))
Depending on the width of your screen, you might need to click the ▸
button to
see the new variable.
We can also categorise an existing variable, for example creating a variable to show whether an offence occurred in the northern or southern half of the city:
mutate( agg_assault_data, region = if_else(latitude > median(latitude), "northern", "southern") )
We can change existing columns, although (as with objects) there is no way
to undo this so you should only replace columns if you are sure you will not
need them. For example, if we wanted to remove the time portion of the date
variable (which may sometimes be useful, as shown in the next section) using the
as_date()
function (also from the lubridate
package) and at the same time
create the weekday variable:
#| echo: false library(lubridate)
mutate( agg_assault_data, date = as_date(date), weekday = wday(date, label = TRUE) )
You may sometimes want to change only some values in a column. We can do this in various ways, depending on which values we want to change:
mutate( agg_assault_data, # Change a single value with a new value (and otherwise keep the existing # value) using the if_else() function location_type = if_else(location_type == "street", "road", location_type), # Change multiple values in a categorical variable using the recode() # function, in which values are changed using arguments in the format # old_value = new_value location_category = recode( location_category, "open space" = "public open space", "street" = "street or road" ) )
We could also make changes based on more-complicated sets of criteria using the
case_when()
function, but we will return to that in a future tutorial.
The R functions that you use inside mutate()
must return the same number of
values as there are rows in the dataset. This is true for most R functions
(which are referred to as vectorised functions), but there are some – such as
mean()
and max()
– that return a single value. These summarising functions
cannot be used inside mutate()
(you will see an error message if you try) but
are instead used with the next data-wrangling function we will learn about:
summarise()
.
Summarising data is often useful in crime analysis. We can use the summarise()
function from the dplyr
package to produce summaries of different columns in
our data. There is an identical function called summarize()
so that you do not
have to remember whether to use the US or British spelling.
By default, summarise()
collapses data into a single row, with each column
summarised using a function that you specify. For example, suppose you want to
find out which police station a specialist squad should be based at to most
easily respond to reports of serious assaults. You might do this by working out
the weighted centre of all the offence locations, i.e. the means of the
longitudes and latitudes for all the crimes. You could then base the squad at
the police station that was closest to the weighted centre.
summarise( agg_assault_data, mean_lng = mean(longitude, na.rm = TRUE), mean_lat = mean(latitude, na.rm = TRUE) )
summarise()
becomes more useful if we first divide our data into groups, since
we then get a summary for each group separately. We can use the .by
argument
of the summarise()
function to specify that we want separate summaries for
each unique value of one or more columns in the data. For example, to produce a
separate summary for each unique value of location_category
, we can use this
code:
summarise( agg_assault_data, mean_lng = mean(longitude, na.rm = TRUE), mean_lat = mean(latitude, na.rm = TRUE), .by = location_category )
You can add multiple grouping variables using the c()
(combine) function if
you want to generate summary values for groups within groups:
summarise( agg_assault_data, mean_lng = mean(longitude, na.rm = TRUE), mean_lat = mean(latitude, na.rm = TRUE), .by = c(location_category, location_type) )
One very common way of summarising data is to count the number of rows in a
dataset that have each unique value of one or more columns. For example, if we
have a dataset of crimes in which each row represents a single crime, we might
want to count how many crimes happened on each day of the week, or how many
crimes of each type are in the dataset. We can use summarise()
to do that,
together with the n()
function (from the same dplyr
package as
summarise()
). For example, if we wanted to count how many rows in the
agg_assault_data
dataset had each unique combination of location_category
and location_type
:
summarise(agg_assault_data, n = n(), .by = c(location_category, location_type))
In this code, the n()
function simply returns the number of rows of data in
each group, i.e. the number of rows with each unique combination of values of
location_category
and location_type
.
Since counting the number of rows in each group a dataset is a very common task,
dplyr
includes another function called count()
that allows you to do the
same thing as in the code above, but with slightly less typing:
So if you wanted to know how many aggravated assaults had occurred in each location category and type:
count(agg_assault_data, location_category, location_type)
It is sometimes useful to be able to place rows in a dataset into a particular
order. We can do this using the arrange()
function from the dplyr
package.
For example, we can sort the aggravated-assault data by date:
arrange(agg_assault_data, date)
By default, arrange()
sorts rows in ascending order, i.e. it sorts numeric
values from the smallest to the largest, dates from earliest to latest and
character values alphabetically. We can instead sort values in descending order
by wrapping the name of a column in the desc()
function:
arrange(agg_assault_data, desc(date))
We can also sort the data based on multiple columns – the data are sorted first on the first column that you specify, with tied rows then sorted on the subsequent columns in order.
arrange(agg_assault_data, date, desc(location_type), location_category)
::: {.tutorial}
Type the code necessary to arrange agg_assault_data
in order of latitude,
in descending order (from largest to smallest)
Run your code using the Run Code
button, then (if necessary) correct your code
and run it again. Once you are happy that your code does what it is intended to
do, click the Solution
button to check.
arrange(agg_assault_data, desc(latitude))
:::
Once we have finished wrangling a particular dataset, it is often useful to save it to a file so that we can use it again in future without going through all the steps of data wrangling again.
Most R functions that begin with read_
(like read_csv()
and read_excel()
)
have equivalent functions that begin write_
and which save data into a
particular file format. In this example, we will use the write_csv()
function
from the readr
package, which is loaded when we load the tidyverse
package.
# We can write data to a file in the same folder as our R script write_csv(agg_assault_data, "fort_worth_agg_assault.csv") # Or another folder on your computer ('../../' is short for the parent folder of # the parent folder of the current folder) write_csv(agg_assault_data, "../../fort_worth_agg_assault.csv")
For very large datasets, we can save a compressed version of the file by adding
.gz
(for gzip) to the end of the file name, which tells R to compress the file
after creating it.
write_csv(agg_assault_data, "fort_worth_agg_assault.csv.gz")
read_csv()
can read gzipped CSV files, but some other programs (such as Excel)
cannot, so only use this option if you are sure you will only need to open the
file in software that can handle it.
There are corresponding write functions for other types of data (which we will come back to when we learn how to handle spatial data), but in this course we will store all non-spatial data in CSV format because it can be read by many different programs.
In this tutorial we have learned how to use the dplyr
functions select()
,
filter()
, mutate()
, summarise()
and arrange()
to wrangle data from one
format to another. Data wrangling is part of almost all data analysis, so these
are skills we will use frequently.
Data wrangling often involves multiple steps. For example, we might want to load some data, select certain columns, filter some rows, mutate some of the variables, summarise the dataset and save the result. We can do each of these steps separately, assigning the result of each step to a new object.
# Load the lubridate package, since we will need it below library(lubridate) # Read San Francisco robbery data robbery1 <- read_csv("https://mpjashby.github.io/crimemappingdata/san_francisco_robbery.csv") # Select only the columns we need robbery2 <- select(robbery1, date_time) # Filter only those offences that occurred in the first quarter of 2019 robbery3 <- filter(robbery2, as.Date(date_time) <= as.Date("2019-03-31")) # Create a new weekday variable robbery4 <- mutate(robbery3, weekday = wday(date_time, label = TRUE)) # Count how many offences occurred on each weekday q1_weekday_counts <- count(robbery4, weekday) # Print the first few rows of the result head(q1_weekday_counts, n = 7)
This code works, but involves creating six new objects, even though we only need
the final object for our analysis. You may notice that the first argument
expected by select()
, filter()
, mutate()
and count()
is
always the data tibble produced by the previous step. This means we can skip
saving the result of each step as a new object, and just run the previous
function inside the first function. This approach produces the following code,
which produces exactly the same result as the code above.
q1_weekday_counts <- count( mutate( filter( select( read_csv("https://mpjashby.github.io/crimemappingdata/san_francisco_robbery.csv"), -offense_type ), as.Date(date_time) <= as.Date("2019-03-31") ), weekday = wday(date_time, label = TRUE) ), weekday ) head(q1_weekday_counts, n = 7)
This code works and takes up less space, but it's quite difficult to read –
which can be a problem for finding and fixing problems with your code. For
example, it's quite hard (without counting pairs of parentheses) to work out
that the reference to the column weekday
on line 12 of this code belongs to
the count()
function on line 1.
It's possible to write this code so that it is readable and does not require us
to create multiple different objects to store the result of each step in our
code. This method uses the |>
(or pipe) operator. The pipe operator works by
using the result of the code on the left-hand side of the pipe as the first
argument to a function on the right-hand side. So the code
x |> fun1() |> fun2(y)
is the same as the code fun2(fun1(x), y)
, but it is
much easier to see that fun1()
is run before fun2()
. It may be useful to
read the pipe operator as 'and then', since piped code does the first thing
and then the second thing with the result and then the third thing with the
result of that, and so on. Piped code (sometimes called a pipeline) is a lot
like the series of steps in a recipe.
Since each function we are using returns the new data, and the first argument to all of those functions is the name of the input data object, the pipe means we can just omit the first argument to all except the first function.
san_fran_rob <- read_csv("https://mpjashby.github.io/crimemappingdata/san_francisco_robbery.csv") q1_weekday_counts <- san_fran_rob |> select(-offense_type) |> filter(as.Date(date_time) <= as.Date("2019-03-31")) |> mutate(weekday = wday(date_time, label = TRUE)) |> count(weekday) head(q1_weekday_counts, n = 7)
This code strikes a good balance between being easy to read and not requiring us to manage lots of intermediate variables. You might not find the pipe operator completely intuitive at the moment, but it will become easier as you see more examples in future tutorials.
::: {.box .reading}
You can find out more about how to use the pipe operator in the Introducing magrittr
online tutorial.
:::
::: {.box .welldone}
In this tutorial, you have learned how to wrangle data in R using functions from
packages in the tidyverse
suite of packages. You can now construct a complete
pipeline of R code to take raw data and transform it into the format(s) we need
to effectively map crimes.
:::
::: {.box .reading}
Developing your data wrangling skills will help you to produce better, faster analysis of crime (and other) data. If you would like to develop your skills further, you might be interested in:
dplyr
package, which is very useful for reminding
you of the code needed to run each of the functions we have used in this
tutorial.:::
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.