library(learnr) library(tutorial.helpers) library(tidyverse) library(janitor) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") students <- read_csv("data/students.csv") students2 <- students |> clean_names() |> mutate( meal_plan = factor(meal_plan), age = if_else(age == "five", "5", age), age = parse_number(age) ) # For creating this file once. # write_csv(students2, "data/students2.csv") simple_csv <- " x 10 . 20 30" another_csv <- " x,y,z 1,2,3" tbl_1 <- tibble(John = 1 , Aliya = 2, Maxilla = 3)
This tutorial covers Chapter 7: Data import from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to import data into your R project using read_csv()
and related functions from the readr package. You will also learn how to write out data to files with functions like write_csv()
.
This section provides practical advice for handling features like column names, types, and missing data.
Note the distinction between Tidyverse (capitalized and in italics) and tidyverse (no capitalization and in bold). The latter refers to an actual R package, the one which we use almost every time we use R. The latter has a more general meaning:
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
readr is the key package for reading in data. Load readr by typing library(readr)
at your Console. Then run help(package = "readr")
. Copy/paste the first header below. (It should be a single line which references the version number.)
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 1)
Because readr is one of the core packages in the Tidyverse we rarely load it directly. Instead, we run library(tidyverse)
which loads all those packages.
The "core" Tidyverse is the 9 packages loaded by library(tidyverse)
but the broader Tidyverse universe includes many more packages which share the same general philosophy.
Consider this data:
read_lines("data/students.csv") |> cat(sep = "\n")
Write code for reading this data into R. Use read_csv()
and set the file
argument to "data/students.csv". All the files we will use in this tutorial live in the data/
directory.
read_csv(file = ...)
read_csv(file = "data/students.csv")
When you run read_csv()
, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message.
Instead of just reading in the file and then just "dumping" out the results back to the screen, run the same code as above but assign the result to a new object, students
.
students <- read_csv(file = ...)
students <- read_csv(file = "data/students.csv")
There are two primary modes of doing data analysis. In the first, we write a series of statements, connected with pipes, which "dump out" their results directly to the screen. This approach is useful for interactive analysis. In the second, which usually comes later, we assign the results of the commands to an object, like students
, which we will work with later.
Print out the students
object.
students
students
In the favourite.food
column, there are a bunch of food items, and then the character string "N/A", which should have been a real NA that R will recognize as “not available”. This is something we can address using the na
argument.
By default, read_csv()
only recognizes empty strings ("") in this dataset as NAs. We want it to also recognize the character string "N/A". Run read_csv()
on "data/students.csv", setting the na
argument to c("N/A", "")
.
students <- read_csv(file = "data/students.csv", na = ...)
students <- read_csv(file = "data/students.csv", na = c("N/A", ""))
There are a bewildering set of characters that different people will use to indicate that data is missing. Look carefully at your data to find them all.
You might also notice that the Student ID and Full Name columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names; they’re non-syntactic names. To refer to these variables, you need to surround them with backticks: `
.
Pipe students
to the rename()
function, with student_id = "Student ID"
as the argument.
students |> rename(... = "Student ID")
students |> rename(student_id = "Student ID")
This fixes the variable name Student ID
, but we still have to deal with Full Name
. Sometimes, a data set will have scores of weirdly named variables. In that case, we recomend using clean_names()
from the janitor package.
The janitor package is also commonly used for cleaning names. Load in the package below. Note: Nothing will be displayed if the code runs correctly.
library(janitor)
library(janitor)
janitor has several useful functions, including make_clean_names()
, which does the same thing as clean_names()
but can be used directly during data import rather than as part of a pipe.
Pipe students
to clean_names()
.
students |> clean_names()
students |> clean_names()
clean_names()
not only fixes the non-syntactic names like Full Name
; it also cleans up any variable name which does not follow the standard approach of, first, no capitalization and, second, using underscores as a word seperator. Note how favourite.food
becomes favourite_food
and mealPlan
becomes meal_plan
Another common task after reading in data is to consider variable types. For example, meal_plan
is a categorical variable with a known set of possible values, which in R should be represented as a factor. Continue the pipe by adding a call to mutate()
, with `meal_plan = factor(meal_plan)'.
students |> clean_names() |> mutate(meal_plan = ...(meal_plan))
students |> clean_names() |> mutate(meal_plan = factor(meal_plan))
Note that the values in the meal_plan
variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<chr>
) to factor (<fct>
).
Before you analyze these data, you’ll probably want to fix the age
column. Currently, age
is a character variable because one of the observations is typed out as five
instead of a numeric 5
.
Continue the pipe by adding a new line to you call to mutate()
in which you re-define age
with age = if_else(age == "five", "5", age)
students |> clean_names() |> mutate( meal_plan = factor(meal_plan), age = ...(age == "five", "5", age) )
students |> clean_names() |> mutate( meal_plan = factor(meal_plan), age = if_else(age == "five", "5", age) )
A new function here is if_else()
, which has three arguments. The first argument test
should be a logical vector. The result will contain the value of the second argument, yes
, when test is TRUE
, and the value of the third argument, no
, when it is FALSE
. Here we’re saying if age
is the character string "five", make it "5", and if not leave it as age
.
The result of the pipe still shows age
as a character variable. But we know that age
is a number. R has a collection of parse_*
functions which transform variable types. Continue the pipe by adding a third line to the mutate()
statement: age = parse_number(age)
. Don't forget to separate the different mutate()
steps with commas.
students |> clean_names() |> mutate( meal_plan = factor(meal_plan), age = if_else(age == "five", "5", age), age = ...(age) )
students |> clean_names() |> mutate( meal_plan = factor(meal_plan), age = if_else(age == "five", "5", age), age = parse_number(age) )
We could combine the two manipulations of age
into a single line within mutate()
as:
age = parse_number(if_else(age == "five", "5", age))
But it is often easier to build a pipe step-by-step, checking that each part is doing what you want before moving on.
"CSV" stands for comma-separated values, meaning that the variable names and data values are separated by commas in the file.
Consider the contents of the test_1.csv
file.
cat(readLines("data/test_1.csv"), sep = "\n")
Write code to read this file into R using read_csv()
, setting the file
argument to "data/test_1.csv".
Use read_csv() to read in a csv file. Set the `file` argument to "data/test_1.csv".
read_csv(... = "data/test_1.csv")
read_csv(file = "data/test_1.csv")
The result when your code is run should look like this:
read_csv("data/test_1.csv")
Working with R interactively is like having a conversation. You say something to R and then R says something back. In this case, you say, "Read this file." R responds with "This data has two rows and three columns." The size of a data set is where the conversation begins.
R then provides some information about the columns in the data set. The column specification message is a suggestion from R to you to specify the data types for each column of data. R "guesses" a data type unless we use the col_types
argument.
Make the message disappear by setting the show_col_types
argument to FALSE
.
read_csv(file = "data/test_1.csv", show_col_types = ...)
read_csv("data/test_1.csv", show_col_types = FALSE)
Hitting "Run Code" should generate this output:
read_csv("data/test_1.csv", show_col_types = FALSE)
It is always better to use the col_types
argument explicitly in order to ensure that the variable types are what you want them to be.
R is trying to be helpful by nagging you with that message. It is saying "Don't you want to specify the column types for this data? That would be a good idea!"
Consider the contents of the test_2.csv
file.
cat(readLines("data/test_2.csv"), sep = "\n")
Write code for skipping the text at the top of "data/test_2.csv"
by setting the second argument skip
to the appropriate number.
In addition to the `file` argument, you will
need to use the`skip` argument here. Set `skip`
to 2.
read_csv(file = "data/test_2.csv", skip = ...)
read_csv("data/test_2.csv", skip = 2)
The result when your code is run should look like this:
read_csv("data/test_2.csv", skip = 2)
The argument skip
is used to skip rows, but to skip columns, you can use the col_only()
function as the argument to col_types
in order to read in only the columns which you want.
Consider the contents of the test_3.csv
file.
cat(readLines("data/test_3.csv"), sep = "\n")
Write code that will create default names for "data/test_3.csv"
by setting the col_names
argument to FALSE
.
Use the `col_names` argument and set it to FALSE
read_csv(file = "data/test_3.csv", col_names = ...)
read_csv("data/test_3.csv", col_names = FALSE)
The result when your code is run should look like this:
read_csv("data/test_3.csv", col_names = FALSE)
The argument col_names
can also be used to create custom column names.
Consider, again, the contents of the test_3.csv
file.
cat(readLines("data/test_3.csv"), sep = "\n")
Set the argument col_names
to a vector containing the column names "a"
, "b"
, and "c"
.
Use the `col_names` argument and set it to c("a", "b", "c").
read_csv(file = "data/test_3.csv", ... = c("a", "b", "c"))
read_csv("data/test_3.csv", col_names = c("a", "b", "c"))
The result when your code is run should look like this:
read_csv("data/test_3.csv", col_names = c("a", "b", "c"))
The col_names
argument is not just specific to read_csv()
; it can be used in other functions such as read_excel()
and read_delim()
Get rid of the column specification message by setting the col_types
argument to cols(a = col_double(), b = col_double(), c = col_double())
.
read_csv("data/test_3.csv", col_names = c("a", "b", "c"), ... = cols(a = ..., b = col_double(), ... = col_double()))
read_csv("data/test_3.csv", col_names = c("a", "b", "c"), col_types = cols(a = col_double(), b = col_double(), c = col_double()))
The result when your code is run should look like this:
read_csv("data/test_3.csv", col_names = c("a", "b", "c"), col_types = cols(a = col_double(), b = col_double(), c = col_double()))
There are many other arguments to cols
. Type ?cols
into your Console to explore!
Consider the contents of the test_5.csv
file. Note the "." for the first value for b
. In this file, a period indicates a missing value. This is not always true. Missing values can be indicated in many different ways. And, sometimes, a period is just a period.
cat(readLines("data/test_5.csv"), sep = "\n")
Write code to recognize the .
value for b
in "data/test_5.csv"
as an NA value by setting the na
argument to "."
in read_csv()
.
Use the `na` argument and set it to "."
read_csv(file = "data/test_5.csv", na = ".")
read_csv("data/test_5.csv", na = ".")
The result when your code is run should look like this:
read_csv("data/test_5.csv", na = ".")
Before removing "."
, the col_type
of Column b
was character, but after it became a double
. One element can change the entire column which can mess up other parts of your code.
Consider the contents of the test_6.csv
file.
cat(readLines("data/test_6.csv"), sep = "\n")
Write code for skipping the text lines within "data/test_6.csv"
by setting the comment
argument to "#"
.
Use the `comment` argument and set it to "#".
read_csv(file = "data/test_6.csv", comment = "...")
read_csv("data/test_6.csv", comment = "#")
The result when your code is run should look like this:
read_csv("data/test_6.csv", comment = "#")
It doesn't always have to be "#", it can be any character that designates a line as a comment!
Consider the contents of the test_7.csv
file.
cat(readLines("data/test_7.csv"), sep = "\n")
Write code to make sure the column grade
within "data/test_7.csv"
appears as an integer variable (col_integer()
), and student
as a character variable (col_character()
).
Use the col_types argument and set it to cols(grade = col_integer(), student = col_character())
read_csv("data/test_7.csv", col_types = cols(grade = col_integer(), student = col_character()))
The result when your code is run should look like this:
read_csv("data/test_7.csv", col_types = cols(grade = col_integer(), student = col_character()))
There are functions which specify data: col_logical()
, col_double()
, col_date()
, and so on.
Consider the contents of the test_bad_names.csv
file.
cat(readLines("data/test_bad_names.csv"), sep = "\n")
Many files will have column names that are not formatted correctly, but tidyverse has the name_repair
argument to fix that. Using the contents of "data/test_bad_names.csv"
, use the name_repair
argument and set it to "universal"
in read_csv()
read_csv(file = ..., ... = "universal")
read_csv(file = "data/test_bad_names.csv", name_repair = "universal")
The result when your code is run should look like this:
read_csv(file = "data/test_bad_names.csv", name_repair = "universal")
The "universal"
makes sure the columns names are all unique and uses the syntax already built into the name_repair
command to organize the names. There are other options such as minimal
andunique
for this function. Try them out!
Of course, a variable name like ..2021.enrolled
still looks pretty ugly, but it is better than a name which starts with a number.
Now read the file "data/test_bad_names.csv"
using read_csv()
. Then pipe it into clean_names()
, a function from the janitor package.
read_csv(... = "data/test_bad_names.csv") |> ...
read_csv(file = "data/test_bad_names.csv") |> clean_names()
The result when your code is run should look like this:
read_csv(file = "data/test_bad_names.csv") |> clean_names()
The function clean_names()
used the syntax within the janitor package to clean the names and also makes them unique. This lets you easily access the different columns without running into errors. x2021_enrolled
is a much better variable name than ..2021.enrolled
.
To make the code cleaner and to reduce the number of pipes, you can set the name_repair
argument to janitor::make_clean_names
in read_csv()
.
read_csv(file = "data/test_bad_names.csv", name_repair = ...)
read_csv(file = "data/test_bad_names.csv", name_repair = janitor::make_clean_names)
The result when your code is run should look like this:
read_csv(file = "data/test_bad_names.csv", name_repair = janitor::make_clean_names)
The janitor package has a function called remove_empty()
to remove empty spaces, remove_constant()
to remove columns of constant values, and many more.
CSV files are just one type of text file. A text file is any file which includes plain text. The contents of such files are easy to look at in any text editor, or in RStudio.
Consider the contents of the text file delim_1.txt
:
cat(readLines("data/delim_1.txt"), sep = "\n")
Write code for reading this file in to R. The values in the file are separated by pipes rather than commas. So, instead of read_csv()
, you should use read_delim()
. Don't forget that the delim_1.txt
file, like all the files in this tutorial, is in the data
directory.
Set the file argument to "data/delim_1.txt". Also use the `delim` argument and set it to "|".
read_delim("data/delim_1.txt", delim = "|")
The result when your code is run should look like this:
read_delim("data/delim_1.txt", delim = "|")
Note how the spaces and commas are included in the values for town
. You can't use read_csv()
here because not all the columns are denoted by commas.
Consider the contents of the text file delim_2.txt
:
cat(readLines("data/delim_2.txt"), sep = "\n")
Write code for reading this file in to R. Use the col_types
argument to, first, prevent the col_types
message from printing out and, second, to set population
as an integer and, third, to ensure that date
is a <date>
variable.
Set the `col_types` argument to cols(date = col_date(format = ""), population = col_integer(), town = col_character())
read_delim("data/delim_2.txt", delim = "|", ... = cols(date = col_date(format = ""), ... = col_integer(), town = col_character()))
read_delim("data/delim_2.txt", delim = "|", col_types = cols(date = col_date(format = ""), population = col_integer(), town = col_character()))
The result when your code is run should look like this:
read_delim("data/delim_2.txt", delim = "|", col_types = cols(date = col_date(format = ""), population = col_integer(), town = col_character()))
Once you’ve mastered read_csv()
, using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for.
A CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself.
In this tutorial, we will make use of the "quotation trick" which allows read_csv()
and related functions to read data directly from a quoted string, rather than a file. As an example, run this code:
read_csv(" a, b, c 1, 2, 3")
read_csv(" a, b, c 1, 2, 3")
This produces the same tibble as if the character string were in a separate file.
readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000 rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:
F
, T
, FALSE
, or TRUE
(ignoring case)? If so, it’s a logical.1
, -4.5
, 5e6
, Inf
)? If so, it’s a number.You can see that behavior in action in this simple example. Press "Run Code".
read_csv(" logical,numeric,date,string TRUE,1,2021-01-15,abc false,4.5,2021-02-15,def T,Inf,2021-02-16,ghi ")
read_csv(" logical,numeric,date,string TRUE,1,2021-01-15,abc false,4.5,2021-02-15,def T,Inf,2021-02-16,ghi ")
The first row is just treated as a column name by read_csv()
. It does not use this information when determining variable types.
The most common way column detection fails is that a column contains unexpected values, resulting in a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the NA
that readr expects. Press "Run Code".
simple_csv <- " x 10 . 20 30" read_csv(simple_csv)
simple_csv <- " x 10 . 20 30" read_csv(simple_csv)
Note how x
is read in as a character column when, obviously, it should be a number. In this very small case, you can easily see the missing value .
. But what happens if you have thousands of rows with only a few missing values represented by .
s speckled among them?
One approach is to tell read_csv()
that x
is a numeric column, and then see where it fails. As we saw earlier in the tutorial, you can do that with the col_types
argument, which takes a named list where the names match the column names in the CSV file. Run read_csv()
with simple_csv
as the first argument and col_types = list(x = col_double())
as the second.
read_csv( simple_csv, ... = list(x = col_double()) )
read_csv( simple_csv, col_types = list(x = col_double()) )
This worked in that x
is a <dbl>
. R has two built-in number variable types: integers and doubles. But how can we investigate the warning?
Take the call to read_csv()
from the previous question and assign the output to an object called df
. Then, in the next line, run problems()
on df
.
... <- read_csv( simple_csv, col_types = list(x = col_double()) ) problems(...)
df <- read_csv( simple_csv, col_types = list(x = col_double()) ) problems(df)
This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .
. That suggests this dataset uses .
for missing values. Real world data sets will often have many more issues but, with tools like problems()
, you can solve them one-by-one.
Since we now that .
means a missing value in this data, we can now call read_csv()
on simple_csv
with the na
argument set to "."
.
read_csv(simple_csv, ... = ".")
read_csv(simple_csv, na = ".")
readr provides a total of nine column types for you to use. Here are the most important 4.
col_logical()
and col_double()
read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.
col_integer()
reads integers. We seldom distinguish integers and doubles because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
col_character()
reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn’t make sense to (e.g.) divide it in half, for example, a phone number, social security number, credit card number, etc.
It’s also possible to override the default column by switching from list()
to cols()
and specifying .default
. Use read_csv()
to read in another_csv
with the col_types
argument set to cols(.default = col_character()
.
another_csv <- " x,y,z 1,2,3"
read_csv( another_csv, ... = cols(... = col_character()) )
read_csv( another_csv, col_types = cols(.default = col_character()) )
Here are the other 5 column types from readr.
col_factor()
, col_date()
, and col_datetime()
create factors, dates, and date-times respectively.
col_number()
is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies.
col_skip()
skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.
Another useful helper is cols_only()
which will read in only the columns you specify. Run read_csv()
on another_csv
with the col_types
argument set to cols_only(x = col_character()
.
read_csv( another_csv, col_types = ...(x = ...) )
read_csv( another_csv, col_types = cols_only(x = col_character()) )
The help page for cols()
includes more details and discussion.
Consider another example file:
cat(readLines("data/ex_2.csv"), sep = "\n")
Read the data/ex_2.csv
file into R to check if there are any parsing mistakes.
read_csv("data/ex_2.csv")
read_csv("data/ex_2.csv")
Notice that R parses column a
and b
both as a doubles. But what if column a
should be parsed as an integer and column b
should be parsed as a date?
cat(readLines("data/ex_2.csv"), sep = "\n")
Begin by using read_csv()
to read in the file ex_2.csv
. Then, set the col_types
argument to cols()
. Within cols()
, set .default
to col_character()
.
read_csv(..., col_types = cols(.default = ...) )
read_csv("data/ex_2.csv", col_types = cols(.default = col_character()) )
This is not what we want. But it is often convenient to read in a new file as all character variables. It is then easier to examine its contents directly.
cat(readLines("data/ex_2.csv"), sep = "\n")
Pipe the results of read_csv()
to the function mutate()
. Within mutate()
set a
to parse_integer(a)
.
... |> mutate(a = ...)
read_csv("data/ex_2.csv", col_types = cols(.default = col_character()) ) |> mutate(a = parse_integer(a))
You can also use parse_number()
but the resulting variable will be a <dbl>
, not an <int>
.
cat(readLines("data/ex_2.csv"), sep = "\n")
Continue your pipe with mutate()
. Use parse_date()
to transform b
to dates. The first argument to parse_date()
should be b
. The second argument should be format
. Set format
to "%Y%M%D"
.
... |> mutate(b = parse_date(b, format = "..."))
read_csv("data/ex_2.csv", col_types = cols(.default = col_character()) ) |> mutate(a = parse_integer(a)) |> mutate(b = parse_date(b, format = "%Y%M%D"))
"%Y%M%D"
tells R to read the number as a date (Y for year, M for month, D for date).
Let's explore one last file, ex_3.csv
, that has parsing problems.
cat(readLines("data/ex_3.csv"), sep = "\n")
Run read_csv("data/ex_3.csv")
and examine the parsing failures.
read_csv(...)
read_csv("data/ex_3.csv")
What are the problems here? First, R parses column x
as a character, when it is clearly a date. Also, column z
should be parsed as an integer not a character!
cat(readLines("data/ex_3.csv"), sep = "\n")
Let's first fix column x
. Pipe the results of read_csv("data/ex_3.csv")
to the function mutate()
. Within mutate()
set x
equal to parse_date(x, "%d %B %Y")
.
... |> mutate(x = parse_date(...))
read_csv("data/ex_3.csv") |> mutate(x = parse_date(x, "%d %B %Y"))
The %d
signifies that there's only 1 digit for the day, the %B
means that we're using the name of the month rather than the number, and the spaces in between represent the spaces in between each value. By customizing our format string, we can easily parse any format of dates without any problems.
Also note that we did not need to use the .default = col_character()
trick before we used mutate()
. Why? Because R already read all of the columns as characters to begin with.
cat(readLines("data/ex_3.csv"), sep = "\n")
Continue your pipe with mutate()
. Within mutate()
set z
to parse_number(z)
.
... |> mutate(z = parse_number(...))
read_csv("data/ex_3.csv") |> mutate(x = parse_date(x, "%d %B %Y")) |> mutate(z = parse_number(z))
parse_number()
is good at dealing with currency signs and other detritus.
Cleaning multiple files at once is a common task.
Run list.files("data")
to check what files there are in the data
folder.
...("data")
list.files("data")
The list.files()
function is part of base R. Check out its help page.
Change the call to list.files("data")
by setting the argument pattern
to "similar"
to only look at the files with the names "similar" in them.
list.files("data", ... = "similar")
list.files("data", pattern = "similar")
The result when your code is run should look like this:
list.files("data", pattern = "similar")
You can also set pattern
to ".csv" or ".delim" for those types of files in a folder.
To show the exact directory of where the files came from, set the argument full.names
to TRUE
in list.files()
.
list.files("data", pattern = "similar", full.names = ...)
list.files("data", pattern = "similar", full.names = TRUE)
The result when your code is run should look like this:
list.files("data", pattern = "similar", full.names = TRUE)
These are the contents of similar_1.csv
, similar_2.csv
, similar_3.csv
, respectively.
cat(readLines("data/similar_1.csv"), sep = "\n") cat(readLines("data/similar_2.csv"), sep = "\n") cat(readLines("data/similar_3.csv"), sep = "\n")
Let's combine the files by piping the last call to list_files()
directly to read_csv()
!
list.files("data", pattern = "similar", full.names = TRUE) |> ...
list.files("data", pattern = "similar", full.names = TRUE) |> read_csv()
Column b
's type is chr
because the "." in similar_1.csv
makes R think the rest of the column are characters. We will fix that using the na
argument in read_csv()
.
Using the same pipleline, change read_csv()
to set the argument na
to "."
to get rid of the character in column b
.
... |> read_csv(... = ".")
list.files("data", pattern = "similar", full.names = TRUE) |> read_csv(na = ".")
The result when your code is run should look like this:
list.files("data", pattern = "similar", full.names = TRUE) |> read_csv(na = ".")
Because the "." is gone, column b
's type is dbl
now.
Now let's get rid of the annoying "specify column types" message by using the show_col_types
argument. In the call to read_csv()
, add the show_col_types
argument to FALSE
.
... |> read_csv(na = ".", ... = FALSE)
list.files("data", pattern = "similar", full.names = TRUE) |> read_csv(na = ".", show_col_types = FALSE)
You can use the other arguments of read_csv()
to further clean your files, such as col_names
, col_types
, or skip
.
Consider the three sales files currently in the data
directory. Run list.files()
with "data"
as the path
(first) argument and "sales"
as the value for the pattern
argument.
list.files(path = ..., pattern = ...)
list.files(path = "data", pattern = "sales")
Although there are only 3 files here, in many cases you will need to deal with hundreds or even thousands of files.
Pipe the results from list_files()
to read_csv()
. Don't forget to add full.names = TRUE
to the call to list.files()
, otherwise read_csv()
won't be able to find the files.
list.files(path = "data", pattern = 'sales', full.names = ...) |> ...
list.files(path = "data", pattern = 'sales', full.names = TRUE) |> read_csv()
Although this works, we have lost the information about which rows come from which input files. This is important because the file names tell us which month the data is from.
Using the same code as above, add the id
argument to read_csv()
with a value of "file"
.
list.files(path = "data", pattern = 'sales', full.names = TRUE) |> read_csv(... = "file")
list.files(path = "data", pattern = 'sales', full.names = TRUE) |> read_csv(id = "file")
This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources. It is often useful to know the "provenance" of a given piece of data. If something seems wrong later in the process, we will want to track it back to its original source.
readr also comes with two useful functions for writing data back to disk: write_csv()
and write_tsv()
. The most important arguments to these functions are x
(the data frame or tibble to write) and file
(the location to write it to). You can also specify how missing values are written with na
, and if you want to append
to an existing file.
Let's first create a new R object, students2
which is the result of the clean up we did above on the original "students.csv" file. Press "Run Code."
students2 <- students |> clean_names() |> mutate( meal_plan = factor(meal_plan), age = if_else(age == "five", "5", age), age = parse_number(age) ) students2
students2 <- students |> clean_names() |> mutate( meal_plan = factor(meal_plan), age = if_else(age == "five", "5", age), age = parse_number(age) ) students2
This workflow is very common. First, we interactively add code, line-by-line, to a pipe, running the entire pipe each time, examining the output as it is "spat" back to the screen. Second, once the pipe produces what we want, then we add an object, students2
in this case, to the front of the pipe, allowing us to create a permanent object with which we can then work.
Type students2
and hit "Run Code." This will produce the same output as print(students2)
.
students2
students2
students2
has been cleaned up from the original "students.csv", most importantly in terms the variable names. Note that meal_plan
is a <fct>
, meaning a factor.
Use write_csv()
to write the contents of the students2
object to a file called "students2.csv"
which is located in the data
directory. Do this by setting the first argument of write_csv()
, x
, to students2
and then the second argument, file
, to "data/students2.csv"
.
write_csv(x = ..., ... = "data/students2.csv")
write_csv(x = students2, file = "data/students2.csv")
As with many commonly used functions, we will often drop the argument names. In that case, we would typically write write_csv(students2, "data/students2.csv")
.
Now let’s read that csv file back in. Run read_csv()
on "data/students2.csv"
.
...("data/students2.csv")
read_csv("data/students2.csv")
Note that variable type information is lost when you save to CSV because you’re starting over by reading from a plain text file again. meal_plan
is now <chr>
, meaning a character variable.
This makes CSVs a little unreliable for caching interim results --- you need to recreate the column specification every time you load in. There are two main alternative: write_rds()
/read_rds()
and write_parquet()
/read_parquet()
.
RDS files store R objects in a file which can be saved on your computer. Then, if you come back to a project, even after restarting R, you can quickly load back the object, without redoing all the code which created it.
Consider the following plot.
iris |> ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + geom_jitter() + labs(title = "Sepal Dimensions of Various Species of Iris", x = "Sepal Length", y = "Sepal Width")
We have saved the plot for you to an object named iris_p
. On the line 8, use write_rds()
to save this plot to a file named test_1.rds
which is located within the data
directory. Note: Nothing will be displayed for you to see.
iris_p <- iris |> ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + geom_jitter() + labs(title = "Sepal Dimensions of Various Species of Iris", x = "Sepal Length", y = "Sepal Width")
The first argument should be the object you want to save. The second argument should be the location in which you want the file to be saved and the file's name.
...(iris_p, "data/test_1.rds")
iris_p <- iris |> ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + geom_jitter() + labs(title = "Sepal Dimensions of Various Species of Iris", x = "Sepal Length", y = "Sepal Width") # Every time we run this, it creates a new test_1.rds, which needs to be # committed. Why does the file change each time? Presumably the binary format # includes some creation information. This is annoying, but probably harmless. # Or maybe this doesn't create a new file each time? I can't seem to replicate # this behavior now. write_rds(iris_p, "data/test_1.rds")
The big advantage of creating an RDS file is that we can reload the object it contains later, without re-running the code which created it.
Run list.files("data")
. You should see your newly created file listed.
...("data")
list.files("data")
To find the file on your computer, you can set the list.files()
argument include.dirs
to TRUE
. This causes the full path for each file to be returned.
Let's now use read_rds()
to read in the newly created file! Set the file
argument to "data/test_1.rds"
.
read_rds(... = "data/test_1.rds")
read_rds(file = "data/test_1.rds")
Plots are just one example of what we can store in a RDS file. We can also store datasets.
Consider the following dataset.
glimpse(mtcars)
Use write_rds()
to save mtcars
to a file named test_2.rds
which is located in a directory called data
.
The first argument should be the object you want to save. The second argument should be the path you want the file saved as. The path includes both the location of the file and its name.
...(mtcars, "data/test_2.rds")
# Similar annoyance as discussed above in the creation of test_1.rds. Maybe just # don't bother with this test? write_rds(mtcars, "data/test_2.rds")
You are not limited to just one object in an RDS file. You can save multiple objects!
Run list.files("data")
. You should see your newly created file listed.
list.files(...)
list.files("data")
You can use append
with write_csv()
and similar text-based functions to add on data to an existing file. That won't work with write_rds()
and other functions which work with binary data. In that case, the files must be recreated each time.
Let's now use read_rds()
to read in the newly created file! Set the file
argument to "data/test_2.rds"
.
read_rds(... = "data/test_2.rds")
read_rds(file = "data/test_2.rds")
write_rds()
and read_rds()
are the most commonly used approaches for saving/using R objects.
write_rds()
and read_rds()
are not the best approach for working with large data sets. In that case, use the functions write_parquet()
and read_parquet()
from the arrow package.
Copy/paste any question from the Apache Arrow FAQ below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.
Sometimes you’ll need to assemble a tibble “by hand,” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns (tibble()
) or by rows (tribble()
).
Create a tibble by using the tibble()
function. Pass three arguments to tibble()
, which are the three variables you want to include in the new tibble: x = c(1, 2, 5)
, y = c("h", "m", "g")
, and z = c(0.08, 0.83, 0.60)
. Don't forget to separate input arguments with commas.
tibble( x = ..., ... = c("h", "m", "g"), ... = ... )
tibble(x = c(1, 2, 5), y = c("h", "m", "g"), z = c(0.08, 0.83, 0.60))
Laying out the data by column can make it hard to see how the rows are related, so an alternative is tribble()
, short for transposed tibble, which lets you lay out your data row by row.
Use tribble()
to create the same tibble as in the previous question. The first argument will be a row containing ~x, ~y, ~z
. The second argument, on a new row, will be 1, "h", 0.08
. And so on. Don't forget to all a comma at the end of each row.
tribble( ~x, ..., ~z, 1, "h", ..., 2, ..., 0.83, 5, "g", ... )
tribble( ~x, ~y, ~z, 1, "h", 0.08, 2, "m", 0.83, 5, "g", 0.60 )
The difference between the name tibble()
and the name tribble()
is one letter: r. The r stands for rows since tribble()
allows you to type in the data by row.
This tutorial covered Chapter 7: Data import from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to import data into your R project using read_csv()
and related functions from the readr package. You also learned how to write out data to files with functions like write_csv()
.
The janitor package includes a variety of useful functions, especially clean_names()
.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.