library(learnr) library(gradethis) library(readr) library(dplyr) gradethis_setup() gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia")
Today's class is all about reading into R and writing out data files out of R.
Because sometimes you want to work with your own damn data, instead of data about flowers, cars, life expectancy, or the Big Five personality scores of 2800 of your closest friends.
Normally there would be a separate worksheet file, but we'll do everything in-line in the guide for this class.
Let's first load the built-in gapminder dataset and the tidyverse:
library(gapminder) library(tidyverse)
Next, let's filter the data to only 2007 and only the Asia continent and save it to a new object.
(gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia"))
We can write this to a comma-separated value (csv) file with just one command:
write_csv(gap_asia_2007, "exported_file.csv")
Try writing a csv named "gapminder_asia_2007.csv
":
write_csv(gap_asia_2007, "gapminder_asia_2007.csv")
grade_code()
But where did this file go? We should save the file in a sensible location. We need to practice controlling where R is running, where it looks for files, and where it writes out your files. To that end, let's talk about the working directory and RStudio Projects.
When you open, it "runs" in some folder on your computer. This is the place it will look for files to import and write files as output. Think about where your participation and homework files end up when you knit them.
If you have R/RStudio closed, and you open a .R or .Rmd file, R/RStudio will start in the folder holding that file.
If you open R/RStudio from the Windows Start menu, the Mac dock, the Mac Spotlight, etc., R/Studio will start in its default location (probably your user home directory, see Tools → Global Options → General → Default working directory…).
When I say "R/Studio will start in…", what I am referring to is R's "working
directory". Like I say above, this is the place R will look for files to import
and write files as output. You can check what R's current working directory
is using the getwd()
function:
getwd()
You can also change the working directory using the setwd()
function:
setwd(file.path("path", "to", "folder"))
Do not use setwd()
! You should always write your R scripts so that the entire
project is self-contained in a folder. All of the scripts, folders, data, output,
etc. should all "live" within this project folder.
Write all of your R scripts assuming:
rm(list = ls())
to clean the workspace--the workspace is already cleanlibrary()
install.packages()
calls unless they are commented out
or otherwise set not to runfile.choose()
,
read.clipboard()
, the buttons in RStudio, etc.data/cats-data_2020-03-04.xlsx
or file.path("data", "cats-data_2020-03-04.xlsx")
/Users/brenton/Research/cats_project/data/cats-data_2020-03-04.xlsx
or C:\\Users\\brenton\\Research\\cats_project\\data\\cats-data_2020-03-04.xlsx
file.path("C:", "Users", "brenton", "Research", "cats_project", "data", "cats-data_2020-03-04.xlsx")
file.path()
or here::here()
, rather
than typing out the whole path (Windows and Mac/Linux have different
syntaxes for file paths):path/to/folder
path\\to\\folder
This approach has several advantages:
RStudio projects can help with following these best practices. Let's make a project.
Click File → New Project…
and then make a new project in your participation repo
folder. Give it a useful name like DataSci-participation.Rproj
.
When you double click on a .Rproj file, it:
You can also set specfic options for each RStudio project (e.g., number of spaces to insert when you type Tab, etc.).
Let's practice closing RStudio and re-opening it by opening the RStudio project.
Run getwd()
to see where R is running.
You can make your projects very portable by using .Rproj RStudio projects and
relative paths with file.path()
. This still can be a little fiddly:
file.choose()
for relative paths.Enter the here
package. here
takes the idea of relative paths and extends it
in very useful ways.
The major function is here::here()
. Like file.path()
, here::here()
lets you
specify a path to a file and then adds the system-appropriate separators (/
or \\
).
Where here::here()
shines is that it figures out better where the relative paths
should start from. It looks round in the folders in your directory and finds
the .Rproj file, then constructs the relative file paths from there.
For example, create a new folder "data" in your participation repo, and within
it make a subfolder called "s008_data". Then, save gap_asia_2007
using the
here::here()
and write_csv()
functions:
write_csv(gap_asia_2007, here::here("data", "s008_data", "exported_file.csv"))
More details on here
are available in this
short article.
The same csv file that we just saved to disk can be imported into R again by specifying the path where it exists:
read_csv(here::here("data", "s008_data", "exported_file.csv"))
Normally we would store the imported data into a new object that you can use in
subsequent analyses using the assignment function <-
.
Notice that the output of the imported file is the same as the original tibble,
and read_csv()
was intelligent enough to detect the types of the columns.
This won't always be true so it's worth checking! The
read_csv package has
many additional options including the ability to specify column types (e.g.,
is "1990" a year or a number?), skip columns, skip rows, rename columns on
import, trim whitespace, and more.
One of the most important options to set is the na
argument, which specifies
what values to treat as NA
on import. By default, read_csv()
treats blank
cells (i.e., ""
) and cells with "NA"
as missing. You might need to change
this (e.g., if missing values are entered as -999
). Note that readxl::read_excel()
by default only has na = c("")
(no "NA"
)!
To import a CSV file from a web, assign the URL to a variable
url <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv"
and then apply read_csv file to the url
.
read_csv(url)
You try and read one in:
Link = "http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv"
read_csv("http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv")
grade_code()
(This doesn't work for all import functions, so I usually keep them separate as two steps.)
First, we'll need the package to load in Excel files:
library(readxl)
Datafiles from this tutorial were obtained from: https://beanumber.github.io/sds192/lab-import.html#data_from_an_excel_file
Unlike with a CSV file, tto import an .xls or .xlsx file from the internet, you first need to download it localy.
Note: The folder you want to save the file to has to exist!. If it doesn't, you will get an error.
Either create the folder path in Finder/Windows Explorer, or use the dir.create()
function:
dir.create(here::here("data", "s008_data"), recursive = TRUE)
Next, you download the file. To download it, create a new object called xls_url
and then using download.file
to dowload it to a specified destination path.
xls_url <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/GreatestGivers.xls" download.file(xls_url, here::here("data", "s008_data", "some_file.xls"), mode = "wb")
NOTE: The mode = "wb"
argument at the end is really important if you are on
Windows. If you omit, you will probabbly get a message about downloading a
corrupt file. More details about this behaviour can be found
here.
Naming a file "some_file" is extremely bad practice (hard to keep track of the files), and you should always give it a more descriptive name. Often, it's a good idea to name the file similarly (or the same as) the original file (sometimes that might not be a good idea if the original name is non-descriptive).
There's handy trick to extract the filename from the URL:
file_name <- basename(xls_url) download.file(xls_url, here::here("data", "s008_data", file_name), mode = "wb")
Now we can import the file:
read_excel(here::here("data", "s008_data", file_name))
Let's load a sample SPSS file and work with it. Download the file from
here
and save it in your data/s008_data
folder.
These data are a random subset of the data used in this paper. This was a study looking at personality traits that distinguish C-level executives from lower-level managers among men and women.
The subset of data here consists of 200 cases, with variables indicating: 1. The language of assessment (English, Dutch, or French) 2. Gender 3. C-level or not 4. Extraversion level, as well as 4 facet traits (Leading, Communion, Persuasive, Motivating)
Let's load in the data using the haven
package.
(clevel <- haven::read_spss(here::here("data", "s008_data", "clevel.sav")))
Notice that this tibble looks a little different for the language
and gender
variables than normal. It has labels for the numeric values. This format
is what SPSS uses, but it's not standard for R. Let's convert those variables,
and isClevel
as well, to factors:
clevel_cleaned <- clevel %>% mutate(language = as_factor(language), gender = as_factor(gender), isClevel = factor(isClevel, levels = c(0, 1), labels = c("No", "Yes")) ) %>% print()
Notice how the variables are now factors with labels as the entries, instead of the original code numbers.
Let's save this file as a CSV file so that it's a smaller file and easier to import again in the future.
write_csv(clevel_cleaned, here::here("data", "s008_data", "clevel_cleaned.csv"))
Now let's make a plot.
clevel_plot <- clevel_cleaned %>% mutate(isClevel = recode(isClevel, No = "Below C-level", Yes = "C-level"), gender = recode(gender, Female = "Women", Male = "Men")) %>% ggplot(aes(paste(isClevel, gender, sep = "\n"), Extraversion, color = gender)) + geom_boxplot() + geom_jitter(height = .2) + scale_color_manual(values = c("#1b9e77", "#7570b3")) + ggtitle("Extraversion Stan Scores") + scale_y_continuous(breaks = 1:9) + ggthemes::theme_fivethirtyeight() %>% print()
Let's save the plot in several formats. This is useful if we want to use the plot
outside of Markdown. Save plots using the ggsave()
funnction.
ggsave()
has many options. See the help function ?ggsave
for full details.
The main arguments are filename
, plot
, width
and height
(inches by default),
and dpi
(dots per inch; for bitmap formats).
ggsave()
will try to guess what format you want based on the file name. If you
want, you can specify a specific format or R graphics device to save with using
the device
argument.
dir.create(here::here("output", "figures")) ggsave(here::here("output", "figures", "clevel_extraversion.svg"), clevel_plot)
You can save to several formats. Generally, work with a vector format like
.svg
, .eps
, or .pdf
. Vector graphics represent the image as a series of
data points and equations. This means that they can be made smaller or larger or
zoomed in on without damaging the image quality.
If you can't use a vector format for some reason, you can also export to a bitmap format. Bitmap graphs represent the image as colored dots or pixels. This means that the image quality will suffer if you make the image larger or zoom in on it (making it smaller can also sometimes compromise quality). With bitmap images, you need to be concerned with resolution (how many pixels/dots per inch when printed). Always use at least 300 DPI resolution.
There are several popular bitmap image formats. .tiff
/.tif
is the highest
quality, but also the largest. Use it for print graphics, but you should probably
avoid it for images to be hosted on the web. .png
is a bit smaller, and it
should be your go to for charts, figures, line drawings, etc. More complex
images (e.g., photos) can get pretty big with .png
, though. .jpg
/.jpeg
is
probably the most popular bitmap format. It works well for photographs hosted on
the web, but its compression often makes line drawing and charts look terrible.
.jpg
/.jpeg
also degrades in quality each time you edit/save it, so don't
use it for images you intend to edit. Generally avoid .gif
unless you are
making an animation or you need very small file size for a simple image. .gif
supports very few colors, so always check your image quality after making a .gif
.
Let's save to some other formats:
ggsave(here::here("output", "figures", "clevel_extraversion.eps"), clevel_plot) ggsave(here::here("output", "figures", "clevel_extraversion.pdf"), clevel_plot) ggsave(here::here("output", "figures", "clevel_extraversion.tiff"), clevel_plot) ggsave(here::here("output", "figures", "clevel_extraversion.png"), clevel_plot)
Follow a consistent folder structure for all of your projects. This will make it easier for you to say organized and make your code, data, and projects easy to share.
In your project's root folder, you should have a README.md file and a .Rproj project. Then, you should have folders that separate different types of files:
data
: Stores all of your data files for a projectdata-raw
and data
folders is a good idea. markdown
: Stores your RMarkdown docuemnts. output
: Stores any output your scripts generatefigures
, reports
, etc.You might add additional folders, such as:
tests
: A folder that includes tests to check that your scripts or results
are accurate. Check out the testthat
package.templates
: A folder to hold template files (e.g., RMarkdown templates,
Word templates, CSS files for HTML output, TeX templates for PDF output)admin
: For adminstrative documents (e.g., IRB approval, grant information)doc
: For documentation (e.g., variable codebooks, style guides), scripts
or src
or R
: Folders to store functions and scripts that
you call from your markdown (e.g., a data import and cleaning script)R
, python
, sql
, C
, etc.scripts
or src
folder if not too manyrio
Do the clevel
activity again, but this time use the rio::import()
and
rio::export()
functions instead of the read_*()
and write_*()
functions.
In rio::import()
, be sure to specify: rio::import(..., setclass = "tibble")
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.