library(learnr)
library(gradethis)
library(readr)
library(dplyr)

gradethis_setup()

gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia")

File input/output (I/O)

Today's class is all about reading into R and writing out data files out of R.

Because sometimes you want to work with your own damn data, instead of data about flowers, cars, life expectancy, or the Big Five personality scores of 2800 of your closest friends.

Worksheet

Normally there would be a separate worksheet file, but we'll do everything in-line in the guide for this class.

Resources

References and tutorials

Package documentation

Writing data to disk

Let's first load the built-in gapminder dataset and the tidyverse:

library(gapminder)
library(tidyverse)

Next, let's filter the data to only 2007 and only the Asia continent and save it to a new object.

(gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia"))

We can write this to a comma-separated value (csv) file with just one command:

write_csv(gap_asia_2007, "exported_file.csv")

Try writing a csv named "gapminder_asia_2007.csv":


write_csv(gap_asia_2007, "gapminder_asia_2007.csv")
grade_code()

But where did this file go? We should save the file in a sensible location. We need to practice controlling where R is running, where it looks for files, and where it writes out your files. To that end, let's talk about the working directory and RStudio Projects.

The working directory and RStudio Projects

When you open, it "runs" in some folder on your computer. This is the place it will look for files to import and write files as output. Think about where your participation and homework files end up when you knit them.

If you have R/RStudio closed, and you open a .R or .Rmd file, R/RStudio will start in the folder holding that file.

If you open R/RStudio from the Windows Start menu, the Mac dock, the Mac Spotlight, etc., R/Studio will start in its default location (probably your user home directory, see Tools → Global Options → General → Default working directory…).

The working directory

When I say "R/Studio will start in…", what I am referring to is R's "working directory". Like I say above, this is the place R will look for files to import and write files as output. You can check what R's current working directory is using the getwd() function:

getwd()

You can also change the working directory using the setwd() function:

setwd(file.path("path", "to", "folder"))

Do not use setwd()! You should always write your R scripts so that the entire project is self-contained in a folder. All of the scripts, folders, data, output, etc. should all "live" within this project folder.

Write all of your R scripts assuming:

  1. You are running them by opening a fresh new R session
    • Don't use rm(list = ls()) to clean the workspace--the workspace is already clean
    • You need to load required pacakges with library()
    • Don't work on several different projects in one R session at the same time!
  2. You have all of the necessary packages installed
    • Don't include install.packages() calls unless they are commented out or otherwise set not to run
  3. The script will run without human input
    • You need to import or load any data you are working with
    • Load data and write output using R commands, not file.choose(), read.clipboard(), the buttons in RStudio, etc.
  4. All of the needed files live in your project folder
    • Write relative paths, rather than absolute paths
    • data/cats-data_2020-03-04.xlsx or file.path("data", "cats-data_2020-03-04.xlsx")
    • Not: /Users/brenton/Research/cats_project/data/cats-data_2020-03-04.xlsx or C:\\Users\\brenton\\Research\\cats_project\\data\\cats-data_2020-03-04.xlsx
      orfile.path("C:", "Users", "brenton", "Research", "cats_project", "data", "cats-data_2020-03-04.xlsx")
  5. Your script might run on any system
    • Write the paths to files using file.path() or here::here(), rather than typing out the whole path (Windows and Mac/Linux have different syntaxes for file paths):
    • Mac/Linux: path/to/folder
    • Windows: path\\to\\folder

This approach has several advantages:

  1. Frictionless running on different computers
  2. Less breakage (e.g., if you move a folder around, the relative paths will still work)
  3. Less surprising or weird behaviors due to session crud
  4. Easy to tweak data/code and update results

RStudio projects

RStudio projects can help with following these best practices. Let's make a project.

Click File → New Project… and then make a new project in your participation repo folder. Give it a useful name like DataSci-participation.Rproj.

When you double click on a .Rproj file, it:

  1. Opens a new fresh R session, with
  2. The working directory set to the location of the .Rproj file, and
  3. No connection whatsoever to any other R sessions you already have open

You can also set specfic options for each RStudio project (e.g., number of spaces to insert when you type Tab, etc.).

Let's practice closing RStudio and re-opening it by opening the RStudio project. Run getwd() to see where R is running.

here::here()

You can make your projects very portable by using .Rproj RStudio projects and relative paths with file.path(). This still can be a little fiddly:

  1. Need to remember to open the .Rproj file, not a .Rmd or .R file
  2. If you open a .Rmd file contained in a subfolder, the working directory won't be right for working with file.choose() for relative paths.

Enter the here package. here takes the idea of relative paths and extends it in very useful ways.

The major function is here::here(). Like file.path(), here::here() lets you specify a path to a file and then adds the system-appropriate separators (/ or \\). Where here::here() shines is that it figures out better where the relative paths should start from. It looks round in the folders in your directory and finds the .Rproj file, then constructs the relative file paths from there.

For example, create a new folder "data" in your participation repo, and within it make a subfolder called "s008_data". Then, save gap_asia_2007 using the here::here() and write_csv() functions:

write_csv(gap_asia_2007, here::here("data", "s008_data", "exported_file.csv"))

More details on here are available in this short article.

Reading data from disk

The same csv file that we just saved to disk can be imported into R again by specifying the path where it exists:

read_csv(here::here("data", "s008_data", "exported_file.csv"))

Normally we would store the imported data into a new object that you can use in subsequent analyses using the assignment function <-.

Notice that the output of the imported file is the same as the original tibble, and read_csv() was intelligent enough to detect the types of the columns. This won't always be true so it's worth checking! The read_csv package has many additional options including the ability to specify column types (e.g., is "1990" a year or a number?), skip columns, skip rows, rename columns on import, trim whitespace, and more.

One of the most important options to set is the na argument, which specifies what values to treat as NA on import. By default, read_csv() treats blank cells (i.e., "") and cells with "NA" as missing. You might need to change this (e.g., if missing values are entered as -999). Note that readxl::read_excel() by default only has na = c("") (no "NA")!

Import a file from the web/cloud

Import a CSV file from the internet

To import a CSV file from a web, assign the URL to a variable

url <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv"

and then apply read_csv file to the url.

read_csv(url)

You try and read one in:

Link = "http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv"


read_csv("http://gattonweb.uky.edu/sheather/book/docs/datasets/magazines.csv")
grade_code()

(This doesn't work for all import functions, so I usually keep them separate as two steps.)

Import an Excel file (.xls or .xlsx) from the internet

First, we'll need the package to load in Excel files:

library(readxl) 

Datafiles from this tutorial were obtained from: https://beanumber.github.io/sds192/lab-import.html#data_from_an_excel_file

Unlike with a CSV file, tto import an .xls or .xlsx file from the internet, you first need to download it localy.

Note: The folder you want to save the file to has to exist!. If it doesn't, you will get an error.

Either create the folder path in Finder/Windows Explorer, or use the dir.create() function:

dir.create(here::here("data", "s008_data"), recursive = TRUE)

Next, you download the file. To download it, create a new object called xls_url and then using download.file to dowload it to a specified destination path.

xls_url <- "http://gattonweb.uky.edu/sheather/book/docs/datasets/GreatestGivers.xls"
download.file(xls_url, here::here("data", "s008_data", "some_file.xls"), mode = "wb")

NOTE: The mode = "wb" argument at the end is really important if you are on Windows. If you omit, you will probabbly get a message about downloading a corrupt file. More details about this behaviour can be found here.

Naming a file "some_file" is extremely bad practice (hard to keep track of the files), and you should always give it a more descriptive name. Often, it's a good idea to name the file similarly (or the same as) the original file (sometimes that might not be a good idea if the original name is non-descriptive).

There's handy trick to extract the filename from the URL:

file_name <- basename(xls_url)
download.file(xls_url, here::here("data", "s008_data", file_name), mode = "wb")

Now we can import the file:

read_excel(here::here("data", "s008_data", file_name))

Read in a sample SPSS file.

Let's load a sample SPSS file and work with it. Download the file from here and save it in your data/s008_data folder.

These data are a random subset of the data used in this paper. This was a study looking at personality traits that distinguish C-level executives from lower-level managers among men and women.

The subset of data here consists of 200 cases, with variables indicating: 1. The language of assessment (English, Dutch, or French) 2. Gender 3. C-level or not 4. Extraversion level, as well as 4 facet traits (Leading, Communion, Persuasive, Motivating)

Let's load in the data using the haven package.

(clevel <- haven::read_spss(here::here("data", "s008_data", "clevel.sav")))

Notice that this tibble looks a little different for the language and gender variables than normal. It has labels for the numeric values. This format is what SPSS uses, but it's not standard for R. Let's convert those variables, and isClevel as well, to factors:

clevel_cleaned <-
  clevel %>% 
  mutate(language = as_factor(language),
         gender = as_factor(gender),
         isClevel = factor(isClevel, 
                           levels = c(0, 1), 
                           labels = c("No", "Yes"))
  ) %>% 
  print()

Notice how the variables are now factors with labels as the entries, instead of the original code numbers.

Saving data frames

Let's save this file as a CSV file so that it's a smaller file and easier to import again in the future.

write_csv(clevel_cleaned, here::here("data", "s008_data", "clevel_cleaned.csv"))

Saving plots

Now let's make a plot.

clevel_plot <-
  clevel_cleaned %>% 
  mutate(isClevel = recode(isClevel, 
                           No = "Below C-level", 
                           Yes = "C-level"),
         gender = recode(gender,
                         Female = "Women",
                         Male = "Men")) %>% 
  ggplot(aes(paste(isClevel, gender, sep = "\n"), Extraversion, color = gender)) +
  geom_boxplot() +
  geom_jitter(height = .2) +
  scale_color_manual(values = c("#1b9e77", "#7570b3")) +
  ggtitle("Extraversion Stan Scores") +
  scale_y_continuous(breaks = 1:9) +
  ggthemes::theme_fivethirtyeight() %>% 
  print()

Let's save the plot in several formats. This is useful if we want to use the plot outside of Markdown. Save plots using the ggsave() funnction.

ggsave() has many options. See the help function ?ggsave for full details. The main arguments are filename, plot, width and height (inches by default), and dpi (dots per inch; for bitmap formats).

ggsave() will try to guess what format you want based on the file name. If you want, you can specify a specific format or R graphics device to save with using the device argument.

dir.create(here::here("output", "figures"))
ggsave(here::here("output", "figures", "clevel_extraversion.svg"), clevel_plot)

You can save to several formats. Generally, work with a vector format like .svg, .eps, or .pdf. Vector graphics represent the image as a series of data points and equations. This means that they can be made smaller or larger or zoomed in on without damaging the image quality.

If you can't use a vector format for some reason, you can also export to a bitmap format. Bitmap graphs represent the image as colored dots or pixels. This means that the image quality will suffer if you make the image larger or zoom in on it (making it smaller can also sometimes compromise quality). With bitmap images, you need to be concerned with resolution (how many pixels/dots per inch when printed). Always use at least 300 DPI resolution.

There are several popular bitmap image formats. .tiff/.tif is the highest quality, but also the largest. Use it for print graphics, but you should probably avoid it for images to be hosted on the web. .png is a bit smaller, and it should be your go to for charts, figures, line drawings, etc. More complex images (e.g., photos) can get pretty big with .png, though. .jpg/.jpeg is probably the most popular bitmap format. It works well for photographs hosted on the web, but its compression often makes line drawing and charts look terrible. .jpg/.jpeg also degrades in quality each time you edit/save it, so don't use it for images you intend to edit. Generally avoid .gif unless you are making an animation or you need very small file size for a simple image. .gif supports very few colors, so always check your image quality after making a .gif.

Let's save to some other formats:

ggsave(here::here("output", "figures", "clevel_extraversion.eps"), clevel_plot)
ggsave(here::here("output", "figures", "clevel_extraversion.pdf"), clevel_plot)
ggsave(here::here("output", "figures", "clevel_extraversion.tiff"), clevel_plot)
ggsave(here::here("output", "figures", "clevel_extraversion.png"), clevel_plot)

Organizing your project folders

Follow a consistent folder structure for all of your projects. This will make it easier for you to say organized and make your code, data, and projects easy to share.

In your project's root folder, you should have a README.md file and a .Rproj project. Then, you should have folders that separate different types of files:

You might add additional folders, such as:

Bonus Activity: rio

Do the clevel activity again, but this time use the rio::import() and rio::export() functions instead of the read_*() and write_*() functions. In rio::import(), be sure to specify: rio::import(..., setclass = "tibble").



bwiernik/progdata documentation built on Feb. 1, 2021, 2:33 a.m.