library(learnr)
library(testwhat)
library(magrittr)

tutorial_options(
  exercise.timelimit = 60,
  exercise.checker = testwhat::testwhat_learnr
)
knitr::opts_chunk$set(comment = NA)

Disclaimer

This tutorial is in many parts built from tutorials published on GitHub by RStudio and its Education team, mainly from their 2-day internal R bootcamp and from the RStudio Cloud primers.

Reading rectangular data

The readr and readxl packages

readxl logo readr logo

Reading and summarizing data - The skimr package

skimr logo

nobel <- readr::read_csv(file = "www/nobel.csv")
skimr::skim(nobel)

The skimr package provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skim_df object which can be included in a pipeline or displayed nicely for the human reader.

Writing data

df <- tibble::tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
)

list.files()
readr::write_csv(df, path = "df.csv")
list.files()
# For Unix systems:
# writeLines(system("head -n 3 df.csv", intern = TRUE))
# For Windows:
# writeLines(system("gc df.csv | select -first 3", intern = TRUE))
if (file.exists("df.csv"))
  file.remove("df.csv")

Exercise


dir.create("www")
download.file(
  url = "https://github.com/astamm/teachr/raw/master/inst/tutorials/03_DataImport/www/nobel.csv", 
  destfile = "www/nobel.csv", 
  mode = "wb"
)
**Hint:** Use the `%in%` operator when filtering.

Variable names

edi_airbnb <- readr::read_csv(file = "www/edi-airbnb.csv")
names(edi_airbnb)

... but R doesn't allow spaces in variable names:

edi_airbnb$Number of bathrooms

Option 1 - Quote ` ` variable names

edi_airbnb$`Number of bathrooms`

Option 2 - Define column names

edi_airbnb_col_names <- readr::read_csv(
  file = "www/edi-airbnb.csv", 
  col_names = c(
    "id", "price", "neighbourhood", 
    "accommodates", "bathroom", "bedroom", 
    "bed", "review_scores_rating", 
    "n_reviews", "url"
  )
)

names(edi_airbnb_col_names)

Option 3 - Format text to snake_case

readr logo

edi_airbnb_cleaned_names <- edi_airbnb %>%
  janitor::clean_names()

names(edi_airbnb_cleaned_names)

The janitor package has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff. The main functions:

Variable types

Suppose you have some data stored in a CSV file that looks like this:

knitr::include_graphics("images/df-na.png")

Let us import it in our R session:

readr::read_csv("www/df-na.csv")

What is the type of the different variables ? Is it what you expected ?

Option 1 - Explicit NAs

A first solution is to use the argume na to explicitly list all values in the file that should be considered as NA:

readr::read_csv(
  file = "www/df-na.csv", 
  na = c("", "NA", ".", "9999", "Not applicable")
)

Option 2 - Specify column types

readr::read_csv(
  file = "www/df-na.csv", 
  col_types = list(
    readr::col_double(), 
    readr::col_character(), 
    readr::col_character()
  )
)

Column types

type function | data type ------------------ | ------------- col_character() | character col_date() | date col_datetime() | POSIXct (date-time) col_double() | double (numeric) col_factor() | factor col_guess() | let readr guess (default) col_integer() | integer col_logical() | logical col_number() | numbers mixed with non-number characters col_numeric() | double or integer col_skip() | do not read col_time() | time

Exercise 1


dir.create("www")
download.file(
  url = "https://github.com/astamm/teachr/raw/master/inst/tutorials/03_DataImport/www/favourite-food.xlsx", 
  destfile = "www/favourite-food.xlsx", 
  mode = "wb"
)

read_rds() and write_rds()

readr::read_rds(path)
readr::write_rds(x, path)

Exercise 2


dir.create("www")
download.file(
  url = "https://github.com/astamm/teachr/raw/master/inst/tutorials/03_DataImport/www/favourite-food.xlsx", 
  destfile = "www/favourite-food.xlsx", 
  mode = "wb"
)

Exercise 3

The sales data set (located at www/sales.xlsx) looks like this:

readxl::read_excel("www/sales.xlsx")

Read it using appropriate arguments for the readxl::read_excel() function such that it looks like the following:

sales <- readxl::read_excel(
  path = "www/sales.xlsx", 
  skip = 3, 
  col_names = c("id", "n")
)
sales

dir.create("www")
download.file(
  url = "https://github.com/astamm/teachr/raw/master/inst/tutorials/03_DataImport/www/sales.xlsx", 
  destfile = "www/sales.xlsx", 
  mode = "wb"
)

Other types of data

Missing Values

haven logo

As stated in the description of the naniar package, missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis.

Visualizing missing values

The naniar package provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of 'ggplot2' and tidy data.

You can always decide to discard any observation that would contain a missing value, but this strategy is often not optimal as you might reduce drastically your sample size and come to weak statistical conclusions. A better strategy pertains to doing imputation.

Imputing missing values

Imputation means replacing missing values by actual values using the available non-missing data. For imputation, we recommend the packages Amelia, mice and miceFast. In particular, the Amelia::amelia() and mice::mice() functions provide a very easy way to perform multiple imputation on a data set.

The miceFast package has a slightly more complex syntax to achieve the same results provided by the mice package. However, it is a complete re-implementation of the mice package from scratch using C++, which comes with $2$ big advantages:

  1. it has very few package dependencies;
  2. it is considerably much faster.

It is therefore recommended for those who do not like many dependencies to be installed on their computer or for those who need to perform multiple imputation on big data sets.

Note finally that there are many other packages for handling missing values.



astamm/teachr documentation built on Jan. 12, 2023, 7:21 a.m.