library(learnr) library(testwhat) library(magrittr) tutorial_options( exercise.timelimit = 60, exercise.checker = testwhat::testwhat_learnr ) knitr::opts_chunk$set(comment = NA)
This tutorial is in many parts built from tutorials published on GitHub by RStudio and its Education team, mainly from their 2-day internal R bootcamp and from the RStudio Cloud primers.
readr::read_csv()
- comma delimited files;readr::read_csv2()
- semicolon separated files (common in countries where , is used as the decimal place);readr::read_tsv()
- tab delimited files;readr::read_delim()
- reads in files with any delimiter;readxl::read_excel()
- reads .xls
and .xlsx
files;nobel <- readr::read_csv(file = "www/nobel.csv")
skimr::skim(nobel)
The skimr package provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skim_df object which can be included in a pipeline or displayed nicely for the human reader.
df <- tibble::tribble( ~x, ~y, 1, "a", 2, "b", 3, "c" ) list.files() readr::write_csv(df, path = "df.csv") list.files() # For Unix systems: # writeLines(system("head -n 3 df.csv", intern = TRUE)) # For Windows: # writeLines(system("gc df.csv | select -first 3", intern = TRUE))
if (file.exists("df.csv")) file.remove("df.csv")
www/nobel.csv
;nobel_stem
, that filters for the STEM fields (Physics, Medicine, Chemistry, and Economics);nobel_nonstem
, that filters for the remaining fields;nobel-stem.csv
and nobel-nonstem.csv
;dir.create("www") download.file( url = "https://github.com/astamm/teachr/raw/master/inst/tutorials/03_DataImport/www/nobel.csv", destfile = "www/nobel.csv", mode = "wb" )
edi_airbnb <- readr::read_csv(file = "www/edi-airbnb.csv") names(edi_airbnb)
... but R doesn't allow spaces in variable names:
edi_airbnb$Number of bathrooms
edi_airbnb$`Number of bathrooms`
edi_airbnb_col_names <- readr::read_csv( file = "www/edi-airbnb.csv", col_names = c( "id", "price", "neighbourhood", "accommodates", "bathroom", "bedroom", "bed", "review_scores_rating", "n_reviews", "url" ) ) names(edi_airbnb_col_names)
snake_case
edi_airbnb_cleaned_names <- edi_airbnb %>% janitor::clean_names() names(edi_airbnb_cleaned_names)
The janitor package has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff. The main functions:
Suppose you have some data stored in a CSV file that looks like this:
knitr::include_graphics("images/df-na.png")
Let us import it in our R session:
readr::read_csv("www/df-na.csv")
What is the type of the different variables ? Is it what you expected ?
A first solution is to use the argume na
to explicitly list all values in the file that should be considered as NA
:
readr::read_csv( file = "www/df-na.csv", na = c("", "NA", ".", "9999", "Not applicable") )
readr::read_csv( file = "www/df-na.csv", col_types = list( readr::col_double(), readr::col_character(), readr::col_character() ) )
type function | data type
------------------ | -------------
col_character()
| character
col_date()
| date
col_datetime()
| POSIXct (date-time)
col_double()
| double (numeric)
col_factor()
| factor
col_guess()
| let readr guess (default)
col_integer()
| integer
col_logical()
| logical
col_number()
| numbers mixed with non-number characters
col_numeric()
| double or integer
col_skip()
| do not read
col_time()
| time
www/favourite-food.xlsx
;NA
s and make sure you're happy with variable types;Low
, Middle
, High
;favourite-food.csv
;favourite-food.csv
back in and observe the variable types. Are they as you left them?dir.create("www") download.file( url = "https://github.com/astamm/teachr/raw/master/inst/tutorials/03_DataImport/www/favourite-food.xlsx", destfile = "www/favourite-food.xlsx", mode = "wb" )
read_rds()
and write_rds()
read_rds()
and write_rds()
, respectively.readr::read_rds(path) readr::write_rds(x, path)
favourite-food.rds
;favourite-food.rds
back in and observe the variable types. Are they as you left them?dir.create("www") download.file( url = "https://github.com/astamm/teachr/raw/master/inst/tutorials/03_DataImport/www/favourite-food.xlsx", destfile = "www/favourite-food.xlsx", mode = "wb" )
The sales
data set (located at www/sales.xlsx
) looks like this:
readxl::read_excel("www/sales.xlsx")
Read it using appropriate arguments for the readxl::read_excel()
function such that it looks like the following:
sales <- readxl::read_excel( path = "www/sales.xlsx", skip = 3, col_names = c("id", "n") ) sales
dir.create("www") download.file( url = "https://github.com/astamm/teachr/raw/master/inst/tutorials/03_DataImport/www/sales.xlsx", destfile = "www/sales.xlsx", mode = "wb" )
As stated in the description of the naniar package, missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis.
The naniar package provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of 'ggplot2' and tidy data.
You can always decide to discard any observation that would contain a missing value, but this strategy is often not optimal as you might reduce drastically your sample size and come to weak statistical conclusions. A better strategy pertains to doing imputation.
Imputation means replacing missing values by actual values using the available non-missing data. For imputation, we recommend the packages Amelia, mice and miceFast. In particular, the Amelia::amelia()
and mice::mice()
functions provide a very easy way to perform multiple imputation on a data set.
The miceFast package has a slightly more complex syntax to achieve the same results provided by the mice package. However, it is a complete re-implementation of the mice package from scratch using C++, which comes with $2$ big advantages:
It is therefore recommended for those who do not like many dependencies to be installed on their computer or for those who need to perform multiple imputation on big data sets.
Note finally that there are many other packages for handling missing values.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.