```r knitr::opts_chunk$set(comment=NA, fig.width=6, fig.height=6, echo = TRUE, eval = TRUE)
### Slides Here a link to the lecture slides for this session: LINK ### Overview In this practical you'll learn how to read and save data. By the end of this practical you will know how to: 1. Create a project 2. Identify the location of a file 3. Read in data of various types 4. Use R's file connections 5. Scrape the internet (if you get there) ### Functions Here are the main read-in functions: | Function| Description| |:------|:--------| | `read_csv()`| Read flat csv file| | `read_sas()`| Read SAS file| | `read_sac()`| Read SPSS file| | `readRDS()`| Read RDS file | | `file(...,'r'), readLines`| Read from file conection | Here are the main export functions: | Function| Description| |:------|:--------| | `write_csv()`| Write flat csv file| | `write_sas()`| Write SAS file| | `write_sac()`| Write SPSS file| | `saveRDS()`| Save RDS file | | `file(...,'w'), readLines`| Write to file conection | ## Tasks The tutorial begins with the titanic data set, which contains records of 1313 passengers on their name (`Name`), their passenger class (`PClass`), their age (`Age`), their sex (`Sex`), whether they survived (`Survived`), and finally a numeric coding of their sex (`SexCode`). Later you will be working with a randomly generated artificial data set of mine, which will help demonstrate differences in speed and file size. Please download the following data files and save them to an appropriate location on the disc: <a href="{{site.url}}/_materials/data_sets/titanic.csv">titanic.csv</a><br> <a href="{{site.url}}/_materials/data_sets/titanic.sav">titanic.sav</a><br> <a href="{{site.url}}/_materials/data_sets/titanic.sas7bdat">titanic.sas7bdat</a><br> <a href="{{site.url}}/_materials/data_sets/my_data.csv">my_data.csv</a> If you can, please verify whether the `.sav` and `sas7bdat` are appropriate SPSS and SAS files, respectively. ### Read to `tibble` 1. Create a new **project** either in the location of the downloaded data files or its parent folder (one folder higher). To do this go to *File/New Project...*, select existing directory, and enter the desired folder. This step will ease reading and saving data, as it sets the working directory to the proximity of the data files. Confirm this using `getwd()`. 2. Identify the filepath (incl. filename) for each of the three data files. RStudio makes it really easy. Simply write quotation marks in your script editor, e.g., `""`, move the (text) curser between the marks, and press the tab key. This opens a file menu that allows you to conveniently select the path to the file. The result is a character string that exactly defines the location of the file relative to the current working directory. Oftentimes, it is convenient to do this directly within the read function, e.g., `read_csv("")`. 3. Load the packages `readr` and `haven` using `library()`, and read in the three titanic data files usig `read_csv()`, `read_sas()`, and `read_sav()`. Call them `df_csv`, `df_sas`, and `df_sav`, respectively. Inspect each of the imported objects using `print()` and `str()`. What kind of object are they? ```r df_csv <- read_csv('data_files/titanic.csv') df_sas <- read_sas('data_files/titanic.sas7bdat') df_sav <- read_sav('data_files/titanic.sav') df_csv ; df_sas ; df_sav str(df_csv) ; str(df_sas) ; str(df_sav)
==
introduced in the last tutorial and the arithmetic mean function mean()
. Remember, that logical values can be coerced to 0 and 1. Note, that the data sets contain NA's, that is missing values. One way to see this is via the summary function summary()
. Specifically, the variable Age
conains missing values. Use the help file ?mean
to find out how to compute the mean while ignoring missing values and then use ==
to verify the equality between the three data sets. mean(df_csv == df_sas, na.rm=T) == 1 mean(df_csv == df_sav, na.rm=T) == 1 mean(df_sas == df_sav, na.rm=T) == 1
df_csv
using a negative index. Negtive indices mean omit rather than select. E.g., c(1, 2, 3)[-2] returns the vector without the second value. Then write the reduced data frame to disk using write_csv()
. Use the exact same file path. Observe whether R gives you a warning before (over-)writing the data?df_csv = df_csv[,-1] write_csv(df_csv,"data_files/titanic.csv")
my_data.csv
and read it in using read_csv()
. Now write the data back to disk using write_sas
, write_sav
, and also saveRDS
. Each of these functions take as the first argument the data frame and as the second argumen the file path. To obtain the latter adapt the filepath of my_data.csv
by hand to make sure that the file path has the correct file ending, i.e., .sas7bdat
, .sav
, and .RDS
, respectively. While you write the individual data sets pay attention to the speed of execution. my_data = read_csv("data_files/my_data.csv") write_sas(my_data,"data_files/my_data.sas7bdat") write_sav(my_data,"data_files/my_data.sav") saveRDS(my_data,"data_files/my_data.RDS")
saveRDS
function was the slowest. This is because .RDS
generally leads to the highest degree of compression, i.e., the smalles data files. To verify this use the file.info()
on each of the four files, my_data.csv
, my_data.sas7bdat
, my_data.sav
, and my_data.RDS
(using the correct path). Look out for the size
element, which gives the size in bytes
.file.info("data_files/my_data.csv") file.info("data_files/my_data.sas7bdat") file.info("data_files/my_data.sav") file.info("data_files/my_data.RDS")
file()
. The main arguments to file
are a location (a file path, url, etc.) and a mode indicator, e.g., r
for Open for reading in text mode
. Try opening a connection to the titanic.csv
dataset. To do this, you must assign the output of file()
to an object, which then is the connection, e.g., con = file("my_path","r")
. con = file('data_files/titanic.csv','r')
readLines
. readLines
iterates through the file line by line and returns each line as a character string. Store the output in an object. Try reading the file. When done close the connection using close(my_con)
and inspect the data. lin = readLines(con) close(con)
read_csv()
. If you're up for the challenge you can try to nonetheless parse this data. A good way is to split each line using the stri_split_fixed
function using the comma ,
(from the stringi
package). This will return a list of vectors, where each vector is one row of the data frame. Next step is to bind the rows to a matrix using do.call(rbind,my_list)
. What's then left is transform the matrix to a data frame and to take care of column types and names. lin <- readLines(con) close(con) require(stringi) spl <- stri_split_fixed(lin,',') mat <- do.call(rbind,spl) df <- as.tibble(mat) # take care of types
url()
we can establish a connection to a webpage, which we then can use to retrieve information. This is used, for instance, by the very convenient rvest
package. To illustrate how easy it can be, consider the following code that downloads and parses Milestones table of R's Wikipedia page.# load package install.packages('rvest') library(rvest) library(magrittr) # get html url = 'https://en.wikipedia.org/wiki/R_(programming_language)' page = read_html(url) # get table # use XPath from inspect page (e.g., in Chrome) table = page %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table() # create tibble as.tibble(table)
For more details on all steps of data analysis check out Hadley Wickham's R for Data Science.
For more advanced content on objects check out Hadley Wickham's Advanced R.
For more on pirates and data analysis check out the respective chapters in YaRrr! The Pirate's Guide to R YaRrr! Chapter Link
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.