Data Transformation

library(tidyverse)

Tibbles

Tibbles are data frames, but they tweak some older behaviors to make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It’s difficult to change base R without breaking existing code, so most innovation occurs in packages.

``{block2, type='rmdimportant'} See Chapter 7 in [R for Data Science](http://r4ds.had.co.nz/tibbles.html) andvignette("tibble")` for a more complete description of tibbles.

Key advantages of tibbles:

-    Never changes the types of the inputs (e.g. it never converts strings to factors!).
-    Never changes the names of variables.
-    Never creates row names.
-    Refined printing to the console:
     -   only prints the first 10 rows
     -   shows the data type of each column
     -   highlights missing values
     -   aligns numeric data
-   Enhanced subsetting

When creating a tibble:

-   it will **ONLY** recycle inputs of length 1
-   you can refer to variables you just created

```r
mytibble <- tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)

# versus

mydf <- data.frame(
  x = 1:5,
  y = 1
)

mydf$z <- mydf$x^2 + mydf$y

mytibble
mydf

For the most part you can use data.frames and tibbles interchangeably. However, some older functions in base R do not work properly with tibbles. This is because of the [ subset operator. As we saw in subsetting, if you return a single column with [ on a data.frame you will get a vector back instead of a single column data.frame. Tibbles always return tibbles and never a vector.

You can convert a data frame to a tibble with as_tibble(), while you can coerce a tibble to a data frame with as.data.fame().

``{block2, type='rmdimportant'} dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator,<-`

## Filter rows with `filter()`

`filter()` allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. 

## Arrange rows with `arrange()`

`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

Notes:

-    Use `desc()` to re-order by a column in descending order:
-    Missing values are always sorted at the end.


## Select columns with `select()`

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. `select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

There are a number of helper functions you can use within select():

-    everything(): all variables or everything else not already selected / deselected.
-    starts_with(): start with a prefix.
-    ends_with(): ends with a prefix
-    contains(): contains a literal string
-    matches(): match a regular expression. 
-    num_range(): a numerical range like x1, x2, x3.
-    one_of(): variable in a character vector

```{block type='rmdnote'}
Closely related to `select()` is `rename()` for renaming columns.

Add new variables with mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. mutate() always adds new columns at the end of your dataset so we’ll start by creating a narrower dataset so we can see the new variables. Remember that when you’re in RStudio, the easiest way to see all the columns is View().

Because we have are using tibbles you can refer to columns you just created.

``{block type='rmdtip'} If you only want to keep the new variables, usetransmute()`.

There are a number of helper functions you can use within mutate():

-    na_if(): convert values to NA
-    replace_na(): convert NA to a value
-    if_else(): vectorised if
-    recode(): recode values
-    case_when(): A general vectorised if. Equivalent of the SQL CASE WHEN statement. 


## Pipes

```{block2, type='rmdimportant'}
See Chapter 14 in [R for Data Science](http://r4ds.had.co.nz/pipes.html).

Typically it takes a series of operations to go from raw data to meaningful analysis. There are (at least) four ways to to do this:

Each have the place and utility. None are perfect for every situation. We’ll use piping frequently from now on because it considerably improves the readability of code.

Grouped summaries with summarise()

The last key verb is summarise(). It collapses a data frame to a single row. summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries.

{block2, type='rmdimportant'} If you need to remove grouping, and return to operations on ungrouped data, use ungroup()

Grouped mutates

Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter()

Reading For Next Class

  1. Read Tidy Data paper published in the Journal of Statistical Software, http://www.jstatsoft.org/v59/i10/paper.

Exercises

  1. Read and Try the exercises in Data transformation. There are quite a few useful functions and use cases we did not cover. It also does a very good job of how does one go from a question we want answered to quickly seeing the answer in a useful form.

  2. We barely scratched the surface of some of the useful dplyr functions for data transformations. Review the dplyr-transformation cheatsheet and see the help file for interesting functions.



DavisBrian/rclassnotes documentation built on May 17, 2019, 8:19 a.m.