knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this vignette I will provide an overview of some of the more common strategies that you will use to manipulate and organize your data for subsequente analysis. We will be working with two packages that are part of the tidyverse package. The first, tidyr
, provides a number of functions for reorganizing variables between long and wide format as well as separating out new variables based on the values of other variable. The second, dplyr
, is used for manipulating data, that is, to select, filter, sort, etc. and for transforming values either through recoding or some other operation.
Let's take at a dataset included in the analyzr
package. First, install and load the package, and the main tidyverse tools.
devtools::install_github("WFU-TLC/analyzr")
library(tidyverse) library(analyzr)
Let's take a look at the sdac
dataset.
glimpse(sdac)
This dataset is in the tidy format. Take a look at the R documentation for this dataset with ?sdac
.
There are a few tidyverse verbs that are very commonly used to manipulate data frames.
select() allows you to select a subset of columns
sdac %>% select(speaker_id, damsl_tag, birth_year, utterance_text) %>% head()
arrange() sorts a data frame by one or more columns
sdac %>% select(speaker_id, damsl_tag, birth_year, utterance_text) %>% arrange(birth_year) %>% head()
filter() allows you to select rows where the values match certain parameters
sdac %>% select(speaker_id, damsl_tag, birth_year, utterance_text) %>% arrange(birth_year) %>% filter(birth_year == 1971) %>% head()
filter()
can be combined with numerous operators and vector functions.
sdac %>% select(speaker_id, damsl_tag, birth_year, utterance_text) %>% arrange(birth_year) %>% filter(between(birth_year, 1950, 1969)) %>% head()
sdac %>% select(speaker_id, damsl_tag, birth_year, utterance_text) %>% arrange(birth_year) %>% filter(birth_year > 1955) %>% head()
You often want to explore your data by summarizing. A basic summary is count()
.
sdac %>% count()
You can also add column names to count()
to group your count summary.
sdac %>% count(birth_year, sort = TRUE)
You can also use the group_by()
function to expliciy group your data for multiple operations.
sdac %>% group_by(birth_year) %>% count()
Using group_by()
we can sample data as well.
sdac %>% group_by(birth_year) %>% sample_n(2) %>% select(speaker_id, birth_year, utterance_text) %>% arrange(birth_year) %>% head()
summarize
Vector functions
knitr::include_graphics(path = "http://www.sthda.com/sthda/RDoc/images/tidyr.png")
separate/ unite
Two table verbs
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.