In this Recipe we will build on our R coding skills to implement descriptive assessment of datasets.
First let's load some packages that will help us manipulate and create summaries of data.
library(tidyverse) # to manipulate and visualize datasets library(janitor) # to create (cross-)tabulations library(skimr) # to provide statistical summaries
We will be using a dataset drawn from the Barcelona English Language Corpus [@Munoz2007; @Munoz2006]. The corpus includes a series of tasks that aim to provide data on the effects of age on the acquisition of English as a foreign language (EFL). The task selected here is the "Written composition" which comprises written essays by students from Barcelona ranging from 10 to 17 years of age.
library(TBDBr) # API access to the Talkbank repository belc_tokens <- getTokens(corpusName = "slabank", corpora = c("slabank", "English", "BELC", "compositions")) belc_participants <- getParticipants(corpusName = "slabank", corpora = c("slabank", "English", "BELC", "compositions")) belc_compositions_tokens <- left_join(belc_participants, belc_tokens) %>% # join participants and tokens information unnest(everything()) %>% # unnest lists from variables filter(str_detect(path, "/compositions/")) %>% # remove compositions-2014 data separate(col = filename, into = c("time_group", "participant_id"), sep = "c") %>% # isolate participant_id mutate(time = str_extract(time_group, "^\\d")) %>% # isolate the testing time mutate(group = str_extract(time_group, "[AB]")) %>% # isolate the group filter(group == "A") %>% # only keep participants from group A filter(!is.na(pos)) %>% # remove lines with no pos filter(pos != "L2") %>% # remove lines with no pos as L2 filter(!is.na(stem)) %>% # remove lines with no stem select(participant_id, age_group = time, sex, word, num_utterances = numutts, stem, pos) # select only relevant variables belc <- belc_compositions_tokens %>% group_by(participant_id, age_group) %>% # grouping by participant composition mutate(num_tokens = n(), # number of words num_types = n_distinct(word), # number of distinct words ttr = round(num_types/ num_tokens, 3)) %>% # ratio of distinct words to words ungroup() %>% # clear grouping select(participant_id, age_group, sex, num_tokens, num_types, ttr) %>% # select relevant variables distinct() %>% # simplify the dataset to only the unique values for each row mutate(age_group = case_when( # recode group numbers to text age_group == "1" ~ "10-year-olds", age_group == "2" ~ "12-year-olds", age_group == "3" ~ "16-year-olds", age_group == "4" ~ "17-year-olds" )) %>% arrange(age_group, sex) # sort the dataset write_rds(belc, file = "recipe_4/data/rds/belc.rds")
Let's read the curated dataset^[To inspect the process by which this dataset was created, you can look at the source file for this recipe.].
# read the `belc` dataset from an .rds file belc <- read_rds(file = "data/rds/belc.rds")
belc <- read_rds(file = "recipe_4/data/rds/belc.rds")
In R there is a set of fundamental vector types which are used for distinct informational value types and these vector types are associated with either categorical- (character and logical) and continuous-types (integer and double). A couple things to note: first, character and logical vectors can be re-typed as a more complex vector type called a factor. Factors allow us to encode order to the character vectors (say for ordinal variables) and assign numeric values to the character vector to allow for us to do mathematical operations. We will how factors work later. Second, both integer and doubles are also called numeric vectors. The difference between integer and doubles is that doubles allow for decimal places, whereas integers are whole numbers.
The belc
dataset reflects a summary of each of the essays by each of the students and includes the number of word tokens (num_tokens
), the number of word types (num_types
) and the ratio of tokens to types (ttr
). The structure of the dataset (belc
) is seen below.
glimpse(belc) # overview of the dataset
As we see from the overview of the belc
dataset, we have character, integer and double vector types in the data frame. Therefore this dataset contains three categorical (participant_id
, age_group
, and sex
) and three continuous variables (num_tokens
, num_types
, and ttr
).
Now let's look at approaching descriptive summaries using the belc
dataset. First we will look at single vector summaries and then we will look at multiple vector summaries. The type of summary that we will apply will depend on the type of values that a variable contains, that is whether the values are categorical or continuous.
To prepare to work with the categorical data, let's re-type the categorical variables (participant_id
, age_group
and sex
) as factors. To apply the as.factor()
function to all the character vectors I make use of the mutate_if()
function which allows me to target only the vectors in the dataset that are of type character. Since the age_group
variable is ordinal we will explicitly encode that order. Again, the mutate()
function is called but only targets the age_group
vector.
belc <- belc %>% mutate_if(is.character, as.factor) %>% # create factors from character variables mutate(age_group = factor(age_group, ordered = TRUE)) # create ordered variable for age_group
Now let's take a look at the dataset.
glimpse(belc) # dataset overview
Now we have factors instead of character vectors for our categorical variables.
Categorical
To get a descriptive summary of the categorical variables I will use the skimr package [@R-skimr] and call the skim()
function and then pipe these results to yank()
and target only the factor variables.
belc %>% skim() %>% yank("factor")
From this output we see a host of descriptive information about our categorical variables. What is important to note is that categorical variables are summarized by counts. The most common value for a categorical variable is called the mode. If we want to look at a specific variable, say age_group
we can use the tabyl()
function from the janitor package [@R-janitor].
belc %>% tabyl(age_group) # create a tabulation of the `age_group` variable
We see that the tabyl()
function provides the counts but also the proportions of each of the values of the age_group
variable. Note that when working with factors the values are often called 'levels'. We can see that this variable has two levels with the same count. This characterizes what is called a bimodal distribution as there are two most frequent levels.
Tabular summaries are often the most effective way to assess categorical variables, but let's set the stage for working with plotting in R.
Among other package, the tidyverse package load the ggplot2 package [@R-ggplot2] which is a powerful package for creating plots in R. The 'gg' in ggplot2 refers to the use of "The Grammar of Graphics" approach to building plots. There are three basic elements for all ggplot2 plots: (1) data, (2) mappings, and (3) geometries. The data is, of course, the dataset that we want to use. The mappings used to select the variables to be used in the plot and how the variables are to be mapped to the plotting space. Finally the geometries specify how the mappings are organized.^[For reference visit the the ggplot2 website]
Let's create a simple plot for the age_group
variable. First we pass the datset to ggplot()
. The ggplot()
function then requires that we provide the aesthetic mappings aes()
in this case we have one variable and so we will want this variable to appear on the x axis. The geom_bar()
function by default will then count the levels of our age_group
factor variable and plot them on the y axis.
belc %>% # dataset ggplot(aes(x = age_group)) + # map age_group to x geom_bar() # create a barplot
There we go. Not a particularly informative plot, given we are only looking at a single categorical variable, but we will build on this basic formula to creat more informative graphics.
::: {.tip}
Note that the equivalent of the %>%
for building ggplot2 plots is the +
operator. This can be confusing, but it is important to recognize this distinction as it is easily overlooked and can cause unexpected errors.
:::
Continuous
Now let's turn to continuous variables. Where tabular summaries of categorical variables makes sense, this is not the case for continuous variables as by definition a continuous variable is not count-based but rather the values range along a continuum. Let's look at what the basic descriptives are for out continuous variables with skim()
(this time selecting only the numeric variables).
belc %>% skim() %>% yank("numeric")
Here we see that the type of summary information is not count-based, rather we have a new set of descriptives. The mean and sd (standard deviation) are easy to identify and straightforward. The summaries prefixed with p
represent the percentiles, in this case we have five percentile points (0, 25, 50, 75, 100), which slice the percentile space in four ranges therefore we call these the quartiles. These values are often called the 'five-number summary'. The 50th percentile is also known as the median. The five number summary provides a numerical view of the distribution of a continuous variable. Another use of these quartiles. is to calculate the range between the 25th and 75th quantile, (50% of the values), known as the Interquartile Range (IQR). This gives us a more precise estimate of the distribution as it does not include extreme values (above or below the IQR).
We can calculate this measure manually, or just apply the IQR()
function.
IQR(belc$num_types) # calculate iqr for `num_types`
Where with categorical variables tabular formats are often the most informative way to understand a variable, for continuous variables plots are the most informative. Let's now create some plots which provide views of the distribution of the num_types
variable.
Picking up with the quantiles, we can create an Empirical Cumulative Distribution Frequency (ECDF) plot which will give us an understanding of the proportions of the values along a continuous percentile range.
belc %>% # dataset ggplot(aes(x = num_types)) + # map `num_types` to x stat_ecdf(geom = "step") # generate the cumulative distribution
Here we can graphically inspect the points that intersect on the x and y axis to estimate the percentile of the values that have some number of unique words. We can also get a specific determination by using the following functions. Let's say we want to know how many number of unique words are in the lower 10% of the written compositions from BELC.
ecdf(belc$num_tokens) %>% # calculate the cumulative distribution quantile(.1) # pull the value for the 10th percentile
Now let's move towards looking at distributions. This is done by creating either a histogram or a density plot. Let's plot both here. I will assign each plot to an object and then have them output side-by-side using the gridExtra package [@R-gridExtra]
p1 <- belc %>% # dataset ggplot(aes(x = num_types)) + # map `num_types` to x geom_histogram() # create histogram p2 <- belc %>% # dataset ggplot(aes(x = num_types)) + # map `num_types` to x geom_density() # create density plot gridExtra::grid.arrange(p1, p2, ncol = 2) # arrange both plots in two columns
As you can see, a histogram also provides count information, like we saw with categorical variables. However, the counts here are based on binned groups, that is, a range of values are calculated that span the entire value space and then values that fall within one of these bins (ranges) are counted. The size of the bin range can be adjusted, but we've just gone with the default (in this case bins = 30
). The density plot uses proportions to provide an more continuous view of the distribution. Both plot types have their advantages. In the case of histograms it can be easier to identify outliers while density plots can help us determine more easily if the distribution is normal or skewed (left or right). As we will see, identifying outliers and determining the type of distribution we are working with will be useful downstream in certain types of analysis approaches.
On the topic of normal distributions, let's look at a useful plot for assessing the extent to which a continuous variable is normally distributed --the Quantile-Quantile plots (QQ Plot).
belc %>% # dataset ggplot(aes(sample = num_types)) + # map `num_types` to sample stat_qq() + # calculate the sample and theoretical quantiles points stat_qq_line() # plot the theoretical line
In QQ-plots the more the points diverge from the line, the less likely that the distribution is normal.
Now we turn our attention to working with multiple variables. We will first look at variables of the same type, and then look at describing variables of distinct types.
Categorical
Just was with descriptions of single categorical variables, tabular summaries are very useful. When there are multiple categorical variables, we cross-tabulate. That is, that each of the values of one variable are tabulated for the each value of the other variable(s).
Let's do a cross-tabulation of the age_group
and sex
variables. Again we will use the tabyl()
function but this time with two variables.
belc %>% # dataset tabyl(age_group, sex) # cross-tab of `age_group` and `sex`
We can also add proportions to this cross-tabulation by adding the adorn_percentages()
function. I've also rounded the output with adorn_rounding()
. Both these functions are part of the janitor package.
belc %>% # dataset tabyl(age_group, sex) %>% # cross-tab of `age_group` and `sex` adorn_percentages() %>% # add percentages (row by default) adorn_rounding(2) # round the output
The order of the variables in the tabyl()
function can allow you rotate the output. This may be desirable depending on the number of levels in a particular categorical variable.
belc %>% # dataset tabyl(sex, age_group) %>% # cross-tab of `age_group` and `sex` adorn_percentages() %>% # add percentages (row by default) adorn_rounding(2) # round the output
Now we can visualize this relationship in a bar plot as well. Let's create two bar plots, in fact. One for the counts and the second for proportions.
p1 <- belc %>% # dataset ggplot(aes(x = sex, fill = age_group)) + # map sex to x and age_group to y geom_bar() + # generate bar plot with counts labs(y = "Count") # add labels p2 <- belc %>% # dataset ggplot(aes(x = sex, fill = age_group)) + # map sex to x and age_group to y geom_bar(position = "fill") + # generate bar plot with proportions labs(y = "Proportion") # add labels gridExtra::grid.arrange(p1, p2, ncol = 2) # arrange both plots in two columns
The proportions provide a apples-to-apples comparison allowing us to see relative sizes of the age group levels for each level of sex.
::: {.tip}
I've added another function labs()
to the plot to change the y axis label to a custom label. With the labs()
function you can change the labels of any of axis as well as other mapping aesthetics as well as the title.
:::
Continuous
For summaries of continuous variables we can generate correlation statistics as well as visualize relationships. Let's start with building a plot to visualize the relationship between num_tokens
and ttr
. To plot points where continuous variables coincide we use the geom_point()
function. If we want to include a trend line we use the geom_smooth()
function. If we want that trend line to be linear, then the argument method = "lm"
is included.
p1 <- belc %>% # dataset ggplot(aes(x = num_tokens, y = ttr)) + # map num_tokens to x, ttr to y geom_point() + # plot x/y points labs(x = "Number of tokens", y = "Type-Token Ratio") p2 <- belc %>% # dataset ggplot(aes(x = num_tokens, y = ttr)) + # map num_tokens to x, ttr to y geom_point() + # plot x/y points geom_smooth(method = "lm") + # add a linear trend line labs(x = "Number of tokens", y = "Type-Token Ratio") gridExtra::grid.arrange(p1, p2, ncol = 2) # arrange both plots in two columns
To calculate a statistical summary of a relationship between two continuous variables (correlation) we can use the cor()
function from base R's stats package. We select the variables we want to explore and assign them to x
and y
and then select the appropriata method for the correlation assessment. For normally distributed continuous variables, we set method =
to 'pearson' and for non-normal distributions 'kendall'.
cor(x = belc$num_tokens, y = belc$ttr, method = "kendall") # correlation stat
Correlation statistics range from 1 to -1. The closer to either of these values means that the relationship is strong. A value closer to 0 means the correlation is weak and if it is 0 or near 0 there is no correlation.
::: {.tip}
Remember that to determine if a continuous variable conforms to the normal distribution we can apply the Shapiro-Wilk Normality Test using the shapiro.test()
function. A significant $p$-value means that the distribution is not normal.
:::
Mixed
The final scenario is one in which we are interested in assessing the relationship between a categorical variable and a continuous variable. We can perform a summary using group_by()
and the summarise()
functions. First we group the dataset by the categorical variable and then we create a new variable which is the result of the summary statistic that we want to calculate. In this case, let's look at the mean number of tokens by each level of the learner age group.
belc %>% # dataset group_by(age_group) %>% # group dataset by age_group summarise(mean_num_tokens = mean(num_tokens)) # calculate the mean num_tokens
The statistic(s) that we want to calculate are up to us and we can create multiple statistics by adding other functions to the summarise()
function.
belc %>% # dataset group_by(age_group) %>% # group dataset by age_group summarise(mean_num_tokens = mean(num_tokens), # calculate mean sd_num_tokens = sd(num_tokens), # calculate standard deviation median_num_tokens = median(num_tokens), # calculate median iqr_num_tokens = IQR(num_tokens)) # calculate the interquartile range score
The numeric summaries are helpful for reporting, but a visual can be much easier to interpret. To assess a categorical variable and a continuous variable we turn to box plots.
belc %>% # dataset ggplot(aes(x = age_group, y = num_tokens)) + # map age_group to x and num_tokens to y geom_boxplot() # create box plot
In this recipe we covered various common strategies for descriptively assessing variables in a dataset. We worked with single variables of both categorical and continuous types discussing the relationship between R's vector types and informational values as well as looking a descriptive stats and visualizations. We also looked at strategies for assessing multiple variables either of the same type or mixed.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.