library(edr) library(tidyverse)
This chapter covers
pivot_longer()
functionna_if()
functionThe last two chapters introduced us to data transformation (using dplyr) and data visualization (using ggplot). We really learned a lot, and we just dove right in with example data tables that were furnished by the edr package. The sw
and dmd
datasets from Chapters 2 and 3 were tidy datasets. We didn't have to think much about the arrangement of the data within those tables, there were used as is. They just worked. Quite often, however, the datasets you'll encounter and want to use will not be tidy. This can cause problems in your analyses, so we really need to make the effort to tidy them before performing any analysis. This chapter is all about recognizing the difference between tidy and untidy data, and then, using strategies to tidy that data before getting to the analysis stage.
While the edr package usually offers tidy datasets, there are a few untidy ones in there as well. We will use the untidy winniweather
dataset in our examples for this chapter. The dataset contains temperature data (four times daily) for the City of Winnipeg during the month of February 2015. As ever, we get access to all of edr's datasets using library(edr)
. Once that statement is executed, the untidy winniweather
table will be available. This is not a large dataset since it has merely 28 rows (one row per day of the study month) and the following six variables:
yearmonth
: a character string representing the year and the month; it's always "2015-2"
in this datasetday
: an integer representing the day of the monthtemp00_00
, temp06_00
, temp12_00
, and temp18_00
: hourly temperatures in degrees Celsius for the hours of 12 AM, 6 AM, 12 PM, and 6 PMFor more information on this dataset, execute help(winniweather)
in the RStudio console. The help page should provide more background detail than what's provided here.
Let's first learn how to identify untidy data. It's not always plain to see but we'll try to hone this sense by learning the principles of tidy data, identifying characteristics that are untidy, and then looking at several examples of untidiness. After that, the focus will be on turning the very untidy winniweather
dataset into a tidy data table. Our eventual goal will be to generate several exploratory plots of the tidied dataset. In doing so, we'll come across common pitfalls, learn how to fix those problems, and make increasingly better plots.
Tidy data is structured in a standardized way such that it is ready for analysis with tidy tools. Put another way, data values within a tidy dataset are organized in a manner expected by these data analysis tools. Having a standard solves many problems of data cleaning and analysis: we don't need to develop our own specification of what an analysis-ready dataset is, and, downstream data analysis tools can take advantage of a standard spec for tidy data and work well together.
In practice raw data is rarely tidy, and, because of that, it is much harder to work with as a result (downstream analysis tools expect tidy data). If we use techniques for tidying the data before we proceed to the analysis phase of our work, we will experience much less difficulty in that analysis. There will be no need to develop workarounds and there will be an overall lower risk of errors during data analysis.
In general, tidy data has these characteristics:
It's important to recognize untidy data when you get a new dataset. Detecting such untidiness will help us develop a strategy for tidying that makes the analysis of the data go that much more smoothly. Here is a list of the more common scenarios:
It can be difficult to both recognize when data is untidy and what the best course of action might be when untidiness is detected. The rest of this chapter will use the untidy winniweather
dataset to point out some of the more common ways that data can be untidy. The lessons will also demonstrate some useful techniques for transforming the untidy dataset into a tidy and useful dataset. The big payoff will be the ability to plot all of the data points in the tidied version of winniweather with ggplot (which is not initially possible without severe complications).
The tidyr package, part of the Tidyverse, has been designed to help with the task of tidying tabular data. We only need a few functions from the package to transform an untidy data table (winniweather
) to a tidy one.
The winniweather
dataset is available in the edr package; let's look at the top of the table by printing the dataset.
r edr::code_hints(
"**CODE //** The ~~winniweather~~ dataset, printed to the console."
)
winniweather
The dataset provides hourly temperatures at four times in each day, at 00:00
, 06:00
, 12:00
, and 18:00
(all 24-hour times) in the last four columns. The temperature values are in degrees Celsius. The first column, yearmonth
, gives us the year and month (as combined string, in a YYYY-M
format), and the second column (day
) has the day of the month.
The winniweather
dataset was introduced as an example of an untidy dataset. Let's look at a few of the traits that make this table untidy. The first and more obvious of these is that there are multiple observations in each line: each line has temperature readings at different times in the day (this is facilitated by encoding column names with time-of-day data). Look at the columns starting with "temp"
, each contains time data in the column names with a tempHH_MM format. According to the principles of tidy data, each observation is placed in its own row. So, to make this tidier, a good strategy is to create hour
and temp
columns. This would result in four rows for every row currently in the dataset and this makes intuitive sense because it's really four separate temperature observations.
Another potential problem is that the yearmonth
column is a combined value of the year and the month. It doesn't make much sense that the year and the month are combined while the day isn't combined with anything. We should split yearmonth
into year
and month
columns since tidy data stipulates that each value is placed in its own cell.
pivot_longer()
FunctionLet's start to transform winniweather
into a tidy dataset by first addressing the "temp*"
columns with tidyr's pivot_longer()
function. This function takes multiple columns and collapses them into key-value pairs (making the resulting table longer, with more rows). Put another way, the column names become values in the new key column and the values in each of the targeted columns go into the value column (we can supply our own names for the new key and value columns). It's far better to show than tell in this case. The following demonstrates how we would use pivot_longer()
to make this table more tidy.
r edr::code_hints(
"**CODE //** How to make data ~~pivot_longer()~~ and thus make it more tidy.",
c(
"#A We are piping the ~~winniweather~~ dataset to the ~~pivot_longer()~~ function.",
"#B We start by supplying a set of columns that contain data to be put in a 'longer' format (i.e., one observation per row). The ~~starts_with()~~ function is familiar: it's also used in **dplyr**'s ~~select()~~ function.",
"#C The ~~names_to~~ argument expects a name for a new column. This new column will contain the column names from the ~~starts_with()~~ selection.",
"#D The ~~values_to~~ argument also expects a name (again, of your choosing) for a new column; this column will contain the values."
))
winni_mod <- winniweather %>% #A pivot_longer( starts_with("temp"), #B names_to = "hour", #C values_to = "temp" #D ) winni_mod
The printing of the new winni_mod
tibble in the console shows the transformation of multiple temperature columns to a single a temp column.
We supplied a key of "hour"
and a value of "temp"
and targeted the columns that start with "temp"
(using the select helper function starts_with()
, for convenience). What we get in the output table (stored as the variable winni_mod
) is a longer table of (112 rows) compared to 28 rows in the original table. This is four times the row count of the original since the four separate observations per row in winniweather
are now given separate rows (as we need one row per observation for tidiness).
Although this is a great start, there is still some work to do to make the resulting table more useful. We can see that the hour
column in its current state is not ideal as it contains a string with the hour embedded within it. It's better to make that a number representing the hour of day. Let's do that now with a combination of dplyr's mutate()
function paired with case_when()
.
r edr::code_hints(
"**CODE //** Modifying values in the hour column with ~~mutate()~~.",
c(
"#A We are piping the ~~winni_mod~~ object to **dplyr**'s ~~mutate()~~ function.",
"#B The ~~hour~~ column already exists. We are essentially modifying values within the ~~hour~~ column on a case-by-case basis with the ~~case_when()~~ function.",
"#C Each line represents a different possible case (~~hour~~ being equal to a specific string) and a replacement value (the integer-based hour of day in 24-hour time)."
))
winni_mod <- winni_mod %>% #A mutate(hour = case_when( #B hour == "temp00_00" ~ 0L, #C hour == "temp06_00" ~ 6L, #C hour == "temp12_00" ~ 12L, #C hour == "temp18_00" ~ 18L #C )) winni_mod
We now find that winni_mod
has an hour
column with integer-based values (instead of text labels).
Remember that mutate()
modifies existing columns (or creates new ones) by providing it with a column name (here, hour
) and an expression. The case_when()
function and the resulting expression we have here is useful when you want to supply different logic on a case-by-case basis. The different cases are hour
being equivalent to each of the four different temp*
values. To the right of each ~
symbol are the expressions for each case. The expressions are very simple in every case: use the integer values of 0
, 6
, 12
, or 18
(the trailing L
tells R to store these values as integers).
separate()
Function to Split a Column into SeveralWe can now separate yearmonth
into year
and month
columns. We'll make these integer columns just like hour. The tidyr function to use here is separate()
and it contains all the features we need to split a column of data into multiple columns.
Using the separate()
function effectively only requires a little planning, which will be outlined here. To start, we know which column we want to separate (yearmonth
) and we also have names for the new columns (year
and month
). These names will be used within the col and into arguments of separate()
. Secondly, we know that the yearmonth
column in this dataset contains the 2015-2
value repeated in all rows (i.e., there is a single unique value). We ultimately want the 2015
part in a new year
column and the 2
part in a new month
column. That the hyphen separates the values we want in our new columns is ideal for our use of separate()
: we can use sep = "-"
and the function will understand that values before and after that character will comprise the new columns. Lastly, we are taking a character column as input (yearmonth
) and we'd like integer-based values in the new year
and month
columns. We can ask separate()
to do this conversion for us with the convert = TRUE
option. Let's see how this comes together in the code and table output:
r edr::code_hints(
"**CODE //** Separating the ~~yearmonth~~ column into ~~year~~ and ~~month~~ columns.",
c(
"#A The ~~winni_mod~~ object is piped into the ~~separate()~~ function, the result of that is stored as ~~winni_tidy~~.",
"#B For the ~~col~~ argument, we should provide the column name that contains the string to separate into multiple columns (~~yearmonth~~). The column name shouldn't be in quotes because the column exists.",
"#C The ~~into~~ argument lets us define column names for the split string. Since we are separating a string into two parts, we will provide two column names (in ~~c()~~): ~~year~~ and ~~month~~. These new names need to be in quotes because they don't yet exist.",
"#D With ~~sep~~, we provide an instruction on how to separate the strings in ~~col~~. In this case, we are splitting on a hyphen (~~-~~): everything left of it will be put into the ~~year~~ column, everything to the right of the hyphen is placed in the ~~month~~ column.",
"#E Using ~~convert = TRUE~~ will convert the new columns (~~year~~ and ~~month~~) to integer-type columns in this case."
))
winni_tidy <- #A winni_mod %>% separate( col = yearmonth, #B into = c("year", "month"), #C sep = "-", #D convert = TRUE #E ) winni_tidy
We see from the table printed in the console that the yearmonth
column has been replaced by the year
and month
columns.
This table is now tidy! As a last thing to do we could arrange the rows to be in the order of the observation times. This is the perfect case for using the arrange()
function.
r edr::code_hints(
"**CODE //** Using ~~arrange()~~ to put the observations in the correct order.",
c(
"#A The ~~arrange()~~ function from the **dplyr** package takes unquoted column names (or expressions with ~~desc()~~, not shown here but demonstrated in Chapter 2). The order is important. We are sorting first by ~~year~~, then ~~month~~, then ~~day~~, and finally by ~~hour~~."
))
winni_tidy <- winni_tidy %>% arrange(year, month, day, hour) #A winni_tidy
The output is now arranged, and the printed tibble above allows us to verify this.
Recall that the columns given to arrange()
have a sorting priority. Here we are sorting in order of the largest time parts to the smallest (year
, month
, day
, hour
). Furthermore, the arrangement is in ascending order, where the time of each observation is increasing as we move down the table.
We now have a single observation per row and, as far as we can tell, all observations are accounted for. There is immense value in having our data in a tidy format. For one thing, it's easier now to plot the data as time-series plot using ggplot. We do have to make one more alteration though (and this is only a small change). We should create a date-time column so that each point in time is represented by a single variable. We can do that with dplyr's mutate()
function and the base function ISOdate()
. With those functions, the new iso_date
column can be created by supplying all of the time-component columns to ISOdate()
.
r edr::code_hints(
"**CODE //** A date-time column is necessary for **ggplot**-based time-series plots."
)
winni_tidy <- winni_tidy %>% mutate(iso_date = ISOdate(year, month, day, hour)) winni_tidy
The winni_tidy
table now indeed contains a new column called iso_date
. The iso_date
column appears to have a type with the <dttm>
(date-time) abbreviation in the tibble print out. The object type of iso_date
is not character
or numeric
but instead POSIXct
. It's a convenient way to store accurate date-time values in a single column of a table.
About POSIXct
: Under the hood, the POSIXct
data type contains the number of seconds since the start of January 1, 1970. Positive numbers represent the number of seconds after this time, and negative numbers represent the number of seconds beforehand. Try using the Sys.time()
function (which gives you a POSIXct
object with the current date-time) with as.numeric()
in the R console like this: Sys.time() %>% as.numeric()
. You'll get the number of seconds after 1970-01-01 00:00:00
.
Now that we have the time data represented in a single column, a basic scatterplot in ggplot requires only two statements: the ggplot()
call followed by defining a layer of points with the geom_point()
function.
r edr::code_hints(
"**CODE //** A scatterplot of all observations in ~~winni_tidy~~ using **ggplot**.",
c(
"#A The ~~ggplot()~~ function call begins the plotting process, where we provide the tidied ~~winni_tidy~~ table as the plot data.",
"#B We use the ~~geom_point()~~ *geom* to indicate that we want a scatterplot. We are giving this function the minimally required aesthetics of ~~x~~ and ~~y~~ (the ~~iso_date~~ and ~~temp~~ columns in ~~winni_tidy~~)."
))
ggplot(data = winni_tidy) + #A geom_point(aes(x = iso_date, y = temp)) #B
(ref:winni-tidy-ggplot-1) The scatterplot with all temperature data reveals a spurious data point.
The plot we obtain from this code might surprise us since there is a temperature value with an extremely high value (9999
). Plots like these are easy to create and are very useful for finding anomalous values hiding in the dataset. In this case, we see a single spurious value. There is a logical explanation for why we have a very high temperature value in this dataset: it is actually missing value. In datasets you may obtain from different sources, the encoding of missing values can take various forms. Sometimes, the field is blank, there may be text (e.g., "N/A"
, "missing"
, "-"
, etc.), or, in this case, it may be a value of the same type (numeric
) but set to a physically improbable value (9999 degrees Celsius!). Once we know how missing values are encoded for any particular dataset, we can take action to re-encode as proper NA
values. In this way, our tidy tools, such as ggplot, can better handle the data.
NA
sThe winni_tidy
table that we worked hard to make tidy requires yet another change in the name of tidiness. The 9999
value, which is a missing value according to the data creator, needs to be transformed to an NA
value. The tidyr package has a useful function for this, and it works nicely in conjunction with dplyr's mutate: the na_if()
function. We can use the same construction as with mutate()
/case_when()
to modify the temp column, replacing that 9999
value with an NA
value.
r edr::code_hints(
"**CODE //** Replacement of missing values with the ~~mutate()~~ and ~~na_if()~~ functions.",
c(
"#A The ~~na_if()~~ function is used inside of ~~mutate()~~ to modify the ~~temp~~ column. Inside ~~na_if()~~ we first supply the column name to modify, then, we supply the value that is considered to be an ~~NA~~ value."
))
winni_tidy <- winni_tidy %>% mutate(temp = na_if(temp, 9999)) #A
The code that mutates the winni_tidy
table writes over the previous version of winni_tidy.
How do we verify that the change has been successful? After all, using winni_tidy
by itself only shows us the first few rows of the table. We need a better way to inspect the modified table, so that we can sanity check our work.
A reasonable method involves both dplyr's select()
function and the base R function summary()
. The idea is to reduce the table to the column of interest first, and then get summary statistics for the single column (so as to not clutter up the console with summary output for many columns). We won't overwrite the winni_tidy
object with a reassignment here, we just want to inspect the table non-destructively.
r edr::code_hints(
"**CODE //** Inspecting the data in a column (~~temp~~) to verify that the mutation had occurred.",
c(
"#A No reassignment back to ~~winni_tidy~~ here like in previous code listings, we just want to inspect the data.",
"#B We are selecting the ~~temp~~ column to focus only on that column's summary statistics.",
"#C The ~~summary()~~ function is very useful for getting simple summary statistics and validating one's assumptions of the data."
))
winni_tidy %>% #A select(temp) %>% #B summary() #C
The output shows that the maximum value for temp
is now -3.30
and not 9999
, which is a huge improvement! Also, we now have a single NA
value in the temp
column (this was our doing).
On this road to tidiness, we have accomplished quite a lot. This is very much how it goes with real data, it's often a touch-and-go process of tidying, inspecting, tidying, inspecting, and so on.
Let's now plot this data again. The next bit of plotting code isn't different than the previous plotting code, but it's here again for sake of convenience.
r edr::code_hints(
"**CODE //** Another attempt at a scatterplot of all the observations in ~~winni_tidy~~ (the even tidier version).",
c(
"#A This **ggplot** code is unchanged from before, which just goes to show that **ggplot** code can be highly reusable."
))
ggplot(data = winni_tidy) + #A geom_point(aes(x = iso_date, y = temp))
(ref:winni-tidy-ggplot-2) The scatterplot with all temperature data after dealing with the spurious data point.
This plot is much changed from the previous one. This plot is now usable, since the temperature value of 9999
has now been classified as an NA
value (and NA
values aren't plotted, of course). Let's make things interesting and facet the plot by the hour of day. We have that flexibility now since there is an hour
column. We do this in ggplot with the facet_wrap()
function, and the result is a plot with four facets (Figure \@ref(fig:winni-tidy-ggplot-3)).
r edr::code_hints(
"**CODE //** Faceting the temperatures in winni_tidy by the time of day.",
c(
"#A We are using ~~facet_wrap()~~ to make a set of plots faceted by the four different hours that temperatures are available each day in this dataset. The ~~labeler = label_both~~ statement always provides useful information and is a nice touch."
))
ggplot(data = winni_tidy) + geom_point(aes(x = iso_date, y = temp)) + facet_wrap(vars(hour), labeller = label_both) #A
(ref:winni-tidy-ggplot-3) A scatterplot with facets. Now possible with our tidy data.
The plot of Figure \@ref(fig:winni-tidy-ggplot-3) couldn't be generated without our tidying transformations. Actually, none of these plots could be made with the same number of data points we have here. The data were effectively divided across four columns, and, we didn't have a date-time column. But now, we have the freedom to further transform the tidy data (e.g., filtering the data, converting temperature values to degrees Fahrenheit, etc.) and plot again with ease.
pivot_longer()
, separate()
, and na_if()
functions from the tidyr package, we were able to tidy an otherwise untidy datasetWe used the separate()
function to split values in one column to two columns. When doing so, why was yearmonth
column not in quotation marks whereas the year
and month
columns were?
Previously, we used the arrange()
function to put the observations in the correct order (observations of temperature by ascending time). If we didn't do this beforehand, what would be resulting plots look like?
The ISOdate()
function was used to create a POSIXct date-time column. There is also an ISOdatetime()
function. Take a look at the help file (?ISOdate
has an article with both functions) and state why using ISOdate()
was the better choice.
Entirely empty columns (NA
s all the way down) are not entirely uncommon in a dataset. Sometimes we can safely discard such columns. How would you do this if winni_tidy
had an empty column called empty
? Provide a statement.
Sometimes, we'll get data tables that have entirely empty rows (NA
s in every field). We typically want to discard those rows. Consult the tidyr documentation and find out which function helps us do this. How would you use that function to remove an entirely empty row from winni_tidy
?
Recall that we replaced the 'missing' (9999
) value with the combination of the mutate()
and na_if()
functions. After that, the table was inspected with summary()
to see that the change was made. Can you think of another way to check that the 9999
value in temp
was recoded with an NA
value?
The reason that yearmonth
was not in quotation marks is because the column (at the time of the function call) existed in the data table. Columns that exist are, by convention, not put in quotes. The year
and month
columns at the time of the function call did not exist (they were yet to be made), that's why they are in quotes.
The plots would look no different without the arrange()
transformation. Then, why did we do it? It was a good review of the arrange()
function.
The ISOdate()
function has defaults for the time parts (hour = 12
, min = 0
, sec = 0
) and for the time zone (tz = "GMT"
) whereas ISOdatetime()
has no such default values. As our data did not have minute or second time components, we would have to specify our own values for those.
The easiest way to do this is to use dplyr's select()
function: winni_tidy %>% select(-empty)
.
The drop_na()
function from tidyr allows us to remove entirely or partially empty rows. One or more entirely empty rows in winni_tidy
can be removed with winni_tidy %>% drop_na()
.
One possible, alternate method is to use dplyr's filter()
function to check for both an NA
and for a 9999
value in the temp column: winni_tidy %>% filter(is.na(temp) | temp == 9999)
. Success looks like a single row with an NA
in the temp column.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.