knitr::opts_chunk$set(echo = TRUE, fig.retina = 3) penguins <- palmerpenguins::penguins
Questions
How can I use the functions I have learned these days together to improve my workflow?
Where do I go for help to learn more about using the tidyverse functions?
Objectives
To be able to combine the different functions we have covered in tandem to create seamless chains of data handling
To be able to search for more information about the tidyverse and its functions
If you have not worked with a similar program like R before, the notion of where the program runs from is an unfamiliar one. We never really think about where SPSS or excel is running from, we have a file open and work on it. but hopefully, you know where the file is being saved? This is basically where your program is running from. We call this "the working directory". In R, this concept is important for when reading in os saving (writing) files out from R.
The difference in R is that where R is running and where you save, for instance your script, are not necessarily the same thing. And this can be confusing and lead to difficulties in managing your analysis project. We have ways of setting this working directory for R, but the absolutely best way to manage your project, is to utilise Rprojects! Rprojects set your working directory for you, and RStudio makes it easy to open your projects and begin where you left off last. Cleverly organised Rprojects are on of the best ways to ensure reproducibility in R.
Let us create an Rproject for this course. When you maybe start you analysis project later, create a new one and start fresh!
Create a project by going to
File -> New Project ... -> New directory -> New project
Select the main directory you want to create your project in, and give the new folder a name, like "swc_tidyverse". Click "Create project".
RStudio will restart in a fresh instance. Notice where the breadcrumbs in the navigation in the Files pane in RStudio. You should be able to recognise the file path to what you specified in the gui.
For you projects we recommend structuring your projects in a clear fashion. Some guiding rules:
data/
to keep all your raw data. To be read from, never written to.ouput
to keep your analysis results. To be written to and treated like disposable. Your results should be reproducible by your scripts.scripts
where you place your scripts, i.e. files with R code running your data handling and analyses.For instance, if you place your input data in data/my_data.csv
, in your processing script saved in scripts/linear_models.R
, and you have started RStudio by opening your project, you can read in the data with:
library(tidyverse) my_data <- read_csv("data/my_data.csv")
This is called a relative path because the path to your data is relative to where your working directory is. And your working directory is the folder where the my_project.Rproject
file is.
We can also check the working directory in the console to be sure
getwd()
Rprojects are also great if you ever need to share your entire workflow with a colleague. If you are only using relative paths in your project, and all your scripts are safely stored in the project, giving your colleague the entire project folder should work without extra effort other than them installing the necessary packages you have used in your scripts.
Room: plenary
Duration: 5 minutes
1 Close RStudio. Find the
.Rproject
file in the folder you made before for this project (i.e. navigate to the folder you made). Double click it. Did RStudio open your project for you?
.Rproject
files will open them in RStudio. Files -> Recent projects
. This is the list of your most recently used projects. Hopefully it is in that list!This session is going to be a little different than the others. We will be working with more challenges and exploring different way of combining the things we have learned these days. So we will spend more time in break-out rooms solving challenges and being "hands-on" while in the plenary session we will talk about how we solved the challenges and if things are behaving as we expect or not and why.
Before the break, and a little scattered through the sessions, we have been combining the things we have learned. It's when we start using the tidyverse as a whole, all functions together that they start really becoming powerful. In this last session, we will be working on the things we have learned and applying them together in ways that uncover some of the cool things we can get done.
library(tidyverse) penguin_path <- palmerpenguins::path_to_file("penguins.csv") penguins <- read.csv(penguin_path)
Lets say we want to summarise all the measurement variables, i.e. all the columns containing "_". We've learned about summaries and grouped summaries. Can you think of a way we can do that using the things we've learned?
penguins %>% pivot_longer(contains("_"))
We've done this before, why is it a clue now? Now that we have learned grouping and summarising, what if we now also group by the new name column to get summaries for each column as a row already here!
penguins %>% pivot_longer(contains("_")) %>% group_by(name) %>% summarise(mean = mean(value, na.rm = TRUE))
Now we are talking! Now we have the mean of each of our observational columns! Lets add other common summary statistics.
penguins %>% pivot_longer(contains("_")) %>% group_by(name) %>% summarise(mean = mean(value, na.rm = TRUE), sd = sd(value, na.rm = TRUE), min = min(value, na.rm = TRUE), max = max(value, na.rm = TRUE))
That's a pretty neat table! The repetition of na.rm = TRUE
in all is a little tedious, though. Let us use an extra argument in the pivot longer to remove NA
s in the value column
penguins %>% pivot_longer(contains("_"), values_drop_na = TRUE) %>% group_by(name) %>% summarise(mean = mean(value), sd = sd(value), min = min(value), max = max(value))
Now we have a pretty decent summary table of our data.
Room: break-out
Duration: 10 minutes
1a: In our code making the summary table. Add another summary column for the number of records, giving it the name
n
. _hint: try then()
function.1b: Try grouping by more variables, is the output what you would expect it to be?
1c: Create another summary table, with the same descriptive statistics (mean, sd ,min,max and n), but for all numerical variables. Grouped only by the variable names.
## 1a penguins %>% pivot_longer(contains("_"), values_drop_na = TRUE) %>% group_by(name) %>% summarise(mean = mean(value), sd = sd(value), min = min(value), max = max(value), n = n()) # 1b penguins %>% pivot_longer(contains("_"), values_drop_na = TRUE) %>% group_by(name, species) %>% summarise(mean = mean(value), sd = sd(value), min = min(value), max = max(value), n = n()) ## 1c penguins %>% pivot_longer(where(is.numeric), values_drop_na = TRUE) %>% group_by(name) %>% summarise(mean = mean(value), sd = sd(value), min = min(value), max = max(value), n = n())
Now that we have the summaries, we can use them in plots too! But keep typing or copying the same code over and over is tedious. So let us save the summary in its own object, and keep using that.
penguins_sum <- penguins %>% pivot_longer(contains("_"), values_drop_na = TRUE) %>% group_by(name, species, island) %>% summarise(mean = mean(value), sd = sd(value), min = min(value), max = max(value), n = n()) %>% ungroup()
We can for instance make a bar chart with the values from the summary statistics.
penguins_sum %>% ggplot(aes(x = island, y = mean, fill = species)) + geom_bar() + facet_wrap(~ name, scales = "free")
Error: stat_count() can only have an x or y aesthetic.
This error message is telling us that we have used an aesthetic that is not needed in geom_bar. That is because geom_bar calculates frequencies by calling stat_count
. But we don't want to count, we already have the values we want to plot. The ggplot geoms that calculates statistics for plots (like geom bar), have a "stat" option. When we already have calculated the stat, we can let the geom know to use the values as they are by using stat = "identity"
.
penguins_sum %>% ggplot(aes(x = island, y = mean, fill = species)) + geom_bar(stat = "identity") + facet_wrap(~ name, scales = "free")
That is starting to look like something nice. But the way the bars for the species are stacking on top of each other is making it a little hard to read. In ggplot, there is an argument called "position", that could help us. By default in the bar charts position is set to "stacked". We should try the "dodge" option.
penguins_sum %>% ggplot(aes(x = island, y = mean, fill = species)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ name, scales = "free")
Room: break-out
Duration: 10 minutes
2a: Create a bar chart based om the penguins summary data, where the standard deviations are on the x axis and species are on the y axis. Make sure to dodge the bar for easier comparisons. Create subplots on the different metrics (Hint: use facet_wrap().
2b: Change it so that species is both on the x-axis and the fill for the bar chart. Whar argument do you need to add to
facet_wrap()
to make the y-axis scale vary freely between the subplots? Why is this plot misleading?
## 2a penguins_sum %>% ggplot(aes(x = island, y = sd, fill = species)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ name) # 2b penguins_sum %>% ggplot(aes(x = species, y = sd, fill = species)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ name, scales = "free")
The last plot is misleading because the data we have summary data by species and island. Ignoring the island in the plot, means that the values for the different measurements are summed to create the plot! While it still portrays the data, its ignoring an aspect of the data that might be significant to take into account. In stead of showing a single standard deviation for, for instance body mass, would be around 200grams, it looks like nows its almost 500grams!
But we can get even more creative! We mentioned in the pivoting session, that pivoting data is a key skill to really discover how powerful a tool the tidyverse can be. It's when you start thinking of pivoting as solutions to various tasks that is gets super interesting. For instance, in our summary data, we have 4 different statistics, and its hard to get them all nicely into a plot. But they all give us some information about the underlying data. How can we create a plot that showcases them all?
We can pivot even longer and create subplots for each statistic!
penguins_sum %>% pivot_longer(all_of(c("mean", "sd", "min", "max")))
Error: Failed to create output due to bad names. • Choose another strategy with `names_repair` Run `rlang::last_error()` to see where the error occurred.
What is this error? We already have a column named name
so when we try to let pivot_longer make another one, we get an error. Tibbles will not let you create columns with the same name, thankfully! That would be confusing. Let us make sure the new pivoted column with column names has a distinct name.
penguins_sum %>% pivot_longer(all_of(c("mean", "sd", "min", "max")), names_to = "stat")
Now that we have our extra long data, we can try plotting it all! We will switch facet_wrap()
to facet_grid()
which creates a grid of subplots. The formula for the grid is using both side of the ~
sign. And you can think of it like rows ~ columns
.
So here we are saying we want the stats
values as rows, and name
values as columns in the plot grid.
penguins_sum %>% pivot_longer(all_of(c("mean", "sd", "min", "max")), names_to = "stat") %>% ggplot(aes(x = species, y = value, fill = island)) + geom_bar(stat = "identity", position = "dodge") + facet_grid(stat ~ name)
Room: break-out
Duration: 10 minutes
3a: It is hard to see the different metrics in the subplots, because they are all on such different scales. Try setting the y-axis to be set freely to allow differences betweem the subplots. Was this the effect you expected?
3b: Try switching up what is plotted as rows and columns in the facet. Does this help the plot?
## 3a penguins_sum %>% pivot_longer(all_of(c("mean", "sd", "min", "max")), names_to = "stat") %>% ggplot(aes(x = species, y = value, fill = island)) + geom_bar(stat = "identity", position = "dodge") + facet_grid(stat ~ name, scales = "free") ## 3a penguins_sum %>% pivot_longer(all_of(c("mean", "sd", "min", "max")), names_to = "stat") %>% ggplot(aes(x = species, y = value, fill = island)) + geom_bar(stat = "identity", position = "dodge") + facet_grid(name ~ stat, scales = "free")
facet_grid
is more complex than facet_wrap
as it will always force the y-axis for rows, and x-axis for columns remain the same. So wile setting scales to free will help a little, it will only do so within each row and column, not each subplot. When the results do not look as you like, swapping what are rows and columns in the grid can often create better results.
Now we have covered how you can create data summaries, and create creative plots by combining pivots and summaries. We hope these sessions have given your a leg-up in starting you R and tidyverse journey. Remember that learning to code is like learning a new language, the best way to learn is to keep trying. We promise, your efforts will not be in vain as you uncover the power of R and the tidyverse.
As and end to this workshop, we would like to provide you with some learning materials that might aid you in further pursuits of learning R.
The tidyverse webpage offers lots of resources on learning the tidyverse way of working, and information about what great things you can do with this collection of packages. There is an R for Datascience learning community that is an excellent and welcoming community of other learners navigating the tidyverse. We wholeheartedly recommend joining this community! The Rstudio community is also a great place to ask questions or look for solutions for questions you may have, and so is stackoverflow.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.