vignettes/paper.md

title: 'Welcome to the Tidyverse' output: rmarkdown::html_vignette: keep_md: TRUE tags: - data science - R authors: - name: Hadley Wickham orcid: 0000-0003-4757-117X affiliation: 1 - name: Mara Averick orcid: 0000-0001-9659-6192 affiliation: 1 - name: Jennifer Bryan orcid: 0000-0002-6983-2759 affiliation: 1 - name: Winston Chang orcid: 0000-0002-1576-2126 affiliation: 1 - name: Lucy D'Agostino McGowan orcid: 0000-0001-7297-9359 affiliation: 8 - name: Romain François orcid: 0000-0002-2444-4226 affiliation: 1 - name: Garrett Grolemund orcid: 0000-0002-7765-6011 affiliation: 1 - name: Alex Hayes orcid: 0000-0002-4985-5160 affiliation: 12 - name: Lionel Henry orcid: 0000-0002-8143-5908 affiliation: 1 - name: Jim Hester orcid: 0000-0002-2739-7082 affiliation: 1 - name: Max Kuhn orcid: 0000-0003-2402-136X affiliation: 1 - name: Thomas Lin Pedersen orcid: 0000-0002-5147-4711 affiliation: 1 - name: Evan Miller affiliation: 13 - name: Stephan Milton Bache affiliation: 3 - name: Kirill Müller orcid: 0000-0002-1416-3412 affiliation: 2 - name: Jeroen Ooms affiliation: 14 - name: David Robinson affiliation: 5 - name: Dana Paige Seidel orcid: 0000-0002-9088-5767 affiliation: 10 - name: Vitalie Spinu orcid: 0000-0002-2138-3413 affiliation: 4 - name: Kohske Takahashi orcid: 0000-0001-6076-4828 affiliation: 9 - name: Davis Vaughan orcid: 0000-0003-4777-038X affiliation: 1 - name: Claus Wilke orcid: 0000-0002-7470-9261 affiliation: 6 - name: Kara Woo orcid: 0000-0002-5125-4188 affiliation: 7 - name: Hiroaki Yutani orcid: 0000-0002-3385-7233 affiliation: 11 affiliations: - name: RStudio index: 1 - name: cynkra index: 2 - name: Redbubble index: 3 - name: Erasmus University Rotterdam index: 4 - name: Flatiron Health index: 5 - name: Department of Integrative Biology, The University of Texas at Austin index: 6 - name: Sage Bionetworks index: 7 - name: Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health index: 8 - name: Chukyo University, Japan index: 9 - name: Department of Environmental Science, Policy, & Management, University of California, Berkeley index: 10 - name: LINE Corporation index: 11 - name: University of Wisconsin, Madison index: 12 - name: None index: 13 - name: University of California, Berkeley index: 14 date: 08 November 2019 bibliography: paper.bib

Summary

{ width=120px }

At a high level, the tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share a high-level design philosophy and low-level grammar and data structures, so that learning one package makes it easier to learn the next.

The tidyverse encompasses the repeated tasks at the heart of every data science project: data import, tidying, manipulation, visualisation, and programming. We expect that almost every project will use multiple domain-specific packages outside of the tidyverse: our goal is to provide tooling for the most common challenges; not to solve every possible problem. Notably, the tidyverse doesn't include tools for statistical modelling or communication. These toolkits are critical for data science, but are so large that they merit separate treatment. The tidyverse package allows users to install all tidyverse packages with a single command.

There are a number of projects that are similar in scope to the tidyverse. The closest is perhaps Bioconductor [@Bioconductor; @Huber2015], which provides an ecosystem of packages that support the analysis of high-throughput genomic data. The tidyverse has similar goals to R itself, but any comparison to the R Project [@R-core] is fundamentally challenging as the tidyverse is written in R, and relies on R for its infrastructure; there is no tidyverse without R! That said, the biggest difference is in priorities: base R is highly focussed on stability, whereas the tidyverse will make breaking changes in the search for better interfaces. Another closely related project is data.table [@R-datatable], which provides tools roughly equivalent to the combination of dplyr, tidyr, tibble, and readr. data.table prioritises concision and performance.

This paper describes the tidyverse package, the components of the tidyverse, and some of the underlying design principles. This is a lot of ground to cover in a brief paper, so we focus on a 50,000-foot view showing how all the pieces fit together with copious links to more detailed resources.

Tidyverse package

The tidyverse is a collection of packages that can easily be installed with a single "meta"-package, which is called "tidyverse". This provides a convenient way of downloading and installing all tidyverse packages with a single R command:

install.packages("tidyverse")

The core tidyverse includes the packages that you're likely to use in everyday data analyses, and these are attached when you attach the tidyverse package:

library(tidyverse)
#> -- Attaching packages --------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.2.1     v purrr   0.3.3
#> v tibble  2.1.3     v dplyr   0.8.3
#> v tidyr   1.0.0     v stringr 1.4.0
#> v readr   1.3.1     v forcats 0.4.0
#> -- Conflicts ------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

This is a convenient shortcut for attaching the core packages, produces a short report telling you which package versions you're using, and succinctly informs you of any conflicts with previously loaded packages. As of tidyverse version 1.2.0, the core packages include dplyr [@R-dplyr], forcats [@R-forcats], ggplot2 [@ggplot2], purrr [@R-purrr], readr [@R-readr], stringr [@R-stringr], tibble [@R-tibble], and tidyr [@R-tidyr].

Non-core packages are installed with install.packages("tidyverse"), but are not attached by library(tidyverse). They play more specialised roles, so will be attached by the analyst as needed. The non-core packages are: blob [@R-blob], feather [@R-feather], jsonlite [@R-jsonlite], glue [@R-glue], googledrive [@R-googledrive], haven [@R-haven], hms [@R-hms], lubridate [@R-lubridate], magrittr [@R-magrittr], modelr [@R-modelr], readxl [@R-readxl], reprex [@R-reprex], rvest [@R-rvest], and xml2 [@R-xml2].

The tidyverse package is designed with an eye for teaching: install.packages("tidyverse") gets you a "batteries-included" set of 87 packages (at time of writing). This large set of dependencies means that it is not appropriate to use the tidyverse package within another package; instead, we recommend that package authors import only the specific packages that they use.

Components

How do the component packages of the tidyverse fit together? We use the model of data science tools from "R for Data Science" [@r4ds]:

{ width=80% }

Every analysis starts with data import: if you can't get your data into R, you can't do data science on it! Data import takes data stored in a file, database, or behind a web API, and reads it into a data frame in R. Data import is supported by the core readr [@R-readr] package for tabular files (like csv, tsv, and fwf).

Additional non-core packages, such as readxl [@R-readxl], haven [@R-haven], googledrive [@R-googledrive], and rvest [@R-rvest], make it possible to import data stored in other common formats or directly from the web.

Next, we recommend that you tidy your data, getting it into a consistent form that makes the rest of the analysis easier. Most functions in the tidyverse work with tidy data [@tidy-data], where every column is a variable, every row is an observation, and every cell contains a single value. If your data is not already in this form (almost always!), the core tidyr [@R-tidyr] package provides tools to tidy it up.

Data transformation is supported by the core dplyr [@R-dplyr] package. dplyr provides verbs that work with whole data frames, such as mutate() to create new variables, filter() to find observations matching given criteria, and left_join() and friends to combine multiple tables. dplyr is paired with packages that provide tools for specific column types:

There are two main tools for understanding data: visualisation and modelling. The tidyverse provides the ggplot2 [@ggplot2] package for visualisation. ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics [@wilkinson2005].

You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. Modelling is outside the scope of this paper, but is part of the closely affiliated tidymodels [@R-tidymodels] project, which shares interface design and data structures with the tidyverse.

Finally, you'll need to communicate your results to someone else. Communication is one of the most important parts of data science, but is not included within tidyverse. Instead, we expect people will use other R packages, like rmarkdown [@R-rmarkdown] and shiny [@R-shiny], which support dozens of static and dynamic output formats.

Surrounding all these tools is programming. Programming is a cross-cutting tool that you use in every part of a data science project. Programming tools in the tidyverse include:

Design principles

We are still working to explicitly describe the unifying principles that make the tidyverse consistent, but you can read our latest thoughts at https://design.tidyverse.org/. There is one particularly important principle that we want to call out here: the tidyverse is fundamentally human centred. That is, the tidyverse is designed to support the activities of a human data analyst, so to be effective tool builders, we must explicitly recognise and acknowledge the strengths and weaknesses of human cognition.

This is particularly important for R, because it’s a language that’s used primarily by non-programmers, and we want to make it as easy as possible for first-time and end-user programmers to learn the tidyverse. We believe deeply in the motivations that lead to the creation of S: "to turn ideas into software, quickly and faithfully" [@programming-with-data]. This means that we spend a lot of time thinking about interface design, and have recently started experimenting with surveys to help guide interface choices.

Similarly, the tidyverse is not just the collection of packages --- it is also the community of people who use them. We want the tidyverse to be a diverse, inclusive, and welcoming community. We are still developing our skills in this area, but our existing approaches include active use of Twitter to solicit feedback, announce updates, and generally listen to the community. We also keep users apprised of major upcoming changes through the tidyverse blog, run developer days, and support lively discussions on RStudio community.

Acknowledgments

The tidyverse would not be possible without the immense work of the R-core team who maintain the R language and we are deeply indebted to them. We are also grateful for the financial support of RStudio, Inc.

References



hadley/tidyverse documentation built on Nov. 4, 2023, 5:30 a.m.