library(learnr) library(tibble) library(ggformula) library(mosaic) theme_set(theme_bw()) data(Births, package = "mosaicData") if (require(NHANES)) { data(NHANES, package = "NHANES") } for (y in unique(Births$year)) { assign(paste0("Births",y), Births |> filter(year == y)) } Births1978 <- Births |> filter(year == 1978) # knitr::opts_chunk$set(exercise.checker = checkr::checkr_tutor) knitr::opts_chunk$set( echo = FALSE, fig.align = "center", fig.width = 6, fig.height = 2.5)
Here is an interesting plot showing the number of live births in the United
States each day of 1978. We are going to use it to learn how to create
plots using the ggformula
package.
gf_point(births ~ date, data = Births1978)
To get R (or any software) to create this plot (or do anything else, really), there are two important questions you must be able to answer. Before continuing, see if you can figure out what they are.
To get R (or any software) to create this plot, there are two important questions you must be able to answer:
To make this plot, the answers to our questions are
A. Make a scatter plot (i.e., a plot consisting of points)
A. The data used for the plot:
We just need to learn how to tell R these answers.
We will provide answers to our two questions by filling in the boxes of this important template:
We just need to identify which portions of our answers go into which boxes.
It is useful to provide names for the boxes:
These names can help us remember which things go where.
(The ...
indicates that there are some additional arguments we will add
eventually.)
Sometimes we will add or subtract a bit from our formula. Here are some other forms we will eventually see.
# simpler version goal( ~ x, data = mydata ) # fancier version goal( y ~ x | z , data = mydata ) # unified version goal( formula , data = mydata )
box | fill in with | purpose |
---|---|---|
`goal` | `gf_point` | plot some points |
`y` | `births` | y-axis variable |
`x` | `date` | x-axis variable |
`mydata` | `Births1978` | name of data set |
Put each piece in its place in the template below and then run the code to create the plot.
goal(y ~ x, data = mydata)
For the record, here are the first few rows of Births1978
.
Births1978 |> head(3) |> select(births, date, day_of_year)
The most distinctive feature of ggformula
plots is the use of formulas to
describe the positional information of a plot.
Formulas in R always involve the tilde character, which is easy to overlook.
It looks like this:
knitr::include_graphics("images/tilde.png")
The position of on the keyboard varies from brand to brand. On Apple keyboards, it's here.
knitr::include_graphics("images/keyboard-tilde.png")
Most gf_
functions take a formula that describes the positional
attributes of the plot. Using one of these functions with no arguments
will show you the "shape" of the formula it requires.
Run this code to see the formula shape for gf_point()
.
gf_point()
You should see that gf_point()
's formula has the shape y ~ x
,
so the $y$-variable name goes before the tilde and the $x$-variable name goes
after. (Think: "y depends on x". Also note that the $y$-axis label appears
farther left than the $x$-axis label.)
Reverse the roles of the variables -- changing births ~ date
to
date ~ birth
-- to see how the plot changes.
gf_point(births ~ date, data = Births1978)
Change date
to day_of_year
and see how the plot changes.
(If you do this on a separate line, you will see both plots at once.)
gf_point(births ~ date, data = Births1978)
This tutorial includes similar data sets for the years 1969 to 1988. To use data from a different year, just change the name of the data set to indicate the year.
Fill in the "blanks" below to create plots for three different years. To what extent do the patterns from 1978 persist over time?
gf_point(births ~ date, data = Births1978) gf_point(births ~ date, data = Births____) gf_point(births ~ date, data = __________)
(Note: This creates three separate plots. We will learn how to combine them into a single plot shortly.)
Our plots have points because we have used gf_point()
. But there are
many other gf_
functions that create different types of plots.
Experiment with some other plot types by changing gf_point()
to
one of the following:
gf_line()
: connect the dotsgf_smooth()
: smoothed version of gf_line()
gf_lm()
: regression line (same as gf_smooth()
with method = "lm"
)gf_spline()
: another type of smoother (using splines)gf_point(births ~ date, data = Births1978)
If you created the spline plot, you probably found it "too wiggly". If you created
the smoothed plot, you might prefer that the gray bands not be displayed. For any
of these plots you might have preferred different colors or sizes of things.
Such global characteristics of a plot can be adjusted with additional arguments to
the gf_
function. These go in the ...
part of the template.
The general form for these is option = value
.
For example,
spar = 0.5
(or any number between 0 and 1) controls the amount of smoothing in gf_spline()
.se = TRUE
(note capitalization) turns on the "error band" in gf_smooth()
and gf_lm()
.color = "red"
or fill = "navy"
(note quotes) can be used to change the colors of things. (fill
is typically used for regions that are "filled in" and color
for dots and lines.)alpha = 0.5
(or any number between 0 and 1) will set the opacity (0 is completely
transparnt and 1 is completely opaque).Here are some examples. Adjust some of the options to see how things change. We've inserted some line breaks to make the options easier to locate in the code.
gf_spline(births ~ date, data = Births1978, spar = 0.5, color = "navy")
gf_smooth(births ~ date, data = Births1978, se = TRUE, color = "red")
gf_point(births ~ date, data = Births1978, size = 3, shape = 18, alpha = 0.5, color = "purple")
You can learn about the attributes available for a given layer using the
"quick help" for a layer function. You can find out more by reading the help file
produced with ?
.
# "quick help" for gf_point() gf_point()
Curious to know all the available color names? Run this code.
colors()
We said that gf_point()
creates a plot with points.
This isn't quite true. Technically, it creates a layer with points.
A plot may have multiple layers. To create a multi-layered plot, simply
append |>
at the end of the code for one layer and follow that with
another layer.
|>
at
the end of the first two lines.day_of_year
so
that all three years are aligned.gf_smooth()
or gf_spline()
.gf_point(births ~ date, data = Births1978, color = "navy") gf_point(births ~ date, data = Births1969, color = "red") gf_point(births ~ date, data = Births1988, color = "skyblue")
The births data in 1978 contains two clear "waves" of dots. One conjecture is that these are weekdays and weekends. We can test this conjecture by putting different days in different colors.
In the lingo of ggformula
, we need to map color to the variable wday
.
Mapping and setting attributes are different in an important way.
color = "navy"
sets the color to "navy". All the dots will be navy.
color = ~ wday
maps color to wday
. This means that the color will
depend on the values of wday
. A legend (aka, a guide) will be
automatically included to show us which days are which.
Change the color argument so that it maps to wday
.
Don't forget the tilde (~
).
Try some other plot types: gf_line()
, gf_smooth()
, etc. Which do you
like best? Why?
gf_point(births ~ date, data = Births1978, color = "navy")
quiz( question( "Does it appear that the conjecture about weekends is generally correct?", answer("yes", correct = TRUE), answer("no") ), question( "What happens if you omit the `~` before `~wday`?", answer("It works just the same. The tilde isn't required."), answer("All the dots are the same color."), answer("There is an error message: 'wday' not found.", correct = TRUE) ) )
If we want to look at all 20 years of birth data, overlaying the data is likely to put too much information in too little space and make it hard to tell which data is from which year. (Deciphering 20 colors or 20 shapes can be hard, too.) Instead, let's put each year in separate facet or sub-plot. Facets are better than 20 separate plots because the coordinate systems are shared across the facets which saves space and makes comparisons across facets easier.
There are two ways to create facets. The simplest way is to add a vertical bar |
to
our formula.
gf_point(births ~ day_of_year | year, data = Births, size = 0.5)
The second way is to add on a facet command using |>
:
gf_point(births ~ day_of_year, data = Births, size = 0.5) |> gf_facet_wrap( ~ year)
Edit one of the plots above to do the following:
wday
Now edit one the plots above to
date
instead of day_of_year
What advantage do we get from using facets in this case?
The faceting we did on the previous page is called facet wrapping. If the facets don't fit nicely in one row, the facets continue onto additional rows.
A facet grid uses rows, or columns, or both in a fixed way.
gf_point(births ~ day_of_year | year ~ wday, data = Births, size = 0.5)
Recreate the plot above using gf_facet_grid()
. This works much like gf_facet_wrap()
and accepts a formula with one of three shapes
y ~ x
(facets along both axes)~ x
(facets only along x-axis)y ~ .
(facets only along y-axis; notice the important dot in this one) (These three formula shapes can also be used on the right side of |
.)
Some gf_
functions only require the user to supply one variable. The y-positions
for these plots can calculated from the x-variable.
Here are two examples. Run the code below to see the formula shapes used by these plots.
gf_histogram() gf_density()
Notice that when there is only one variable it is always on the right side.
Other examples of functions that take just one variable include
gf_dens()
, gf_freqpoly()
, gf_dotplot()
, gf_bar()
, and gf_qq()
.
Create some "one-variable" plots using the functions listed above and the template below. Which variable should you use?
gf____( ~ _____, data = Births1978)
Time to explore on your own. Here are three data sets you can use.
HELPrct
has data from a study of people addicted to alcohol, cocaine, or heroineKidsFeet
has information about some kids' feet.NHANES
has lots of physiologic and other measurements from 10,000
subjects in the National Health and Nutrition Evaluation Survey. (This data
set will only be available if the NHANES package is installed where
this tutorial is running.)To find out more about the data sets use ?HELPrct
, ?KidsFeet
, or ?NHANES
.
To see the first few rows of the data, you can use head()
.
To get a list of functions available in ggformula
, run this code chunk.
# list all functions starting gf_ apropos("gf_")
Make some plots to explore one or more of these data sets.
gf_point(length ~ width, data = KidsFeet) head(KidsFeet) ?KidsFeet
df_stats()
works much like the plotting functions in ggformula
but it produces
a data frame of summary information rather than a plot. These summary data frames
can be useful for plotting or for obtaining some numerical precision for the patterns
we see in our plots.
Consider this plot
gf_boxplot(births ~ wday, data = Births1978)
We can see clear trends, but what if we want to know the mean and standard deviation, or the median and IQR for the number of births on each of the seven weekdays?
Change gf_boxplot()
to df_stats()
.
gf_boxplot(births ~ wday, data = Births1978)
The default use of df_stats()
computes some of the most common summary statistics.
But we can use df_stats()
to create other numerical summaries as well.
df_stats(births ~ wday, data = Births1978, median, iqr)
Modify the code above to compute some other summaries of your choosing.
Births1978 |> df_stats(births ~ wday, median, iqr)
In addition to the usual plot elements (lines, points, etc.), sometimes it is
useful to add text with gf_text()
or gf_label()
.
See if you can figure out what this does before your execute it.
Births_summary <- df_stats(births ~ wday, data = Births1978) gf_boxplot(births ~ wday, data = Births1978) |> gf_point(mean ~ wday, color = "red", data = Births_summary) |> gf_label(6400 ~ wday, label = ~ round(median), data = Births_summary) |> gf_text( 6700 ~ wday, label = ~ round(mean), data = Births_summary, color = "red")
If you are fussy about your plots, you may be wondering how to have more control over things like:
As you can imagine, all of these things can be adjusted pretty much however you like. But we'll save that for a separate tutorial on customizing your plots.
Not biological anatomy, rather plot anotomy. In order to talk about plots, it is handy to have words for their various components.
A frame is the bounding rectagle in which the plot is constructed. This
is essentially what you may think of as the x- and y- axes, but the plot isn't actually
required to have visible "axes".
glyphs or marks are the particular symbols placed on the plot (examples: dots, lines, smiley faces, your favorite emoji, or whatever gets drawn).
Each glyph has a number of attributes or properties.
scales map raw data over to attributes of the plot.
guides go in the other direction and help the human map graphical attributes back to the raw data. Guides include what you might call a legend, but also things like axis labeling.
facets are coordinated subplots.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.