knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README_files/README-", fig.width=8, fig.height=8, fig.units="in" )
This package provides a collection of custom ggplot2
extensions - Geoms
, Stats
, Coords
, Themes
, etc. The name grcdr
is a contraction of Graphs with Code and Data in R and is a reference to the Graphs with Code and Data blog, to which this package is loosely related.
You can install grcdr
from github with:
# install.packages("devtools") devtools::install_github("bdilday/grcdr")
The below will use dplyr
and ggplot2
library(grcdr) ## load basics library(dplyr) library(ggplot2) ## set the theme ggplot2::theme_set(theme_minimal(base_size = 14))
geom_excursion
plots running quantities as connected scatter plots. It requires an x
and y
aesthetic and also a t
aesthetic to give the ordering (i.e. "time")
Here's some example data provided with the package that gives team-level stats for 4 MLB teams (Cleveland 1999, New York Yankees 1998, Houston Astros 2017, Milwaukee Brewers 1982).
``` {r runs1}
teams_df = read.csv(system.file("extdata/team_stats.csv", package = "grcdr"), stringsAsFactors = FALSE) teams_df$game_date = as.Date(teams_df$game_date) head(teams_df, 2)
We can plot runs scored (`b_r`) on the x-axis and runs allowed (`p_r`) on the y-axis. The time coordinate is the season game number. ``` {r} base_plot = teams_df %>% ggplot() + labs(x="runs scored", y="runs allowed")
By default the data aren't averaged (the run length is 1)
p = base_plot + geom_excursion(aes(x=b_r, y=p_r, t=season_game_number)) + facet_wrap(~k) print(p)
The run_length
parameter controls how many items get summed. The following uses run_length = 10
p = base_plot + geom_excursion(aes(x=b_r, y=p_r, t=season_game_number), run_length = 10) + facet_wrap(~k) print(p)
If you give x_weight
or y_weight
aesthetics then weighted averages are computed instead of sums. Passing _weight = 1
therefore results in straight averages.
p = base_plot + geom_excursion(aes(x=b_r, y=p_r, t=season_game_number, x_weight=1, y_weight=1), run_length = 10) + facet_wrap(~k) print(p)
Additionally, the time ordering can be changed
set.seed(101) random_idx = sample(1:nrow(teams_df), nrow(teams_df)) # order by random p = teams_df %>% cbind.data.frame(random_idx=random_idx) %>% ggplot() + labs(x="runs scored", y="runs allowed") + geom_excursion(aes(x=b_r, y=p_r, t=random_idx, x_weight=1, y_weight=1), run_length = 10) + facet_wrap(~k) print(p)
stat_run
is a lower-level utility than geom_excursion
. It can change the geom
(from path
) and also can plot running line charts in addition to connected scatterplots.
Here I set the y aesthetic but not x, which is effectively a line chart.
p = teams_df %>% ggplot() + stat_run(aes(y=b_r, t=season_game_number)) + facet_wrap(~k) + labs(x="game number", y="runs scored") print(p)
However, unlike a traditional line chart, I can average or sum the y variable
p = teams_df %>% ggplot() + stat_run(aes(y=b_r, t=season_game_number), run_length = 10) + facet_wrap(~k) + labs(x="game number", y="runs scored") print(p)
The default is to generate running totals by using the cumulative sum function and taking differences. A different cumulative aggregation function can be specified, however. Note that following is for illustration and that the run_fun doesn't correspond to any particularly useful quantity(that I'm aware of, anyway).
fun_with_cumsum_fun = function(x) { cumsum(x * (x - 1) * sin(x / 5 * pi)) } p = teams_df %>% ggplot() + stat_run(aes(y=p_r, t=season_game_number), run_length = 10, y_run_fun = fun_with_cumsum_fun) + facet_wrap(~k) + labs(x="game number", y="runs scored") print(p)
Setting the x
aesthetic reproduces a geom_excursion
p = teams_df %>% ggplot() + stat_run(aes(x=b_r, y=p_r, t=season_game_number), run_length = 10) + facet_wrap(~k) + labs(x="runs scored", y="runs allowed") print(p)
If the time coordinate is missing values, we can fill in the corresponding x and y.
censored_df = teams_df %>% filter(season_game_number < 40 | season_game_number >60) p = censored_df %>% ggplot() + stat_run(aes(y=b_r, t=season_game_number, x=season_game_number)) + facet_wrap(~k) print(p) p = censored_df %>% ggplot() + stat_run(aes(y=b_r, t=season_game_number, x=season_game_number), run_length = 10) + facet_wrap(~k) print(p) p = censored_df %>% ggplot() + stat_run(aes(y=b_r, t=season_game_number, x=season_game_number), run_length = 10, y_run_fill_value = 20, x_run_fill_value = 50) + facet_wrap(~k) print(p)
The geom_excursion
layer forces a path
Geom
, but the lower-level stat_run
layer can change the Geom
. For example it can use polygon
(although unclear what the interpretation is)
p = teams_df %>% ggplot() + stat_run(aes(x=b_r, y=p_r, t=game_date), geom='polygon') + facet_wrap(~k) print(p)
This stat applies dimensionality reduction using multi-dimensional scaling. As of this writing the available algorithms are principal components analysis (pca
) or t-distributed stochastic neighbor embedding (tsne
). The variables to use in the dimensionality reduction are passed in the aesthetics x#
where #
is an arbitrary integer. The default geom
is GeomPoint
.
set.seed(101) df1 = data.frame(x1 = rnorm(100)) for (i in 2:10) { k = sprintf("x%d", i) df1[,k] = rnorm(100) } # now, for the last 25 add a constant to create two well separated groups df1[75:100, ] = df1[75:100,] + 2
Use only 2 variables
set.seed(101) p = df1 %>% ggplot(aes(x1=x1, x2=x2)) + stat_mds(mds_method = "pca") print(p)
Use them all
set.seed(101) p = df1 %>% ggplot(aes(x1=x1, x2=x2, x3=x3, x4=x4, x5=x5, x6=x6, x7=x7, x8=x8, x9=x9, x10=x10)) + stat_mds(mds_method = "pca") print(p)
Use them all and label them
set.seed(101) p = df1 %>% mutate(rn=row_number()) %>% ggplot(aes(x1=x1, x2=x2, x3=x3, x4=x4, x5=x5, x6=x6, x7=x7, x8=x8, x9=x9, x10=x10)) + stat_mds(mds_method = "pca", geom="text", aes(label=rn)) print(p)
Apply t-SNE. This requires the Rtsne
package.
set.seed(101) p = df1 %>% mutate(rn=row_number()) %>% ggplot(aes(x1=x1, x2=x2, x3=x3, x4=x4, x5=x5, x6=x6, x7=x7, x8=x8, x9=x9, x10=x10)) + stat_mds(mds_method = "tsne", geom="text", aes(label=rn)) print(p)
This geom implements a tail scatter plot. It is inspired by the xenographics project. The x
and y
aesthetics are points in a two-d plane. Subsequent variables are passed in aesthetics named x#
where x is an arbitrary integer. They do not need to start at 1
, however, the order will be interpreted lexigraphically. The x#
variables are mapped to lines extending at an angle of -(15 + 30 * i)
degrees. This means that variables trying to use 12 or more variables in addition to x
and y
is not supported at this time and will result in lines that overlap.
Some simulated data
set.seed(101) df1 = data.frame(x1 = rnorm(100), x2 = rnorm(100)) df1$x3 = with(df1, x1**2 + abs(x2)) df1$x4 = 100 * df1$x1 ** 2 # make a categorical var df1$g = factor(sample(c(0,1), 100, replace = TRUE))
Plot with geom_tailscatter
p = df1 %>% ggplot(aes(x=x1, y=x2, x3=x3, x4=x4)) + geom_tailscatter(size=2) print(p)
The parameter tail_scale
controls the length of the tail lines
p = df1 %>% ggplot(aes(x=x1, y=x2, x3=x3, x4=x4)) + geom_tailscatter(size=2, tail_scale = 0.5) print(p)
Color by group
p = df1 %>% ggplot(aes(x=x1, y=x2, x3=x3, x4=x4, color=g)) + geom_tailscatter(size=2) print(p)
tsne_linked
is an htmlwidget
. It takes a data set, projects it into 2-dimensions using the t-SNE
algorithm, and then plots a 2-d scatter plot. The points in the scatter plot are linked to a bar graph that shows the values of the coordinates that went into the t-SNE
calculation. The scatter plot uses a Voronoi tessellation to make the mouse-over highlighting smoother.
simulated data
set.seed(101) df1 = data.frame(x1 = rnorm(100), x2 = rnorm(100)) df1$x3 = rnorm(100) df1$x4 = rnorm(100) df1$id = row.names(df1) df1$g = ifelse(df1$x1 > 0, 1, 0) tsne_coords = c("x1", "x2", "x3", "x4")
{r eval=FALSE}
tsne_linked(df1,
tsne_coords = tsne_coords,
label_var = "id",
group_var = "g")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.