library(varyres) library(ggplot2)
Variable-resolution (VR) heat maps are a solution to the problem of choosing a resolution for a heat map. They integrate region-appropriate resolutions, and convey the relative data abundance of data in those regions.
Consider the scatterplot of the locations of pitches to baseball player Jhonny Peralta in the MLB from 2008 to 2015 in (a) below. To visualize his success at various locations we might create a heat map of his success rate by gridding the domain and calculating the empirical success rate of pitches in each grid cell. Examples of the resulting heat map for resolutions of 4x4 and 16x16 are shown in (b) and (c). Which resolution is more appropriate? In some areas of the domain, i.e. the center, the 4x4 resolution is too low but in others, i.e. the edges, the 16x16 resolution is too high. A variable-resolution heat map, (d), solves this problem by allowing the resolution to be determined by the underlying concentration of data points.
library(ggplot2) no_axis_labs <- theme(axis.title.x=element_blank(), axis.title.y=element_blank()) # a ggplot(hitter, aes(x, y)) + geom_point(size = 0.25, alpha = 1/15) + coord_equal() + labs(title = "(a)") + no_axis_labs vr_1 <- varyres(data = hitter, cutoff = 1, max = 4) # b mapit(vr_1[[3]]) + no_axis_labs + labs(title = "(b)") # c mapit(vr_1[[5]]) + no_axis_labs + labs(title = "(c)") vr_200 <- varyres(data = hitter, cutoff = 200, max = 4) mapit(vr_200[[5]]) + no_axis_labs + labs(title = "(d)")
This vignette introduces varyres, a package to create these variable resolution heat maps.
You can download varyres from GitHub with a function in the devtools package.
# install.packages("devtools") devtools::install_github('cwcomiskey/varyres')
To illustrate the use of varyres
the function, we will use the hitter
dataset that is provided with the varyres package (and used to create the motivating plots above).
hitter
hitter
contains r dim(hitter)[1]
rows/observations, one for each swing baseball player Jhonny Peralta took between 2008 and 2015. Each observation includes a pitch location and a swing outcome. The data is called PITCHf/x data, and comes from Sportvision, Inc in conjunction with Major League Baseball Advanced Media.
library(varyres) head(hitter)
hitter
has four columns/variables for each swing.
x
gives the horizontal location of the pitch as it passes through the strike zone, in feet from the middle of home plate.y
gives the vertical location of the pitch as it passes through the strike zone, in feet from the ground.res
is a variable that equals 1 if the swing was successful, and 0 if not.des
gives a short description of the play.The function varyres(...)
creates, from your data set, a VR-ready data frame. From there you can use ggplot2
(or your other favorite plotting package) to create the VR heat map. varyres provides the mapit()
function as a quick way to generate a ggplot2 heat map from varyres()
output.
For example, the subdivision presented above in (d), was calculated with,
vr <- varyres(data = hitter, cutoff = 200, fun = mean, max = 4)
The resulting object, vr
, is a list where each element describes an iteration of the subdivision algorithm. The last element contains the final iteration and can then be displayed
mapit(vr[[5]], g = TRUE) + geom_text(aes(label = count), size = 2)
The VR map has finer resolution in the center, more coarse resolution around the edges, and in-between as needed. Also, notice how the box sizes implicitly convey the varying data concentration: bigger boxes correspond to less concentrated data, smaller boxes to more concentrated data. The sample sizes printed on the grid boxes (with geom_text()
) explicitly show this correspondence.
The cutoff
argument gives the user control over maximum box sample size. For example, cutoff = 100
means each box will have at most 100 observations, because boxes over the cutoff will be subdivided further. Using cutoff = 100
gives this map.
vr <- varyres(data = hitter, cutoff = 100, fun = mean, max = 5) mapit(vr[[6]], g = TRUE)
Notice one prominent difference: smaller central boxes, due to further subdivision.
The max
argument specifies a maximum number of iterations. For example, if you accidentally specify cutoff = 0
the algorithm would never stop without max = 6
as the default. The fun
argument specifies what grid box summary function to use. By default, fun = mean
, so the statistic calculated for each box, and stored as the variable statistic
, is the mean of the res
variable.
varyres(...)
returns the subdivision iterations, and seeing these steps helps understand how the algorithm works. The sequence of subdivisions here proceeds by subdividing boxes with sample sizes above cutoff = 100
.
mapit(vr[[1]]) + no_axis_labs mapit(vr[[2]]) + no_axis_labs mapit(vr[[3]]) + no_axis_labs mapit(vr[[4]]) + no_axis_labs mapit(vr[[5]]) + no_axis_labs mapit(vr[[6]]) + no_axis_labs
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.