In cwcomiskey/varyres: Variable-Resolution Heat Maps

library(varyres)
library(ggplot2)

Variable-resolution (VR) heat maps are a solution to the problem of choosing a resolution for a heat map. They integrate region-appropriate resolutions, and convey the relative data abundance of data in those regions.

Consider the scatterplot of the locations of pitches to baseball player Jhonny Peralta in the MLB from 2008 to 2015 in (a) below. To visualize his success at various locations we might create a heat map of his success rate by gridding the domain and calculating the empirical success rate of pitches in each grid cell. Examples of the resulting heat map for resolutions of 4x4 and 16x16 are shown in (b) and (c). Which resolution is more appropriate? In some areas of the domain, i.e. the center, the 4x4 resolution is too low but in others, i.e. the edges, the 16x16 resolution is too high. A variable-resolution heat map, (d), solves this problem by allowing the resolution to be determined by the underlying concentration of data points.

library(ggplot2)

no_axis_labs <- theme(axis.title.x=element_blank(),
  axis.title.y=element_blank())

# a
ggplot(hitter, aes(x, y)) +
  geom_point(size = 0.25, alpha = 1/15) +
  coord_equal() +
  labs(title = "(a)") +
  no_axis_labs 

vr_1 <- varyres(data = hitter, cutoff = 1, max = 4)

# b
mapit(vr_1[[3]]) +
  no_axis_labs +
  labs(title = "(b)") 

# c
mapit(vr_1[[5]])  +
  no_axis_labs +
  labs(title = "(c)") 

vr_200 <- varyres(data = hitter, cutoff = 200, max = 4)

mapit(vr_200[[5]]) +  
  no_axis_labs +
  labs(title = "(d)")

This vignette introduces varyres, a package to create these variable resolution heat maps.

Installation

You can download varyres from GitHub with a function in the devtools package.

# install.packages("devtools")
devtools::install_github('cwcomiskey/varyres')

Usage

To illustrate the use of varyres the function, we will use the hitter dataset that is provided with the varyres package (and used to create the motivating plots above).

Data: `hitter`

hitter contains r dim(hitter)[1] rows/observations, one for each swing baseball player Jhonny Peralta took between 2008 and 2015. Each observation includes a pitch location and a swing outcome. The data is called PITCHf/x data, and comes from Sportvision, Inc in conjunction with Major League Baseball Advanced Media.

library(varyres)
head(hitter)

hitter has four columns/variables for each swing.

x gives the horizontal location of the pitch as it passes through the strike zone, in feet from the middle of home plate.
y gives the vertical location of the pitch as it passes through the strike zone, in feet from the ground.
res is a variable that equals 1 if the swing was successful, and 0 if not.
des gives a short description of the play.

A Variable-Resolution Heat Map

The function varyres(...) creates, from your data set, a VR-ready data frame. From there you can use ggplot2 (or your other favorite plotting package) to create the VR heat map. varyres provides the mapit() function as a quick way to generate a ggplot2 heat map from varyres() output.

For example, the subdivision presented above in (d), was calculated with,

vr <- varyres(data = hitter, cutoff = 200, fun = mean, max = 4)

The resulting object, vr, is a list where each element describes an iteration of the subdivision algorithm. The last element contains the final iteration and can then be displayed

mapit(vr[[5]], g = TRUE) + 
  geom_text(aes(label = count), size = 2)

The VR map has finer resolution in the center, more coarse resolution around the edges, and in-between as needed. Also, notice how the box sizes implicitly convey the varying data concentration: bigger boxes correspond to less concentrated data, smaller boxes to more concentrated data. The sample sizes printed on the grid boxes (with geom_text()) explicitly show this correspondence.

Details

The cutoff argument gives the user control over maximum box sample size. For example, cutoff = 100 means each box will have at most 100 observations, because boxes over the cutoff will be subdivided further. Using cutoff = 100 gives this map.

vr <- varyres(data = hitter, cutoff = 100, fun = mean, max = 5)
mapit(vr[[6]], g = TRUE)

Notice one prominent difference: smaller central boxes, due to further subdivision.

The max argument specifies a maximum number of iterations. For example, if you accidentally specify cutoff = 0 the algorithm would never stop without max = 6 as the default. The fun argument specifies what grid box summary function to use. By default, fun = mean, so the statistic calculated for each box, and stored as the variable statistic , is the mean of the res variable.

Iterations

varyres(...) returns the subdivision iterations, and seeing these steps helps understand how the algorithm works. The sequence of subdivisions here proceeds by subdividing boxes with sample sizes above cutoff = 100.

mapit(vr[[1]]) + no_axis_labs
mapit(vr[[2]]) + no_axis_labs
mapit(vr[[3]]) + no_axis_labs 
mapit(vr[[4]]) + no_axis_labs 
mapit(vr[[5]]) + no_axis_labs 
mapit(vr[[6]]) + no_axis_labs