quantileplot: Create a Smooth Quantile Plot

View source: R/quantileplot.R

quantileplotR Documentation

Create a Smooth Quantile Plot

Description

Creates a bivariate, smooth quantile plot. This is the central function of the quantileplot package. This plot visualizes estimates of the marginal density of the predictor, the conditional density of the outcome at selected values of the predictor, and smooth curves showing quantiles of the outcome as smooth functions of the predictor. This package is described in greater depth by Lundberg, Lee, and Stewart (2021), which is a generalization of Lundberg and Stewart (2020). The statistical core of the package relies on the methods of Fasiolo et al. (2020).

Usage

quantileplot(
  formula,
  data,
  weights = NULL,
  quantiles = c(0.1, 0.25, 0.5, 0.75, 0.9),
  slice_n = 7,
  show_ci = FALSE,
  quantile_notation = "legend",
  xlab = NULL,
  ylab = NULL,
  x_data_range = NULL,
  y_data_range = NULL,
  x_axis_range = NULL,
  y_axis_range = NULL,
  x_breaks = NULL,
  y_breaks = NULL,
  x_labels = ggplot2::waiver(),
  y_labels = ggplot2::waiver(),
  x_bw = NULL,
  y_bw = NULL,
  truncation_notation = "label",
  credibility_level = 0.95,
  uncertainty_draws = NULL,
  inverse_transformation = NULL,
  granularity = 512,
  second_formula = NULL,
  argGam = NULL,
  previous_fit = NULL,
  ...
)

Arguments

formula

A bivariate model formula (e.g. y ~ s(x))

data

Data frame containing the variables in formula. If weights are specified, they must be a column of data.

weights

String name for sampling weights, which are a column of data. If not given, a simple random sample is assumed.

quantiles

Numeric vector containing quantiles to be estimated. Values should be between 0 and 1.

slice_n

Integer number of vertical slices (conditional densities of y given x) to be plotted. Default is 7.

show_ci

Logical, defaults to FALSE. Whether to show credible intervals for the estimated smooth quantile curves.

quantile_notation

String, either legend or label. If legend (the default), then quantile curves are denoted by colors with a legend. If label, then quantile curves are annotated in the plot.

xlab

String x-axis title

ylab

String y-axis title

x_data_range

Numeric vector of length 2 containing the range of horizontal values to be plotted. Defaults to the range of the predictor variable in data. You may want to specify a narrower range if the predictor is extremely skewed. Quantile curves and densities will be estimated only on data in this range, and the plot will note the percent truncated.

y_data_range

Numeric vector of length 2 containing the range of vertical values to be plotted. Defaults to the range of the outcome variable in data. You may want to specify a narrower range if the outcome is extremely skewed. Densities are truncated to this range. All data contribute to quantile curve estimation regardless of y_data_range to avoid selection on the outcome, though the visualization is truncated to y_data_range. The plot will note the percent truncated.

x_axis_range

Numeric vector of length 2 for custom x-axis limits. This affects the plotting area but does not affect the data analyzed or displayed. To truncate the data, use x_data_range.

y_axis_range

Numeric vector of length 2 for custom y-axis limits. This affects the plotting area but does not affect the data analyzed or displayed. To truncate the data, use y_data_range.

x_breaks

Numeric vector of values for x-axis breaks. Alternatively, customize after producing the plot by modifying the resulting ggplot2 object. See vignette for examples.

y_breaks

Numeric vector of values for x-axis breaks. Alternatively, customize after producing the plot by modifying the resulting ggplot2 object. See vignette for examples.

x_labels

Vector of length(x_breaks) containing labels, or a function to convert breaks into labels. Alternatively, customize after producing the plot by modifying the resulting ggplot2 object. See vignette for examples.

y_labels

Vector of length(y_breaks) containing labels, or a function to convert breaks into labels. Alternatively, customize after producing the plot by modifying the resulting ggplot2 object. See vignette for examples.

x_bw

Numeric bandwidth for density estimation in the x dimension. The standard deviation of a Gaussian kernel. If NULL, this is set by the defaults in stats::density().

y_bw

Numeric bandwidth for density estimation in the y dimension. The standard deviation of a Gaussian kernel. If NULL, this is set by the defaults in stats::density().

truncation_notation

String, one of label, label_no_pct, or none. If x_data_range or y_data_range is narrower than the range of the data, this argument specifies how to note that truncation on the visualization. If label, then truncation is labeled including the percent of data truncated. If label_no_pct, then truncation is labeled but the percent truncated is omitted. If none, then truncation is not labeled on the plot.

credibility_level

Numeric probability value for credible intervals; default to 0.95 to produce 95 percent credible intervals. Only relevant if show_ci = TRUE.

uncertainty_draws

A whole number. If non-null, the number of simulated posterior draws to estimate for each smooth quantile curve. When used with the plot function, these appear in panels below the main plot.

inverse_transformation

A function of a scalar argument. Only used in the rare use case where the outcome has an extremely skewed distribution and the user wants to estimate the quantile curves on a transformed outcome, to be brought back to the original scale for the visualization. In that case, this argument is the function to convert from the transformed outcome back to the original scale. For instance, if the outcome in the model formula is log(y + 1) then the inverse transformation should be function(y) exp(y) - 1. This is a rare use case because it is only relevant when a transformation of the outcome aids the estimation of quantile curves. If you want to visualize on a transformed scale, you should instead create a transformed variable in data rather than conducting the transformation within the model formula. For common transformations (e.g. log(y)), the inverse_transformation argument can left NULL and will be determined automatically.

granularity

Integer number of points at which to evaluate each density. Defaults to 512, as in stats::density(). Higher values yield more granular density estimates.

second_formula

Model formula to allow the learning rate to change as a function of the predictor. This is passed to mqgam as the second element in the form argument. Defaults to the same specification as formula but without the outcome variable.

argGam

Additional arguments to the GAM for model fitting. Passed to mqgam.

previous_fit

The result of a previous call to quantileplot. If provided, then the mqgam fit for the quantile curves will not be re-estimated, which can be useful for iteratively deciding about other arguments in settings that are computationally demanding. This argument must be paired with other arguments that match the previous call (e.g. data, formula).

...

Other arguments passed to mqgam.

Value

An object of S3 class quantileplot, which supports summary(), print(), and plot() functions. The returned object has several elements.

  • plot is a ggplot2 object. This contains the most basic plot. The user can customize this output by passing additional layers to quantileplot.out$plot as they would for any ggplot2 object.

  • sim_curve_plots is a list object of ggplot2 objects, one for each quantile curve, which shows the point estimate for the curve in black and a series of simulated posterior samples in gray.

  • densities is a list of length four.

    • marginal and conditional are data frames containing the estimated marginal and conditional densities.

    • x_bw and y_bw are the bandwidths used for Gaussian kernel density estimation.

  • curves is a data frame containing the estimated quantile curves.

  • mqgam.out is the output from the call to the mqgam function in the qgam package, which is used to estimate the quantile curves.

  • x_data_range and y_data_range are the horizontal and vertical ranges of the plot.

  • slice_x_values are the predictor values at which vertical conditional densities are estimated.

  • call is the user's call that produced these results.

  • arguments is a list of all the arguments to the function, including those specified by the user and those specified by defaults.

References

Lundberg, Ian, Robin C. Lee, and Brandon M. Stewart. 2021. "The quantile plot: A visualization for bivariate population relationships." Working paper.

Lundberg, Ian, and Brandon M. Stewart. 2020. "Comment: Summarizing income mobility with multiple smooth quantiles instead of parameterized means." Sociological Methodology 50(1):96-111.

Fasiolo, Matteo, Simon N. Wood, Margaux Zaffran, Raphaƫl Nedellec, and Yannig Goude. 2020. "Fast calibrated additive quantile regression." Journal of the American Statistical Association.

Examples

x <- rbeta(1000,1,2)
y <- log(1 + 9 * x) * rbeta(1000, 1, 2)
data <- data.frame(x = x, y = y)
quantileplot(y ~ s(x), data)

ilundberg/quantileplot documentation built on May 23, 2022, 3:12 a.m.