plotluck: "I'm feeling lucky" for ggplot

Description Usage Arguments Value Determining the type of plot Conditional variables Reordering of factor levels Instance weights Axis scaling Missing values Sampling Factor preprocessing Coloring Generating multiple plots at once Debugging Column name matching Remarks on supported plot types Remarks on the use of options Limitations See Also Examples

Description

The purpose of plotluck is to let the user focus on what to plot, and automate the how. Given a dependency formula with up to three variables, it tries to choose the most suitable type of plot. It also automates sampling large datasets, correct handling of observation weights, logarithmic axis scaling, ordering and pruning of factor levels, and overlaying smoothing curves or median lines.

Usage

1

Arguments

data

a data frame.

formula

an object of class formula: a symbolic description of the relationship of up to three variables.

FormulaMeaningPlot types
y~1 Distribution of single variable Density, histogram, scatter, dot, bar
y~x One explanatory variable Scatter, hex, violin, box, spine, heat
y~x+z Two explanatory variables heat, spine
y~1|z or y~x|z One conditional variable Represented through coloring or facetting
y~1|x+z Two conditional variables Represented through facetting

In addition to these base plot types, the dot symbol "." can also be used, and denotes all variables in the data frame. This gives rise to a lattice or series of plots (use with caution, can be slow).

FormulaMeaning
.~1 Distribution of each variable in the data frame, separately
y~. Plot y against each variable in the data frame
.~x Plot each variable in the data frame against x
.~. Plot each variable in the data frame against each other.

See also section "Generating multiple plots at once" below.

weights

observation weights or frequencies (optional).

opts

a named list of options (optional); See also plotluck.options.

...

additional parameters to be passed to the respective ggplot2 geom objects.

Value

a ggplot object, or a plotluck.multi object if the dot symbol was used.

Determining the type of plot

Besides the shape of the formula, the algorithm takes into account the type of variables as either numeric, ordered, or unordered factors. Often, it makes sense to treat ordered factors similarly as numeric types.

One-variable numeric (resp. factor) distributions are usually represented by density (resp. Cleveland dot) charts, but can be overridden to histograms or bar plots using the geom option. Density plots come with an overlaid vertical median line.

For two numerical variables, by default a scatter plot is produced, but for high numbers of points a hexbin is preferred (option min.points.hex). These plots come with a smoothing line and standard deviation.

The relation between two factor variables can be depicted best by spine (a.k.a., mosaic) plots, unless they have too many levels (options max.factor.levels.spine.x, max.factor.levels.spine.y, max.factor.levels.spine.z). Otherwise, a heat map is produced.

For a mixed-type (factor/numeric) pair of variables, violin (overridable to box) plots are generated. However, if the resulting graph would contain too many (more than max.factor.levels.violin) violin plots in a row, the algorithm switches automatically. The number of bins of a histogram can be customized with n.breaks.histogram. The default setting, NA, applies a heuristic estimate.

The case of a response two dependent variables ('y~x+z') is covered by either a spine plot (if all are factors) or a heat map.

In many cases with few points for one of the aggregate plots, a scatter looks better (options min.points.density, min.points.violin, min.points.hex).

If each factor combination occurs only once in the data set, we resort to bar plots.

Conditional variables

Conditional variables are represented by either trying to fit into the same graph using coloring (max.factor.levels.color), or by facetting (preferred dimensions facet.num.wrap (resp. facet.num.grid) for one resp. two variables). Numeric vectors are discretized accordingly. Facets are laid out horizontally or vertically according to the plot type, up to maximum dimensions of facet.max.rows and facet.max.cols.

Reordering of factor levels

To better illustrate the relation between an independent factor variable and a dependent numerical variable (or an ordered factor), levels are reordered according to the value of the dependent variable. If no other numeric or ordered variable exists, we sort by frequency.

Instance weights

Argument weights allows to specify weights or frequency counts for each row of data. All plots and summary statistics take weights into account when supplied. In scatter and heat maps, weights are indicated either by a shaded disk with proportional area (default) or by jittering (option dedupe.scatter), if the number of duplicated points exceeds min.points.jitter. The amount of jittering can be controlled with jitter.x and jitter.y.

Axis scaling

plotluck supports logarithmic and log-modulus axis scaling. log-modulus is considered if values are both positive and negative; in this case, the transform function is f(x) = sign(x) * log(1+abs(x)).

The heuristic to apply scaling is based on the proportion of total display range that is occupied by the 'core' region of the distribution between the lower and upper quartiles; namely, the fact whether the transform could magnify this region by a factor of at least trans.log.thresh.

Missing values

By default, missing (NA or NaN) values in factors are are shown as a special factor level code"?". They can be removed by setting na.rm=TRUE. Conventionally, missing numeric values are not shown.

Sampling

For very large data sets, plots can take a very long time (or even crash R). plotluck has a built-in stop-gap: If the data comprises more than sample.max.rows, it will be sampled down to that size (taking into account weights, if supplied).

Factor preprocessing

Character (resp. logical) vectors are converted to unordered (resp. ordered) factors.

Frequently, when numeric variables have very few values despite sufficient data size, it helps to treat these values as the levels of a factor; this is governed by option few.unique.as.factor.

If an unordered factor has too many levels, plots can get messy. In this case, only the max.factor.levels most frequent ones are retained, while the rest are merged into a default level ".other.".

Coloring

If color or fill aesthetics are used to distinguish different levels or ranges of a variable, the color scheme adjusts to the type. Preferably, a sequential (resp. qualitative) palette is chosen for a numeric/ordered (unordered) factor (palette.brewer.seq, palette.brewer.qual); see also RColorBrewer.

Generating multiple plots at once

If formula contains a dot (".") symbol, the function creates a number of 1D or 2D plots by calling plotluck repeatedly. As described above, this allows either single distribution, one-vs-all and all-vs-all variable plots. To save space, rendering is minimal without axis labels.

In the all-vs-all case, the diagonal contains 1D distribution plots, analogous to the behavior of the default plot method for data frames, see plot.data.frame.

With setting in.grid=FALSE, plots are produced in a sequence, otherwise together on one or multiple pages, if necessary (default). Page size is controlled by multi.max.rows and multi.max.cols.

With entropy.order=TRUE, plots are sorted by an estimate of empirical conditional entropy, with the goal of prioritizing the more predictive variables. Set verbose=TRUE if you want to see the actual values. For large data sets the calculation can be time consuming; entropy calculation can be suppressed by setting multi.entropy.order=FALSE.

@note The return value is an object of class plotluck_multi. This class does not have any functionality; its sole purpose is to make this function work in the same way as ggplot and plotluck, namely, do the actual drawing if and only if the return value is not assigned.

Debugging

With the option verbose=TRUE turned on, the function will print out information about the chosen and applicable plot types, ordering, log scaling, etc.

Column name matching

Variable names can be abbreviated if they match a column name uniquely by prefix.

Remarks on supported plot types

By default, plotluck uses violin and density plots in place of the more traditional box-and-whisker plots and histograms; these modern graph types convey the shape of a distribution better. In the former case, summary statistics like mean and quantiles are less useful if the distribution is not unimodal; a wrong choice of the number of bins of a histogram can create misleading artifacts.

Following Cleveland's advice, factors are plotted on the y-axis to make labels most readable and compact at the same time. This direction can be controlled using option prefer.factors.vert.

Due to their well-documented problematic aspects, pie charts and stacked bar graphs are not supported.

With real-world data (as opposed to smooth mathematical functions), three-dimensional scatter, surface, or contour plots can often be hard to read if the shape of the distribution is not suitable, data coverage is uneven, or if the perspective is not carefully chosen depending on the data. Since they usually require manual tweaking, we have refrained from incorporating them.

Remarks on the use of options

For completeness, we have included the description of option parameters in the current help page. However, the tenet of this function is to be usable "out-of-the-box", with no or very little manual tweaking required. If you find yourself needing to change option values repeatedly or find the presets to be suboptimal, please contact the author.

Limitations

plotluck is designed for generic out-of-the-box plotting, and not suitable to produce more specialized types of plots that arise in specific application domains (e.g., association, stem-and-leaf, star plots, geographic maps, etc). It is restricted to at most three variables. Parallel plots with variables on different scales (such as time series of multiple related signals) are not supported.

See Also

plotluck.options, sample.plotluck, ggplot

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Single-variable density
data(diamonds, package='ggplot2')
plotluck(diamonds, price~1)
invisible(readline(prompt="Press [enter] to continue"))

# Violin plot
data(iris)
plotluck(iris, Species~Petal.Length)
invisible(readline(prompt="Press [enter] to continue"))

# Scatter plot
data(mpg, package='ggplot2')
plotluck(mpg, cty~model)
invisible(readline(prompt="Press [enter] to continue"))

# Spine plot
data(Titanic)
plotluck(as.data.frame(Titanic), Survived~Class+Sex, weights=Freq)
invisible(readline(prompt="Press [enter] to continue"))

# Facetting
data(msleep, package='ggplot2')
plotluck(msleep, sleep_total~bodywt|vore)
invisible(readline(prompt="Press [enter] to continue"))

# Heat map
plotluck(diamonds, price~cut+color)


# Multi plots
# All 1D distributions
plotluck(iris, .~1)

# 2D dependencies with one fixed variable on vertical axis
plotluck(iris, Species~.)

# See also tests/testthat/test_plotluck.R for more examples!

plotluck documentation built on June 27, 2019, 5:07 p.m.