Description Usage Arguments Value Determining the type of plot Conditional variables Reordering of factor levels Instance weights Axis scaling Missing values Sampling Factor preprocessing Coloring Generating multiple plots at once Debugging Column name matching Remarks on supported plot types Remarks on the use of options Limitations See Also Examples
The purpose of plotluck
is to let the user focus on what to plot,
and automate the how. Given a dependency formula with up to three
variables, it tries to choose the most suitable type of plot. It also automates
sampling large datasets, correct handling of observation weights, logarithmic
axis scaling, ordering and pruning of factor levels, and overlaying smoothing
curves or median lines.
1 |
data |
a data frame. | ||||||||||||||||||||||||||||
formula |
an object of class
In addition to these base plot types, the dot symbol
See also section "Generating multiple plots at once" below. | ||||||||||||||||||||||||||||
weights |
observation weights or frequencies (optional). | ||||||||||||||||||||||||||||
opts |
a named list of options (optional); See also | ||||||||||||||||||||||||||||
... |
additional parameters to be passed to the respective ggplot2 geom objects. |
a ggplot object, or a plotluck.multi object if the dot symbol was used.
Besides the shape of the formula, the algorithm takes into account the type of variables as either numeric, ordered, or unordered factors. Often, it makes sense to treat ordered factors similarly as numeric types.
One-variable numeric (resp. factor) distributions are usually represented by
density (resp. Cleveland dot) charts, but can be overridden to histograms or
bar plots using the geom
option. Density plots come with an overlaid
vertical median line.
For two numerical variables, by default a scatter plot is produced, but for
high numbers of points a hexbin is preferred (option min.points.hex
).
These plots come with a smoothing line and standard deviation.
The relation between two factor variables can be depicted best by spine
(a.k.a., mosaic) plots, unless they have too many levels (options
max.factor.levels.spine.x
, max.factor.levels.spine.y
,
max.factor.levels.spine.z
). Otherwise, a heat map is produced.
For a mixed-type (factor/numeric) pair of variables, violin (overridable
to box) plots are generated. However, if the resulting graph would contain
too many (more than max.factor.levels.violin
) violin plots in a row,
the algorithm switches automatically. The number of bins of a histogram can
be customized with n.breaks.histogram
. The default setting, NA
,
applies a heuristic estimate.
The case of a response two dependent variables ('y~x+z') is covered by either a spine plot (if all are factors) or a heat map.
In many cases with few points for one of the aggregate plots, a scatter
looks better (options min.points.density
, min.points.violin
,
min.points.hex
).
If each factor combination occurs only once in the data set, we resort to bar plots.
Conditional variables are represented by either
trying to fit into the same graph using coloring (max.factor.levels.color
),
or by facetting (preferred dimensions facet.num.wrap
(resp.
facet.num.grid
) for one resp. two variables). Numeric vectors are
discretized accordingly. Facets are laid out horizontally or vertically
according to the plot type, up to maximum dimensions of facet.max.rows
and facet.max.cols
.
To better illustrate the relation between an independent factor variable and a dependent numerical variable (or an ordered factor), levels are reordered according to the value of the dependent variable. If no other numeric or ordered variable exists, we sort by frequency.
Argument weights
allows to specify weights
or frequency counts for each row of data. All plots and summary statistics
take weights into account when supplied. In scatter and heat maps, weights
are indicated either by a shaded disk with proportional area (default) or by
jittering (option dedupe.scatter
), if the number of duplicated points
exceeds min.points.jitter
. The amount of jittering can be controlled
with jitter.x
and jitter.y
.
plotluck
supports logarithmic and log-modulus
axis scaling. log-modulus is considered if values are both positive and
negative; in this case, the transform function is f(x) = sign(x) *
log(1+abs(x))
.
The heuristic to apply scaling is based on the proportion of total display
range that is occupied by the 'core' region of the distribution between the
lower and upper quartiles; namely, the fact whether the transform could
magnify this region by a factor of at least trans.log.thresh
.
By default, missing (NA
or NaN
) values
in factors are are shown as a special factor level code"?". They can be
removed by setting na.rm=TRUE
. Conventionally, missing numeric values
are not shown.
For very large data sets, plots can take a very long time
(or even crash R). plotluck
has a built-in stop-gap: If the data
comprises more than sample.max.rows
, it will be sampled down to that
size (taking into account weights
, if supplied).
Character (resp. logical) vectors are converted to unordered (resp. ordered) factors.
Frequently, when numeric variables have very few values despite sufficient
data size, it helps to treat these values as the levels of a factor; this is
governed by option few.unique.as.factor
.
If an unordered factor has too many levels, plots can get messy. In this
case, only the max.factor.levels
most frequent ones are retained,
while the rest are merged into a default level ".other."
.
If color
or fill
aesthetics are used to
distinguish different levels or ranges of a variable, the color scheme adjusts
to the type. Preferably, a sequential (resp. qualitative) palette is chosen
for a numeric/ordered (unordered) factor (palette.brewer.seq
,
palette.brewer.qual
); see also RColorBrewer.
If formula
contains a dot
("."
) symbol, the function creates a number of 1D or 2D plots by calling
plotluck
repeatedly. As described above, this allows either single
distribution, one-vs-all and all-vs-all variable plots. To save space,
rendering is minimal without axis labels.
In the all-vs-all case, the diagonal contains 1D distribution plots, analogous
to the behavior of the default plot method for data frames, see
plot.data.frame
.
With setting in.grid=FALSE
, plots are produced in a sequence, otherwise
together on one or multiple pages, if necessary (default). Page size is
controlled by multi.max.rows
and multi.max.cols
.
With entropy.order=TRUE
, plots are sorted by an estimate of
empirical conditional entropy, with the goal of prioritizing the more
predictive variables. Set verbose=TRUE
if you want to see the actual
values. For large data sets the calculation can be time consuming; entropy
calculation can be suppressed by setting multi.entropy.order=FALSE
.
@note The return value is an object of class plotluck_multi
. This
class does not have any functionality; its sole purpose is to make this
function work in the same way as ggplot
and plotluck
, namely,
do the actual drawing if and only if the return value is not assigned.
With the option verbose=TRUE
turned on, the function
will print out information about the chosen and applicable plot types, ordering,
log scaling, etc.
Variable names can be abbreviated if they match a column name uniquely by prefix.
By default, plotluck
uses violin and density plots in place of the more traditional box-and-whisker
plots and histograms; these modern graph types convey the shape of a
distribution better. In the former case, summary statistics like mean and
quantiles are less useful if the distribution is not unimodal; a wrong
choice of the number of bins of a histogram can create misleading artifacts.
Following Cleveland's advice, factors are plotted on the y-axis to make labels
most readable and compact at the same time. This direction can be controlled
using option prefer.factors.vert
.
Due to their well-documented problematic aspects, pie charts and stacked bar graphs are not supported.
With real-world data (as opposed to smooth mathematical functions), three-dimensional scatter, surface, or contour plots can often be hard to read if the shape of the distribution is not suitable, data coverage is uneven, or if the perspective is not carefully chosen depending on the data. Since they usually require manual tweaking, we have refrained from incorporating them.
For completeness, we have included the description of option parameters in the current help page. However, the tenet of this function is to be usable "out-of-the-box", with no or very little manual tweaking required. If you find yourself needing to change option values repeatedly or find the presets to be suboptimal, please contact the author.
plotluck
is designed for generic out-of-the-box
plotting, and not suitable to produce more specialized types of plots that
arise in specific application domains (e.g., association, stem-and-leaf,
star plots, geographic maps, etc). It is restricted to at most three variables.
Parallel plots with variables on different scales (such as time
series of multiple related signals) are not supported.
plotluck.options
, sample.plotluck
, ggplot
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | # Single-variable density
data(diamonds, package='ggplot2')
plotluck(diamonds, price~1)
invisible(readline(prompt="Press [enter] to continue"))
# Violin plot
data(iris)
plotluck(iris, Species~Petal.Length)
invisible(readline(prompt="Press [enter] to continue"))
# Scatter plot
data(mpg, package='ggplot2')
plotluck(mpg, cty~model)
invisible(readline(prompt="Press [enter] to continue"))
# Spine plot
data(Titanic)
plotluck(as.data.frame(Titanic), Survived~Class+Sex, weights=Freq)
invisible(readline(prompt="Press [enter] to continue"))
# Facetting
data(msleep, package='ggplot2')
plotluck(msleep, sleep_total~bodywt|vore)
invisible(readline(prompt="Press [enter] to continue"))
# Heat map
plotluck(diamonds, price~cut+color)
# Multi plots
# All 1D distributions
plotluck(iris, .~1)
# 2D dependencies with one fixed variable on vertical axis
plotluck(iris, Species~.)
# See also tests/testthat/test_plotluck.R for more examples!
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.