OneDimensional Scatter Diagram, Spike Histogram, or Density
Description
scat1d
adds tick marks (bar codes. rug plot) on any of the four
sides of an existing plot, corresponding with nonmissing values of a
vector x
. This is used to show the data density. Can also
place the tick marks along a curve by specifying ycoordinates to go
along with the x
values.
If any two values of x
are within \code{eps}*\var{w} of
each other, where eps
defaults to .001 and w is the span
of the intended axis, values of x
are jittered by adding a
value uniformly distributed in [\code{jitfrac}*\var{w},
\code{jitfrac}*\var{w}], where jitfrac
defaults to
.008. Specifying preserve=TRUE
invokes jitter2
with a
different logic of jittering. Allows plotting random subsegments to
handle very large x
vectors (seetfrac
).
jitter2
is a generic method for jittering, which does not add
random noise. It retains unique values and ranks, and randomly spreads
duplicate values at equidistant positions within limits of enclosing
values. jitter2
is especially useful for numeric variables with
discrete values, like rating scales. Missing values are allowed and
are returned. Currently implemented methods are jitter2.default
for vectors and jitter2.data.frame
which returns a data.frame
with each numeric column jittered.
datadensity
is a generic method used to show data densities in
more complex situations. Here, another datadensity
method is
defined for data frames. Depending on the which
argument, some
or all of the variables in a data frame will be displayed, with
scat1d
used to display continuous variables and, by default,
bars used to display frequencies of categorical, character, or
discrete numeric variables. For such variables, when the total length
of value labels exceeds 200, only the first few characters from each
level are used. By default, datadensity.data.frame
will
construct one axis (i.e., one strip) per variable in the data frame.
Variable names appear to the left of the axes, and the number of
missing values (if greater than zero) appear to the right of the axes.
An optional group
variable can be used for stratification,
where the different strata are depicted using different colors. If
the q
vector is specified, the desired quantiles (over all
group
s) are displayed with solid triangles below each axis.
When the sample size exceeds 2000 (this value may be modified using
the nhistSpike
argument, datadensity
calls
histSpike
instead of scat1d
to show the data density for
numeric variables. This results in a histogramlike display that
makes the resulting graphics file much smaller. In this case,
datadensity
uses the minf
argument (see below) so that
very infrequent data values will not be lost on the variable's axis,
although this will slightly distortthe histogram.
histSpike
is another method for showing a highresolution data
distribution that is particularly good for very large datasets (say
\code{n} > 1000). By default, histSpike
bins the
continuous x
variable into 100 equalwidth bins and then
computes the frequency counts within bins (if n
does not exceed
10, no binning is done). If add=FALSE
(the default), the
function displays either proportions or frequencies as in a vertical
histogram. Instead of bars, spikes are used to depict the
frequencies. If add=FALSE
, the function assumes you are adding
small density displays that are intended to take up a small amount of
space in the margins of the overall plot. The frac
argument is
used as with scat1d
to determine the relative length of the
whole plot that is used to represent the maximum frequency. No
jittering is done by histSpike
.
histSpike
can also graph a kernel density estimate for
x
, or add a small density curve to any of 4 sides of an
existing plot. When y
or curve
is specified, the
density or spikes are drawn with respect to the curve rather than the
xaxis.
histSpikeg
is similar to histSpike
but is for adding layers
to a ggplot2
graphics object. histSpikeg
can also add
lowess
curves to the plot.
histSpikep
has a narrower focus but is similar for
plotly
graphics.
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50  scat1d(x, side=3, frac=0.02, jitfrac=0.008, tfrac,
eps=ifelse(preserve,0,.001),
lwd=0.1, col=par("col"),
y=NULL, curve=NULL,
bottom.align=FALSE,
preserve=FALSE, fill=1/3, limit=TRUE, nhistSpike=2000, nint=100,
type=c('proportion','count','density'), grid=FALSE, ...)
jitter2(x, ...)
## Default S3 method:
jitter2(x, fill=1/3, limit=TRUE, eps=0,
presorted=FALSE, ...)
## S3 method for class 'data.frame'
jitter2(x, ...)
datadensity(object, ...)
## S3 method for class 'data.frame'
datadensity(object, group,
which=c("all","continuous","categorical"),
method.cat=c("bar","freq"),
col.group=1:10,
n.unique=10, show.na=TRUE, nint=1, naxes,
q, bottom.align=nint>1,
cex.axis=sc(.5,.3), cex.var=sc(.8,.3),
lmgp=NULL, tck=sc(.009,.002),
ranges=NULL, labels=NULL, ...)
# sc(a,b) means default to a if number of axes <= 3, b if >=50, use
# linear interpolation within 350
histSpike(x, side=1, nint=100, frac=.05, minf=NULL, mult.width=1,
type=c('proportion','count','density'),
xlim=range(x), ylim=c(0,max(f)), xlab=deparse(substitute(x)),
ylab=switch(type,proportion='Proportion',
count ='Frequency',
density ='Density'),
y=NULL, curve=NULL, add=FALSE,
bottom.align=type=='density', col=par('col'), lwd=par('lwd'),
grid=FALSE, ...)
histSpikeg(formula=NULL, predictions=NULL, data,
lowess=FALSE, xlim=NULL, ylim=NULL,
side=1, nint=100,
frac=function(f) 0.01 + 0.02*sqrt(f1)/sqrt(max(f,2)1),
span=3/4, histcol='black')
histSpikep(p, x, y, z, group=NULL, color=NULL, hovertext=NULL,
colors=NULL, bottom.align=TRUE, tracename='Proportion', ...)

Arguments
x 
a vector of numeric data, or a data frame (for 
object 
a data frame or list (even with unequal number of observations per
variable, as long as 
side 
axis side to use (1=bottom (default for 
frac 
fraction of smaller of vertical and horizontal axes for tick mark
lengths. Can be negative to move tick marks outside of plot. For

jitfrac 
fraction of axis for jittering. If
\code{jitfrac} <= 0, no
jittering is done. If 
tfrac 
Fraction of tick mark to actually draw. If \code{tfrac}<1,
will draw a random fraction 
eps 
fraction of axis for determining overlapping points in 
lwd 
line width for tick marks, passed to 
col 
color for tick marks, passed to 
y 
specify a vector the same length as 
curve 
a list containing elements 
bottom.align 
set to 
preserve 
set to 
fill 
maximum fraction of the axis filled by jittered values. If 
limit 
specifies a limit for maximum shift in jittered values. Duplicate
values will be spread within
+/ \code{fill}*min(\var{u}\var{d},\var{d}\var{l})/2. The
default 
nhistSpike 
If the number of observations exceeds or equals 
type 
used by or passed to 
grid 
set to 
nint 
number of intervals to divide each continuous variable's axis for

... 
optional arguments passed to 
presorted 
set to 
group 
an optional stratification variable, which is converted to a

which 
set 
method.cat 
set 
col.group 
colors representing the 
n.unique 
number of unique values a numeric variable must have before it is considered to be a continuous variable 
show.na 
set to 
naxes 
number of axes to draw on each page before starting a new plot. You
can set 
q 
a vector of quantiles to display. By default, quantiles are not shown. 
cex.axis 
character size for draw labels for axis tick marks 
cex.var 
character size for variable names and frequence of 
lmgp 
spacing between numeric axis labels and axis (see 
tck 
see 
ranges 
a list containing ranges for some or all of the numeric variables.
If 
labels 
a vector of labels to use in labeling the axes for

minf 
For 
mult.width 
multiplier for the smoothing window width computed by

xlim 
a 2vector specifying the outer limits of 
ylim 
yaxis range for plotting (if 
xlab 
xaxis label ( 
ylab 
yaxis label ( 
add 
set to 
formula 
a formula of the form 
predictions 
the data frame being plotted by 
data 
for 
lowess 
set to 
span 
passed to 
histcol 
color of line segments (tick marks) for

p 
an existing 
z 
vector of numeric values to be added to 
color 
an optional grouping variable to be assigned to the color attribute for lines 
hovertext 
a character vector specifying hover text for

colors 
a vector of colors to be used by the 
tracename 
optional trace name, default is 
Details
For scat1d
the length of line segments used is
frac*min(par()$pin)/par()$uin[opp]
data units, where
opp is the index of the opposite axis and frac
defaults
to .02. Assumes that plot
has already been called. Current
par("usr")
is used to determine the range of data for the axis
of the current plot. This range is used in jittering and in
constructing line segments.
Value
histSpike
returns the actual range of x
used in its binning
Side Effects
scat1d
adds line segments to plot.
datadensity.data.frame
draws a complete plot. histSpike
draws a complete plot or adds to an existing plot.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
Nashville TN, USA
f.harrell@vanderbilt.edu
Martin Maechler (improved scat1d
)
Seminar fuer Statistik
ETH Zurich SWITZERLAND
maechler@stat.math.ethz.ch
Jens OehlschlaegelAkiyoshi (wrote jitter2
)
Center for Psychotherapy Research
ChristianBelserStrasse 79a
D70597 Stuttgart Germany
oehl@psyresstuttgart.de
See Also
segments
, jitter
, rug
,
plsmo
, lowess
, stripplot
,
hist.data.frame
,Ecdf
, hist
,
histogram
, table
,
density
, stat_plsmo
, histboxp
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90  plot(x < rnorm(50), y < 3*x + rnorm(50)/2 )
scat1d(x) # density bars on top of graph
scat1d(y, 4) # density bars at right
histSpike(x, add=TRUE) # histogram instead, 100 bins
histSpike(y, 4, add=TRUE)
histSpike(x, type='density', add=TRUE) # smooth density at bottom
histSpike(y, 4, type='density', add=TRUE)
smooth < lowess(x, y) # add nonparametric regression curve
lines(smooth) # Note: plsmo() does this
scat1d(x, y=approx(smooth, xout=x)$y) # data density on curve
scat1d(x, curve=smooth) # same effect as previous command
histSpike(x, curve=smooth, add=TRUE) # same as previous but with histogram
histSpike(x, curve=smooth, type='density', add=TRUE)
# same but smooth density over curve
plot(x < rnorm(250), y < 3*x + rnorm(250)/2)
scat1d(x, tfrac=0) # dots randomly spaced from axis
scat1d(y, 4, frac=.03) # bars outside axis
scat1d(y, 2, tfrac=.2) # same bars with smaller random fraction
x < c(0:3,rep(4,3),5,rep(7,10),9)
plot(x, jitter2(x)) # original versus jittered values
abline(0,1) # unique values unjittered on abline
points(x+0.1, jitter2(x, limit=FALSE), col=2)
# allow locally maximum jittering
points(x+0.2, jitter2(x, fill=1), col=3); abline(h=seq(0.5,9,1), lty=2)
# fill 3/3 instead of 1/3
x < rnorm(200,0,2)+1; y < x^2
x2 < round((x+rnorm(200))/2)*2
x3 < round((x+rnorm(200))/4)*4
dfram < data.frame(y,x,x2,x3)
plot(dfram$x2, dfram$y) # jitter2 via scat1d
scat1d(dfram$x2, y=dfram$y, preserve=TRUE, col=2)
scat1d(dfram$x2, preserve=TRUE, frac=0.02, col=2)
scat1d(dfram$y, 4, preserve=TRUE, frac=0.02, col=2)
pairs(jitter2(dfram)) # pairs for jittered data.frame
# This gets reasonable pairwise scatter plots for all combinations of
# variables where
#
#  continuous variables (with unique values) are not jittered at all, thus
# all relations between continuous variables are shown as they are,
# extreme values have exact positions.
#
#  discrete variables get a reasonable amount of jittering, whether they
# have 2, 3, 5, 10, 20 \dots levels
#
#  different from adding noise, jitter2() will use the available space
# optimally and no value will randomly mask another
#
# If you want a scatterplot with lowess smooths on the *exact* values and
# the point clouds shown jittered, you just need
#
pairs( dfram ,panel=function(x,y) { points(jitter2(x),jitter2(y))
lines(lowess(x,y)) } )
datadensity(dfram) # graphical snapshot of entire data frame
datadensity(dfram, group=cut2(dfram$x2,g=3))
# stratify points and frequencies by
# x2 tertiles and use 3 colors
# datadensity.data.frame(split(x, grouping.variable))
# need to explicitly invoke datadensity.data.frame when the
# first argument is a list
## Not run:
require(rms)
f < lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),
data=d)
p < Predict(f, cholesterol, sex)
g < ggplot(p, aes(x=cholesterol, y=yhat, color=sex)) + geom_line() +
xlab(xl2) + ylim(1, 1)
g < g + geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2,
linetype=0, show_guide=FALSE)
g + histSpikeg(yhat ~ cholesterol + sex, p, d)
# colors < c('red', 'blue')
# p < plot_ly(x=x, y=y, color=g, colors=colors, mode='markers')
# histSpikep(p, x, y, z, color=g, colors=colors)
## End(Not run)
