# scat1d: One-Dimensional Scatter Diagram, Spike Histogram, or Density In Hmisc: Harrell Miscellaneous

### Description

scat1d adds tick marks (bar codes. rug plot) on any of the four sides of an existing plot, corresponding with non-missing values of a vector x. This is used to show the data density. Can also place the tick marks along a curve by specifying y-coordinates to go along with the x values.

If any two values of x are within \code{eps}*\var{w} of each other, where eps defaults to .001 and w is the span of the intended axis, values of x are jittered by adding a value uniformly distributed in [-\code{jitfrac}*\var{w}, \code{jitfrac}*\var{w}], where jitfrac defaults to .008. Specifying preserve=TRUE invokes jitter2 with a different logic of jittering. Allows plotting random sub-segments to handle very large x vectors (seetfrac).

jitter2 is a generic method for jittering, which does not add random noise. It retains unique values and ranks, and randomly spreads duplicate values at equidistant positions within limits of enclosing values. jitter2 is especially useful for numeric variables with discrete values, like rating scales. Missing values are allowed and are returned. Currently implemented methods are jitter2.default for vectors and jitter2.data.frame which returns a data.frame with each numeric column jittered.

datadensity is a generic method used to show data densities in more complex situations. Here, another datadensity method is defined for data frames. Depending on the which argument, some or all of the variables in a data frame will be displayed, with scat1d used to display continuous variables and, by default, bars used to display frequencies of categorical, character, or discrete numeric variables. For such variables, when the total length of value labels exceeds 200, only the first few characters from each level are used. By default, datadensity.data.frame will construct one axis (i.e., one strip) per variable in the data frame. Variable names appear to the left of the axes, and the number of missing values (if greater than zero) appear to the right of the axes. An optional group variable can be used for stratification, where the different strata are depicted using different colors. If the q vector is specified, the desired quantiles (over all groups) are displayed with solid triangles below each axis.

When the sample size exceeds 2000 (this value may be modified using the nhistSpike argument, datadensity calls histSpike instead of scat1d to show the data density for numeric variables. This results in a histogram-like display that makes the resulting graphics file much smaller. In this case, datadensity uses the minf argument (see below) so that very infrequent data values will not be lost on the variable's axis, although this will slightly distortthe histogram.

histSpike is another method for showing a high-resolution data distribution that is particularly good for very large datasets (say \code{n} > 1000). By default, histSpike bins the continuous x variable into 100 equal-width bins and then computes the frequency counts within bins (if n does not exceed 10, no binning is done). If add=FALSE (the default), the function displays either proportions or frequencies as in a vertical histogram. Instead of bars, spikes are used to depict the frequencies. If add=FALSE, the function assumes you are adding small density displays that are intended to take up a small amount of space in the margins of the overall plot. The frac argument is used as with scat1d to determine the relative length of the whole plot that is used to represent the maximum frequency. No jittering is done by histSpike.

histSpike can also graph a kernel density estimate for x, or add a small density curve to any of 4 sides of an existing plot. When y or curve is specified, the density or spikes are drawn with respect to the curve rather than the x-axis.

histSpikeg is similar to histSpike but is for adding layers to a ggplot2 graphics object or traces to a plotly object. histSpikeg can also add lowess curves to the plot.

### Usage

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 scat1d(x, side=3, frac=0.02, jitfrac=0.008, tfrac, eps=ifelse(preserve,0,.001), lwd=0.1, col=par("col"), y=NULL, curve=NULL, bottom.align=FALSE, preserve=FALSE, fill=1/3, limit=TRUE, nhistSpike=2000, nint=100, type=c('proportion','count','density'), grid=FALSE, ...) jitter2(x, ...) ## Default S3 method: jitter2(x, fill=1/3, limit=TRUE, eps=0, presorted=FALSE, ...) ## S3 method for class 'data.frame' jitter2(x, ...) datadensity(object, ...) ## S3 method for class 'data.frame' datadensity(object, group, which=c("all","continuous","categorical"), method.cat=c("bar","freq"), col.group=1:10, n.unique=10, show.na=TRUE, nint=1, naxes, q, bottom.align=nint>1, cex.axis=sc(.5,.3), cex.var=sc(.8,.3), lmgp=NULL, tck=sc(-.009,-.002), ranges=NULL, labels=NULL, ...) # sc(a,b) means default to a if number of axes <= 3, b if >=50, use # linear interpolation within 3-50 histSpike(x, side=1, nint=100, frac=.05, minf=NULL, mult.width=1, type=c('proportion','count','density'), xlim=range(x), ylim=c(0,max(f)), xlab=deparse(substitute(x)), ylab=switch(type,proportion='Proportion', count ='Frequency', density ='Density'), y=NULL, curve=NULL, add=FALSE, bottom.align=type=='density', col=par('col'), lwd=par('lwd'), grid=FALSE, ...) histSpikeg(formula=NULL, predictions=NULL, data, plotly=NULL, lowess=FALSE, xlim=NULL, ylim=NULL, side=1, nint=100, frac=function(f) 0.01 + 0.02*sqrt(f-1)/sqrt(max(f,2)-1), span=3/4, histcol='black', showlegend=TRUE)

### Details

For scat1d the length of line segments used is frac*min(par()$pin)/par()$uin[opp] data units, where opp is the index of the opposite axis and frac defaults to .02. Assumes that plot has already been called. Current par("usr") is used to determine the range of data for the axis of the current plot. This range is used in jittering and in constructing line segments.

### Value

histSpike returns the actual range of x used in its binning

### Side Effects

scat1d adds line segments to plot. datadensity.data.frame draws a complete plot. histSpike draws a complete plot or adds to an existing plot.

### Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
Nashville TN, USA
f.harrell@vanderbilt.edu

Martin Maechler (improved scat1d)
Seminar fuer Statistik
ETH Zurich SWITZERLAND
maechler@stat.math.ethz.ch

Jens Oehlschlaegel-Akiyoshi (wrote jitter2)
Center for Psychotherapy Research
Christian-Belser-Strasse 79a
D-70597 Stuttgart Germany
oehl@psyres-stuttgart.de

### Examples

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 plot(x <- rnorm(50), y <- 3*x + rnorm(50)/2 ) scat1d(x) # density bars on top of graph scat1d(y, 4) # density bars at right histSpike(x, add=TRUE) # histogram instead, 100 bins histSpike(y, 4, add=TRUE) histSpike(x, type='density', add=TRUE) # smooth density at bottom histSpike(y, 4, type='density', add=TRUE) smooth <- lowess(x, y) # add nonparametric regression curve lines(smooth) # Note: plsmo() does this scat1d(x, y=approx(smooth, xout=x)$y) # data density on curve scat1d(x, curve=smooth) # same effect as previous command histSpike(x, curve=smooth, add=TRUE) # same as previous but with histogram histSpike(x, curve=smooth, type='density', add=TRUE) # same but smooth density over curve plot(x <- rnorm(250), y <- 3*x + rnorm(250)/2) scat1d(x, tfrac=0) # dots randomly spaced from axis scat1d(y, 4, frac=-.03) # bars outside axis scat1d(y, 2, tfrac=.2) # same bars with smaller random fraction x <- c(0:3,rep(4,3),5,rep(7,10),9) plot(x, jitter2(x)) # original versus jittered values abline(0,1) # unique values unjittered on abline points(x+0.1, jitter2(x, limit=FALSE), col=2) # allow locally maximum jittering points(x+0.2, jitter2(x, fill=1), col=3); abline(h=seq(0.5,9,1), lty=2) # fill 3/3 instead of 1/3 x <- rnorm(200,0,2)+1; y <- x^2 x2 <- round((x+rnorm(200))/2)*2 x3 <- round((x+rnorm(200))/4)*4 dfram <- data.frame(y,x,x2,x3) plot(dfram$x2, dfram$y) # jitter2 via scat1d scat1d(dfram$x2, y=dfram$y, preserve=TRUE, col=2) scat1d(dfram$x2, preserve=TRUE, frac=-0.02, col=2) scat1d(dfram$y, 4, preserve=TRUE, frac=-0.02, col=2) pairs(jitter2(dfram)) # pairs for jittered data.frame # This gets reasonable pairwise scatter plots for all combinations of # variables where # # - continuous variables (with unique values) are not jittered at all, thus # all relations between continuous variables are shown as they are, # extreme values have exact positions. # # - discrete variables get a reasonable amount of jittering, whether they # have 2, 3, 5, 10, 20 \dots levels # # - different from adding noise, jitter2() will use the available space # optimally and no value will randomly mask another # # If you want a scatterplot with lowess smooths on the *exact* values and # the point clouds shown jittered, you just need # pairs( dfram ,panel=function(x,y) { points(jitter2(x),jitter2(y)) lines(lowess(x,y)) } ) datadensity(dfram) # graphical snapshot of entire data frame datadensity(dfram, group=cut2(dfram$x2,g=3)) # stratify points and frequencies by # x2 tertiles and use 3 colors # datadensity.data.frame(split(x, grouping.variable)) # need to explicitly invoke datadensity.data.frame when the # first argument is a list ## Not run: require(rms) f <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), data=d) p <- Predict(f, cholesterol, sex) g <- ggplot(p, aes(x=cholesterol, y=yhat, color=sex)) + geom_line() + xlab(xl2) + ylim(-1, 1) g <- g + geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0, show_guide=FALSE) g + histSpikeg(yhat ~ cholesterol + sex, p, d) # colors <- c('red', 'blue') # p <- plot_ly(x=x, y=y, color=g, colors=colors, mode='markers') # histSpikep(p, x, y, z, color=g, colors=colors) ## End(Not run)

Hmisc documentation built on May 20, 2017, 5:52 a.m.
Search within the Hmisc package
Search all R packages, documentation and source code

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.