Description Usage Arguments Details Value Side Effects Author(s) References See Also Examples
Does a hierarchical cluster analysis on variables, using the Hoeffding
D statistic, squared Pearson or Spearman correlations, or proportion
of observations for which two variables are both positive as similarity
measures. Variable clustering is used for assessing collinearity,
redundancy, and for separating variables into clusters that can be
scored as a single variable, thus resulting in data reduction. For
computing any of the three similarity measures, pairwise deletion of
NAs is done. The clustering is done by hclust()
. A small function
naclus
is also provided which depicts similarities in which
observations are missing for variables in a data frame. The
similarity measure is the fraction of NAs
in common between any two
variables. The diagonals of this sim
matrix are the fraction of NAs
in each variable by itself. naclus
also computes na.per.obs
, the
number of missing variables in each observation, and mean.na
, a
vector whose ith element is the mean number of missing variables other
than variable i, for observations in which variable i is missing. The
naplot
function makes several plots (see the which
argument).
So as to not generate too many dummy variables for multivalued
character or categorical predictors, varclus
will automatically
combine infrequent cells of such variables using an auxiliary
function combine.levels
that is defined here. If all values of
x
are NA
, combine.levels
returns a numeric vector
is returned that is all NA
.
plotMultSim
plots multiple similarity matrices, with the similarity
measure being on the xaxis of each subplot.
na.pattern
prints a frequency table of all combinations of
missingness for multiple variables. If there are 3 variables, a
frequency table entry labeled 110
corresponds to the number of
observations for which the first and second variables were missing but
the third variable was not missing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"),
type=c("data.matrix","similarity.matrix"),
method="complete",
data=NULL, subset=NULL, na.action=na.retain,
trans=c("square", "abs", "none"), ...)
## S3 method for class 'varclus'
print(x, abbrev=FALSE, ...)
## S3 method for class 'varclus'
plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...)
naclus(df, method)
naplot(obj, which=c('all','na per var','na per obs','mean na',
'na per var vs mean na'), ...)
combine.levels(x, minlev=.05)
plotMultSim(s, x=1:dim(s)[3],
slim=range(pretty(c(0,max(s,na.rm=TRUE)))),
slimds=FALSE,
add=FALSE, lty=par('lty'), col=par('col'),
lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05,
labelx=TRUE, xspace=.35)
na.pattern(x)

x 
a formula,
a numeric matrix of predictors, or a similarity matrix. If For 
df 
a data frame 
s 
an array of similarity matrices. The third dimension of this array
corresponds to different computations of similarities. The first two
dimensions come from a single similarity matrix. This is useful for
displaying similarity matrices computed by 
similarity 
the default is to use squared Spearman correlation coefficients, which
will detect monotonic but nonlinear relationships. You can also
specify linear correlation or Hoeffding's (1948) D statistic, which
has the advantage of being sensitive to many types
of dependence, including highly nonmonotonic relationships. For
binary data, or data to be made binary, 
type 
if 
method 
see 
data 

subset 

na.action 
These may be specified if 
trans 
By default, when the similarity measure is based on
Pearson's or Spearman's correlation coefficients, the coefficients are
squared. Specify 
... 
for 
ylab 
yaxis label. Default is constructed on the basis of 
legend. 
set to 
loc 
a list with elements 
maxlen 
if a legend is plotted describing abbreviations, original labels
longer than 
labels 
a vector of character strings containing labels corresponding to columns in the similar matrix, if the column names of that matrix are not to be used 
obj 
an object created by 
which 
defaults to 
minlev 
the minimum proportion of observations in a cell before that cell is
combined with one or more cells. If more than one cell has fewer than
minlev*n observations, all such cells are combined into a new cell
labeled 
abbrev 
set to 
slim 
2vector specifying the range of similarity values for scaling the
yaxes. By default this is the observed range over all of 
slimds 
set to 
add 
set to 
lty 

col 

lwd 
line type, color, or line thickness for 
vname 
optional vector of variable names, in order, used in 
h 
relative height for subplot 
w 
relative width for subplot 
u 
relative extra height and width to leave unused inside the subplot. Also used as the space between yaxis tick mark labels and graph border. 
labelx 
set to 
xspace 
amount of space, on a scale of 1: 
options(contrasts= c("contr.treatment", "contr.poly"))
is issued
temporarily by varclus
to make sure that ordinary dummy variables
are generated for factor
variables. Pass arguments to the
dataframeReduce
function to remove problematic variables
(especially if analyzing all variables in a data frame).
for varclus
or naclus
, a list of class varclus
with elements
call
(containing the calling statement), sim
(similarity matrix),
n
(sample size used if x
was not a correlation matrix already 
n
is a matrix), hclust
, the object created by hclust
,
similarity
, and method
. naclus
also returns the
two vectors listed under
description, and naplot
returns an invisible vector that is the
frequency table of the number of missing variables per observation.
plotMultSim
invisibly returns the limits of similarities used in
constructing the yaxes of each subplot. For similarity="ccbothpos"
the hclust
object is NULL
.
na.pattern
creates an integer vector of frequencies.
plots
Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com
Sarle, WS: The VARCLUS Procedure. SAS/STAT User's Guide, 4th Edition, 1990. Cary NC: SAS Institute, Inc.
Hoeffding W. (1948): A nonparametric test of independence. Ann Math Stat 19:546â€“57.
hclust
, plclust
, hoeffd
, rcorr
, cor
, model.matrix
,
locator
, na.pattern
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68  set.seed(1)
x1 < rnorm(200)
x2 < rnorm(200)
x3 < x1 + x2 + rnorm(200)
x4 < x2 + rnorm(200)
x < cbind(x1,x2,x3,x4)
v < varclus(x, similarity="spear") # spearman is the default anyway
v # invokes print.varclus
print(round(v$sim,2))
plot(v)
# plot(varclus(~ age + sys.bp + dias.bp + country  1), abbrev=TRUE)
# the 1 causes k dummies to be generated for k countries
# plot(varclus(~ age + factor(disease.code)  1))
#
#
# use varclus(~., data= fracmiss= maxlevels= minprev=) to analyze all
# "useful" variables  see dataframeReduce for details about arguments
df < data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3),
e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))
par(mfrow=c(2,2))
for(m in c("ward","complete","median")) {
plot(naclus(df, method=m))
title(m)
}
naplot(naclus(df))
n < naclus(df)
plot(n); naplot(n)
na.pattern(df) # builtin function
x < c(1, rep(2,11), rep(3,9))
combine.levels(x)
x < c(1, 2, rep(3,20))
combine.levels(x)
# plotMultSim example: Plot proportion of observations
# for which two variables are both positive (diagonals
# show the proportion of observations for which the
# one variable is positive). Chancecorrect the
# offdiagonals by subtracting the product of the
# marginal proportions. On each subplot the xaxis
# shows month (0, 4, 8, 12) and there is a separate
# curve for females and males
d < data.frame(sex=sample(c('female','male'),1000,TRUE),
month=sample(c(0,4,8,12),1000,TRUE),
x1=sample(0:1,1000,TRUE),
x2=sample(0:1,1000,TRUE),
x3=sample(0:1,1000,TRUE))
s < array(NA, c(3,3,4))
opar < par(mar=c(0,0,4.1,0)) # waste less space
for(sx in c('female','male')) {
for(i in 1:4) {
mon < (i1)*4
s[,,i] < varclus(~x1 + x2 + x3, sim='ccbothpos', data=d,
subset=d$month==mon & d$sex==sx)$sim
}
plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'),
add=sx=='male', slimds=TRUE,
lty=1+(sx=='male'))
# slimds=TRUE causes separate scaling for diagonals and
# offdiagonals
}
par(opar)

Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
varclus(x = x, similarity = "spear")
Similarity matrix (Spearman rho^2)
x1 x2 x3 x4
x1 1.00 0.00 0.26 0.00
x2 0.00 1.00 0.26 0.42
x3 0.26 0.26 1.00 0.12
x4 0.00 0.42 0.12 1.00
No. of observations used for each pair:
x1 x2 x3 x4
x1 200 200 200 200
x2 200 200 200 200
x3 200 200 200 200
x4 200 200 200 200
hclust results (method=complete)
Call:
hclust(d = as.dist(1  x), method = method)
Cluster method : complete
Number of objects: 4
x1 x2 x3 x4
x1 1.00 0.00 0.26 0.00
x2 0.00 1.00 0.26 0.42
x3 0.26 0.26 1.00 0.12
x4 0.00 0.42 0.12 1.00
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
pattern
00000111 00011101 00100100
1 1 1
[1] OTHER 2 2 2 2 2 2 2 2 2 2 2
[13] OTHER OTHER OTHER OTHER OTHER OTHER OTHER OTHER OTHER
Levels: OTHER 2
[1] OTHER OTHER 3 3 3 3 3 3 3 3 3 3
[13] 3 3 3 3 3 3 3 3 3 3
Levels: OTHER 3
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.