curveRep  R Documentation 
curveRep
finds representative curves from a
relatively large collection of curves. The curves usually represent
timeresponse profiles as in serial (longitudinal or repeated) data
with possibly unequal time points and greatly varying sample sizes per
subject. After excluding records containing missing x
or
y
, records are first stratified into kn
groups having similar
sample sizes per curve (subject). Within these strata, curves are
next stratified according to the distribution of x
points per
curve (typically measurement times per subject). The
clara
clustering/partitioning function is used
to do this, clustering on one, two, or three x
characteristics
depending on the minimum sample size in the current interval of sample
size. If the interval has a minimum number of unique values
of
one, clustering is done on the single x
values. If the minimum
number of unique x
values is two, clustering is done to create
groups that are similar on both min(x)
and max(x)
. For
groups containing no fewer than three unique x
values,
clustering is done on the trio of values min(x)
, max(x)
,
and the longest gap between any successive x
. Then within
sample size and x
distribution strata, clustering of
timeresponse profiles is based on p
values of y
all
evaluated at the same p
equallyspaced x
's within the
stratum. An option allows percurve data to be smoothed with
lowess
before proceeding. Outer x
values are
taken as extremes of x
across all curves within the stratum.
Linear interpolation within curves is used to estimate y
at the
grid of x
's. For curves within the stratum that do not extend
to the most extreme x
values in that stratum, extrapolation
uses flat lines from the observed extremes in the curve unless
extrap=TRUE
. The p
y
values are clustered using
clara
.
print
and plot
methods show results. By specifying an
auxiliary idcol
variable to plot
, other variables such
as treatment may be depicted to allow the analyst to determine for
example whether subjects on different treatments are assigned to
different timeresponse profiles. To write the frequencies of a
variable such as treatment in the upper left corner of each panel
(instead of the grand total number of clusters in that panel), specify
freq
.
curveSmooth
takes a set of curves and smooths them using
lowess
. If the number of unique x
points in a curve is
less than p
, the smooth is evaluated at the unique x
values. Otherwise it is evaluated at an equally spaced set of
x
points over the observed range. If fewer than 3 unique
x
values are in a curve, those points are used and smoothing is not done.
curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5,
force1 = TRUE, metric = c("euclidean", "manhattan"),
smooth=FALSE, extrap=FALSE, pr=FALSE)
## S3 method for class 'curveRep'
print(x, ...)
## S3 method for class 'curveRep'
plot(x, which=1:length(res),
method=c('all','lattice','data'),
m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE,
idcol=NULL, freq=NULL, plotfreq=FALSE,
xlim=range(x), ylim=range(y),
xlab='x', ylab='y', colorfreq=FALSE, ...)
curveSmooth(x, y, id, p=NULL, pr=TRUE)
x 
a numeric vector, typically measurement times.
For 
y 
a numeric vector of response values 
id 
a vector of curve (subject) identifiers, the same length as

kn 
number of curve sample size groups to construct.

kxdist 
maximum number of xdistribution clusters to derive
using 
k 
maximum number of xy profile clusters to derive using 
p 
number of 
force1 
By default if any curves have only one point, all curves
consisting of one point will be placed in a separate stratum. To
prevent this separation, set 
metric 
see 
smooth 
By default, linear interpolation is used on raw data to
obtain 
extrap 
set to 
pr 
set to 
which 
an integer vector specifying which sample size intervals
to plot. Must be specified if 
method 
The default makes individual plots of possibly all
xdistribution by sample size by cluster combinations. Fewer may be
plotted by specifying 
m 
the number of curves in a cluster to randomly sample if there
are more than 
nx 
applies if 
probs 
3vector of probabilities with the central quantile first. Default uses quartiles. 
fill 
for 
idcol 
a named vector to be used as a table lookup for color
assignments (does not apply when 
freq 
a named vector to be used as a table lookup for a grouping
variable such as treatment. The names are curve 
plotfreq 
set to 
colorfreq 
set to 
xlim, ylim, xlab, ylab 
plotting parameters. Default ranges are
the ranges in the entire set of raw data given to 
... 
arguments passed to other functions. 
In the graph titles for the default graphic output, n
refers to the
minimum sample size, x
refers to the sequential xdistribution
cluster, and c
refers to the sequential xy profile cluster. Graphs
from method = "lattice"
are produced by
xyplot
and in the panel titles
distribution
refers to the xdistribution stratum and
cluster
refers to the xy profile cluster.
a list of class "curveRep"
with the following elements
res 
a hierarchical list first split by sample size intervals,
then by x distribution clusters, then containing a vector of cluster
numbers with 
ns 
a table of frequencies of sample sizes per curve after
removing 
nomit 
total number of records excluded due to 
missfreq 
a table of frequencies of number of 
ncuts 
cut points for sample size intervals 
kn 
number of sample size intervals 
kxdist 
number of clusters on x distribution 
k 
number of clusters of curves within sample size and distribution groups 
p 
number of points at which to evaluate each curve for clustering 
x 

y 

id 
input data after removing 
curveSmooth
returns a list with elements x,y,id
.
The references describe other methods for deriving
representative curves, but those methods were not used here. The last
reference which used a cluster analysis on principal components
motivated curveRep
however. The kml
package does kmeans clustering of longitudinal data with imputation.
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Segal M. (1994): Representative curves for longitudinal data via regression trees. J Comp Graph Stat 3:214233.
Jones MC, Rice JA (1992): Displaying the important features of large collections of similar curves. Am Statistician 46:140145.
Zheng X, Simpson JA, et al (2005): Data from a study of effectiveness suggested potential prognostic factors related to the patterns of shoulder pain. J Clin Epi 58:823830.
clara
,dataRep
## Not run:
# Simulate 200 curves with percurve sample sizes ranging from 1 to 10
# Make curves with oddnumbered IDs have an xdistribution that is random
# uniform [0,1] and those with evennumbered IDs have an xdist. that is
# half as wide but still centered at 0.5. Shift y values higher with
# increasing IDs
set.seed(1)
N < 200
nc < sample(1:10, N, TRUE)
id < rep(1:N, nc)
x < y < id
for(i in 1:N) {
x[id==i] < if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75))
y[id==i] < i + 10*(x[id==i]  .5) + runif(nc[i], 10, 10)
}
w < curveRep(x, y, id, kxdist=2, p=10)
w
par(ask=TRUE, mfrow=c(4,5))
plot(w) # show everything, profiles going across
par(mfrow=c(2,5))
plot(w,1) # show n=1 results
# Use a color assignment table, assigning low curves to green and
# high to red. Unique curve (subject) IDs are the names of the vector.
cols < c(rep('green', N/2), rep('red', N/2))
names(cols) < as.character(1:N)
plot(w, 3, idcol=cols)
par(ask=FALSE, mfrow=c(1,1))
plot(w, 1, 'lattice') # show n=1 results
plot(w, 3, 'lattice') # show n=45 results
plot(w, 3, 'lattice', idcol=cols) # same but different color mapping
plot(w, 3, 'lattice', m=1) # show a single "representative" curve
# Show median, 10th, and 90th percentiles of supposedly representative curves
plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9))
# Same plot but with much less grouping of x variable
plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2)
# Use ggplot2 for one sample size interval
z < plot(w, 2, 'data')
require(ggplot2)
ggplot(z, aes(x, y, color=curve)) + geom_line() +
facet_grid(distribution ~ cluster) +
theme(legend.position='none') +
labs(caption=z$ninterval[1])
# Smooth data before profiling. This allows later plotting to plot
# smoothed representative curves rather than raw curves (which
# specifying smooth=TRUE to curveRep would do, if curveSmooth was not used)
d < curveSmooth(x, y, id)
w < with(d, curveRep(x, y, id))
# Example to show that curveRep can cluster profiles correctly when
# there is no noise. In the data there are four profiles  flat, flat
# at a higher mean y, linearly increasing then flat, and flat at the
# first height except for a sharp triangular peak
set.seed(1)
x < 0:100
m < length(x)
profile < matrix(NA, nrow=m, ncol=4)
profile[,1] < rep(0, m)
profile[,2] < rep(3, m)
profile[,3] < c(0:3, rep(3, m4))
profile[,4] < c(0,1,3,1,rep(0,m4))
col < c('black','blue','green','red')
matplot(x, profile, type='l', col=col)
xeval < seq(0, 100, length.out=5)
s < x
matplot(x[s], profile[s,], type='l', col=col)
id < rep(1:100, each=m)
X < Y < id
cols < character(100)
names(cols) < as.character(1:100)
for(i in 1:100) {
s < id==i
X[s] < x
j < sample(1:4,1)
Y[s] < profile[,j]
cols[i] < col[j]
}
table(cols)
yl < c(1,4)
w < curveRep(X, Y, id, kn=1, kxdist=1, k=4)
plot(w, 1, 'lattice', idcol=cols, ylim=yl)
# Found 4 clusters but two have same profile
w < curveRep(X, Y, id, kn=1, kxdist=1, k=3)
plot(w, 1, 'lattice', idcol=cols, freq=cols, plotfreq=TRUE, ylim=yl)
# Incorrectly combined black and red because default value p=5 did
# not result in different profiles at x=xeval
w < curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40)
plot(w, 1, 'lattice', idcol=cols, ylim=yl)
# Found correct clusters because evaluated curves at 40 equally
# spaced points and could find the sharp triangular peak in profile 4
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.