knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(andrews)
The andrews
routine, implemented by Jaroslav Myslivec, has been rewritten and extended. The old routine is still available as andrews0
and the rewritten routine should give the same graphics with the same parameter values.
op <- par(mfrow=c(1,2)) andrews0(iris, main="andrews0") andrews(iris, main="andrews") par(op)
op <- par(mfrow=c(1,2)) andrews0(iris, main="andrews0") andrews(iris, main="andrews") par(op)
Some more types of curves has been added. To see more comparisons of andrews
and andrews0
, run interactively
zzz()
Full parameter list for andrews
is:
andrews (df, type = 1, clr = NULL, step = 100, ymax = 10, # old parameters alpha = NULL, palcol = NULL, lwd = 1, lty = "solid", ...) # new parameters
The parameters in the call to andrews
that come in ...
are passed to plot
to draw the base window:
x
, and y
cannot be set by the user through ...
.main
, sub
, xlim
, ylim
, xlab
, ylab
, axes
, asp
and possibly other parameters.Note that the setting of ylim
has priority over the setting of ymax
.
The height of the window can be scaled by setting ymax
or ylim
.
ymax
is not set then ylim=c(-10,10)
is used.ymax
is set to NA
, then ylim=c(-lim,lim)
, where the maximum height lim
is calculated from the curves. Note that this option doubles the calculation time.ylim
directly.op<-par(mfrow=c(1,3)) andrews(iris, main="no ymax") andrews(iris, ymax=NA, main="ymax=NA") andrews(iris, ylim=c(-1,3), main="ylim=c(-1,3)") par(op)
op<-par(mfrow=c(1,3)) andrews(iris, main="no ymax") andrews(iris, ymax=NA, main="ymax=NA") andrews(iris, ylim=c(-1,3), main="ylim=c(-1,3)") par(op)
The width of the window can be scaled by setting xlim
, which is particularly useful for type==3
or type==5
.
andrews(iris, type=3, xlim=c(0,6*pi), ymax=NA)
The parameters lty
and lwd
determine the linetype and width. The length of the parameters must be either 1
or nrow(df)
. In the first case the values are applied to each curve, in the second case lty[i]
and lwd[i]
are used for the i
th line.
andrews(iris, ymax=NA, lwd=3)
The curves can be coloured individually if length(clr)==nrow(df)
.
andrews(iris, ymax=NA, clr=rainbow(nrow(iris)))
By default, the curves are coloured by a variable of the data frame.
andrews(iris, ymax=NA, clr=5) # iris$Species
If is.numeric(df[,clr])
is FALSE
then rainbow(nuv)
is used for the colours, where nuv
is the number of unique values (nuv = length(unique(df[,clr])
). You can use palcol
to select another palette that gives nuv
colours.
andrews(iris, ymax=NA, clr=5, palcol=hcl.colors) # iris$Species
If is.numeric(df[,clr])
is TRUE
, the variable color(df[,clr])
(scaled to the interval $[0, 1]$) is used.
The default is function(v) { hsv(1,1,v,1) }
, but any other colouring function can be used.
andrews(iris, ymax=NA, clr=1) # iris$Sepal.Length
andrews(iris, ymax=NA, clr=1, palcol=function(v) { gray(v) }) # iris$Sepal.Length
or alternatively
andrews(iris, ymax=NA, clr=1, palcol=gray) # iris$Sepal.Length
If you plot many curves then the curve details disappear by overplotting. If the output device supports the alpha channel, the parameter alpha
($0<\alpha<1$) is applied to all curves (length(alpha)==1
) or one for each observation (length(alpha)==nrow(df)
).
andrews(iris, ymax=NA, alpha=0.1)
andrews(iris, ymax=NA, clr=5, alpha=0.1, lwd=2)
All types of Andrews curves can be written as
$$ f_i(t) = \sum_{k=1}^p x_{ik} g_k(t). $$
Currently, six types of curves are implemented, which are determined by the parameter type
:
With deftype
the list of types can be queried by
deftype(1)
A curve type consists of a function function(n, t)
which computes a matrix of $(g_1(t), \ldots, g_n(t))$ and a default range for plotting, usually $-\pi$ till $\pi$.
A new curve type and a default plot range can be implemented by number or name:
deftype("sine", xlim = c(-pi, pi), function(n, t) { n <- as.integer(if (n<1) 1 else n) m <- matrix(NA, nrow=length(t), ncol=n) for (i in 1:n) m[,i] <- sin(i*t) m }) andrews(iris, "sine", ymax=NA, clr=5)
The representation of a set of multivariate observations as curves was proposed by Andrews (1972). For a set of orthonormal basis functions $g_i(t)$ with $i=1, 2, 3, \ldots$ in an interval $[a, b]$ it holds:
$$\int_a^b g_i(t) g_j(t) dt = \delta_{ij} = \begin{cases} 1 & \text{ if } i=j \ 0 & \text{ if } i\neq j \end{cases}$$
If one creates from multivariate observations $x_i=(x_{i1}, \ldots, x_{ip})$ with $i=1,2,\ldots,n$ a function
$$ f_i(t) = \sum_{k=1}^p x_{ik} g_k(t) $$
then it follows:
Property 1 - The integral of the squared difference of two curves is equal to the squared Euclidean distance between the data points $x_i$ and $x_j$: \begin{align} \int_a^b (f_i(t)-f_j(t))^2 dt & = \int_a^b \left(\sum_{k=1}^p (x_{ik}- x_{jk}) g_k(t)\right)^2 dt \ &= \sum_{k=1}^p \sum_{l=1}^p (x_{ik}- x_{jk}) (x_{il}- x_{jl}) \underbrace{\int_a^b g_k(t)g_l(t) dt}{= \delta{kl}}\ &= \sum_{k=1}^p (x_{ik}- x_{jk})^2 \end{align}
This makes the Andrews curves useful for visualising multivariate outliers and clusters in the data.
None of the basis functions of the six types of Andrews curves form an orthonormal system, but two (type %in% c(1,2)
) of them form an orthogonal system:
$$\int_{-\pi}^{\pi} g_i(t)g_j(t) dt = \begin{cases} \pi & \text{ if } i=j \ 0 & \text{ if } i\neq j \end{cases}$$ Thus, property 1 changes to
$$ \int_{-\pi}^{\pi} (f_i(t)-f_j(t))^2 dt = \pi \sum_{k=1}^p (x_{ik}- x_{jk})^2$$
Property 2 - The average curve is the mean value of the observations:
\begin{align} \bar{f}(t) &= \frac1n \sum_{i=1}^n f_i(t) =\frac1n \sum_{i=1}^n \sum_{k=1}^p x_{ik} g_k(t)\ &=\sum_{k=1}^p g_k(t) \left(\frac1n \sum_{i=1}^n x_{ik}\right)\ &=\sum_{k=1}^p g_k(t) \bar{x}_k \end{align}
Property 3 - For a given $t$, the variable $P_t=\sum_{k=1}^p X_k w_{kt}$ can be considered as a univariate projection of the variables $X_1$, $X_2$, ..., $X_p$. If $Var(X_k)=\sigma$ and $Cov(X_k,X_l)=\delta_{kl}$ hold then
$$ Var(P_t)= Var\left(\sum_{k=1}^p X_k w_{kt}\right) = \sum_{k=1}^p w_{kt}^2 Var(X_k)$$
For the curves with type %in% c(1,2,6)
$Var(P_t)\approx \sigma^2 p/2$ holds, since $\sin^2(t)+\cos^2(t)=1$. For these curves, however, we only cover a subset of the possible one-dimensional projections.
For type %in% c(3,5)
, the choice of $\sqrt{p_i}$ ensures that all possible one-dimensional projections are used, but the range $[a,b]$ is not fixed. Usually $a=0$ is chosen and $b$ "appropriate" (in practice $b=4\pi$).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.