cpop: Find the best segmentation of data for a change-in-slope...

View source: R/cpop.R

cpopR Documentation

Find the best segmentation of data for a change-in-slope model

Description

The CPOP algorithm fits a change-in-slope model to data.

Usage

cpop(
  y,
  x = 1:length(y) - 1,
  grid = x,
  beta = 2 * log(length(y)),
  sd = sqrt(mean(diff(diff(y))^2)/6),
  minseglen = 0,
  prune.approx = FALSE
)

Arguments

y

A vector of length n containing the data.

x

A vector of length n containing the locations of y. Default value is NULL, in which case the locations x = 1:length(y)-1 are assumed.

grid

An ordered vector of possible locations for the change points. If this is NULL, then this is set to x, the vector of times/locations of the data points.

beta

A positive real value for the penalty incurred for adding a changepoint (prevents over-fitting). Default value is 2*log(length(y)).

sd

Estimate of residual standard deviation. Can be a single numerical value or a vector of values for the case of varying standard deviation. Default value is sd = sqrt(mean(diff(diff(y))^2)/6).

minseglen

The minimum allowable segment length, i.e. distance between successive changepoints. Default is 0.

prune.approx

Only relevant if a minimum segment length is set. If True, cpop will use an approximate pruning algorithm that will speed up computation but may occasionally lead to a sub-optimal solution in terms of the estimate change point locations. If the minimum segment length is 0, then an exact pruning algorithm is possible and is used.

Details

\loadmathjax

The CPOP algorithm fits a change-in-slope model to data. It assumes that we have we have data points \mjeqn(y_1,x_1),...,(y_n,x_n)(y_1,x_1),...,(y_n,x_n), ordered so that \mjeqnx_1 < x_2 < \cdots < x_nx_1 < x_2 < ... < x_n. For example \mjeqnx_ix_i could be a time-stamp of when response \mjeqny_iy_i is obtained. We model the response, \mjeqnyy, as a signal plus noise where the signal is modelled as a continuous piecewise linear function of \mjeqnxx. That is \mjdeqny_i=f(x_i)+\epsilon_iy_i=f(x_i)+e_i where \mjeqnf(x)f(x) is a continuous piecewise linear function.

To estimate the function \mjeqnf(x)f(x) we specify a set of \mjeqnNN grid points, \mjeqng_1:Ng_1:N with these ordered so that \mjeqng_i < g_jg_i < g_j if and only if \mjeqni < ji < j, and allow the slope of \mjeqnf(x)f(x) to only change at these grid points. We then estimate the number of changes, the location of the changes, and hence the resulting function \mjeqnf(x)f(x) by minimising a penalised weighted least squares criteria. This criteria includes a term which measures fit to the data, which is calculated as the sum of the square residuals of the fitted function, scaled by the variance of the noise. The penalty is proportional to the number of changes.

That is our estimated function will depend on \mjeqnKK, the number of changes in slope, their locations, \mjeqn\tau_1,...,\tau_Ktau_1,...,tau_K, and the value of the function \mjeqnf(x)f(x) at these change points, \mjeqn\alpha_1,...,\alpha_Kalpha_1,...,alpha_K, and its values, \mjeqn\alpha_0alpha_0 at \mjeqn\tau_0 < x_1tau_0 < x_1 and \mjeqn\alpha_K+1alpha_K+1 at some \mjeqn\tau_K+1 > x_Ntau_K+1>x_N. The CPOP algorithm then estimates \mjeqnKK, \mjeqn\tau_1:Ktau_1:K and \mjeqn\alpha_0:K+1alpha_0:K+1 as the values that solve the following minimisation problem \mjdeqn\min_K,\tau_1:K\in g_1:N, \alpha_0:K+1 \left\lbrace\sum_i=1^n \frac1\sigma^2_i \left(y_i - \alpha_j(i)-(\alpha_j(i)+1- \alpha_j(i))\fracx_i-\tau_j(i)\tau_j(i)+1-\tau_j(i)\right)^2+K\beta\right\rbrace\min_K,\tau_1:K\in g_1:N, \alpha_0:K+1 \left\lbrace\sum_i=1^n \frac1\sigma^2_i \left(y_i - \alpha_j(i)-(\alpha_j(i)+1- \alpha_j(i))\fracx_i-\tau_j(i)\tau_j(i)+1-\tau_j(i)\right)^2+K\beta\right\rbrace

where \mjeqn\sigma^2_1,...,\sigma^2_nsigma^2_1,...,sigma^2_n are the variances of the noise \mjeqn\epsilon_i\epsilon_i for \mjeqni=1,...,ni=1,...,n, and \mjeqn\betabeta is the penalty for adding a changepoint. The sum in this expression is the weighted residual sum of squares, and the \mjeqnK\betaK*beta term is the penalty for having \mjeqnKK changes.

If we know, or have a good estimate of, the residual variances, and the noise is (close to) independent over time then an appropriate choice for the penalty is \mjeqn\beta=2 \log nbeta=2log(n), and this is the default for CPOP. However in many applications these assumptions will not hold and it is advised to look at segmentations for different value of \mjeqn\betabeta – this is possible using CPOP with the CROPS algorithm cpop.crops. Larger values of \mjeqn\betabeta will lead to functions with fewer changes. Also there is a trade-off between the variances of the residuals and \mjeqn\betabeta: e.g. if we double the variances and half the value of \mjeqn\betabeta we will obtain the same estimates for the number and location of the changes and the underlying function.

Value

An instance of an S4 class of type cpop.class.

References

\insertRef

doi:10.1080/10618600.2018.1512868cpop

\insertRef

cpop-jss-article-2024cpop

Examples

library(cpop)

# simulate data with change in gradient
set.seed(1)
x <- (1:50/5)^2
y <- simchangeslope(x,changepoints=c(10,50),change.slope=c(0.25,-0.25),sd=1)
# analyse using cpop
res <- cpop(y,x)
p <- plot(res)
print(p)
 
# generate the "true" mean
mu <- simchangeslope(x,changepoints=c(10,50),change.slope=c(0.25,-0.25),sd=0)
# add the true mean to the plot
library(pacman)
p_load(ggplot2)
p <- p + geom_line(aes(y = mu), color = "blue", linetype = "dotted")
print(p)

# heterogeneous data
set.seed(1)
sd <- (1:50)/25
y <- simchangeslope(x,changepoints=c(10,50),change.slope=c(0.25,-0.25),sd=sd)

# analysis assuming constant noise standard deviation
res <- cpop(y,x,beta=2*log(length(y)),sd=sqrt(mean(sd^2)))
p <- plot(res)
print(p)

# analysis with the true noise standard deviation
res.true <- cpop(y,x,beta=2*log(length(y)),sd=sd)
p <- plot(res.true)
print(p)

# add the true mean to the plot
p <- p + geom_line(aes(y = mu), color = "blue", linetype = "dotted")
print(p)


grosed/cpop documentation built on May 31, 2024, 11:16 a.m.