cpop | R Documentation |
The CPOP algorithm fits a change-in-slope model to data.
cpop(
y,
x = 1:length(y) - 1,
grid = x,
beta = 2 * log(length(y)),
sd = sqrt(mean(diff(diff(y))^2)/6),
minseglen = 0,
prune.approx = FALSE
)
y |
A vector of length n containing the data. |
x |
A vector of length n containing the locations of y. Default value is NULL, in which case the locations |
grid |
An ordered vector of possible locations for the change points. If this is NULL, then this is set to x, the vector of times/locations of the data points. |
beta |
A positive real value for the penalty incurred for adding a changepoint (prevents over-fitting). Default value is |
sd |
Estimate of residual standard deviation. Can be a single numerical value or a vector of values for the case of varying standard deviation.
Default value is |
minseglen |
The minimum allowable segment length, i.e. distance between successive changepoints. Default is 0. |
prune.approx |
Only relevant if a minimum segment length is set. If True, cpop will use an approximate pruning algorithm that will speed up computation but may occasionally lead to a sub-optimal solution in terms of the estimate change point locations. If the minimum segment length is 0, then an exact pruning algorithm is possible and is used. |
The CPOP algorithm fits a change-in-slope model to data. It assumes that we have we have data points \mjeqn(y_1,x_1),...,(y_n,x_n)(y_1,x_1),...,(y_n,x_n), ordered so that \mjeqnx_1 < x_2 < \cdots < x_nx_1 < x_2 < ... < x_n. For example \mjeqnx_ix_i could be a time-stamp of when response \mjeqny_iy_i is obtained. We model the response, \mjeqnyy, as a signal plus noise where the signal is modelled as a continuous piecewise linear function of \mjeqnxx. That is \mjdeqny_i=f(x_i)+\epsilon_iy_i=f(x_i)+e_i where \mjeqnf(x)f(x) is a continuous piecewise linear function.
To estimate the function \mjeqnf(x)f(x) we specify a set of \mjeqnNN grid points, \mjeqng_1:Ng_1:N with these ordered so that \mjeqng_i < g_jg_i < g_j if and only if \mjeqni < ji < j, and allow the slope of \mjeqnf(x)f(x) to only change at these grid points. We then estimate the number of changes, the location of the changes, and hence the resulting function \mjeqnf(x)f(x) by minimising a penalised weighted least squares criteria. This criteria includes a term which measures fit to the data, which is calculated as the sum of the square residuals of the fitted function, scaled by the variance of the noise. The penalty is proportional to the number of changes.
That is our estimated function will depend on \mjeqnKK, the number of changes in slope, their locations, \mjeqn\tau_1,...,\tau_Ktau_1,...,tau_K, and the value of the function \mjeqnf(x)f(x) at these change points, \mjeqn\alpha_1,...,\alpha_Kalpha_1,...,alpha_K, and its values, \mjeqn\alpha_0alpha_0 at \mjeqn\tau_0 < x_1tau_0 < x_1 and \mjeqn\alpha_K+1alpha_K+1 at some \mjeqn\tau_K+1 > x_Ntau_K+1>x_N. The CPOP algorithm then estimates \mjeqnKK, \mjeqn\tau_1:Ktau_1:K and \mjeqn\alpha_0:K+1alpha_0:K+1 as the values that solve the following minimisation problem \mjdeqn\min_K,\tau_1:K\in g_1:N, \alpha_0:K+1 \left\lbrace\sum_i=1^n \frac1\sigma^2_i \left(y_i - \alpha_j(i)-(\alpha_j(i)+1- \alpha_j(i))\fracx_i-\tau_j(i)\tau_j(i)+1-\tau_j(i)\right)^2+K\beta\right\rbrace\min_K,\tau_1:K\in g_1:N, \alpha_0:K+1 \left\lbrace\sum_i=1^n \frac1\sigma^2_i \left(y_i - \alpha_j(i)-(\alpha_j(i)+1- \alpha_j(i))\fracx_i-\tau_j(i)\tau_j(i)+1-\tau_j(i)\right)^2+K\beta\right\rbrace
where \mjeqn\sigma^2_1,...,\sigma^2_nsigma^2_1,...,sigma^2_n are the variances of the noise \mjeqn\epsilon_i\epsilon_i for \mjeqni=1,...,ni=1,...,n, and \mjeqn\betabeta is the penalty for adding a changepoint. The sum in this expression is the weighted residual sum of squares, and the \mjeqnK\betaK*beta term is the penalty for having \mjeqnKK changes.
If we know, or have a good estimate of, the residual variances, and the noise is (close to) independent over time then an appropriate choice for the penalty is
\mjeqn\beta=2 \log nbeta=2log(n), and this is the default for CPOP. However in many applications these assumptions will not hold and it is advised to look at segmentations for
different value of \mjeqn\betabeta – this is possible
using CPOP with the CROPS algorithm cpop.crops
. Larger values of \mjeqn\betabeta will lead to functions with fewer changes. Also there is a trade-off between the variances of the residuals
and \mjeqn\betabeta: e.g. if we double the variances and half the value of \mjeqn\betabeta we will obtain the same estimates for the number and location of the changes and the
underlying function.
An instance of an S4 class of type cpop.class.
doi:10.1080/10618600.2018.1512868cpop
\insertRefcpop-jss-article-2024cpop
library(cpop)
# simulate data with change in gradient
set.seed(1)
x <- (1:50/5)^2
y <- simchangeslope(x,changepoints=c(10,50),change.slope=c(0.25,-0.25),sd=1)
# analyse using cpop
res <- cpop(y,x)
p <- plot(res)
print(p)
# generate the "true" mean
mu <- simchangeslope(x,changepoints=c(10,50),change.slope=c(0.25,-0.25),sd=0)
# add the true mean to the plot
library(pacman)
p_load(ggplot2)
p <- p + geom_line(aes(y = mu), color = "blue", linetype = "dotted")
print(p)
# heterogeneous data
set.seed(1)
sd <- (1:50)/25
y <- simchangeslope(x,changepoints=c(10,50),change.slope=c(0.25,-0.25),sd=sd)
# analysis assuming constant noise standard deviation
res <- cpop(y,x,beta=2*log(length(y)),sd=sqrt(mean(sd^2)))
p <- plot(res)
print(p)
# analysis with the true noise standard deviation
res.true <- cpop(y,x,beta=2*log(length(y)),sd=sd)
p <- plot(res.true)
print(p)
# add the true mean to the plot
p <- p + geom_line(aes(y = mu), color = "blue", linetype = "dotted")
print(p)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.