ctbi | R Documentation |
Please cite the following companion paper if you're using the ctbi
package: Ritter, F.: Technical note: A procedure to clean, decompose, and aggregate time series, Hydrol. Earth Syst. Sci., 27, 349–361, https://doi.org/10.5194/hess-27-349-2023, 2023.
The goal of ctbi
is to clean, decompose, impute and aggregate univariate time series. ctbi
stands for Cyclic/Trend decomposition using Bin Interpolation: the time series is divided into a sequence of non-overlapping bins (inputs: bin.side
or bin.center
, and bin.period
). Bins with enough data (input: bin.max.f.NA
) are accepted, and otherwise rejected (their values are set to NA
). The long-term trend is a linear interpolation of the mean values between successive bins. The cyclic component is the mean stack of detrended data within all accepted bins.
Outliers present in the residuals are flagged using an enhanced box plot rule (called Logbox, input: coeff.outlier
) that is adapted to non-Gaussian data and keeps the type I error at \frac{0.1}{√{n}} % (percentage of erroneously flagged outliers). Logbox replaces the original α = 1.5 constant of the box plot rule with α = A \times \log(n)+B+\frac{C}{n}. The variable n ≥q 9 is the sample size, C = 36 corrects biases emerging in small samples, and A and B are automatically calculated on a predictor of the maximum tail weight (m_{*}).
The strength of the cyclic pattern within each bin is quantified by a new metric, the Stacked Cycles Index defined as SCI = 1 - \frac{SS_{res}}{SS_{tot}} - \frac{1}{N_{bin}}. The variable SS_{tot} is the sum of the squared detrended data, SS_{res} is the sum of the squared detrended & deseasonalized data, and N_{bin} is the number of accepted bins. A value of SCI ≤q 0 is associated with no cyclicity, while SCI = 1 is associated with a perfectly cyclic signal. Data can be imputed if SCI_{min} ≤q SCI (input: SCI.min
). Finally, data are aggregated in each bin (input: bin.FUN
).
Important functions of the package: ctbi
, ctbi.outlier
(flag outliers in univariate datasets with the Logbox method) and ctbi.plot
(plot the time series).
ctbi( data.input, bin.side, bin.period, bin.center = NULL, bin.FUN = "mean", bin.max.f.NA = 0.2, SCI.min = 0.6, coeff.outlier = "auto", ylim = c(-Inf, +Inf) )
data.input |
two columns data.frame (or data.table) with the first column being the time component (POSIXct, Date or numeric) and the second column the value (numeric) |
bin.side |
one side of any bin (same class as the time component) |
bin.period |
time interval between two sides of a bin. If the time component x.t of data.input is numeric, |
bin.center |
if |
bin.FUN |
character ('mean', 'median' or 'sum') that defines the aggregating operator |
bin.max.f.NA |
numeric between 0 and 1 that specifies the maximum fraction of missing values for a bin to be accepted. The minimum number of non-NA points for a bin to be accepted is bin.size.min.accepted = bin.size*(1- |
SCI.min |
numeric between 0 and 1 that is compared to the Stacked Cycles Index (SCI). If SCI > |
coeff.outlier |
one of |
ylim |
numeric vector of length 2 that defines the range of possible values. Values strictly below ylim[1] or strictly above ylim[2] are set to NA. Values equal to ylim[1] or ylim[2] are discarded from the residuals |
A list that contains:
data0, the raw dataset (same class as data.input
), with 9 columns: (i) time; (ii) outlier-free and imputed data; (iii) index.bin: index of the bins associated with each data points (the index is negative if the bin is rejected); (iv) long.term: long-term trend; (v) cycle: cyclic component; (vi) residuals: residuals including the outliers; (vii) outliers: quarantined outliers; (viii) imputed: value of the imputed data points; (ix) time.bin: relative position of the data points in their bins, between 0 and 1
data1, the aggregated dataset (same class as data.input
), with 10 columns: (i) aggregated time (center of the bins); (ii) aggregated data; (iii) index.bin: index of the bin (negative value if the bin is rejected); (iv) bin.start: start of the bin; (v) bin.end: end of the bin; (vi) n.points: number of points per bin (including NA
values); (vii) n.NA: number of NA values per bin, originally; (viii) n.outliers: number of outliers per bin; (ix) n.imputed: number of imputed points per bin; (x) variability associated with the aggregation (standard deviation for the mean, MAD for the median and nothing for the sum)
mean.cycle, a dataset (same class as data.input
) with bin.size rows and 4 columns: (i) generic.time.bin1: time of the first bin; (ii) mean: the mean stack of detrended data; (iii) sd: the standard deviation on the mean; (iv) time.bin: relative position of the data points in the bin, between 0 and 1
summary.bin, a vector that contains bin.size (median number of points in non-empty bins), bin.size.min.accepted (minimum number of points for a bin to be accepted) and SCI
summary.outlier, a vector that contains A, B, C, m_{*}, the size of the residuals (n), and the lower and upper outlier threshold
# example of the contaminated sunspot data example1 <- data.frame(year = 1700:1988,sunspot = as.numeric(sunspot.year)) example1[sample(1:289,30),'sunspot'] <- NA # contaminate data with missing values example1[c(5,30,50),'sunspot'] <- c(-50,300,400) # contaminate data with outliers example1 <- example1[-(70:100),] # create gap in the data bin.period <- 11 # aggregation performed every 11 years (the year is numeric here) bin.side <- 1989 # give one side of a bin bin.FUN <- 'mean' bin.max.f.NA <- 0.2 # maximum of 20% of missing data per bin ylim <- c(0,Inf) # negative values are impossible list.main <- ctbi(example1,bin.period=bin.period, bin.side=bin.side,bin.FUN=bin.FUN, ylim=ylim,bin.max.f.NA=bin.max.f.NA) data0.example1 <- list.main$data0 # cleaned raw dataset data1.example1 <- list.main$data1 # aggregated dataset. mean.cycle.example1 <- list.main$mean.cycle # this data set shows a moderate seasonality summary.bin.example1 <- list.main$summary.bin # confirmed with SCI = 0.50 summary.outlier.example1 <- list.main$summary.outlier plot(mean.cycle.example1[,'generic.time.bin1'], mean.cycle.example1[,'mean'],type='l',ylim=c(-80,80), ylab='sunspot cycle', xlab='11 years window') lines(mean.cycle.example1[,'generic.time.bin1'], mean.cycle.example1[,'mean']+mean.cycle.example1[,'sd'],type='l',lty=2) lines(mean.cycle.example1[,'generic.time.bin1'], mean.cycle.example1[,'mean']-mean.cycle.example1[,'sd'],type='l',lty=2) title(paste0('mean cycle (moderate cyclicity: SCI = ',summary.bin.example1['SCI'],')')) # plot tool: ctbi.plot(list.main,show.n.bin=10) # example of the beaver data temp.beaver <- beaver1[,'temp'] t.char <- as.character(beaver1[,'time']) minutes <- substr(t.char,nchar(t.char)-1,nchar(t.char)) hours <- substr(t.char,nchar(t.char)-3,nchar(t.char)-2) hours[hours==""] <- '0' days <- c(rep(12,91),rep(13,23)) time.beaver <- as.POSIXct(paste0('2000-12-',days,' ',hours,':',minutes,':00'),tz='UTC') example2 <- data.frame(time=time.beaver,temp=temp.beaver) bin.period <- '1 hour' # aggregation performed every hour bin.side <- as.POSIXct('2000-12-12 00:00:00',tz='UTC') # give one side of a bin bin.FUN <- 'mean' # aggregation operator bin.max.f.NA <- 0.2 # maximum of 20% of missing data per bin ylim <- c(-Inf,Inf) list.main <- ctbi(example2,bin.period=bin.period, bin.side=bin.side,bin.FUN=bin.FUN, ylim=ylim,bin.max.f.NA=bin.max.f.NA) data0.example2 <- list.main$data0 # cleaned raw dataset data1.example2 <- list.main$data1 # aggregated dataset. lower.threshold <- list.main$summary.outlier['lower.outlier.threshold'] upper.threshold <- list.main$summary.outlier['upper.outlier.threshold'] hist(data0.example2[,'residuals'],xlim=c(-0.5,0.5),30,main='beaver residuals') abline(v=c(lower.threshold,upper.threshold),col='red',lwd=2) # show the histogram of the residuals
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.