#' Time Series Artifact Detection and Correction
#'
#' \code{get_artifacts} and \code{fix_artifacts} identify and repair,
#' respectively, signal artifacts within a time series. They are designed to
#' work with pupil measurements obtained from an eye-tracking system, primarily
#' to identify and interpolate over eye blinks, but also any other type of
#' signal artifact that can appear when continuously measuring pupil diameter or
#' area over time.
#'
#' Like similar algorithms, \code{get_artifacts} relies primarily on abrupt
#' jumps in signal to identify the onset and offset of blinks. Unlike other
#' algorithms, it does not require the user to prespecify threshold values that
#' define onsets and offsets for all time series; rather, it adaptively
#' determines the best values based on the distribution of the first-order
#' differences (velocities) of each time series. This procedure was designed to
#' mimic the relativistic way a human observer would visually identify an
#' artifact, i.e., by assessing the pattern of deviation for the candidate
#' artifact relative to that of its nearest neighbors.
#'
#' Using the output of \code{get_artifacts}, \code{fix_artifacts} repairs
#' artifacts using a sequence of linear interpolation for internal artifacts
#' (artifacts sandwiched between good data) followed by matched lag-1
#' differences for external artifacts (artifacts at the start or end of a time
#' series).
#'
#' @section Advice on setting \code{lim} argument:
#'
#' \code{fix_artifacts} uses the \code{lim} argument to impose a floor and
#' ceiling to the values that the time series is allowed to take. Values that
#' exceed this range are then treated like any other artifact. The purpose of
#' the \code{lim} argument is to catch time series that contain "slow drift",
#' meaning the signal gradually drifts into implausible values. In setting the
#' \code{lim} argument, think about what sort of relative change is plausible
#' for your measurement. For example, if you are measuring pupils, the normal
#' pupil size in adults varies from a minimum diameter of about 2 mm (3 square
#' mm area) to a maximum diameter of about 8 mm (50 square mm area) in bright
#' light vs. darkness. This means if the pupil starts off at maximum dilation,
#' it can experience at most a 75\% decrease in diameter (95\% reduction in
#' area). If the pupil starts off at minimum dilation, it can experience at most
#' a 300\% increase in diameter (1500\% increase in area). This would correspond
#' to \code{lim = c(-0.75, 3)} for diameter and \code{lim = c(-0.95, 1500)} for
#' area, but you should set your limits to be more conservative if you do not
#' expect your measurements to span this range.
#'
#' For example, most users will measure pupils under moderate lighting
#' conditions, so baseline pupil readings will start off closer to the center of
#' their physiological range. Even assuming maximum decreases and increases from
#' this point, the range could be narrowed to \code{lim = c(-0.6, 0.6)} for
#' diameter and \code{lim = c(-0.85, 1.5)} for area. We have found that with
#' psychological stimuli using our eye-tracking setup, \code{lim = c(-0.5, 0.5)}
#' and \code{lim = c(-.75, 1.25)} appear to provide liberal coverage for
#' plausible changes in pupil diameter and area, respectively, and you want your
#' \code{lim} setting to err on the side of being too wide. Note that
#' \code{fix_artifacts} calls the \code{get_oor} ("oor" for "out of range")
#' function to identify which periods of the time series violate the \code{lim}
#' argument. If you wish to know which periods are out of range, you can also
#' call this function directly.
#'
#' @section Warning:
#' If the time series contains periods of artifacts that are too long for
#' interpolation defined by the \code{max_gap} argument, or if the total
#' number of artifacts exceeds the proportion specified by the \code{max_loss}
#' argument, \code{fix_artifacts} will silently return a time series consisting
#' of all missing values so that the user can easily identify which trials need
#' to be dropped.
#'
#' @section Disclaimer:
#' This algorithm is intended for the detection of artifacts in relatively short
#' time series corresponding to pupil-dilation measurements. It has only been
#' tested and validated on 6-8 second trials sampled at 60 Hz or 500 Hz. Good
#' artifact detection may or may not generalize to other sampling rates, trial
#' lengths, and types of data. Please use \code{\link{plot_artifacts}} to
#' inspect the performance of \code{get_artifacts} before continuing with
#' \code{fix_artifacts} and any subsequent data cleaning and analysis.
#'
#' @seealso \code{\link{plot_artifacts}}, \code{\link{plot_comparison}},
#' \code{\link{normalize}}, \code{\link{low_pass_filter}}
#'
#' @param ts A time series, passed as a numeric vector of chronologically
#' ordered, positively valued observations separated by equal intervals of time.
#'
#' @param samp_freq Sampling frequency in Hz.
#'
#' @param max_velocity Maximum allowable velocity specified in terms of a
#' quantile of the distribution of absolute values of first-order differences of
#' the time series. Velocities that exceed the value of this quantile will be
#' used to identify onset/offset of artifacts. Default value is 0.9 (90th
#' percentile). Lower values will lead to greater sensitivity but less
#' specificity in identifying signal spikes. Higher values will lead to less
#' sensitivity but greater specificity.
#'
#' @param lim Two-item numeric vector \code{c(neg, pos)} specifying the negative
#' and positive limits, respectively, of relative change from baseline
#' that is plausible for your time series. (See section "Advice on setting
#' \code{lim} argument" for more information and examples.)
#'
#' @param baseline Logical vector indicating which parts of the time series
#' correspond to the baseline period. This is used as a reference for
#' determining whether values exceed the limits of relative change set in the
#' \code{lim} argument. If there is no \code{baseline} provided, the mean of the
#' entire series (excluding signal loss) will be used for this reference.
#'
#' @param max_loss Maximum fraction of time series that is allowed to contain
#' dropped signal and/or artifacts. Default value is 0.5, meaning if more than
#' half of the time series consists of artifacts, \code{fix_artifacts} will not
#' perform interpolation/extrapolation and instead will return a vector of all
#' \code{NA}s.
#'
#' @param max_gap Maximum gap in seconds that an artifact period is allowed to
#' span. Default value is 1 second, meaning if more than half of the time series
#' consists of artifacts, \code{fix_artifacts} will not perform
#' interpolation/extrapolation and instead will return a vector of all
#' \code{NA}s.
#'
#' @param min_cont Minimum continuity in seconds required between artifacts.
#' If the period of "good" data between two artifacts is less than this
#' threshold, the two artifacts are merged into one. Default value is 0.2
#' (200 ms).
#'
#' @param artifacts A logical vector of equal length to the time series that
#' provides logical indexing into which entries of the time series correspond
#' to artifacts. This can be obtained by first calling \code{get_artifacts}. If
#' not supplied by the user, then \code{fix_artifacts} will call
#' \code{get_artifacts} itself.
#'
#' @param ... further arguments passed to \code{get_artifacts} from
#' \code{get_oor} or \code{fix_artifacts} if it has not already been used to
#' create an \code{artifacts} vector.
#'
#' @return \code{get_artifacts} returns a logical vector that can be used for
#' logical indexing into the time series to identify data artifacts.
#' \code{get_oor} returns a logical vector corresponding to elements of the time
#' series that are out of range, as defined by amount of relative change from
#' baseline using the \code{lim} and \code{baseline} arguments.
#' \code{fix_artifacts} returns a copy of the time series with artifacts and
#' missing data replaced by interpolated values, or a copy of the time series
#' with all values changed to \code{NA} in the event that the artifacts are too
#' numerous (exceed \code{max_loss}) or too continuous (exceed \code{max_gap}).
#'
#' @name artifacts
NULL
#' @importFrom stats quantile
#' @export
#' @rdname artifacts
get_artifacts <- function(ts, samp_freq, min_cont = 0.2, max_velocity = 0.9){
#=============================================================================
# check that time series does not have negative values
#-----------------------------------------------------------------------------
ts[is.na(ts)] <- 0
if(any(ts < 0)) stop("Time series has negative values.")
#=============================================================================
# initialize variables
#-----------------------------------------------------------------------------
margin <- ceiling(samp_freq * min_cont)
artifact <- vector("logical", length(ts))
# determine optimal lag for deriving distribution of change values:
## consider lags between 1 and the size of the margin;
## choose smallest lag that yields 95% non-zero differences
## (the minimum interval at which the signal is reliably changing)
lag <- 1
diff_dist <- diff(ts[ts > 0], lag = lag)
p_static <- sum(diff_dist == 0) / length(diff_dist)
while(lag < margin && p_static > 0.05){
lag <- lag + 1
diff_dist <- diff(ts[ts > 0], lag = lag)
p_static <- sum(diff_dist == 0) / length(diff_dist)
}
max_v <- quantile(abs(diff(ts[ts > 0], lag = lag)), max_velocity)
if(is.na(max_v)){
artifact[] <- TRUE; return(artifact)
}
#=============================================================================
# artifact-detection algorithm
#-----------------------------------------------------------------------------
artifact[ts == 0] <- TRUE; forward <- artifact; backward <- artifact
forward[2:length(ts)] <- abs(diff(ts)) > max_v
backward[(length(ts)-1):1] <- abs(diff(ts[length(ts):1])) > max_v
artifact <- merge_artifacts(artifact | forward | backward, margin)
artifact <- buffer_artifacts(ts, artifact, margin)
artifact <- merge_artifacts(artifact, margin)
#=============================================================================
# if there are any artifacts, run the algorithm again on the corrected ts
#-----------------------------------------------------------------------------
if(all(artifact) || all(!artifact)) return(artifact)
new_ts <- fix_artifacts(ts, samp_freq, lim = c(-.99, Inf),
artifacts = artifact, min_cont = min_cont,
max_gap = Inf, max_loss = 1)
if(all(is.na(new_ts))) return(artifact)
artifact <- artifact |
get_artifacts(new_ts, samp_freq, min_cont, max_velocity)
merge_artifacts(artifact, margin)
}
#' @importFrom stats median
#' @export
#' @rdname artifacts
get_oor <- function(ts, samp_freq, lim, ..., min_cont = 0.2, baseline = NULL,
artifacts = NULL){
# if no logical vector of artifacts has been passed, run get_artifacts
if(is.null(artifacts)){
artifacts <- get_artifacts(ts, samp_freq, ..., min_cont = min_cont)
}
# if the entire time series is one big artifact, abort and return all FALSE
if(all(artifacts)) return(!artifacts)
# obtain cleaned time series without any constraints on max_loss or max_gap
new_ts <- fix_artifacts(ts = ts, samp_freq = samp_freq, baseline = baseline,
artifacts = artifacts, max_gap = Inf, max_loss = 1)
if(all(is.na(new_ts))){
ts[artifacts] <- NA_real_
} else {
ts <- new_ts
}
# if no baseline period is specified or if the baseline period consists
# entirely of missing values, use all non-missing values from the time
# series as the basis for calculating min/max limits
if(is.null(baseline) || all(is.na(ts[baseline]))) baseline <- !is.na(ts)
# determine absolute limits based on relative limits and baseline period
min_lim <- median(ts[baseline], na.rm = TRUE) * (1+lim[1])
max_lim <- median(ts[baseline], na.rm = TRUE) * (1+lim[2])
margin <- floor(samp_freq * min_cont)
# return logical test for which parts of time series exceed lim argument
oor <- !artifacts & (ts < min_lim | ts > max_lim)
oor <- merge_artifacts(oor, margin)
oor <- buffer_artifacts(ts, oor, margin)
merge_artifacts(oor, margin)
}
#' @importFrom stats median
#' @export
#' @rdname artifacts
fix_artifacts <- function(ts, samp_freq, lim = NULL, baseline = NULL,
artifacts = NULL, ..., min_cont = 0.2,
max_gap = 1, max_loss = 0.5){
# store the number of observations in the time series as 'n'
n <- length(ts)
# if no logical vector of artifacts has been passed, run get_artifacts
if(is.null(artifacts)){
artifacts <- get_artifacts(ts, samp_freq, ...)
}
# if the proportion of artifacts exceeds the max_loss threshold, set all
# values of the time series to missing and return
if(sum(artifacts) / length(artifacts) > max_loss){
ts[] <- NA_real_
return(ts)
}
# if limits have been passed, run get_oor and label out-of-range samples
# as artifacts
if(!is.null(lim)){
artifacts <- artifacts |
get_oor(ts, samp_freq, lim, ..., min_cont = min_cont,
baseline = baseline, artifacts = artifacts)
artifacts <- artifacts | is.na(artifacts)
# if the proportion of artifacts now exceeds the max_loss threshold, set all
# values of the time series to missing and return
if(sum(artifacts) / length(artifacts) > max_loss){
ts[] <- NA_real_
return(ts)
}
}
# set observations flagged as artifacts to missing
ts[artifacts] <- NA_real_
# calculate the run-lengths of consecutive artifact vs. artifact-free periods
# across the whole time series
runs <- rle(artifacts)
# extract from this the lengths of the artifact periods, i.e., the gaps in
# signal
gap <- runs$lengths[runs$values]
# if any gaps exceed the max_gap threshold, set all values of the time series
# to missing
if(any(gap > samp_freq * max_gap)) ts[] <- NA_real_
# if the entire time series is now missing (i.e., one or more artifact
# periods was greater than the maximum allowable gap or maximum allowable
# loss), do not proceed
if(all(is.na(ts))) return(ts)
# catch rare circumstance in which the entire time series consists of a
# single run of valid data sandwiched between two runs of artifacts, which
# will break the zoo package's na.approx function
will_break_zoo <- length(runs$values) == 3 && all(runs$values == c(T, F, T))
# linear interpolation for internal artifacts
if(!will_break_zoo) ts <- zoo::na.approx(ts, na.rm = F)
# if there is missing data at end of trial
if(is.na(ts[n])){
# h = the length of the final gap in the trial
h <- gap[length(gap)]
# [n-2h : n-h] = the interval immediately preceding the final gap that is
# equal in length to the final gap plus 1.
# if the length of the final gap is more than half the length of the entire
# time series, or if there are any missing values in the matched interval,
# cancel the interpolation.
if(h > n/2 || any(is.na(ts[(n-2*h):(n-h)]))){
ts[] <- NA_real_
return(ts)
}
# otherwise, compute the lag-1 differences over the matched interval
lag1 <- diff(ts[(n-2*h):(n-h)])
# and repeat these differences over the missing interval
for(i in 1:h){
ts[n-h+i] <- ts[n-h+i-1] + lag1[i]
}
}
# if there is missing data at start of trial
if(is.na(ts[1])){
# h = the length of the initial gap in the trial
h = gap[1]
# [h+1 : 2h+1] = the interval immediately following the first gap that is
# equal in length to the first gap plus 1
# if the length of the initial gap is more than half the length of the
# entire time series, or if there are any missing values in the matched
# interval, cancel the interpolation.
if(h > n/2 || any(is.na(ts[(h+1):(2*h+1)]))){
ts[] <- NA_real_
return(ts)
}
# otherwise, compute the lag-1 differences over the matched interval
lag1 <- diff(ts[(h+1):(2*h+1)])
# and repeat these differences over the missing interval
for(i in h:1){
ts[i] <- ts[i+1] - lag1[i]
}
}
# threshold to lim argument, if it exists; otherwise to lower bound of 0
min_lim <- 0; max_lim <- Inf
if(!is.null(lim)){
if(is.null(baseline)) baseline <- rep(TRUE, length(ts))
min_lim <- median(ts[baseline]) * (1+lim[1])
max_lim <- median(ts[baseline]) * (1+lim[2])
}
ts[ts < min_lim] <- min_lim
ts[ts > max_lim] <- max_lim
ts
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.