In yadbor/bluer: Tools to Analyse Instron BlueHill 2.0 RawData Files

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Background

When testing bio-mechanical specimens the first few loading cycles are often different from the stable long-term behaviour, usually due to bedding in to the fixture. It is common to reject these and only analyse the results from subsequent cycles.

Bluehill can write a cycle and segment label to RawData files that could be used to select a particular cycle of a multi-cycle test. Unfortunately the labels are often wrong, and can lag the actual start of a cycle by quite a bit. This leads to turning points being included in the wrong cycle and makes analysis very difficult.

The solution to this issue is to find the turning points with R and segment the data using those. R lacks a standard method for finding peaks from a data series. There have been several proposed solutions to this, but in testing on test data from Bluehill they all have limitations.

prior art

using `max.col()`

The earliest peaks code I can find (and the most robust) appears to be due to Brian Ripley (one of the authors of embed). This is the 2001 version by Petr Pikal, after Brian Ripley, quoted in https://stat.ethz.ch/pipermail/r-help/2005-November/083240.html

peaks1 <- function(series, span=3) {
  z <- stats::embed(series, span)
  result <- max.col(z) == 1 + span %/% 2
  result
}

This function works by taking creating a matrix where each column is the original series shifted by one. Each row of the matrix z consists of series[i - 0], series[i - 1], ... series[i - span+1]. max.col() then returns the number of the column with the biggest value, so result will be TRUE when the biggest value is in column (1 + span %/% 2), i.e. when the middle column is bigger. The columns are all shifted by 1, so the columns to the left will be the previous span %/% 2 points and the columns on the right will be the next span %/% 2 points. If the middle column is bigger than that point is bigger than it's span %/% 2 neighbours on either side. In other words, a peak. The result vector returned by this function will be shorted than the original series, losing peak %/% 2 points from either end as the input series is shifted off the end.

The next function uses the same idea, extended by Petr Pikal to pad result to the same length as the input series.

peaks2<-function(series,span=3) {
  z <- stats::embed(series, span)
  s <- span%/%2
  v<- max.col(z) == 1 + s
  result <- c(rep(FALSE,s),v)
  result <- result[1:(length(result)-s)]
  result
}

using equalities

Both of these functions were cited on r-help in November 2005 r-help. Later in the thread Martin Maechler (maechler@stat.math.ethz.ch) presented a different idea. https://stat.ethz.ch/pipermail/r-help/2005-November/083376.html

peaks3 <- function(series, span = 3, do.pad = TRUE) {
  if ((span <- as.integer(span)) %% 2 != 1) stop("'span' must be odd")
  s1 <- 1:1 + (s <- span %/% 2)
  if (span == 1) return(rep.int(TRUE, length(series)))
  z <- stats::embed(series, span)
  v <- apply(z[, s1] > z[, -s1, drop = FALSE], 1, all)
  if (do.pad) {
    pad <- rep.int(FALSE, s)
    c(pad, v, pad)
  } else v
}

This function uses embed() like the previous functions, but then instead of calling max.col() it builds a matrix of tests if column s1 is greater than each column except s1. Then it checks if all columns in each row are TRUE. It this is the case, then column s must be bigger than all the other columns, so that point must be a peak. The output vector is finally padded to the imput length, if requested. Note that is uses the to my eye odd idiom of adding a sequence 1:1 to span %/% 2 rather than just adding an integer. This may be something common in R, but I have't seen it elsewhere.

Maechler's approach has the advantage that it can be extended to find troughs, where column s1 is less than all other columns. Maechler also provides this function:

peaksign <- function(series, span = 3, do.pad = TRUE) {
  ## Purpose: return (-1 / 0 / 1) if series[i] is ( trough / "normal" / peak )
  ## ----------------------------------------------------------------------
  ## Author: Martin Maechler, Date: 25 Nov 2005

  if ((span <- as.integer(span)) %% 2 != 1 || span == 1)
    stop("'span' must be odd and >= 3")
  s1 <- 1:1 + (s <- span %/% 2)
  z <- stats::embed(series, span)
  d <- z[, s1] - z[, -s1, drop = FALSE]
  ans <- rep.int(0:0, nrow(d))
  ans[apply(d > 0, 1, all)] <- as.integer(1)
  ans[apply(d < 0, 1, all)] <- as.integer(-1)
  if (do.pad) {
    pad <- rep.int(0:0, s)
    c(pad, ans, pad)
  } else ans
}

Both of these functions have a problem with ties, where a peak has several consecutive points all at the same value. In this case they will fail to find the peak as the points on either side are not strictly greater or less than the "peak" point.

This problem can be addressed by making one of the inequalities >= (or <= depending on which is chosen). For peaks alone:

  z <- stats::embed(series, span)
  v <- apply(z[, s1] >= z[, -s1, drop = FALSE], 1, all)

And for peaks & troughs:

  z <- stats::embed(series, span)
  d <- z[, s1] - z[, -s1, drop = FALSE]
  ans <- rep.int(0:0, nrow(d))
  ans[apply(d >= 0, 1, all)] <- as.integer(1)
  ans[apply(d < 0,  1, all)] <- as.integer(-1)

This approach detects peaks with repeated values, but will return runs of TRUE for each point in the peak. The modified peaksign will return runs of +1 or -1.

Using max.col() avoids this as it will only return one value if there are ties. The value is decided by the parameter ties.method = c("random", "first", "last"). The default is random, which Ripley prefers, as in a flat spot no single point could really be called the peak. The drawback of using random is that the chosen peak point can vary between runs of the code, which is inconvenient.

My solution

Finding both peaks and troughs is very convenient for the sort of cyclic data produced by mechanical testing. I considered two approaches:

using peaksign and reducing runs of peaks to a single point
using max.col() to detect peaks and troughs

Unfortunately there is no min.col(), but max.col(-z) achieves the same effect. As max.col() is written in C it is very fast so calling it twice should not be a big performance hit. This approach seemed easier and cleaner than detecting and reducing runs in the Maechler version of peaksigns. I have kept Maechler's idea of using the logical vector max.col() == middle to select where to insert 1L or -1L into a vector of zeros.

#' Find Peaks and Troughs 2
#'
#' My version, starting from Brian Ripley's and informed by all the others
#' @param series the vector of numbers to find peaks in.
#' @param span the window size to use when looking for lower values.
#' @return a logical vector, TRUE for each peak
#'
peaksign2 <- function(series, span=3, do.pad = TRUE) {
  if ((span <- as.integer(span)) %% 2 != 1 || span == 1) {
    stop("'span' must be odd and >= 3")
  }
  z <- stats::embed(series, span)
  s <- span %/% 2
  ans <- rep.int(0L, nrow(z))
  # max is peak
  v <- max.col(z, ties.method = "first")  == (1 + s)
  ans[v] <-  1
  # no min.col so uses max(-v) for trough
  v <- max.col(-z, ties.method = "first")  == (1 + s)
  ans[v] <- -1
  if (do.pad) {
    pad <- rep.int(0L, s)
    c(pad, ans, pad)
  } else {
    ans
  }
}

This final version handles flat peaks with multiple identical values, and detects both peaks and troughs in a single function.

Even more final version

When tested on more data, the method above failed for some tests with 'flat' spots at the peaks. These could be runs of data points with the same y value or very runs of slowly changing y values. Multiple peaks (or troughs) in a row were returned for a single real peak (or trough). These can be common in slow speed tests, so a more robust method was required. Clearly, there can't be multiple peaks without a trough in between (and the same for mutipe trouchs with no intervening peak). The new method uses the peaksign2() function, but then rejects peaks that do not meet a strict alternation of peaks and troughs. The start of the test is effectively a trough, so the first peak/trough must be a peak, follwoed by a trough and so on. To make it easier to find these, all tests should have positive extension & load at a peak. This means that compressive tests (and other tests where the crosshead moves down) shoudl have load and extension inverted before the peak finder is run.

#' Find Peaks and Troughs Robustly (version 3)
#'
#' My version, starting from Brian Ripley's and informed by all the others.
#'   The previous version found peaks & toughs, but could get false hits
#'   when there was a flat run of the same value. This is not uncommon in
#'   testing data, particularly at low speeds and high data sampling rates,
#'   where the crosshead can pause at the top or bottom of travel and generate
#'   several data points all at the same load/extension.
#'
#'   We solve this by recognising that the starting poing must be a trough
#'   (as we are moving away from start towards maximum) and that the peaks
#'   and troughs must come in strict succession. That is, each peak must be
#'   followed by a trough, and so on. Any additional peaks/troughs are false.
#'
#'   This version only works with data that increase away from the start point.
#'   That means that compressive tests, where the absolute Load & Extension go
#'   negative from start, must be inverted before testing for peaks.
#'
#' @param series The vector of numbers to find peaks in.
#'   Assumed to start low and move towards a peak first.
#' @param span The number of points to check looking for each peak.
#'   Must be odd and >= 3.
#'   Default is 5.
#' @return A vector of (-1 / 0 / 1) if series[i] is ( trough / "normal" / peak ).
#'   Always padded to the length of the input, and awlays starts with a trough.
#'
robust_peaks <- function(series, span=5) {
  if ((span <- as.integer(span)) %% 2 != 1 || span == 1) {
    stop("'span' must be odd and >= 3")
  }
  z <- stats::embed(series, span)
  s <- span %/% 2
  ans <- rep.int(0L, nrow(z))
  # max is peak
  v <- max.col(z, ties.method = "first")  == (1 + s)
  ans[v] <-  1
  # no min.col so uses max(-v) for trough
  v <- max.col(-z, ties.method = "first")  == (1 + s)
  ans[v] <- -1
  # pad back to length of input vector
  pad <- rep.int(0L, s)
  ans <- c(pad, ans, pad)
  # and always start from a virtual trough,
  # as always start with a loading phase
  ans[1] <- -1

  # This algorithm gets false positives for runs of equal value, so clean
  # up by only marking peaks and troughs in strict succession. That is, a
  # peak/trough is only marked if it is different from the preceeding
  # peak/trough.
  # Only operate on elements wich are a potential peak or trough.
  maybe_pk <- ans[ans != 0]
  # Check if each value is a real peak/trough, otherwise erase it.
  shifted <- data.table::shift(maybe_pk, type="lag", fill = -1)
  real_pk <- ifelse(maybe_pk == shifted, 0, maybe_pk)
  # Stuff the corrected peaks back in to the right spots in ans.
  ans[which(ans != 0)] <- real_pk

  return(ans)
}

Timing

How do the various versions compare? Using microbenchmark and real sample data that has a "flat" peak in 8800:9200 (second peak)

library(data.table)
# NB - problem with setting knitr to use the Project Directory, 
# so use default to file directory
print(getwd())
test.file <- "../inst/extdata/vero_clear.is_ccyclic_RawData/Specimen_RawData_6.csv"
DT <- data.table::fread (test.file, skip = 5, col.names = c("time", "ext", "load"), select = 1:3)
DT[, ext := - ext] # need to invert as is compression test
DT[, load := -load]

# set a seed to avoid ripley changing between runs
set.seed(007)
if (requireNamespace("microbenchmark")) {
  # we can do benchmarking
  mbm <- microbenchmark::microbenchmark(
    ripley   = p1 <- DT[, peaks1(ext)],
    pikal    = p2 <- DT[, peaks2(ext)],
    maechler = p3 <- DT[, peaks3(ext)],
    peaksign = ps <- DT[, peaksign(ext)],
    mine     = me <- DT[, peaksign2(ext)],
    robust   = ro <- DT[, robust_peaks(ext)]
  )
  print(print(mbm, order = "median", signif = 2)[,-8])

  if (requireNamespace("ggplot2")) {
    ggplot2::autoplot(mbm)
  } else {
    autoplot(mbm)
  }
} else {
  print('install.packages("microbenchmark") if you want to run the benchmarking')
}

It appears the robust method is a bit slower, but not horribly so.

Output

What peaks did each method find? There should be three peaks and two troughs, with the beginning and end of the series being virtual troughs. By inspection the peaks should be near 3000, 9000, 15000 and the troughs near 6000, 12000.

Peaks only