R/build_design_matrix.R

Defines functions get_ts_multipliers get_additional_regressors determine_run_volumes shift_dmat_timing build_design_matrix

Documented in build_design_matrix

#' Creates an fmri design matrix, including timing files for for AFNI or FSL.
#'
#' @param events a data.frame that includes a column for the event type (e.g. outcome vs. cue),
#'           run number (1:n), trial number (nested within run, 1:x), onset, duration of event
#' @param signals expects a list of list. The first level of lists reflects each signal (e.g. pe vs values).
#'           In the second level, BDM expects values and event (e.g. cue vs outcome). Values is a \code{data.frame}
#'           with run number (1:n), trial (1:x), and signal.
#' @param tr The repetition time of your fMRI sequence in seconds. You must specify this.
#'           This is important to specify correctly to get temporal filtering correct.
#' @param center_values A logical (\code{TRUE/FALSE}) indicating whether to center parameteric regressors prior to
#'           convolution. This is usually a good idea to remove collinearity between parametric and task indicator
#'           regressors. The default is \code{TRUE}.
#' @param hrf_parameters A named vector of HRF parameters passed to \code{fmri.stimulus} internally.
#'           The parameters are a1, a2, b1, b2, cc. Equation is (x/d1)^a1 * exp(-(x - d1)/b1) - c * (x/d2)^a2 * exp(-(x - d2)/b2).
#'           Defaults and descriptions are:
#'             a1 = 6. Controls the time delay to the peak of the positive response
#'             a2 = 12. Controls the time delay to the (negative) peak of the undershoot
#'             b1 = 0.9. Controls the dispersion (breadth) of the positive response
#'             b2 = 0.9. Controls the dispersion (breadth) of the undershoot
#'             cc = 0.35. Controls the relative scaling of the positive response versus the undershoot.
#'
#'           Note: These defaults come from Glover 1999.
#'
#'           Note. FSL double gamma has a1 = 6, a2 = 16, cc = 1/6. Not yet sure about b1 and b2.
#' @param baseline_coef_order Default -1 (no baseline). If >= 0, then design will include polynomial trends
#'           within each run (e.g. baseline_coef_order = 1 includes both an intercept and a linear trend as regressors)
#' @param baseline_parameterization Defaults to "Legendre". This adds Legendre polynomials up to
#'           \code{baseline_coef_order} (e.g., 2). The alternative is "orthogonal_polynomials",
#'           which uses \code{fmri.design} from the \code{fmri} package (now internal to fmri.pipeline) to add 
#'           polynomial regressors that are orthogonal to substantive design factors.
#' @param run_data a data.frame containing metadata about the runs for which we want to model the task design. This
#'           data.frame should contain the columns run_number, run_volumes, run_nifti, and drop_volumes. If run_nifti
#'           is provided, but run_volumes is not, then the number of volumes is looked up from the NIfTI header.
#' @param drop_volumes By default, all volumes are retained. If specified, this can be a vector of the number of volumes
#'           that will be removed from the \emph{beginning} of each convolved regressor. If you pass a single number (e.g., 3),
#'           this number of volumes will be dropped from each run. This is useful if you have dropped the first n volumes
#'           of your MR data, for example to handle problems with steady state magnetization. If you include drop_volumes in
#'           the \code{run_data} data.frame, this will be used over this setting.
#' @param runs_to_output A numeric vector of run numbers to be output. By default, all runs are preserved.
#'           This is used to model only a subset such as \code{c(1, 2, 6)}.
#' @param plot By default (\code{TRUE}), \code{build_design_matrix} will plot the design matrix in the plot window of your R session.
#'           If \code{FALSE}, the plot is not displayed, but the ggplot object is still provided in the $design_plot field.
#' @param write_timing_files When NULL (the default), the function does not write timing files to disk.
#'           This argument accepts a character vector that specifies whether to write "AFNI", "FSL",
#'           or "convolved" timing files to disk (in \code{output_directory}). AFNI files follow the
#'           dmBLOCK convention of TIME*PARAMETER:DURATION for each event. FSL files follow the three-column
#'           format of onset, duration, value. And convolved files represent a given signal convolved with the
#'           HRF such that the 1-column output is in volumes (i.e., one row per volume).
#' @param keep_empty_regressors If TRUE, then a regressor that contains no events for a given run is still retained
#'           as an all-zeros regressor in the convolved timing files and the FSL 3-column files. This is useful if
#'           you want to include regressors that only occur for some subjects, but not others. That said, handling
#'           contrasts involving these regressors is a more complicated matter.
#' @param output_directory Where to output the timing files. By default, the function will output the timing files
#'           to a folder called "run_timing" in the current working directory. If such a folder does not exist,
#'           it will make a folder in your R session's current working directory.
#' @param convolve_wi_run If \code{TRUE} (the default), convolution -- and any corresponding mean centering and
#'           normalization of the heights of parametric signals -- is applied to each run separately.
#'           If FALSE, the events across runs are concatenated before convolution is applied
#'           (i.e., treating it as one long time series).
#' @param high_pass By default, \code{NULL}. If desired, pass in a number in Hz that specifies the high pass filter cutoff.
#'           In this case a FIR-based high-pass filter will be applied to remove any low-frequency fluctuations
#'           in the design signals after convolution. This can be useful and necessary if the MRI have been filtered,
#'           but the regressors have not. It is important that the frequency content of both MR and design signals matches.
#'           Some programs, including FEAT, ensure that equivalent filtering is applied to bot the Y and X sides of this
#'           equation, but you should be clear whether your program does so, too. If it doesn't, probably best to use this
#'           argument to filter things yourself. For example, 3dDeconvolve only handles filtering via a set of drift (polort)
#'           regressors. If you have used another tools such as fslmaths -bptf to filter the data, the polort will not necessarily
#'           result in the same removal of drift from regressors as was applied to the MR data.
#' @param iti_post By default, 12. Assumes 12 volumes after each run. Only necessary to specify if not supplying run_volumes and
#'           expecting function to use events information to calculate run_volumes. Wouldn't recommend this, just a default here.
#' @param ts_multipliers By default, \code{NULL}. If specified, expects either a vector of character strings for different .txt files for
#'           the time-based regressors OR a list of data.frames (1 df per run). These data.frames contain
#'           time series regressors (i.e., having the same length as run_volumes) and can be used as multipliers on a stimulus
#'           signal prior to convolution. This is primarily intended for PPI analysis, where the stimulus regressor is multiplied
#'           by a time series from a seed/candidate region. To use columns of \code{ts_multiplier}, you include
#'           a specifier for an element in the signals list: \code{ts_multiplier="vmPFC"} where "vmPFC" is column name in
#'           \code{ts_multipliers}. If you use a list of character vectors (i.e., read from .txt files), make sure that you have
#'           headers in the text files that will become the names of the ts_multiplier signals.
#' @param additional_regressors By default, \code{NULL}. If additional regressors specified, either expects character vector for different .txt
#'           files for the additional regressors (1 txt file per run) OR it expects a
#'           list of data.frames (1 df per run). These values are tacked onto design_convolved
#'           (and not convolved with HRF), so each regressor should be length of the number of
#'           run_volumes within that run. If you pass in a vector of .txt files containing additional regressors,
#'           these will be read into R, truncated to run_volumes, and column-wise concatenated with
#'           substantive regressors.
#' @param lg An lgr object used for logging messages
#'
#' @details
#'
#' The function outputs a list containing key aspects of the fMRI design, including
#' the unconvolved and convolved regressors, collinearity diagnostics, the number of volumes
#' modeled in each run, and a plot of the design.
#'
#' The basic logic of the inputs to build_design_matrix is that task-related fMRI designs are organized around a set of events
#' that occur in time and have a specific duration. Furthermore, for a given event, it could be a 0/1 non-occurrence versus
#' occurrence representation, \emph{or} the event could be associated with a specific parametric value such as working memory
#' load, reward prediction error, or expected value. These parametric effects are aligned in time with an event, but there
#' may be multiple predictions for a given event. For example, we may align a 0/1 regressor and a reward prediction error
#' the outcome phase of a task.
#'
#' Thus, the function abstracts timing-related information into \code{events} and signals, whether parametric or binary, into the \code{signals}.
#'
#' The \code{events} argument expects a \code{data.frame} that has, minimally, the following structure:
#'
#' \preformatted{
#'  > print(events)
#'      event run_number trial onset duration
#'        cue          1     1     4        2
#'        cue          1     2     7        2
#'    outcome          1     1     6      0.5
#'    outcome          1     2   9.5      0.5
#'        cue          2     1   1.2        2
#'        cue          2     2    12        2
#'    outcome          2     1     6      0.5
#'    outcome          2     2   9.5      0.5
#' }
#'
#' Note that you can tack on other columns to \code{events} if it useful to you. Furthermore, if you want to test different
#' durations (e.g., RT-convolved versus fixed duration versus instantaneous), you can add these as additional columns
#' (e.g., \code{duration_1s}, \code{duration_instant}, etc.). To make use of these in the design, specify the column name
#' in events in the \code{$duration} element of a given signal in the \code{signals} list. If you do not specify the
#' \code{$duration} element in \code{signals}, \code{build_design_matrix} will assume that the relevant duration is stored
#' in the \code{$duration} column of \code{events}.
#'
#' The \code{signals} argument expects a list where each element is a given signal that should be aligned with an event and that
#' has some height (e.g., 0/1 or a parametric value) prior to convolution. The signals list should be named by signal and each element should
#' be a list itself, such as the following:
#'
#' \preformatted{
#'   signals <- list(
#'     cue=list(event="cue", duration=0, value=1, normalization="none")
#'   )
#' }
#'
#' The \code{event} element specifies the mapping between a given signal and the corresponding timing in the \code{events} \code{data.frame}.
#' In essence, this is used to merge the event and signal data together. Here, we specify that the cue signal is aligned in time with the cue event.
#'
#' The \code{duration} element can be:
#' \enumerate{
#'   \item A single number, in which case this fixed duration is used for all events
#'   \item A name of the column to be used in the \code{events} \code{data.frame} (e.g., "duration_rtshift")
#'   \item Omitted altogether, in which case \code{build_design_matrix} will default to the "duration" column of \code{events}.
#' }
#'
#' The \code{value} element can be a single number (e.g., 1) in which case this height is used for all corresponding occurrences of a given event.
#' Most commonly, a fixed value is useful for modeling a 'taskness' regressor, which captures a 0/1 representation of whether an event is occurring
#' at a given moment in time. In conventional task-reated fMRI, this task indicator representation is then convolved with the HRF to model expected BOLD
#' activity due to the occurrence of an event. Alternatively, \code{value} can be a data.frame containing \code{$run_number}, \code{$trial}, and \code{$value}
#' columns that specify the height of the regressor at each trial. This specification is more useful for a parametric regressor, as in model-based fMRI.
#'
#' Optionally, one or more within-subject factor columns can be included in the value data.frame and noted in the 
#' \code{wi_factors} element of the signal. In this case, regressors for each level of the wi_factors will be generated as
#' separate regressors.
#' 
#' Here is an example:
#'
#' \preformatted{
#'   signals <- list(
#'     pe=list(event="outcome", normalization="none", convmax_1=TRUE, wi_factors="trustee",
#'     value=data.frame(
#'       run_number=rep(1,5),
#'       trial=1:5,
#'       value=c(10.2, -11.1, 6, 2.4, 1.5),
#'       trustee=c("Good", "Good", "Bad", "Bad", "Neutral")
#'     )
#'   )
#' }
#'
#' Here, the parametrically varying prediction error signal will be aligned at the "outcome" event, have a duration copied 
#' from the \code{$duration} column of \code{events}, and will have parametrically varying heights (e.g., 10.2 at trial 1)
#' prior to convolution. Note that the value \code{data.frame} need not have an entry for every trial in the run. For example,
#' if a given signal is only relevant or only occurs for some "outcome" events, the trial column might be something like
#' \code{c(2, 6, 10)}, indicating that the parametric modulator is only modeled at those trials. This is achieved by joining
#' \code{events} with the relevant signal using \code{trial} as a key.
#'
#' The \code{$normalization} element handles the normalization of the HRF for each regressor. This can be:
#' \enumerate{
#'   \item \code{durmax_1}: pre-convolution, normalize the HRF max to 1.0 for long events (15+ sec) such that
#'             height of HRF is modulated by duration of event but maxes at 1. This is identical to dmUBLOCK(0).
#'   \item \code{evtmax_1}: pre-convolution, normalize the HRF max to 1.0 for each stimulus
#'             regardless of duration. This is identical to dmUBLOCK(1).
#'   \item \code{none}: No normalization of the HRF is performed prior to convolution.
#' }
#'
#' The optional \code{$convmax_1} element handles rescaling the \emph{convolved} regressor to a maximum height of 1.0.
#'   If TRUE for a given signal, the convolved regressor will be divided by its max, leading to a max of 1.0 across
#'   both runs (assuming \code{convolve_wi_run} is \code{TRUE}) and subjects. This may be useful for scaling the regression
#'   coefficients in voxelwise regression across subjects. For example, if the parametric signal captures similar dynamics
#'   within subjects over the experiment, but the scaling varies substantially between subjects, \code{convmax_1} can
#'   help to place the betas on an equivalent scale across subjects (assuming the MR data are also scaled similarly
#'   between subjects).
#'
#' The optional \code{$demean_convolved} element handles whether to demean a convolved regressor.
#'
#' The optional \code{$beta_series} element handles whether to convert a given signal into a set of regressors,
#' one per event. This results in a wide design matrix in which the beta series regressors each reflect a single
#' convolved event. All other signal arguments apply such as HRF normalization. \code{$beta_series} defaults to FALSE.
#'
#' The optional \code{ts_multipliers} element can be added to a given signal in order to multiply the event in the design
#' by a continuous time series such as an ROI. This is primarily used to compute interaction terms between events in the design
#' and activity in a given region in order to examine connectivity using a psychophysiological interaction (PPI) approach.
#' The \code{build_design_matrix} function will mean center the time series prior to multiplying it by the relevant design regressor,
#' then convolve the result with the HRF. Thus, it is typical to provide a *deconvolved* time series to \code{ts_multipliers}, though
#' some packages (e.g., FSL) don't typically use deconvolution in this way.
#'
#' Finally, the optional \code{add_deriv} element determines whether the temporal derivative of a regressor is added to
#'   the design matrix after convolution. Following FSL, the derivatives are computed by a first-order difference and are
#'   then residualized for other regressors in the matrix. That is, the derivatives are orthogonalized with respect to
#'   substantive regressors. By default, derivatives are not added, but if \code{TRUE} for a given signal, this will be added
#'   to the convolved design matrix.
#'
#' This function was adapted from the fitclock package (https://github.com/PennStateDEPENdLab/fitclock.git) to
#'   allow for more general creation of design matrices for fMRI analyses.
#'
#' @return A list of containing different aspects of the design matrix:
#' \itemize{
#'        \item \code{$design}: A runs x signals list containing events before convolution.
#'          Each element is a 2-D matrix containing, minimally, "trial", onset", "duration", and "value" columns.
#'          Onsets and durations are specified in seconds, consistent with FSL's 3-column format.
#'          Within each matrix, the onset, duration and value of the signal is specified.
#'        \item \code{$design_convolved}: The convolved design matrices for each run. Each element in the list contains
#'          a run. Within each design matrix, each column contains a regressor, encompassing substantive regressors,
#'          additional signals, and polynomial baseline regressors, if specified. Each row reflects the predicted value for each volume.
#'        \item \code{$design_unconvolved}: The unconvolved design matrices for each run. Same structure as \code{$design_convolved},
#'          but prior to convolution with the HRF.
#'        \item \code{$collin_events}: A list containing information about the collinearity of regressors before convolution.
#'          At the highest level of the list, each element contains a run. At the second level of the list,
#'          the first element contains the correlation matrix of the regressors and the second element provides
#'          the variance inflation factor (VIF) associated with each regressor. Example: \code{design$collin_events$run1$vif}
#'        \item \code{$collin_convolved}: A list containing information about collinearity of the convolved regressors,
#'          including substantive signals, additional regressors, and polynomial regressors. Follows the same structure as \code{$collin_events}.
#'        \item \code{$concat_onsets}: A list containing concatenated event onset times for each signal. Each signal is an element of the list containing
#'          a vector of all onset times across runs. That is, the total time of run1 is added to onsets of run2 to support a combined analysis of all
#'          runs, which is common in AFNI (e.g., using -concat or multiple files to -input).
#'        \item \code{$run_volumes}: A vector containing the total number of volumes modeled for each run.
#'        \item \code{$design_plot}: A ggplot object showing the design matrix. This is generated by \code{visualize_design_matrix}.
#' }
#' @importFrom data.table as.data.table
#' @importFrom dplyr filter select slice bind_rows arrange "%>%"
#' @importFrom orthopolynom legendre.polynomials polynomial.values
#' @importFrom oro.nifti readNIfTI
#' @importFrom ggplot2 ggplot
#' @importFrom car vif
#' @importFrom stats as.formula cor lm residuals rnorm
#' @importFrom utils read.table write.table
#' @importFrom RNifti niftiHeader
#' @importFrom checkmate assert_file_exists
#' @importFrom rlang flatten
#'
#' @author Michael Hallquist
#' @author Alison Schreiber
#' @examples
#'
#' \dontrun{
#'   data(example_events)
#'   data(example_signals)
#'
#'   #basic convolved design matrix
#'   d <- build_design_matrix(events = example_events, signals = example_signals, tr=1.0, plot=FALSE)
#'
#'   data(example_nuisrun1) #load demo additional signals
#'   data(example_nuisrun2)
#'
#'   #design matrix with 0,1,2 polynomial baseline regressors and a set of additional regressors
#'   #this does not contain a 'taskness' regressor for cue or outcome
#'   dnuis <- build_design_matrix(events = example_events, signals = example_signals, tr=1.0,
#'     additional_regressors = list(example_nuisrun1, example_nuisrun2), baseline_coef_order = 2)
#'
#'   #tweak the design to add temporal derivatives for both ev and pe
#'   example_signals$pe$add_deriv <- TRUE
#'   example_signals$ev$add_deriv <- TRUE
#'
#'   #also use the evtmax_1 normalization method for both parametric signals
#'   example_signals$pe$normalization <- "evtmax_1"
#'   example_signals$ev$normalization <- "durmax_1"
#'
#'   #finally, add a taskness regressor for cue and outcome
#'   example_signals$cue_evt <- list(value=1, event="cue", normalization="none")
#'   example_signals$outcome_evt <- list(value=1, event="outcome", normalization="none")
#'
#'   #include up to quadratic drift terms in the design, drop 3 volumes from the beginning,
#'   #and write timing files in AFNI, FSL, and convolved formats (to the "run_timing" directory).
#'   d_modified <- build_design_matrix(events = example_events, signals = example_signals, tr=1.0,
#'     baseline_coef_order=2, drop_volumes=3, write_timing_files = c("convolved", "AFNI", "FSL"))
#'
#'   #show unconvolved design
#'   plot(visualize_design_matrix(concat_design_runs(d_modified, convolved=FALSE)))
#'
#'   #beta series approach for cues: 1 regressor per cue event
#'   example_signals$cue_evt$beta_series <- TRUE
#'   example_signals$ev <- NULL
#'   example_signals$pe <- NULL
#'
#'   d_beta <- build_design_matrix(events = example_events, signals = example_signals, tr=1.0, plot=FALSE,
#'     baseline_coef_order=2, drop_volumes=3, write_timing_files = c("convolved", "FSL"))
#'
#'   #PPI example
#'
#'   events <- rbind(
#'     data.frame(event="cue", run_number=1, trial=1:5, onset=c(5, 20, 50, 100, 150), duration=4),
#'     data.frame(event="cue", run_number=2, trial=1:5, onset=c(15, 35, 60, 90, 105), duration=4)
#'   )
#'
#'   signals <- list(
#'     load=list(value=rbind(
#'       data.frame(run_number=1, trial=1:5, value=1:5),
#'       data.frame(run_number=2, trial=1:5, value=1:5)
#'     ), event="cue", normalization="none"),
#'     cue_evt=list(value=1, event="cue", normalization="none"),
#'     cue_evt_ppi=list(value=1, event="cue", normalization="none", ts_multipliers="vs")
#'   )
#'
#'   library(neuRosim)
#'   forppi1 <- simTSrestingstate(nscan=328, base=1, TR=1, SNR=6)
#'   forppi2 <- simTSrestingstate(nscan=242, base=1, TR=1, SNR=6)
#'   test_ppi <- list(run1=data.frame(vs=forppi1), run2=data.frame(vs=forppi2))
#'
#'   #In PPI, the time series itself should typically be included as a regressor
#'   d <- build_design_matrix(events = events, signals = signals, tr=0.5, center_values=TRUE,
#'                 write_timing_files = c("convolved", "AFNI", "FSL"),
#'                 drop_volumes = 0, baseline_coef_order = 2, run_volumes = c(328, 242),
#'                 ts_multipliers = test_ppi, additional_regressors = test_ppi)
#' }
#'
#' @export
build_design_matrix <- function(
  events = NULL,
  signals = NULL,
  tr=NULL, #TR of scan in seconds
  center_values=TRUE, #whether to center parametric regressors prior to convolution
  hrf_parameters=c(a1 = 6, a2 = 12, b1 = 0.9, b2 = 0.9, cc = 0.35),
  baseline_coef_order=-1L, #don't include baseline by default
  baseline_parameterization="Legendre",
  run_data = NULL, # data.frame containing run numbers, niftis, volumes, and dropped volumes
  run_nifti_drop_applied = TRUE, # if TRUE, assume that drop_volumes was already removed
  drop_volumes=0L, #vector of how many volumes to drop from the beginning of a given run
  runs_to_output=NULL,
  plot=TRUE,
  write_timing_files=NULL,
  output_directory="run_timing",
  keep_empty_regressors = TRUE,
  convolve_wi_run=TRUE, #whether to mean center parametric regressors within runs before convolution
  high_pass=NULL, #whether to apply a high-pass filter to the design matrix (e.g., to match fmri preprocessing)
  iti_post = 12,
  ts_multipliers=NULL, #time series regressors that can be multiplied against signals prior to convolution
  additional_regressors = NULL, #allow for additional regression text file to be implemented. need separate file for each
  lg = NULL
) {

  checkmate::assert_character(write_timing_files, null.ok=TRUE)
  if (!is.null(write_timing_files)) { write_timing_files <- tolower(write_timing_files) } #always use lower case internally
  checkmate::assert_subset(write_timing_files, c("convolved", "fsl", "afni", "spm"))

  # validate run_data
  checkmate::assert_data_frame(run_data)
  if (!is.null(run_data$run_nifti)) { # verify the NIfTIs exist, if specified
    checkmate::assert_file_exists(run_data$run_nifti)
  }
  checkmate::assert_integerish(run_data$run_volumes, lower = 1, null.ok = TRUE)
  checkmate::assert_integerish(run_data$drop_volumes, lower = 0, null.ok = TRUE)
  if (is.null(run_data$run_number)) {
    message("No run_number column found in run_data. Assuming that runs numbers are sequential ascending and adding this column.")
    run_data$run_number <- seq_len(nrow(run_data))
  }
  checkmate::assert_integerish(run_data$run_number, lower = 1)
  checkmate::assert_integerish(drop_volumes, lower = 0)

  if (is.null(lg)) lg <- lgr::get_logger()

  # If drop_volumes is just 1 in length, assume it applies to all runs
  # This will only have an effect if run_data does not already have a drop_volumes column
  if (is.null(run_data$drop_volumes)) {
    if (length(drop_volumes) == nrow(run_data)) {
      run_data$drop_volumes <- drop_volumes # populate run-wise drops
    } else if (length(drop_volumes) == 1L && is.numeric(drop_volumes) && drop_volumes[1L] > 0) {
      lg$info(glue("Using first element of drop_volumes for all runs: {drop_volumes[1L]}"))
      run_data$drop_volumes <- rep(drop_volumes[1L], nrow(run_data))
    } else {
      run_data$drop_volumes <- 0L
    }
  }

  #take a snapshot of arguments to build_design_matrix that we pass to subsidiary functions
  bdm_args <- as.list(environment(), all.names = TRUE)
  #validate events data.frame

  if (is.null(events)) stop("You must pass in an events data.frame. See ?build_design_matrix for details.")
  if (is.null(signals)) stop("You must pass in a signals list. See ?build_design_matrix for details.")
  if (is.null(tr)) stop("You must pass in the tr (repetition time) in seconds. See ?build_design_matrix for details.")

  checkmate::assert_number(tr, lower=0.01, upper=100) # enforce scalar number

  stopifnot(inherits(events, "data.frame"))
  if (!"event" %in% names(events)) { stop("events data.frame must contain event column with the name of the event") }
  if (!"trial" %in% names(events)) { stop("events data.frame must contain trial column with the trial number for each event") }
  if (!"onset" %in% names(events)) { stop("events data.frame must contain onset column with the onset time in seconds") }
  if (!"duration" %in% names(events)) { stop("events data.frame must contain duration column with the event duration in seconds") }

  if (!"run_number" %in% names(events)) {
    lg$info("No run_number column found in events. Assuming run_number=1 and adding this column")
    events$run_number <- 1
  }

  if (any(is.na(events$duration))) {
    msg <- "Invalid missing (NA) durations included in events data.frame. Cannot continue."
    lg$error(msg)
    lg$error("%s", capture.output(print(subset(events, is.na(duration)))))
    stop(msg)
  } else if (any(events$duration < 0)) {
    msg <- "Invalid negative durations included in events data.frame. Cannot continue."
    lg$error(msg)
    lg$error("%s", print(subset(events, duration < 0)))
    stop(msg)
  }

  if (any(is.na(events$onset))) {
    msg <- "Invalid missing (NA) onsets included in events data.frame. Cannot continue."
    lg$error(msg)
    lg$error("%s", capture.output(print(subset(events, is.na(onset)))))
    stop(msg)
  } else if (any(events$onset < 0)) {
    msg <- "Invalid negative onsets included in events data.frame"
    lg$error(msg)
    lg$error("%s", capture.output(print(subset(events, onset < 0))))
    stop(msg)
  }

  # update run_volumes to reflect drops: elementwise subtraction of dropped volumes from full lengths
  # GENERALLY: want drop_volumes to a) subtract from run_volumes and b) subtract tr*drop_volumes from timing
  #   We should also assume that any volume-level input (ts files and confounds) should be shortened, too.
  # The caveat is how to handle files that are already truncated. We'd need to *add* drop_volumes to offset.

  # Internal flags for tracking how drop_volumes is applied. Not exposed to user for now
  # At present, we only support the following:
  #   - NIfTIs can already have drop_volumes applied (externally), or they can be truncated/dropped later (by other programs)
  #   - timing is always shifted according to drop_volumes
  #   - additional regressors are always assumed to be the original (untruncated) length, and volumes are dropped if requested
  #   - ts_multipliers (PPI-style) are always assumed to be the original (untruncated) length, and volumes are dropped if requested

  shift_nifti <- FALSE # if TRUE, we'd be on the hook for doing an fslroi x <drop_volumes> approach (not supported)
  shift_timing <- TRUE # shift drop_volumes*tr from event onset times (would only be FALSE if user did the subtraction externally, which is not supported)
  # shorten_volumes <- TRUE # subtract drop_volumes from run_volumes.
  shorten_additional <- TRUE # whether to apply drop_volumes to additional regressors (would only be FALSE if user supplied confound files that already dropped these)
  shorten_ts <- TRUE # whether to apply drop_volumes to ts regressors (would only be FALSE if user supplied confound files that already dropped these)

  # make sure beta_series is populated with TRUE or FALSE (default)
  signals <- lapply(signals, function(s) {
    s$beta_series <- ifelse(isTRUE(s$beta_series), TRUE, FALSE) #no beta series by default
    s
  })

  # expand_signal returns a list itself -- use flatten to make one big mega-list
  signals_expanded <- rlang::flatten(unname(lapply(signals, expand_signal)))
  names(signals_expanded) <- sapply(signals_expanded, "[[", "name")

  # merge the trial-indexed signals with the time-indexed events
  # basically: put the events onto the time grid of the run based on the "event" element of the list
  signals_aligned <- lapply(signals_expanded, function(s) {
    # signals_aligned <- lapply(signals, function(s) {
    if (is.null(s$event)) {
      msg <- "Signal does not have event element"
      lg$error(msg)
      lg$error("%s", capture.output(print(s)))
      stop(msg)
    }

    if (is.null(s$value)) {
      lg$info("Signal is missing a 'value' element. Adding 1 for value, assuming a unit-height regressor.")
      s$value <- 1
    }

    join_cols <- c("run_number", "trial")
    df_events <- dplyr::filter(events, event == s$event)
    event_runs <- factor(sort(unique(df_events$run_number)))
    df_signal <- s$value # the signal data.frame for this signal

    # further match-merge on id, session, and block_number, if present
    if ("id" %in% names(df_events)) join_cols <- c(join_cols, "id")
    if ("session" %in% names(df_events)) join_cols <- c(join_cols, "session")
    if ("block_number" %in% names(df_events)) join_cols <- c(join_cols, "block_number")

    if (length(df_signal) == 1L && is.numeric(df_signal)) { #task indicator-type regressor
      s_aligned <- df_events
      s_aligned$value <- df_signal #replicate requested height for all occurrences
    } else if (is.data.frame(df_signal)) {
      s_aligned <- df_signal %>%
        dplyr::left_join(df_events, by = join_cols) %>%
        dplyr::arrange(run_number, trial) # enforce match on signal side
    } else { stop("Unknown data type for signal.") }

    if (length(s$duration) > 1L) {
      stop("Don't know how to interpret multi-element duration argument for signal: ", paste0(s$duration, collapse = ", "))
    }

    if (!is.null(s$duration)) {
      if (is.numeric(s$duration)) {
        s_aligned$duration <- s$duration #replicate the scalar on all rows
      } else {
        s_aligned$duration <- s_aligned[[s$duration]]
      }
    }

    # transform to make dmat happy (runs x regressors 2-d list)
    # dplyr::select will tolerate quoted names, which avoids R CMD CHECK complaints
    retdf <- s_aligned %>%
      dplyr::select("run_number", "trial", "onset", "duration", "value") %>%
      mutate(run_number = factor(run_number, levels = event_runs)) %>%
      setDT()
    # use data.table split method to keep all levels and drop by column. sorted=TRUE also keeps things in run order
    retsplit <- split(retdf, by="run_number", keep.by=FALSE, sorted=TRUE)
    names(retsplit) <- paste0("run_number", names(retsplit))
    #tag the aligned signal with the event element so that we can identify which regressors are aligned to the same event later
    retsplit <- lapply(retsplit, function(rr) { attr(rr, "event") <- s$event; return(rr) })

    return(retsplit)
  })

  #extract the normalization for each regressor into a vector
  bdm_args$normalizations <- sapply(signals_expanded, function(s) {
    ifelse(is.null(s$normalization), "none", s$normalization) #default to none
  })

  #extract whether to divide a given regressor into a beta series (one regressor per event)
  bdm_args$beta_series <- sapply(signals_expanded, function(s) {
    ifelse(isTRUE(s$beta_series), TRUE, FALSE) #no beta series by default
  })

  #Extract whether to remove zero values from the regressor prior to convolution.
  #This is especially useful when mean centering before convolution.
  bdm_args$rm_zeros <- sapply(signals_expanded, function(s) {
    if (is.null(s$rm_zeros)) {
      TRUE #default to removing zeros before convolution
    } else {
      if (!s$rm_zeros %in% c(TRUE, FALSE)) { #NB. R kindly type casts "TRUE" and 1/0 to logical for this comparison
        stop("Don't know how to interpret rm_zeros setting of: ", s$rm_zeros)
      } else {
        as.logical(s$rm_zeros)
      }
    }
  })

  #extract the convmax_1 settings for each regressor into a vector
  bdm_args$convmax_1 <- sapply(signals_expanded, function(s) {
    if (is.null(s$convmax_1)) {
      FALSE
    } else {
      if (!s$convmax_1 %in% c(TRUE, FALSE)) { #NB. R kindly type casts "TRUE" and 1/0 to logical for this comparison
        stop("Don't know how to interpret convmax_1 setting of: ", s$convmax_1)
      } else {
        as.logical(s$convmax_1)
      }
    }
  })

  #determine whether to add a temporal derivative for each signal
  bdm_args$add_derivs <- sapply(signals_expanded, function(s) {
    if (is.null(s$add_deriv)) {
      FALSE
    } else {
      if (!s$add_deriv %in% c(TRUE, FALSE)) { #NB. R kindly type casts "TRUE" and 1/0 to logical for this comparison
        stop("Don't know how to interpret add_deriv setting of: ", s$add_deriv)
      } else {
        as.logical(s$add_deriv)
      }
    }
  })

  #define number of runs based off of the length of unique runs in the events data.frame
  nruns <- length(unique(events$run_number))

  # determine the number of volumes in each run based on inputs
  run_data <- determine_run_volumes(run_data, nruns, run_nifti_drop_applied, tr, signals_aligned, lg)

  #determine which run fits should be output for fmri analysis
  if (is.null(runs_to_output)) {
    message("Assuming that all runs should be fit and run numbers are sequential ascending")
    runs_to_output <- seq_along(run_data$run_volumes) #output each run
  }

  # read and process additional regressors (e.g., confounds) -- if not NULL, this should return a data.frame indexed by run
  additional_regressors <- get_additional_regressors(additional_regressors, run_data$run_volumes, run_data$drop_volumes, shorten_additional, lg)

  ts_multipliers_df <- get_ts_multipliers(ts_multipliers, run_data, shorten_ts)

  #Add ts_multipliers to signals as needed. Will generate a list in which each element is a run
  #NB. This doesn't handle runs_to_output appropriately!!
  bdm_args$ts_multiplier <- lapply(signals_expanded, function(s) {
    if (is.null(s$ts_multiplier) || isFALSE(s$ts_multiplier)) {
      return(NULL)
    } else {
      #split the relevant column of the ts_multipliers_df for this signal at run boundaries
      ss <- split(ts_multipliers_df[[s$ts_multiplier]], ts_multipliers_df$run_number)
      return(ss)
    }
  })

  # Build the runs x signals 2-D list
  # Note that we enforce dropped volumes (from beginning of run) below
  # This is because fmri.stimulus gives odd behaviors (e.g. constant 1s) if an onset time is negative
  dmat <- do.call(cbind, lapply(seq_along(signals_aligned), function(signal) {
    lapply(signals_aligned[[signal]], function(run) {
      mm <- as.matrix(run)
      attr(mm, "event") <- attr(run, "event") # propagate event tag
      return(mm)
    })
  }))

  # enforce that run subsetting depends on having the runs present in the run_data object
  stopifnot(all(runs_to_output %in% run_data$run_number))
  run_data <- run_data %>% dplyr::filter(run_number %in% !!runs_to_output)

  # use row names, rather than numeric positions, to enforce subsetting (e.g., noncontiguous runs in dmat)
  dmat <- dmat[paste0("run_number", runs_to_output), , drop = FALSE]

  # from this point forward, run_volumes and drop_volumes are local variables that reflect the runs that are retained
  run_volumes <- run_data$run_volumes
  drop_volumes <- run_data$drop_volumes
  run_nifti <- run_data$run_nifti

  #run_volumes and drop_volumes are used by convolve_regressor to figure out the appropriate regressor length
  bdm_args$run_volumes <- run_volumes #copy into argument list
  bdm_args$drop_volumes <- drop_volumes #copy into argument list

  # make sure the columns of the 2-D list are named by signal
  dimnames(dmat)[[2L]] <- names(signals_aligned)

  # if volumes are being dropped (and shift-timing is TRUE), subtract the dropped volumes from the onset times.
  dmat <- shift_dmat_timing(dmat, tr, drop_volumes, shift_timing)

  # concatenate regressors across runs by adding timing from MR files.
  run_timing <- cumsum(run_volumes) * tr # timing in seconds of the start of successive runs

  # convert the trial-oriented dmat to a time-oriented dmat_convolved.
  # also get an unconvolved version on the time grid for diagnostics.
  dmat_convolved <- place_dmat_on_time_grid(dmat, convolve = TRUE, run_timing = run_timing, bdm_args, lg)
  dmat_unconvolved <- place_dmat_on_time_grid(dmat, convolve = FALSE, run_timing = run_timing, bdm_args, lg)

  # dmat_convolved should now be a 1-d runs list where each element is a data.frame of convolved regressors.
  names(dmat_convolved) <- names(dmat_unconvolved) <- paste0("run", runs_to_output)

  #add additional regressors to dmat_convolved here so that they are written out as part of the write_timing_files step
  if (!is.null(additional_regressors)) {
    for (i in seq_along(dmat_convolved)) {
      # this is clunky, but necessary to make sure we grab the right additional signals 
      # (would need to refactor dmat_convolved to get it less clunky)
      runnum <- as.numeric(sub("run_number(\\d+)", "\\1", names(dmat_convolved)[i], perl=TRUE))
      additional_regressors_currun <- additional_regressors_df %>% 
        dplyr::filter(run_number == !!runnum) %>%
        dplyr::select(-run_number)

      #ensure that additional regressors are mean-centered
      additional_regressors_currun <- as.data.frame(lapply(additional_regressors_currun, function(x) { x - mean(x, na.rm=TRUE) } ))
      dmat_convolved[[i]] <- dplyr::bind_cols(dmat_convolved[[i]], additional_regressors_currun)
      dmat_unconvolved[[i]] <- dplyr::bind_cols(dmat_unconvolved[[i]], additional_regressors_currun)
    }

    #additional_regressors_df_split <- split(additional_regressors_df, additional_regressors_df$run_number)
    #dmat_changed <- lapply(dmat, function(x) {cbind(x, additional_regressors_df_split[[x]])})
    #dmat_convolved <- cbind(dmat_convolved, additional_regressors_df)
  }

  # Write timing files to disk for analysis by AFNI, FSL, etc.
  tf_convolved <- tf_convolved_concat <- tf_fsl <- tf_afni <- NULL
  
  if (!is.null(write_timing_files)) {
    dir.create(output_directory, recursive=TRUE, showWarnings=FALSE)

    if ("convolved" %in% write_timing_files) {
      # write convolved regressors
      tf_convolved <- matrix(NA_character_, nrow = dim(dmat)[1L], ncol = dim(dmat)[2L], dimnames=dimnames(dmat))

      conv_concat <- list()
      lapply(seq_along(dmat_convolved), function(r) {
        lapply(seq_along(dmat_convolved[[r]]), function(v) {
          reg_name <- names(dmat_convolved[[r]])[v]
          fname <- paste0(names(dmat_convolved)[r], "_", reg_name, ".1D")
          to_write <- round(dmat_convolved[[r]][[v]], 6)
          conv_concat[[reg_name]] <<- c(conv_concat[[reg_name]], to_write) # add for concatenated 1D file
          ofile <- file.path(output_directory, path_sanitize(fname, replacement = ""))
          tf_convolved[r, v] <<- ofile
          write.table(to_write,
            file = ofile,
            sep = "\n", eol = "\n", quote = FALSE, col.names = FALSE, row.names = FALSE
          )
        })
      })

      #tf_convolved_concat <- rep(NA_character_, length(conv_concat) %>% setNames(names(conv_concat))

      #write run-concatenated convolved regressors (for use in AFNI)
      tf_convolved_concat <- sapply(seq_along(conv_concat), function(v) {
        fname <- paste0(names(conv_concat)[v], "_concat.1D")
        ofile <- file.path(output_directory, path_sanitize(fname, replacement = ""))
        #tf_convolved_concat[v] <<- ofile
        write.table(conv_concat[[v]],
          file = ofile,
          sep = "\n", eol = "\n", quote = FALSE, col.names = FALSE, row.names = FALSE
        )
        return(ofile)
      }) %>% setNames(names(conv_concat))

    }

    if ("fsl" %in% write_timing_files) {
      tf_fsl <- matrix(NA_character_, nrow = dim(dmat)[1L], ncol = dim(dmat)[2L], dimnames = dimnames(dmat))

      for (i in seq_len(dim(dmat)[1L])) {
        for (reg in seq_len(dim(dmat)[2L])) {
          regout <- dmat[[i, reg]]
          if (nrow(regout) == 0L) {
            next
          } else {
            regout <- regout[, c("onset", "duration", "value"), drop = FALSE]
          }

          if (center_values && !all(na.omit(regout[, "value"]) == 0.0)) {
            #remove zero-value events from the regressor
            regout <- regout[regout[, "value"] != 0, , drop = FALSE]

            #now mean center values (unless there is no variation, such as a task indicator function)
            if (nrow(regout) > 1L && sd(regout[, "value"], na.rm=TRUE) > 0) {
              regout[, "value"] <- regout[, "value"] - mean(regout[, "value"], na.rm=TRUE)
            }
          }

          fname <- paste0("run", runs_to_output[i], "_", dimnames(dmat)[[2L]][reg], "_FSL3col.txt")
          ofile <- file.path(output_directory, path_sanitize(fname, replacement = ""))
          tf_fsl[i, reg] <- ofile
          write.table(regout, file=ofile, sep="\t", eol="\n", col.names=FALSE, row.names=FALSE)
        }
      }
    }

    if ("afni" %in% write_timing_files) {
      #TODO: Make this work for beta series outputs
      #use dmBLOCK-style regressors: time*modulation:duration. One line per run

      #AFNI amplitude modulation forces a mean and deviation from the mean regressor for each effect
      #as a result, if two parametric influences occur at a given time, it leads to perfect collinearity.

      #need to unify multiple values that share onset time
      #time*modulation1,modulation2:duration

      # regonsets <- lapply(dmat, function(reg) {
      #   reg[,"onset"]
      #   #unname(do.call(c, lapply(reg, function(run) { run[,"onset"]})))
      # })

      #this approach is flawed if the first onsets for a run for two events do not align because one is omitted.
      #for example, if there is no PE on trial 1, but then it is aligned later, the first onsets will differ spuriously
      #seems the only way around this might be to preserve trial as a field in dmat, then merge here.
      # regonsets <- apply(dmat, 1, function(run) {
      #   onsets <- sapply(run, function(df) { df[,"onset"]})
      #   rmat <- matrix(NA_real_, nrow=max(sapply(onsets, length)), ncol=length(onsets)) #use max length of any regressor as target
      #
      #   #unname(do.call(c, lapply(reg, function(run) { run[,"onset"]})))
      # })

      # TODO: make this work for uneven numbers of events across regressors
      # The apply returns a list in the case of uneven regressors, which crashes various things below
      #this will require some sort of outer_join approach using trial. I've now preserved trial as a field in dmat, but haven't solved this
      regonsets <- apply(dmat, 2, function(reg) {
        unname(do.call(c, lapply(reg, function(run) { run[, "onset"]})))
      })

      regdurations <- apply(dmat, 2, function(reg) {
        unname(do.call(c, lapply(reg, function(run) { run[, "duration"]})))
      })

      # magical code from here: http://stackoverflow.com/questions/22993637/efficient-r-code-for-finding-indices-associated-with-unique-values-in-vector
      # use combination of onset and duration to determine unique dmBLOCK regressors
      first_onset_duration <- paste(regonsets[1, ], regdurations[1, ])

      # just matches on first row (should work in general)
      dt <- as.data.table(first_onset_duration)[, list(comb=list(.I)), by=first_onset_duration]

      lapply(dt$comb, function(comb) {
        combmat <- dmat[, comb, drop=F]
        #onsets and durations are constant, amplitudes vary
        runvec <- c() #character vector of AFNI runs (one row per run)
        for (i in seq_len(dim(combmat)[1L])) {
          #just use first regressor of combo to get vector onsets and durations (since combinations, by definition, share these)
          runonsets <- combmat[[i, 1]][, "onset"]
          rundurations <- combmat[[i, 1]][, "duration"]
          runvalues <- do.call(cbind, lapply(combmat[i,], function(reg) { reg[, "value"] }))

          #AFNI doesn't like us if we pass in the boxcar ourselves in the dmBLOCK format (since it creates this internally). Filter out.
          indicator_func <- apply(runvalues, 2, function(col) { all(col == 1.0)} )
          if (any(indicator_func)) runvalues <- runvalues[, -1*which(indicator_func), drop=FALSE]

          # if the indicator regressor was the only thing present, revert to the notation TIME:DURATION notation
          # for dmBLOCK (not TIME*PARAMETER:DURATION)
          if (ncol(runvalues) == 0L) {
            runvec[i] <- paste(sapply(seq_along(runonsets), function(j) {
              paste0(round(runonsets[j], 6), ":", round(rundurations[j], 6))
            }), collapse=" ")
          } else {
            runvec[i] <- paste(sapply(seq_along(runonsets), function(j) {
              paste0(round(runonsets[j], 6), "*", paste(round(runvalues[j, ], 6), collapse=","), ":", round(rundurations[j], 6))
            }), collapse=" ")
          }
        }

        writeLines(runvec, file.path(output_directory, paste0(paste(dimnames(combmat)[[2L]], collapse="_"), "_dmBLOCK.txt")))
      })

    }
  }

  # compute collinearity diagnostics on the unconvolved signals
  collin_diag_events <- get_collin_events(dmat)

  #compute collinearity diagnostics on the convolved signals
  collin_diag_convolved <- lapply(dmat_convolved, function(run) {
    corvals <- cor(run, use="pairwise.complete.obs")
    vif_mat <- data.frame(cbind(dummy=rnorm(nrow(run)), run)) #add dummy constant for vif
    vif_form <- as.formula(paste("dummy ~ 1 +", paste(names(run), collapse=" + ")))

    var_infl <- tryCatch(car::vif(lm(vif_form, data=vif_mat)), error=function(e) { NA }) #return NA if failure
    list(r=corvals, vif=var_infl)
  })

  # add baseline regressors to convolved design matrix, if requested
  dmat_convolved <- add_baseline_regressors(dmat_convolved,
    baseline_coef_order = baseline_coef_order,
    baseline_parameterization = baseline_parameterization
  )

  #Define concatenatenation of all events (e.g., in an AFNI-style setup)
  #Note that we want to add the run timing from the r-1 run to timing for run r.
  #Example: run 1 is 300 volumes with a TR of 2.0
  #  Thus, run 1 has timing 0s .. 298s, and the first volume of run 2 would be 300s
  design_concat <- lapply(seq_len(dim(dmat)[2L]), function(reg) {
    thisreg <- dmat[, reg]
    concat_reg <- do.call(rbind, lapply(seq_along(thisreg), function(run) {
      timing <- thisreg[[run]]
      timing[, "onset"] <- timing[, "onset"] + ifelse(run > 1, run_timing[run-1], 0)
      return(timing)
    }))
    attr(concat_reg, "event") <- attr(thisreg[[1]], "event") #propagate event alignment

    concat_reg
  })
  names(design_concat) <- dimnames(dmat)[[2L]]

  #just the onsets for each event
  concat_onsets <- lapply(design_concat, function(x) { x[, "onset"] })

  timing_files <- list(convolved = tf_convolved, convolved_concat = tf_convolved_concat, fsl = tf_fsl, afni = tf_afni)

  to_return <- list(
    design = dmat, design_concat = design_concat, design_convolved = dmat_convolved,
    design_unconvolved = dmat_unconvolved, collin_events = collin_diag_events,
    collin_convolved = collin_diag_convolved, concat_onsets = concat_onsets, runs_to_output = runs_to_output,
    run_nifti = run_nifti, run_volumes = run_volumes, drop_volumes = drop_volumes, tr = tr,
    output_directory = output_directory, additional_regressors = additional_regressors,
    timing_files = timing_files)

  to_return$design_plot <- visualize_design_matrix(concat_design_runs(to_return))

  class(to_return) <- c("list", "bdm") #tag as bdm object
  if (isTRUE(plot)) { plot(to_return$design_plot) }
  return(to_return)

}

# helper function to shift onset times based on dropped volumes
shift_dmat_timing <- function(dmat, tr, drop_volumes = 0, shift_timing = TRUE) {
  if (isFALSE(shift_timing) || all(drop_volumes == 0L)) {
    return(dmat)
  } # return dmat unchanged

  # how much time is being dropped from the beginning of the run (used to change event onsets)
  time_offset <- tr * drop_volumes

  if (time_offset[1L] > 0) {
    if (length(unique(time_offset)) == 1L) {
      to_print <- time_offset[1L]
    } else {
      to_print <- paste(time_offset, collapse = ", ")
    }
    message("Based on drop_volumes and tr, subtracting ", to_print, "s from event onsets")
  }

  # Shift the timing of the regressors in dmat according to drop_volumes
  for (i in seq_len(dim(dmat)[1])) { # run
    for (j in seq_len(dim(dmat)[2])) { # regressor
      df <- dmat[[i, j]]
      if (nrow(df) == 0L) next # empty regressor
      df[, "onset"] <- df[, "onset"] - time_offset[i]
      if (min(df[, "onset"] < 0)) {
        message(
          "For run ", dimnames(dmat)[[1]][i], ", regressor ", dimnames(dmat)[[2]][j], ", there are ",
          sum(df[, "onset"] < 0), " events before time 0"
        )
      }
      dmat[[i, j]] <- df
    }
  }

  return(dmat)
}

# helper to lookup number of volumes in each run based on whether NIfTIs are passed in versus run_volumes vector
determine_run_volumes <- function(run_data = NULL, nruns = NULL, run_nifti_drop_applied=TRUE, tr = NULL, signals_aligned, lg = NULL) {
  checkmate::assert_data_frame(run_data)
  checkmate::assert_integerish(nruns, lower = 1, len = 1L)

  if (!is.null(run_data$run_nifti)) {
    lg$info("Using NIfTI images to determine run lengths.")
    run_volumes_detected <- sapply(run_data$run_nifti, function(xx) {
      oro.nifti::readNIfTI(xx, read_data = FALSE)@dim_[5L]
    }, USE.NAMES=FALSE)

    # if user specified the number of volumes, check that this matches detected volumes
    if (!is.null(run_data$run_volumes)) {
      if (all.equal(run_data$run_volumes, run_volumes_detected)) {
        # nothing to do at present
      } else if (all.equal(run_volumes_detected + run_data$drop_volumes, run_data$run_volumes)) {
        lg$info("Number of volumes detected + run_data$drop_volumes == run_data$run_volumes. Thus, assuming run_nifti_drop_applied == TRUE.")
        run_nifti_drop_applied <- TRUE
      } else {        
        lg$warn("Detected and provided run volumes do not match.")
        lg$warn("%s", capture.output(print(cbind(run_volumes = run_data$run_volumes, run_volumes_detected = run_volumes_detected))))
      }
    }

    # For NIfTI input, if run_nifti_drop_applied == TRUE, assume that the .nii.gz is already truncated.
    # Therefore, the number of volumes detected in the NIfTI must be adjusted *upward* 
    #   by run_data$drop_volumes so that the drop is carried out properly.
    if (any(run_data$drop_volumes > 0) && isTRUE(run_nifti_drop_applied)) {
      lg$info("NIfTIs already have run_data$drop_volumes removed. Thus, adjusting run_data$run_volumes upward accordingly.")
      run_data$run_volumes <- run_volumes_detected + run_data$drop_volumes
      #print(data.frame(nifti=run_data$run_nifti, run_volumes_detected=run_volumes_detected, run_volumes=run_data$run_volumes))
    } else {
      run_data$run_volumes <- run_volumes_detected
    }
  } else if (is.null(run_data$run_volumes)) {
    #determine the last fMRI volume to be analyzed
    run_data$run_volumes <- rep(0, nruns)

    for (i in 1:nruns) {
      for (j in seq_along(signals_aligned)) {
        currentdf <- signals_aligned[[j]][[i]] #jth signal, ith run (signals_aligned has signals at top level)

        #estimate the last moment
        highesttime <- ceiling(max(currentdf$onset + currentdf$duration + iti_post, na.rm=TRUE)/tr)

        #update run_data$run_volumes for ith run only if this signal has later event
        if (highesttime > run_data$run_volumes[i]) { run_data$run_volumes[i] <- highesttime }
      }
    }

    lg$info(sprintf("Assuming that last fMRI volume was %.1f seconds after the onset of the last event.", iti_post))
    lg$info(paste0("Resulting volumes: ", paste(run_data$run_volumes, collapse=", ")))

  } else if (is.numeric(run_data$run_volumes)) {
    #replicate volumes for each run if a scalar is passed
    if (length(run_data$run_volumes) == 1L) { run_data$run_volumes <- rep(run_data$run_volumes, nruns) }
    stopifnot(length(run_data$run_volumes) == nruns)
  } else {
    msg <- sprintf("Don't know how to handle run_data$run_volumes: %s", run_data$run_volumes)
    lg$error(msg)
    stop(msg)
  }

  # Shorten number of volumes based on the number of volumes that are dropped
  # run_volumes should always be the final number of volumes after drops are applied
  run_data$run_volumes <- run_data$run_volumes - run_data$drop_volumes

  return(run_data)
}

# helper function to process confound/additional regressors into a single data.frame, dropping volumes if requested
get_additional_regressors <- function(additional_regressors, run_volumes, drop_volumes = 0, shorten_additional = TRUE, lg = NULL) {
  # handle additional volume-wise regressors (e.g., confounds)
  if (is.null(additional_regressors)) {
    return(NULL)
  } # nothing to do

  # if user provides a list of read each dataset into a list
  if (is.character(additional_regressors)) {
    stopifnot(all(file.exists(additional_regressors))) # enforce existence of these files
    additional_regressors <- lapply(additional_regressors, data.table::fread, data.table = FALSE)
  }

  if (!is.list(additional_regressors)) {
    msg <- "Cannot process additional_regressors because it is not a list"
    lg$error(msg)
    stop(msg)
  }

  if (length(additional_regressors) != length(run_volumes)) {
    msg <- glue("Number of elements in additional_regressors: {length(additional_regressors)} does not much length of run_volumes: {length(run_volumes)}")
    lg$error(msg)
    stop(msg)
  }

  # process list of regressors (one element per run)
  # all that need to do is concatenate the data frames after filtering any obs that are above run_volumes
  additional_regressors_df <- data.frame()

  for (i in seq_along(additional_regressors)) {
    additional_regressors_currun <- additional_regressors[[i]]
    stopifnot(is.data.frame(additional_regressors_currun))
    additional_regressors_currun$run_number <- i

    # define which rows of the dataset to keep based on drop_volumes settings
    if (isTRUE(shorten_additional)) {
      rv <- 1:run_volumes[i] + drop_volumes[i]
    } else {
      rv <- 1:run_volumes[i]
    }

    # message(paste0("Current run_volumes:", rv))
    if (nrow(additional_regressors_currun) < length(rv)) {
      msg <- glue("additional_regressors have fewer observations: {nrow(additional_regressors_currun)} than run_volumes: {length(rv)}")
      lg$error(msg)
      stop(msg)
    }

    additional_regressors_currun <- additional_regressors_currun %>%
      dplyr::slice(rv) %>%
      as.data.frame()
    additional_regressors_df <- dplyr::bind_rows(additional_regressors_df, additional_regressors_currun)
  }

  return(additional_regressors_df)
}

# helper function to read in run-wise time series that will be multiplied against convolved regressors (PPI)
get_ts_multipliers <- function(ts_multipliers = NULL, run_data, shorten_ts) {
  # handle time series modulator regressors
  if (!is.null(ts_multipliers)) {
    ts_multipliers_df <- data.frame()

    for (i in seq_along(ts_multipliers)) {
      if (is.character(ts_multipliers)) {
        ts_multipliers_currun <- read.table(ts_multipliers[i]) # read in ith text file
      } else {
        ts_multipliers_currun <- ts_multipliers[[i]] # use ith element of list of data.frames
      }

      stopifnot(is.data.frame(ts_multipliers_currun))
      # mean center PPI signals -- crucial for convolution to be sensible
      ts_multipliers_currun <- as.data.frame(lapply(ts_multipliers_currun, function(x) {
        x - mean(x, na.rm = TRUE)
      }))
      ts_multipliers_currun$run_number <- i

      # define which rows of the dataset to keep based on drop_volumes settings
      if (isTRUE(shorten_ts)) {
        rv <- 1:run_data$run_volumes[i] + run_data$drop_volumes[i]
      } else {
        rv <- 1:run_data$run_volumes[i]
      }

      # message(paste0("Current run_volumes:", rv))
      if (nrow(ts_multipliers_currun) < length(rv)) {
        stop("ts_multiplier regressor has fewer observations than run_volumes")
      }
      ts_multipliers_currun <- dplyr::slice(ts_multipliers_currun, rv) %>% as.data.frame()
      ts_multipliers_df <- dplyr::bind_rows(ts_multipliers_df, ts_multipliers_currun)
    }
  } else {
    ts_multipliers_df <- NULL
  }

  return(ts_multipliers_df)
}

#Example Data Set that can be used to visualize what build design matrix expects
# source("fmri_utility_fx.R")
# set.seed(480929)
# events <- dplyr::bind_rows(data.frame(event="cue",
#   rbind(
#     data.frame(
#       run_number=1,
#       trial=1:50,
#       onset=cumsum(rpois(50, lambda = 2)),
#       duration=rep(2, 50), custom_dur=abs(rnorm(50))),
#     data.frame(
#       run_number=2,
#       trial=1:50,
#       onset=cumsum(rpois(50, lambda = 2)),
#       duration=rep(2, 50), custom_dur=abs(rnorm(50)))
#   )),
#   data.frame(event="outcome",
#              rbind(
#     data.frame(
#       run_number=1,
#       trial=1:50,
#       onset=cumsum(rpois(50, lambda = 2)),
#       duration=rep(2, 50)),
#     data.frame(
#       run_number=2,
#       trial=1:50,
#       onset=cumsum(rpois(50, lambda = 2)),
#       duration=rep(2, 50))
#   ))
# )
#
# signals <- list(
#   ev=list(value=rbind(data.frame(run_number=1, trial=1:50, value=rnorm(50)), data.frame(run_number=2, trial=1:50, value=rnorm(50))), event="cue", duration="custom_dur", normalization="evtmax_1"),
#   pe=list(value=rbind(data.frame(run_number=1, trial=1:50, value=rnorm(50)), data.frame(run_number=2, trial=1:50, value=rnorm(50))), event="outcome", duration=1)
# )
#
#
#
# df1 <- data.frame(csf = rnorm(100), wm = rnorm(100))
# df2 <- data.frame(csf = rnorm(121), wm = rnorm(121))
# #For events, BDM expects a single data frame that includes a column for the event type (e.g. outcome vs. cue), run number (1:n), trial number (nested within run, 1:x), onset, duration of event
# #potential custom duration if want to specify specific lengths of events
# # For signals, BDM expects a list of list. The first level of lists reflects each signal (e.g. pe vs value). In the second level, BDM expects value and event (e.g. cue vs outcome).
# #value is a data frame with run number (1:n), trial (1:x), and signal.
# #Additionally will also accept a third element to the list which can specify custom durations
# #Also BDM expects either a character vector of the function nifit for each run OR it expects a numerical vector for the volumes to be analyzed in each run
# #If you don't supply the run_volumes, character string it will infer the number of volumes based off of the last onset within the run and will add 12 ITIs to that by default (very risk; do not recommend!)
#
# #Test Case
#
# d <- build_design_matrix(events = events, signals = signals)
# dnocon <- build_design_matrix(events = events, signals = signals, convolve = FALSE)
# dcon <- d$design_convolved$run1
# dcon$time <- 1:nrow(dcon)
# ggplot(dcon, aes(x = time, y = pe)) + geom_line(size = 0.8)
# dnoconout <- dnocon$design_convolved$run1
# dnoconout$time <- 1:nrow(dcon)
# ggplot(dnoconout, aes(x = time, y = pe)) + geom_line(size = 0.8)
# dnuis <- build_design_matrix(events = events, signals = signals, additional_regressors = list(df1, df2), write_timing_files = c("AFNI", "FSL", "convolved"), baseline_coef_order = 2)
# dnuisrun1 <- dnuis$design_convolved$run1
# dnuisrun1$time <- 1:nrow(dnuisrun1)
# ggplot(dnuisrun1, aes(x = time, y = wm)) + geom_line(size = 0.8)
UNCDEPENdLab/fmri.pipeline documentation built on April 3, 2025, 3:21 p.m.