Purpose of the w4mclassfilter package

The purpose of the w4mclassfilter R package is to provide the computational back-end of the Galaxy tool W4M Data Subset (https://github.com/HegemanLab/w4mclassfilter_galaxy_wrapper).

This package (and the Galaxy tool) perform several steps, either to reduce the number samples or features to be analyzed, or to address several data issues that may impede downstream statistical analysis:

How the w4m_filter_by_sample_class function is used

Ordinarily, a Galaxy tool wrapper invokes w4m_filter_by_sample_class. For exploratory or debugging purposes, the package may be installed loaded into R and help may then be obtained with the following command:

?w4mclassfilter::w4m_filter_by_sample_class

W4M uses the XCMS and CAMERA packages to preprocess GC-MS or LC-MS data, producing three files that are documented in detail on the Workflow4Metabolomics (W4M) web site. In summary:

Input- and Output-Format

Ordinary usage of the w4mclassfilter::w4m_filter_by_sample_class method is to read from and write to tab-delimited flat files (TSVs) because Galaxy presents datasets to tools as files. However, because general-purpose R packages usually use data structures in memory for their input and output, this function can accept not only with TSVs but also with data structures (data.frame, matrix, list, env); see 'Flexible Input and Output' below for details.

For all inputs and outputs that are file paths, those paths must be unique.

Feature- and Sample-Elimination

When w4m_filter_by_sample_class is invoked:

Note that even when no rows or columns of the input dataMatrix input have zero variance, there is the possibility that eliminating samples or features may result in some rows or columns having zero variance, adversely impacting downstream statistical analysis. Consequently, w4m_filter_by_sample_class eliminates these rows or columns and the corresponding rows from sampleMetadata, variableMetadata, and dataMatrix.

Support for Imputation of Missing Values

w4m_filter_zero_imputation

The w4mclassfilter::w4m_filter_zero_imputation function is the default imputation method used by w4m_filter_by_sample_class. This function imputes negative and NA intensity values as zero.

w4m_filter_zero_imputation <-
  function(m) {
    # replace NA values with zero
    m[is.na(m)] <- 0
    # replace negative values with zero, if applicable
    m[m<0] <- 0
    # return matrix as the result
    return (m)
  }
w4m_filter_median_imputation

The w4mclassfilter::w4m_filter_median_imputation function imputes negative intensity values as zero and NA intensity values as the median value for the corresponding feature.

w4m_filter_median_imputation <-
  function(m) {
    # Substitute NA with median for the row.
    # For W4M datamatrix:
    #   - each row has intensities for one feature
    #   - each column has intensities for one sample
    interpolate_row_median <- function(m) {
      # ref: https://stats.stackexchange.com/a/28578
      #   - Create a data.frame whose columns are features and rows are samples.
      #   - For each feature, substitute NA with the median value for the feature.
      t_result <- sapply(
          as.data.frame(t(m))
        , function(x) {
            x[is.na(x)] <- median(x, na.rm = TRUE)
            x
          }
        , simplify = TRUE
        )
      #   - Recover the rownames discarded by sapply.
      rownames(t_result) <- colnames(m)
      #   - Transform result so that rows are features and columns are samples.
      m <- t(t_result)
      # eliminate negative values
      m[m < 0] <- 0
      return (m)
    }
    return (interpolate_row_median(m))
  }
w4m_filter_no_imputation

The w4mclassfilter::w4m_filter_no_imputation function imputes negative intensity values as zero and leaves NA intensity values unaffected.

w4m_filter_no_imputation <-
  function(m) {
    # replace negative values with zero, if applicable
    m[m < 0] <- 0
    return (m)
  }

Support for Regular Expressions

w4mclassfilter::w4m_filter_by_sample_class supports use of R regular expression patterns to select class-names.

The R base::grepl function (at the core of this functionality) uses POSIX 1003.2 standard regular expressions, which allow precise pattern-matching and are exhaustively defined at:

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

However, only a few basic building blocks of regular expressions need to be mastered for most cases:

Within square brackets:

Outside of square brackets:

Caveat: The tool wrapper uses the comma (",") to split a list of sample-class names, so commas may not be used within regular expressions for this tool

First Example: Consider a field of class-names consisting of

                  marq3,marq6,marq9,marq12,front3,front6,front9,front12

| this regular expression | matches this set of sample-class names | | :--- | :--- | | ^front[0-9][0-9]*$ | "front3,front6,front9,front12" | | ^[a-z][a-z]3$ | "front3,marq3" | | ^[a-z][a-z]12$ | "front12,marq12" | | ^[a-z][a-z][0-9]$ | "front3,front6,front9,marq3,marq6,marq9" |

Second Example: Consider these regular expression patterns as possible matches to a sample-class name

                  AB0123

| this regular expression | matches this set of sample-class names | | :--- | :--- | | ^[A-Z][A-Z][0-9][0-9]*$ | AB0123 | | ^[A-Z][A-Z]*[0-9][0-9]*$ | AB0123 | | ^[A-Z][0-9]* | AB0123, see Note 1. | | ^[A-Z][A-Z][0-9] | AB0123, see Note 2. | | ^[A-Z][A-Z]*[0-9][0-9]$ | NO MATCH, see Note 3. | | ^[A-Z][0-9]*$ | NO MATCH, see Note 4. |

Flexible Input and Output

To support XCMS outside the context of Galaxy, w4mclassfilter::w4m_filter_by_sample_class supports input from and output to data structures as follows:

Inputs:

Outputs:

Computing Treatment Centers

w4mclassfilter::w4m_filter_by_sample_class supports provides as an advanced option to compute one of three types of centers for each treatment:



HegemanLab/w4mclassfilter documentation built on March 14, 2021, 1:19 a.m.