Working with aggregate functions

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
    ## Related to
    ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("knitcitations")

## Load knitcitations with a clean bibliography
cleanbib()
cite_options(hyperlink = "to.doc", citation_format = "text", style = "html")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitcitations = citation("knitcitations")[1],
    knitr = citation("knitr")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    ISAnalytics = citation("ISAnalytics")[1]
)

write.bibtex(bib, file = "aggregate_function_usage.bib")

Introduction

In this vignette we're going to explain in detail how to use functions of the aggregate family, namely:

  1. aggregate_metadata
  2. aggregate_values_by_key

How to install ISAnalytics

To install the package run the following code:

## For release version
if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }
BiocManager::install("ISAnalytics")

## For devel version
if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }
# The following initializes usage of Bioc devel
BiocManager::install(version = "devel")
BiocManager::install("ISAnalytics")

To install from GitHub:

# For release version
if (!require(devtools)) {
    install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
    ref = "RELEASE_3_12",
    dependencies = TRUE,
    build_vignettes = TRUE
)

## Safer option for vignette building issue
devtools::install_github("calabrialab/ISAnalytics",
    ref = "RELEASE_3_12"
)

# For devel version
if (!require(devtools)) {
    install.packages("devtools")
}
devtools::install_github("calabrialab/ISAnalytics",
    ref = "master",
    dependencies = TRUE,
    build_vignettes = TRUE
)

## Safer option for vignette building issue
devtools::install_github("calabrialab/ISAnalytics",
    ref = "master"
)
library(ISAnalytics)

Setting options

ISAnalytics has a verbose option that allows some functions to print additional information to the console while they're executing. To disable this feature do:

# DISABLE
options("ISAnalytics.verbose" = FALSE)

# ENABLE
options("ISAnalytics.verbose" = TRUE)

Some functions also produce report in a user-friendly HTML format, to set this feature:

# DISABLE HTML REPORTS
options("ISAnalytics.widgets" = FALSE)

# ENABLE HTML REPORTS
options("ISAnalytics.widgets" = TRUE)

Aggregating metadata

We refer to information contained in the association file as "metadata": sometimes it's useful to obtain collective information based on a certain group of variables we're interested in. The function aggregate_metadata does just that, according to the grouping variables, meaning the names of the columns in the association file to perform a group_by operation with, creates a summary which includes:

Typical workflow

Import the association file via import_assocition_file. If you need more information on import function please view the vignette "How to use import functions".

withr::with_options(list(ISAnalytics.widgets = FALSE), {
    path_AF <- system.file("extdata", "ex_association_file.tsv",
        package = "ISAnalytics"
    )

    root_correct <- system.file("extdata", "fs.zip",
        package = "ISAnalytics"
    )
    root_correct <- unzip_file_system(root_correct, "fs")
    association_file <- import_association_file(path_AF, root_correct)
})

Perform aggregation:

aggregated_meta <- aggregate_metadata(association_file,
    grouping_keys = c(
        "SubjectID",
        "CellMarker",
        "Tissue", "TimePoint"
    ),
    import_stats = FALSE
)
knitr::kable(aggregated_meta)

As you can see there is an additional parameter you can set, import_stats: if set to TRUE, the function will automatically look into the file system you provided as root when you imported the association file and will try to locate first the iss folder for each project and then all Vispa2 "stats" files.
Vispa2 stats contain useful information that is not included in the association file and is linked to the single Vispa2 run. In some cases it's useful to perform aggregation on those info too. If you set the parameter to TRUE, besides the columns mentioned before you will also have:

withr::with_options(list(ISAnalytics.widgets = FALSE), {
    aggregated_meta <- aggregate_metadata(association_file,
        grouping_keys = c(
            "SubjectID", "CellMarker",
            "Tissue", "TimePoint"
        ),
        import_stats = TRUE
    )
})
knitr::kable(aggregated_meta)

If you have the option ISAnalytics.widgets set to TRUE, this will produce a report in HTML format that tells you which stats files were imported. To avoid this, you can set the option to FALSE.

Aggregation of values by key

ISAnalytics contains useful functions to aggregate the values contained in your imported matrices based on a key, aka a single column or a combination of columns contained in the association file that are related to the samples.

Typical workflow

Import your association file (see previous section) and then import your matrices:

withr::with_options(list(ISAnalytics.widgets = FALSE), {
    matrices <- import_parallel_Vispa2Matrices_auto(
        association_file = association_file, root = NULL,
        quantification_type = c("fragmentEstimate", "seqCount"),
        matrix_type = "annotated", workers = 2, patterns = NULL,
        matching_opt = "ANY", multi_quant_matrix = FALSE
    )
})

The function aggregate_values_by_key can perform the aggregation both on the list of matrices and a single matrix.

# Takes the whole list and produces a list in output
aggregated_matrices <- aggregate_values_by_key(matrices, association_file)

# Takes a single matrix and produces a single matrix as output
aggregated_matrices_single <- aggregate_values_by_key(
    matrices$seqCount,
    association_file
)
knitr::kable(head(aggregated_matrices_single))

Changing parameters to obtain different results

The function has several different parameters that have default values that can be changed according to user preference.

  1. Changing the key value
    You can change the value of the parameter key as you see fit. This parameter should contain one or multiple columns of the association file that you want to include in the grouping when performing the aggregation. The default value is set to c("SubjectID", "CellMarker", "Tissue", "TimePoint") (same default key as the aggregate_metadata function.
agg1 <- aggregate_values_by_key(
    x = matrices$seqCount,
    association_file = association_file,
    key = c("SubjectID", "ProjectID")
)
knitr::kable(head(agg1))
  1. Changing the lambda value
    The lambda parameter indicates the function(s) to be applied to the values for aggregation. lambda must be a named list of either functions or purrr-style lambdas: if you would like to specify additional parameters to the function the second option is recommended. The only important note on functions is that they should perform some kind of aggregation on numeric values: this means in practical terms they need to accept a vector of numeric/integer values as input and produce a SINGLE value as output. Valid options for this purpose might be: sum, mean, median, min, max and so on.
agg2 <- aggregate_values_by_key(
    x = matrices$seqCount,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(mean = ~ mean(.x, na.rm = TRUE))
)
knitr::kable(head(agg2))

Note that, when specifying purrr-style lambdas (formulas), the first parameter needs to be set to .x, other parameters can be set as usual.

You can also use in lambda functions that produce data frames or lists. In this case all variables from the produced data frame will be included in the final data frame. For example:

agg3 <- aggregate_values_by_key(
    x = matrices$seqCount,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(describe = psych::describe)
)

agg3
  1. Changing the value_cols value
    The value_cols parameter tells the function on which numeric columns of x the functions should be applied. NOte that every function contained in lambda will be applied to every column in value_cols: resulting columns will be named as "original name_function applied".
## Obtaining multi-quantification matrix
comp <- comparison_matrix(matrices)

agg4 <- aggregate_values_by_key(
    x = comp,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(sum = sum, mean = mean),
    value_cols = c("seqCount", "fragmentEstimate")
)
knitr::kable(head(agg4))
  1. Changing the group value
    The group parameter should contain all other variables to include in the grouping besides key. By default this contains c("chr", "integration_locus", "strand", "GeneName", "GeneStrand"). You can change this grouping as you see fit, if you don't want to add any other variable to the key, just set it to NULL.
agg5 <- aggregate_values_by_key(
    x = matrices$seqCount,
    association_file = association_file,
    key = "SubjectID",
    lambda = list(sum = sum, mean = mean),
    group = c(mandatory_IS_vars())
)
knitr::kable(head(agg5))

Reproducibility

The r Biocpkg("ISAnalytics") package r citep(bib[["ISAnalytics"]]) was made possible thanks to:

This package was developed using r BiocStyle::Githubpkg("lcolladotor/biocthis").

R session information.

## Session info
library("sessioninfo")
options(width = 120)
session_info()

Bibliography

This vignette was generated using r Biocpkg("BiocStyle") r citep(bib[["BiocStyle"]]) with r CRANpkg("knitr") r citep(bib[["knitr"]]) and r CRANpkg("rmarkdown") r citep(bib[["rmarkdown"]]) running behind the scenes.

Citations made with r CRANpkg("knitcitations") r citep(bib[["knitcitations"]]).

## Print bibliography
bibliography()


Try the ISAnalytics package in your browser

Any scripts or data that you put into this service are public.

ISAnalytics documentation built on April 9, 2021, 6:01 p.m.