knitr::opts_chunk$set(echo = TRUE)

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("curatedTBData")

R packages

library(curatedTBData)
library(dplyr)
library(SummarizedExperiment)
library(sva)

Access to MetaData

The DataSummary table summarized the list of available studies and their metadata information contained in the curatedTBData package. \ The table helps users to query avialable datasets quickly.

# Remove GeographicalRegion, Age, DiagnosisMethod, Notes, Tissue, HIVStatus for concise display
data("DataSummary", package = "curatedTBData")
DataSummary |>
  dplyr::select(-c(`Country/Region`, Age, DiagnosisMethod, Notes,
                   Tissue, HIVStatus)) |>
  DT::datatable()

Access curated tuberculosis gene expression profile

Users can use curatedTBData() function to access data. There are three arguments in the function. The first argument study_name represents the names of the data that are used to determine the resources of interests. Users can find all available resource names from DataSummary$Study. The second argument dry.run enables users to determine the resources's availability before actually downloading them. When dry.run is set to TRUE, the output includes names of the resources. The third argument curated.only allows the users to access the curated ready-to-use data. If curated.only is TRUE, the function only download the curated gene expression profile and the clinical annotation of the corresponding data. If curated.only is FALSE, the function downloads all available resources for input studies.

curatedTBData("GSE19439", dry.run = TRUE, curated.only = FALSE)

To download complete data for a Microarry study (e.g. GSE19439), we set dry.run = FALSE and curated.only = FALSE. There are two experiments assay being included in the output Microarray studies. The first experiment is assay_curated, which is a matrix that represents normalized, curated version of the gene expression profile. The second experiment is assay_raw, which is a r Biocpkg("SummarziedExperiment") object that contains the raw gene expression profile and information about probe features.

GSE19439 <- curatedTBData("GSE19439", dry.run = FALSE, curated.only = FALSE)
GSE19439

When accessing data for RNA sequencing studies, another assay called assay_reprocess is included. This matrix represents the reprocessed version of gene expression profile from the raw .fastq files using r Biocpkg("Rsubread").

GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE)
GSE79362

A list of r Biocpkg("MultiAssayExperiment") objects is returned when dry.run is FALSE. \ To save running time, in the following example, we set the curated.only = TRUE and selected five studies that belong to GSE19491 from the Gene Expression Omnibus.

myGEO <- c("GSE19435", "GSE19439", "GSE19442", "GSE19444", "GSE22098")
object_list <- curatedTBData(myGEO, dry.run = FALSE, curated.only = TRUE)
object_list[1:2]

A full version example for RNA sequencing data: GSE79362.

GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE)
GSE79362

Subset/Combine MultiAssayExperiment objects

Subset

The major advantage of using r Biocpkg("MultiAssayExperiment") is the coordination of the meta-data and assays when sub-setting. The MultiAssayExperiment object has built-in function for subsetting samples based on column condition. \ The following code shows how to select samples with only active TB.

GSE19439 <- object_list$GSE19439
GSE19439[, GSE19439$TBStatus == "PTB"]["assay_curated"] # 13 samples

The following example shows how to subset patients with active TB and LTBI

GSE19439[, GSE19439$TBStatus %in% c("PTB", "LTBI")]["assay_curated"] # 30 samples

Dataset integeration

Merge Studies with common gene symbols

The combine_objects() function provides an easy implementation for combining different studies based on common gene symbols. The function returns a r Biocpkg("SummarizedExperiment") object that contains the merged assay and associated clinical annotation. Noticed that GSE74092 is usually removed from merging, because it used quantitative PCR, which did not have enough coverage to capture all genes. \ There are two arguments in the combine_objects() function. The first one is object_list, which takes a list of r Biocpkg("MultiAssayExperiment") objects obtained from curatedTBData(). Notice that the names(object_list) should not be NULL and must be unique for each object within the list, so that we can keep track the original study after merging. The second argument is experiment_name, which can be a string or vector of strings representing the name of the assay from the object.

GSE19491 <- combine_objects(object_list, experiment_name = "assay_curated",
                            update_genes = TRUE)
GSE19491

When the objects are merged, the original individual data tag can be found in the Study section from the metadata.

unique(GSE19491$Study)

It is also possible to combine the given list of objects with different experiment names. In this case, the experiment_name is a vector of string that corresponds to each of object from the input list.

exp <- combine_objects(c(GSE79362[1], object_list[1]),
                       experiment_name = c("assay_reprocess_hg19",
                                           "assay_curated"),
                       update_genes = TRUE)
exp

Batch correction

If datasets are merged, it is typically recommended to remove a very likely batch effect. We use the ComBat() function from r Biocpkg("sva") to remove potential batch effect between studies. \ In the following example, each study is viewed as one batch. The batch corrected assay will be stored in a r Biocpkg("SummarizedExperiment") object.

batch1 <- colData(GSE19491)$Study
combat_edata1 <- sva::ComBat(dat = assay(GSE19491), batch = batch1)
assays(GSE19491)[["Batch_corrected_assay"]] <- combat_edata1
GSE19491

Dataset subset

The function subset_curatedTBData() allows the users to subset a list of r Biocpkg("MultiAssayExperiment") with the output contains the exact conditions given by the annotationCondition. \ With subset_curatedTBData(), users can quickly subset desired results from r Biocpkg("curatedTBData") database without checking individual object. \ There are four arguments in this function. The theObject represents a r Biocpkg("MultiAssayExperiment") or r Biocpkg("SummarizedExperiment") object. The annotationColName is a character that indicates the column name in the metadata. The annotationCondition is a character or vector of characters that the users intend to select.

Select patients with Active TB and LTBI

In the following example, we call subset_curatedTBData() function to subset samples with active TB (PTB) and latent TB infection (LTBI) for binary classification.

multi_set_PTB_LTBI <- lapply(object_list, function(x)
  subset_curatedTBData(x, annotationColName = "TBStatus",
                       annotationCondition = c("LTBI", "PTB"), 
                       assayName = "assay_curated"))
# Remove NULL from the list
multi_set_PTB_LTBI <- multi_set_PTB_LTBI[!sapply(multi_set_PTB_LTBI, is.null)]
multi_set_PTB_LTBI[1:3]

Select other outcome of interests

The HIV status (HIVStatus) and diabetes status (DiabetesStatus) for each subject were also recorded for each study in the r Biocpkg("curatedTBData"). \ In the following example, we select subjects with HIV positive from the input. \ Users can also find HIV status information for each study by looking at the column: HIVStatus from DataSummary. \ When the the length of the annotationCondition equals to 1, we can subset using either MultiAssayExperiment built-in procedure or subset_curatedTBData.

multi_set_HIV <- lapply(object_list, function(x)
  subset_curatedTBData(x, annotationColName = "HIVStatus",
                       annotationCondition = "Negative",
                       assayName = "assay_curated"))
# Remove NULL from the list
multi_set_HIV <- multi_set_HIV[!vapply(multi_set_HIV, is.null, TRUE)]
multi_set_HIV[1:3]

Session Info

sessionInfo()


compbiomed/curatedTBData documentation built on March 14, 2024, 2:08 p.m.