norm_data: Normalize Sequencing Count Matrix

View source: R/norm_data.R

norm_dataR Documentation

Normalize Sequencing Count Matrix

Description

This function normalizes sequencing count data in form of SummarizedExperiment or data.frame. In either class, the columns and rows of the count matix should be samples/conditions and genes respectively.

Usage

norm_data(
  data,
  assay.na = NULL,
  norm.fun = "CNF",
  par.list = NULL,
  log2.trans = TRUE,
  data.trans
)

Arguments

data
Terms

spatial features: cells, tissues, organs, etc; variables: experimental variables such as drug dosage, temperature, time points, etc; biomolecules: genes, proteins, metabolites, etc; spatial heatmap: SHM.

'SummarizedExperiment'

The assays slot stores the data matrix, where rows and columns are biomolecules and spatial featues respectively. Typically, at least two columns of spatial features and variables are stored in the colData slot respectively. When plotting SHMs, only identical spatial features between the data and aSVG will be colored according to the expression values of chosen biomolecules. Replicates of the same type in these two columns should be identical, e.g. "tissueA", "tissueA" rather than "tissueA1", "tissueA2". If column names in the assays slot follow the "spatialFeature__variable" scheme, i.e. spatial features and variables are concatenated by double underscore, then the colData slot is not required at all. If the data do not have experiment variables, the variable column in colData or the double underscore scheme is not required.

'data.frame'

Rows and columns are biomolecules and spatial featues respectively. If there are experiment variables, the column names should follow the naming scheme "spatialFeature__variable". Otherwise, the column names should only include spatial features. The double underscore is a reserved string for specific purposes in spatialHeatmap, and thus should be avoided for naming spatial feature or variables. A column of biomolecule description can be included. This is only applicable in the interactive network graph (see network), where mousing over a node displays the corresponding description.

vector

In the function shm, the data can be provided in a numeric vector for testing with a single gene. If so, the naming schme of the vector is the same with the data.frame.

Multiple variables

For plotting SHMs, multiple variables contained in the data can be combined into a composite one, and the composite variable will be treated as a regular single variable. See the vignette for more details by running browseVignettes('spatialHeatmap') in R.

assay.na

The name of target assay to use when data is SummarizedExperiment.

norm.fun

Normalizing functions, one of "CNF", "ESF", "VST", "rlog", "none". Specifically, "CNF" stands for calcNormFactors from edgeR (McCarthy et al. 2012), and "EST", "VST", and "rlog" is equivalent to estimateSizeFactors,
varianceStabilizingTransformation, and rlog from DESeq2 respectively (Love, Huber, and Anders 2014). If "none", no normalization is applied. The default is "CNF" and the output data is processed by cpm (Counts Per Million). The parameters of each normalization function are provided through par.list.

par.list

A list of parameters for each normalizing function assigned in norm.fun. The default is NULL and list(method='TMM'), list(type='ratio'),
list(fitType='parametric', blind=TRUE),
list(fitType='parametric', blind=TRUE) is internally set for "CNF", "ESF", "VST", "rlog" respectively. Note the slot name of each element in the list is required, e.g. list(method='TMM') rather than list('TMM').
Complete parameters of "CNF": https://www.rdocumentation.org/packages/edgeR/
versions/3.14.0/topics/calcNormFactors
Complete parameters of "ESF": https://www.rdocumentation.org/packages/
DESeq2/versions/1.12.3/topics/estimateSizeFactors
Complete parameters of "VST": https://www.rdocumentation.org/packages/
DESeq2/versions/1.12.3/topics/varianceStabilizingTransformation
Complete parameters of "rlog": https://www.rdocumentation.org/packages/
DESeq2/versions/1.12.3/topics/rlog

log2.trans

Logical. If TRUE (default) and the selected normalization method does not use log2 scale by default ("ESF"), the output data is log2-transformed after normalization. If FALSE and the selected normalization method uses log2 scale by default ("VST", "rlog"), the output data is 2-exponent transformed after normalization.

data.trans

This argument is deprecated and replaced by log2.trans. One of "log2", "exp2", and "none", corresponding to transform the count matrix by "log2", "2-based exponent", and "no transformation" respecitvely. The default is "none".

Value

An object of SummarizedExperiment or data.frame, depending on the input data.

Author(s)

Jianhai Zhang jzhan067@ucr.edu
Dr. Thomas Girke thomas.girke@ucr.edu

References

SummarizedExperiment: SummarizedExperiment container. R package version 1.10.1
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/ McCarthy, Davis J., Chen, Yunshun, Smyth, and Gordon K. 2012. "Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation." Nucleic Acids Research 40 (10): 4288–97 Keays, Maria. 2019. ExpressionAtlas: Download Datasets from EMBL-EBI Expression Atlas Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. "Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2." Genome Biology 15 (12): 550. doi:10.1186/s13059-014-0550-8 McCarthy, Davis J., Chen, Yunshun, Smyth, and Gordon K. 2012. "Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation." Nucleic Acids Research 40 (10): 4288–97 Cardoso-Moreira, Margarida, Jean Halbert, Delphine Valloton, Britta Velten, Chunyan Chen, Yi Shao, Angélica Liechti, et al. 2019. “Gene Expression Across Mammalian Organ Development.” Nature 571 (7766): 505–9

See Also

calcNormFactors in edgeR, and estimateSizeFactors, varianceStabilizingTransformation, rlog in DESeq2.

Examples


## Two example data sets are showcased for the data formats of "data.frame" and 
## "SummarizedExperiment" respectively. Both come from an RNA-seq analysis on 
## For conveninece, they are included in this package. The complete raw count data are
## downloaded using the R package ExpressionAtlas (Keays 2019) with the accession 
## number "E-MTAB-6769". 

# Access example data 1.
df.chk <- read.table(system.file('extdata/shinyApp/data/count_chicken_simple.txt', 
package='spatialHeatmap'), header=TRUE, row.names=1, sep='\t', check.names=FALSE)

# Column names follow the naming scheme
# "spatialFeature__variable".  
df.chk[1:3, ]

# A column of gene description can be optionally appended.
ann <- paste0('ann', seq_len(nrow(df.chk))); ann[1:3]
df.chk <- cbind(df.chk, ann=ann)
df.chk[1:3, ]

# Access example data 2. 
count.chk <- read.table(system.file('extdata/shinyApp/data/count_chicken.txt', 
package='spatialHeatmap'), header=TRUE, row.names=1, sep='\t')
count.chk[1:3, 1:5]

# A targets file describing spatial features and variables is required for example  
# data 2, which should be made based on the experiment design. 

# Access the targets file. 
target.chk <- read.table(system.file('extdata/shinyApp/data/target_chicken.txt', 
package='spatialHeatmap'), header=TRUE, row.names=1, sep='\t')
# Every column in example data 2 corresponds with a row in the targets file. 
target.chk[1:5, ]
# Store example data 2 in "SummarizedExperiment".
library(SummarizedExperiment)
se.chk <- SummarizedExperiment(assay=count.chk, colData=target.chk)
# The "rowData" slot can optionally store a data frame of gene annotation.
rowData(se.chk) <- DataFrame(ann=ann)

# Normalize data.
df.chk.nor <- norm_data(data=df.chk, norm.fun='CNF', log2.trans=TRUE)
se.chk.nor <- norm_data(data=se.chk, norm.fun='CNF', log2.trans=TRUE)

# Aggregate replicates of "spatialFeature_variable", where spatial features are organs
# and variables are ages.
df.chk.aggr <- aggr_rep(data=df.chk.nor, aggr='mean')
df.chk.aggr[1:3, ]

se.chk.aggr <- aggr_rep(data=se.chk.nor, sam.factor='organism_part', con.factor='age',
aggr='mean')
assay(se.chk.aggr)[1:3, 1:3]

# Genes with experssion values >= 5 in at least 1% of all samples (pOA), and coefficient
# of variance (CV) between 0.2 and 100 are retained.
df.chk.fil <- filter_data(data=df.chk.aggr, pOA=c(0.01, 5), CV=c(0.2, 100))
se.chk.fil <- filter_data(data=se.chk.aggr, sam.factor='organism_part', con.factor='age', 
pOA=c(0.01, 5), CV=c(0.2, 100), file=NULL)


jianhaizhang/spatialHeatmap documentation built on April 21, 2024, 7:43 a.m.