metaseqr2: The main metaseqr2 pipeline

View source: R/main.R

metaseqr2R Documentation

The main metaseqr2 pipeline

Description

This function is the main metaseqr2 workhorse and implements the main metaseqr2 workflow which performs data read, filtering, normalization and statistical selection, creates diagnostic plots and exports the results and a report if requested. The metaseqr2 function is responsible for assembling all the steps of the metaseqr2 pipeline which i) reads the input gene or exon read count table ii) performs prelimininary filtering of data by removing chrM and other non-essential information for a typical differential gene expression analysis as well as a preliminary expression filtering based on the exon counts, if an exon read count file is provided. iii) performs data normalization with one of currently widely used algorithms, including EDASeq (Risso et al., 2011), DESeq (Anders and Huber, 2010), edgeR (Robinson et al., 2010), NOISeq (Tarazona et al., 2012) or no normalization iv) performs a second stage of filtering based on the normalized gene expression according to several gene filters v) performs statistical testing with one or more of currently widely used algorithms, including DESeq (Anders and Huber, 2010), edgeR (Robinson et al., 2010), NOISeq (Tarazona et al., 2012), limma (Smyth et al., 2005) for RNA-Seq data vi) in the case of multiple statistical testing algorithms, performs meta-analysis using one of six vailable methods (see the meta.p argument) vii) exports the resulting differentially expressed gene list in text tab-delimited format viii) creates a set of diagnostic plots either available in the aforementioned packages or metaseqr2 specific ones and ix) creates a comprehensive HTML report which summarizes the run information, the results and the diagnostic plots. Certain diagnostic plots (e.g. the volcano plot) can be interactive with the use of the external Highcharts (http://www.highcharts.com) JavaScript library for interactive graphs. Although the inputs to the metaseqr2 workflow are many, in practice, setting only very few of them and accepting the defaults as the rest can result in quite comprehensible results for mainstream organisms like mouse, human, fly and rat.

Usage

    metaseqr2(counts, sampleList, excludeList = NULL,
        fileType = c("auto", "sam", "bam", "bed"),
        path = NULL, contrast = NULL, libsizeList = NULL,
        embedCols = list(idCol = 4, gcCol = NA, nameCol = NA, 
            btCol = NA),
        annotation = NULL, org = c("hg18", "hg19", "hg38", "mm9", 
            "mm10", "rn5", "rn6", "dm3", "dm6", "danrer7", 
            "pantro4", "susscr3", "tair10", "equcab2"),
        refdb = c("ensembl", "ucsc", "refseq"), version = "auto",
        transLevel = c("gene", "transcript", "exon"),
        countType = c("gene", "exon","utr"),
        utrOpts = list(frac = 1, minLength = 300, downstream = 50),
        exonFilters = list(minActiveExons = list(exonsPerGene = 5, 
            minExons = 2, frac = 1/5)),
        geneFilters = list(length = list(length = 500), 
            avgReads = list(averagePerBp = 100, quantile = 0.25), 
            expression = list(median = TRUE, mean = FALSE,  
                quantile = NA, known = NA, custom = NA), 
            biotype = getDefaults("biotypeFilter", org[1]),
            presence = list(frac = 0.25, minCount = 10,
                perCondition = FALSE)),
        whenApplyFilter = c("postnorm", "prenorm"),
        normalization = c("deseq", "deseq2", "edaseq", "edger",
            "noiseq", "nbpseq", "absseq", "dss", "each", "none"),
        normArgs = NULL,
        statistics = c("deseq", "deseq2", "edger", "noiseq", 
            "limma", "nbpseq", "absseq", "dss"),
        statArgs = NULL,
        adjustMethod = sort(c(p.adjust.methods, "qvalue")),
        metaP = if (length(statistics) > 1) c("simes", 
            "bonferroni", "fisher", "dperm_min", "dperm_max",
            "dperm_weight", "fperm", "whitlock", "minp", "maxp",
            "weight", "pandora", "none") else "none",
        weight = rep(1/length(statistics), length(statistics)),
        nperm = 10000, pcut = NA, logOffset = 1, pOffset = NULL, 
        preset = NULL, qcPlots = c("mds", "biodetection", 
            "countsbio", "saturation", "readnoise","filtered", 
            "correl", "pairwise", "boxplot", "gcbias", 
            "lengthbias", "meandiff", "meanvar", "rnacomp", 
            "deheatmap", "volcano", "biodist", "mastat", 
            "statvenn", "foldvenn", "deregulogram"),
        figFormat = c("png", "jpg", "tiff", "bmp", "pdf", "ps"),
        outList = FALSE, exportWhere = NA,
        exportWhat = c("annotation", "p_value", "adj_p_value", 
            "meta_p_value", "adj_meta_p_value", "fold_change", 
            "stats", "counts","flags"),
        exportScale = c("natural", "log2", "log10", "vst", 
            "rpgm"),
        exportValues = c("raw", "normalized"),
        exportStats = c("mean", "median", "sd", "mad", "cv", 
            "rcv"),
        exportCountsTable = FALSE,
        restrictCores = 0.6, report = TRUE, reportTop = 0.1,
        reportTemplate = "default", saveGeneModel = TRUE,
        verbose = TRUE, runLog = TRUE, 
        reportDb = c("dexie", "sqlite"), 
        localDb = file.path(system.file(package = "metaseqR2"),
            "annotation.sqlite"),
        offlineReport = TRUE, 
        createTracks = FALSE, overwriteTracks = FALSE, 
        trackExportPath = file.path(exportWhere, "tracks"),
        trackInfo = list(stranded = FALSE, normTo = 1e+9, 
            urlBase = "http://www.trackserver.me",
            fasta = NULL, gtf = NULL,
            hubInfo = list(name = "MyHub", shortLabel = "My hub",
            longLabel = "My hub long", 
            email = "someone@example.com")), .progressFun = NULL,
            .exportR2C = FALSE,...)

Arguments

counts

a text tab-delimited file containing gene, exon or 3'UTR counts in one of the following formats: i) the first column contains unique gene or exon identifiers and the rest of the columns contain the read counts for each sample. ii) The first n columns should contain only **gene** annotation elements like chromosomal locations, gene accessions, exon accessions, GC content etc. and the rest columns should contain gene read counts, iii) counts can also be an .RData file with previous analysis elements (see Details) and iv) counts can be a list representing the gene model (see Details). Several restrictions apply im each of the four cases. See Details for analytical descriptions.

sampleList

a list containing condition names and the samples under each condition or a small tab-delimited file with the experiment description. Not needed when restoring a previous analysis. See Details for analytical description.

excludeList

a list of samples to exclude, in the same (list) format as sampleList above.

path

an optional path where all the BED/BAM files are placed, to be prepended to the BAM/BED file names in the targets file. See Details for further information.

fileType

the type of raw input files. It can be "auto" for auto-guessing, "bed" for BED files, "sam" for SAM files or "bam" for BAM files.

contrast

a character vector of contrasts to be tested in the statistical testing step(s) of the pipeline. Each element of contrast should STRICTLY have the format "ConditionA_vs_ConditionB_vs_...". Special attention is needed as fold change calculations are based on this argument. If it is NULL, no statistical testing of fold change calculations are performed. See Details for further information.

libsizeList

an optional named list where names represent samples (MUST be the same as the samples in sample.list) and members are the library sizes (the sequencing depth) for each sample. For example libsize.list <- list(Sample_A1=32456913, Sample_A2=4346818).

embedCols

a named list with column numbers to guide the case of embedded annotation. See Details for further information.

annotation

It can be one of i) NULL (default) to use the existing annotation database or fetch on the fly, ii) "embedded" if the annotation elements are embedded in the read counts file (restrictions apply) or iii) a list with a path to a GTF file and certain required metadata. See Details for a full description.

org

the supported organisms by metaseqr2 or a user-named organism which has been imported to the database. See Details for more information.

refdb

the reference annotation repository from which to retrieve annotation elements to use with metaseqr2. It can be one of "ensembl" (default), "ucsc" or "refseq" or a user based one (similar to the org argument).

version

the version of the annotation to use. See Details.

transLevel

perform differential expression analysis at which transcriptional unit, can be one of "gene" (default), "transcript" for reporting differential expression at the transcript level or "exon" for exon level.

countType

the type of reads inside the counts file. It can be one of "gene", "exon" or "utr" for Quant-Seq (Lexogen) protocol. This is a very important and mandatory parameter as it defines the course of the workflow.

utrOpts

a named list with members frac which is the fraction (0-1) of the 3' UTR region to count reads in, minLength the minimum acceptable 3'UTR length irrespective of frac and downstream the number of base pairs to flank the end of the 3' UTR of transcripts when analyzing Quant-Seq data.

exonFilters

a named list whose names are the names of the supported exon filters and its members the filter parameters. See section "Exon filters" below for details.

geneFilters

a named list whose names are the names of the supported gene filters and its members the filter parameters. See section "Gene filters" below for details.

whenApplyFilter

a character string determining when to apply the exon and/or gene filters, relative to normalization. It can be "prenorm" to apply apply the filters and exclude genes from further processing before normalization, or "postnorm" to apply the filters after normalization (default). See also Details.

normalization

the normalization algorithm to be applied on the count data. It can be one of "edaseq" for EDASeq normalization, "deseq" for the normalization algorithm in the DESq package (default), "edger" for the normalization algorithms present in the edgeR package "noiseq" for the normalization algorithms present in the NOISeq package "nbpseq" for the normalization algorithms present in the NBPSeq package or "none" to not normalize the data (highly unrecommended). Algorithm specific arguments can be passed through the normArgs argument).

normArgs

a named list whose names are the names of the normalization algorithm parameters and its members parameter values. See section "Normalization parameters" below for details. Leave NULL for the defaults of normalization.

statistics

one or more statistical analyses to be performed by the metaseqr2 pipeline. It can be one or more of "deseq" (default) to conduct statistical test(s) implemented in the DESeq package, "edger" to conduct statistical test(s) implemented in the edgeR package, "limma" to conduct the RNA-Seq version of statistical test(s) implemented in the limma package, "noiseq" to conduct statistical test(s) implemented in the NOISeq package, "nbpseq" to conduct statistical test(s) implemented in the NBPSeq package, "deseq2" to conduct statistical test(s) implemented in the DESeq2 package, "dss" to conduct statistical test(s) implemented in the DSS package and "absseq" to conduct statistical test(s) implemented in the ABSSeq package. In any case individual algorithm parameters are controlled by the contents of the statArgs list. Finally, it can be NA. In this case no testing is performed and only fold changes are provided if contrast is not NULL.

statArgs

a named list whose names are the names of the statistical algorithms used in the pipeline. Each member is another named list whose names are the algorithm parameters and its members are the parameter values. See section "Statistics parameters" below for details. Leave NULL for the defaults of statistics.

adjustMethod

the multiple testing p-value adjustment method. It can be one of p.adjust.methods or "qvalue" from the qvalue Bioconductor package. Defaults to "BH" for Benjamini-Hochberg correction.

metaP

the meta-analysis method to combine p-values from multiple statistical tests . It can be one of "simes" (default), "bonferroni", "minp", "maxp", "weight", "pandora", "dperm_min", "dperm_max", "dperm_weight", "fisher", "fperm", "whitlock" or "none". See Details for a full description.

weight

a vector of weights with the same length as the statistics vector containing a weight for each statistical test. It should sum to 1. Use with caution with the dperm_weight parameter! Theoretical background is not yet solid and only experience shows improved results!

nperm

the number of permutations performed to derive the meta p-value when metaP="fperm" or metaP="dperm". It defaults to 10000.

pcut

a p-value cutoff for exporting differentially genes, default is to export all the non-filtered genes.

logOffset

an offset to be added to values during logarithmic transformations in order to avoid Infinity (default is 1).

pOffset

a value between 0 and 1 to multiply potential zero p-values with for the combination methods including weighting or NULL (default). See also Details.

preset

an analysis strictness preset. preset can be one of "all_basic", "all_normal", "all_full", "medium_basic", "medium_normal", "medium_full", "strict_basic", "strict_normal" or "strict_full", each of which control the strictness of the analysis and the amount of data to be exported. For an explanation of the presets, see the section "Presets" below.

qcPlots

a set of diagnostic plots to show/create. It can be one or more of "mds", "biodetection", "rnacomp", "countsbio", "saturation", "readnoise", "filtered", "boxplot", "gcbias", "lengthbias", "meandiff", "meanvar", "deheatmap", "volcano", "mastat", "biodist", "statvenn", "foldvenn". See also Details.

figFormat

the format of the output diagnostic plots. It can be one or more of "png", "jpg", "tiff", "bmp", "pdf", "ps". The native format "x11" (for direct display) is not provided as an option as it may not render the proper display of some diagnostic plots in some devices.

outList

a logical controlling whether to export a list with the results in the running environment.

exportWhere

an output directory for the project results (report, lists, diagnostic plots etc.)

exportWhat

the content of the final lists. It can be one or more of "annotation", to bind the annoation elements for each gene, "p_value", to bind the p-values of each method, "adj_p_value", to bind the multiple testing adjusted p-values, "meta_p_value", to bind the combined p-value from the meta-analysis, "adj_meta_p_value", to bind the corrected combined p-value from the meta-analysis, "fold_change", to bind the fold changes of each requested contrast, "stats", to bind several statistics calclulated on raw and normalized counts (see the exportStats argument), "counts", to bind the raw and normalized counts for each sample.

exportScale

export values from one or more transformations applied to the data. It can be one or more of "natural", "log2", "log10", "vst" (Variance Stabilizing Transormation, see the documentation of DESeq package) and "rpgm" which is ratio of mapped reads per gene model (either the gene length or the sum of exon lengths, depending on countType argument). Note that this is not RPKM as reads are already normalized for library size using one of the supported normalization methods. Also, "rpgm" might be misleading when normalization is other than "deseq".

exportValues

It can be one or more of "raw" to export raw values (counts etc.) and "normalized" to export normalized counts.

exportStats

calculate and export several statistics on raw and normalized counts, condition-wise. It can be one or more of "mean", "median", "sd", "mad", "cv" for the Coefficient of Variation, "rcv" for a robust version of CV where the median and the MAD are used instead of the mean and the standard deviation.

exportCountsTable

exports also the calculated read counts table when input is read from bam files and exports also the normalized count table in all cases. Defaults to FALSE.

restrictCores

in case of parallel execution of several subfunctions, the fraction of the available cores to use. In some cases if all available cores are used (restrictCores=1 and the system does not have sufficient RAM, the pipeline running machine might significantly slow down.

report

a logical value controlling whether to produce a summary report or not. Defaults to TRUE.

reportTop

a fraction of top statistically significant genes to append to the HTML report. This helps in keeping the size of the report as small as possible, as appending the total gene list might create a huge HTML file. Users can always retrieve the whole gene lists from the report links. Defaults to 0.1 (top 10 genes). Set to NA or NULL to append all the statistically significant genes to the HTML report.

reportTemplate

an HTML template to use for the report. Do not change this unless you know what you are doing.

saveGeneModel

in case of exon analysis, a list with exon counts for each gene will be saved to the file exportWhere/data/gene_model.RData. This file can be used as input to metaseqR for exon count based analysis, in order to avoid the time consuming step of assembling the counts for each gene from its exons

verbose

print informative messages during execution? Defaults to TRUE.

runLog

write a log file of the metaseqr2 run using package log4r. Defaults to TRUE. The filename will be auto-generated.

reportDb

database system to use for storing the report intereactive graphs. Can be "sqlite" (default) or "dexie". See Details for further explanation on what should be used.

localDb

the metaseqR2 annotaation database location. See also link{buildAnnotationDatabase}.

offlineReport

TRUE (default) to download and include the required JavaScript libraries to properly view the report offline. Ignored if report=FALSE

.

createTracks

option to create normalized bigWig files to display in a genome browser (e.g. UCSC). Defaults to FALSE.

overwriteTracks

overwrite tracks if they already exist? Defaults to FALSE.

trackExportPath

where to export the bigWig files, defaults to file.path(exportWhere,"tracks").

trackInfo

if createTracks=TRUE, a list with additional required information to create the tracks. See Details for further explanation.

.progressFun

a function which updates a Progress object from shiny. This function must accept a detail argument. See http://shiny.rstudio.com/articles/progress.html

.exportR2C

export additional RData along with saveGeneModel.

...

further arguments that may be passed to plotting functions, related to par.

Details

When counts is a tab-delimited file, the following restrictions apply:

  • In the case (i) the first cell of each row is a gene or exon accession and the rest are integers representing the counts for that accession. In that case, the annotation parameter should strictly be NULL or an external file in GTF format.

  • In the case (ii) the annotation parameter can also be "embedded". The ideal embedded annotation contains 8 columns, chromosome, gene or exon start, gene or exon end, gene or exon accession, GC-content (fraction or percentage), strand, HUGO gene symbol and gene biotype (e.g. "protein_coding" or "ncRNA"). When the annotation parameter is "embedded", certain of these features are mandatory (co-ordinates and accessions). If they are not present, the pipeline will not run. If additional elements are not present (e.g. GC content or biotypes), certain features of metaseqr2 will not be available. For example, EDASeq normalization will not be performed based on a GC content covariate but based on gene length which is not what the authors of EDASeq suggest. If biotypes are not present, a lot of diagnostic plots will not be available. If the HUGO gene symbols are missing, the final annotation will contain only gene accessions and thus be less comprehensible. Counts can be a data frame satisfying the above conditions. It is a data frame by default when read2count is used.

  • In the case (iii) the .RData file (output of save function contains static input elements (list containing the gene model (exon counts for each gene), gene and exon annotation to avoid re-(down)loading and/or gene counts depending on countType). This kind of input facilitates the re-analysis of the same experiment, using different filtering, normalization and statistical algorithms. This .RData file is produced when saveGeneModel=TRUE.

  • In the case (iv) counts can be a list representing the gene model (exon/UTR counts for each gene). This .RData file can be generated by setting saveGeneModel=TRUE when performing data analysis for the first time.

Regarding sampleList it should have the format sampleList <- list(ConditionA=c("Sample_A1", "Sample_A2","Sample_A3"), ConditionB=c("Sample_B1","Sample_B2"), ConditionC=c("Sample_C1","Sample_C2")). The names of the samples in list members MUST match the column names containing the read counts in the counts file. If they do not match, the pipeline will either crash or at best, ignore several of your samples. Alternative, sampleList can be a small tab-delimited file structured as follows: the first line of the external tab delimited file should contain column names (names are not important). The first column MUST contain UNIQUE sample names and the second column MUST contain the biological condition where each of the samples in the first column should belong to. If the counts argument is missing, the sampleList argument MUST be a targets text tab-delimited file which contains the sample names, the BAM/BED file names and the biological conditions/groups for each sample/file. The file should be text tab-delimited and structured as follows: the first line of the external tab delimited file should contain column names (names are not important). The first column MUST contain UNIQUE sample names. The second column MUST contain the raw BAM/BED files WITH their full path. Alternatively, the path argument should be provided. If path is not provided and if the files in the second column of the targets file do not contain a path to a directory, the current directory is assumed to be the BAM/BED file container. The third column MUST contain the biological condition where each of the samples in the first column should belong to.

Regarding contrast, a valid example based on the sampleList above is contrast <- c("ConditionA_vs_ConditionB", "ConditionA_vs_ConditionC", "ConditionA_vs_ConditionB_vs_ConditionC"). The first element of pairwise contrasts (e.g. "ConditionA" above) MUST be the control condition or any reference that ConditionB is checked against. metaseqr2 uses this convention to properly calculate fold changes.

Regarding embedCols, this list must contain four members, named idCol, gcCol, nameCol and btCol, which hold the position in the delimited file with embedded annotation, where unique gene ids, GC content, gene names and gene biotypes respectively are located. More specifically:

  • idCol is an integer denoting the column number in the file (or data frame) provided with the counts argument, where the unique gene accessions are. Default to 4 which is the standard feature name column in a BED file.

  • gcCol is an integer denoting the column number in the file (or data frame) provided with the counts argument, where each gene's GC content is given. If not provided, GC content normalization provided by EDASeq will not be available.

  • nameCol is an integer denoting the column number in the file (or data frame) provided with the counts argument, where the HUGO gene symbols are given. If not provided, it will not be available when reporting results. In addition, the "known" gene filter will not be available for application.

  • btCol is an integer denoting the column number in the file (or data frame) provided with the counts argument, where the gene biotypes are given. If not provided, the "biodetection", "countsbio", "saturation", "filtered" and "biodist" plots will not be available.

Regarding annotation instructs metaseqr2 where to find the annotation for the given counts file. It can be one of i) "download" (default) for automatic downloading of the annotation for the organism specified by the org parameter (using biomaRt), ii) "embedded" if the annotation elements are embedded in the read counts file or iv) a file specified by the user which should be as similar as possible to the "download" case, in terms of column structure.

Regarding org, it can be, for human genomes "hg18", "hg19" or "hg38", for mouse genomes "mm9", "mm10", for rat genomes "rn5" or "rn6", for drosophila genome "dm3" or "dm6", for zebrafish genome "danrer7", "danrer10" or "danrer11", for chimpanzee genome "pantro4", "pantro5", for pig genome "susscr3", "susscr11", for Arabidopsis thaliana genome "tair10" and for Equus caballus genome "equcab2". Finally, it can be "USER_NAMED_ORG" with a custom organism which has been imported to the annotation database by the user using a GTF file. For example org="mm10_p1".

Regarding version, this is an integer denoting the version of the annotation to use from the local annotation database or fetch on the fly. For Ensembl, it corresponds to Ensembl releases, while for UCSC/RefSeq, it is the date of creation (locally).

Regarding whenApplyFilter, in the case of whenApplyFilter="prenorm", a first normalization round is applied to a copy of the gene counts matrix in order to derive the proper normalized values that will constitute the several expression-based filtering cutoffs.

Regarding metaP, for the "fisher" and "fperm" methods, see the documentation of the R package MADAM. For the "whitlock" method, see the documentation of the survcomp Bioconductor package. With the "maxp" option, the final p-value is the maximum p-value out of those returned by each statistical test. This is equivalent to an "intersection" of the results derived from each algorithm so as to have a final list with the common genes returned by all statistical tests. Similarly, when meta.p="minp", is equivalent to a "union" of the results derived from each algorithm so as to have a final list with all the genes returned by all statistical tests. The latter can be used as a very lose statistical threshold to aggregate results from all methods regardless of their False Positive Rate. With the "simes" option, the method proposed by Simes (Simes, R. J., 1986) is used. With the "dperm_min", "dperm.max", "dperm.weight" options, a permutation procedure is initialed, where nperm permutations are performed across the samples of the normalized counts matrix, producing nperm permuted instances of the initital dataset. Then, all the chosen statistical tests are re-executed for each permutation. The final p-value is the number of times that the p-value of the permuted datasets is smaller than the original dataset. The p-value of the original dataset is created based on the choice of one of dperm.min, dperm.max or dperm.weight options. In case of dperm.min, the intial p-value vector is consists of the minimum p-value resulted from the applied statistical tests for each gene. The maximum p-value is used with the dperm.max option. With the dperm.weight option, the weight weighting vector for each statistical test is used to weight each p-value according to the power of statistical tests (some might work better for a specific dataset). Be careful as the permutation procedure usually requires a lot of time. However, it should be the most accurate. This method will NOT work when there are no replicated samples across biological conditions. In that case, use meta.p="simes" instead. Finally, there are the "minp", "maxp" and "weight" options which correspond to the latter three methods but without permutations. Generally, permutations would be accurate to use when the experiment includes >5 samples per condition (or even better 7-10) which is rather rare in RNA-Seq experiments. Finally, "pandora" is the same as "weight" and is added to be in accordance with the main algorithm name.

Regarding pOffset, it is used to correct for the case of a p-value which is equal to 0 as a result of internal numerical and approximation procedures. When NULL, random numbers greater than 0 and less than or equal to 0.5 are used to multiply the offending p-values with the lowest provided non-zero p-value, maintaining thus a virtual order of significance, avoiding having the same p-values for two tests and assuming that all zero p-values represent extreme statistical significance. When a numeric between 0 and 1, this number is used for the above multiplication instead.

Regarding qcPlots The "mds" stands for Mutlti-Dimensional Scaling and it creates a PCA-like plot but using the MDS dimensionality reduction instead. It has been succesfully used for NGS data (e.g. see the package htSeqTools) and it shows how well samples from the same condition cluster together. For "biodetection", "countsbio", "saturation", "rnacomp", "readnoise", "biodist" see the vignette of NOISeq package. The "saturation" case has been rewritten in order to display more samples in a more simple way. In addition, the "readnoise" plots represent an older version or the RNA composition plot included in older versions of NOISeq. For "gcbias", "lengthbias", "meandiff", "meanvar" see the vignette of EDASeq package. "lenghtbias" is similar to "gcbias" but using the gene length instead of the GC content as covariate. The "boxplot" option draws boxplots of log2 transformed gene counts. The "filtered" option draws a 4-panel figure with the filtered genes per chromosome and per biotype, as absolute numbers and as fractions of the genome. See also the help page of diagplotFiltered. The "deheatmap" option performs hierarchical clustering and draws a heatmap of differentially expressed genes. In the context of diagnostic plots, it's useful to see if samples from the same groups cluster together after statistical testing. The "volcano" option draws a volcano plot for each contrast and if a report is requested, an interactive volcano plot is presented in the HTML report. The "venn" option will draw an up to 5-way Venn diagram depicting the common and specific to each statistical algorithm genes and for each contrast, when meta-analysis is performed. The "correl" option creates two correlation graphs: the first one is a correlation heatmap (a correlation matrix which depicts all the pairwise correlations between each pair of samples in the counts matrix is drawn as a clustered heatmap) and the second one is a correlogram plot, which summarizes the correlation matrix in the form of ellipses (for an explanation please see the vignette/documentation of the R package corrplot. Set qcPlots=NULL if you don't want any diagnostic plots created.

Regarding reportDb, contrary with the first version of metaseqR, all graphs in the metaseqR2 report are interactive with the usage of the JavaScript libraries Highcharts, plotly (heatmaply) and jvenn.js. However, this adds a great burden regarding rendering the final HTML file and its content, a burden which becomes heavier by the fact the metaseqR2 report is rendered using knitr and rmarkdown instead of raw HTML (previously, brew). Therefore, the pre-calculated JSON objects representing the graphs are stored either in a report-specific IndexedDB (https://javascript.info/indexeddb) flavor called Dexie (https://dexie.org/) (default) or in an SQLite database and then queried using sql.js (https://github.com/kripken/sql.js/). Dexie is prefered because it is very efficient and can produce an independent report that does not need to be served through a web-server and can be viewed locally. Although Dexie is very efficient, some caution is required as knitr and render from rmarkdown are not very memory efficient when rendering larger HTML files. A large HTML file may be produced when analyzing a large dataset with a lot of contrasts that may result in a lot of tables. In such cases, if the report generation crashes with errors related to memory, try lowering the reportTop argument. reportTop does not affect the final lists of differentially expressed genes, only the report tables. The same must be applied also if the report takes too much time to load. If the report is to be served through a web server like Apache (e.g. when the report is provided by a facility to end users), reportDb="sqlite" may be preferred as the total report size will be smaller because of an SQLite database hosting all plots which are queried when required but from the SQLite database and not from the in-browser database (Dexie). **Note** that when using an SQLite database, you will **NOT** be able to view the report in any browser other than Microsoft Edge because of security policies regarding local file access. **Note** also that sql.js is a rather large JavaScript library (around 2.5MB).

Regarding trackInfo, it is a helper list to guide the bigWig track creation and has the following members:

  • stranded, which can be TRUE or FALSE depending on whether you wish to create stranded tracks by separating + and - strand reads. In the case of stranded tracks, a UCSC Genome Brower trackhub is created. Individual tracks can be retrieved from the trackhub.

  • normTo, which is a large integer, denoting the total sum of signal to be used as the normalization target. It defaults to 1e+9. This means that if for a particular sample the sum of signal is 1.5e+9 (sum(sapply(coverage(x),sum)) == 1.5e+9) then this is linearly scaled to 1e+9.

  • urlBase, which is a base url appended to the bigWig files produced (the base path of the bigDataUrl in UCSC Genome Browser track lines).

  • hubInfo, a list with the track hub description in case of stranded tracks. Please see the track hub specifications at the UCSC Genome Browser site.

  • fasta, reference genome in FASTA format for the case of analyzing a custom, non-directly supported organism. It will be converted to the .2bit format and written along with a track hub.

  • gtf, a GTF file describing gene models in the case of analyzing a custom, non-directly supported organism. It will be converted to the .bigBed format and written along with a track hub. Essentially the same as annotation$gtf.

All files (bigWig files, track/trackhub info) are written in the tracks subdirectory of the main path where the report and the outputs are written.

Value

If outList is TRUE, a named list whose length is the same as the number of requested contrasts. Each list member is named according to the corresponding contrast and contains a data frame of differentially expressed genes for that contrast. The contents of the data frame are defined by the exportWhat, exportScale, exportStats, exportValues parameters. If report is TRUE, the output list contains two main elements. The first is described above (the analysis results) and the second contains the same results but in HTML formatted tables.

Exon filters

The exon filters are a set of filters which are applied after the gene models are assembled from the read counts of individual exons and before the gene expression is summarized from the exons belonging to each gene. These filters can be applied when the input read counts file contains exon reads. It is not applicable when the input file already contains gene counts. Such filters can be for example "accept genes where all the exons contain more than x reads" or "accept genes where there is read presence in at least m/n exons, n being the total exons of the gene". Such filters are NOT meant for detecting differential splicing as also the whole metaseqr2 pipeline, thus they should not be used in that context. The exonFilters argument is a named list of filters, where the names are the filter names and the members are the filter parameters (named lists with parameter name, parameter value). See the usage of the metaseqr2 function for an example of how these lists are structured. The supported exon filters in the current version are: i) minActiveExons which implements a filter for demanding m out of n exons of a gene to have a certain read presence with parameters exonsPerGene, minExons and frac. The filter is described as follows: if a gene has up to exonsPerGene exons, then read presence is required in at least minExons of them, else read presence is required in a frac fraction of the total exons. With the default values, the filter instructs that if a gene has up to 5 exons, read presence is required in at least 2, else in at least 20 exons, in order to be accepted. More filters will be implemented in future versions and users are encouraged to propose exon filter ideas to the author by mail. See metaseqr2 usage for the defaults. Set exonFilters=NULL to not apply any exon filtering.

Gene filters

The gene filters are a set of filters applied to gene expression as this is manifested through the read presence on each gene and are preferably applied after normalization. These filters can be applied both when the input file or data frame contains exon read counts and gene read counts. Such filter can be for example "accept all genes above a certain count threshold" or "accept all genes with expression above the median of the normalized counts distribution" or "accept all with length above a certain threshold in kb" or "exclude the 'pseudogene' biotype from further analysis". The supported gene filters in the current version, which have the same structure as the exon filters (named list of lists with filter names, parameter names and parameter arguments) are: i) length which implements a length filter where genes are accepted for further analysis if they are above length (its parameter) kb. ii) avg.reads which implements a filter where a gene is accepted for further analysis if it has more average reads than the quantile of the average count distribution per averagePerBp base pairs. In summary, the reads of each gene are averaged per averagePerBp based on each gene's length (in case of exons, input the "gene's length" is the sum of the lengths of exons) and the quantile quantile of the average counts distribution is calculated for each sample. Genes passing the filter should have an average read count larger than the maximum of the vector of the quantiles calculated above. iii) expression which implements a filter based on the overall expression of a gene. The parameters of this filter are: median, where genes below the median of the overall count distribution are not accepted for further analysis (this filter has been used to distinguish between "expressed" and "not expressed" genes in several cases, e.g. (Mokry et al., NAR, 2011) with a logical as value, mean which is the same as median but using the mean, quantile which is the same as the previous two but using a specific quantile of the total counts distribution, known, where in this case, a set of known not-expressed genes in the system under investigation are used to estimate an expression cutoff. This can be quite useful, as the genes are filtered based on a "true biological" cutoff instead of a statistical cutoff. The value of this filter is a character vector of HUGO gene symbols (MUST be contained in the annotation, thus it's better to use annotation="download") whose counts are used to build a "null" expression distribution. The 90th quantile of this distribution is then the expression cutoff. This filter can be combined with any other filter. Be careful with gene names as they are case sensitive and must match exactly ("Pten" is different from "PTEN"!). iv) biotype where in this case, genes with a certain biotype (MUST be contained in the annotation, thus it's better to use annotation="download") are excluded from the analysis. This filter is a named list of logical, where names are the biotypes in each genome and values are TRUE or FALSE. If the biotype should be excluded, the value should be TRUE else FALSE. See the result of get.defaults("biotype.filter","hg19") for an example. Finally, in future versions there will be support for user-defined filters in the form of a function. v) presence where in this case, a gene is further considered for statistical testing if frac (x100 for a percentage value) have more than minCount reads across all samples (perCondition=FALSE) or across the samples of each condition (perCondition=TRUE).

Normalization parameters

The normalization parameters are passed again as a named list where the names of the members are the normalization parameter names and the values are the normalization parameter values. You should check the documentation of the packages EDASeq, DESeq, edgeR, NOISeq and NBPSeq for the parameter names and parameter values. There are a few exceptions in parameter names: in case of normalization="edaseq" the only parameter names are within.which and between.which, controlling the withing lane/sample and between lanes/samples normalization algorithm. In the case of normalization="nbpseq", there is one additional parameter called main.method which can take the calues "nbpseq" or "nbsmyth". These values correspond to the two different workflows available in the NBPSeq package. Please, consult the NBPSeq package documentation for further details. For the rest of the algorithms, the parameter names are the same as the names used in the respective packages. For examples, please use the getDefaults function.

Statistics parameters

The statistics parameters as passed to statistical algorithms in metaseqr2, exactly with the same way as the normalization parametes above. In this case, there is one more layer in list nesting. Thus, statArgs is a named list whose names are the names the algorithms used (see the statistics parameter). Each member is another named list,with parameters to be used for each statistical algorithm. Again, the names of the member lists are parameter names and the values of the member lists are parameter values. You should check the documentations of DESeq, edgeR, NOISeq, limma and NBPSeq for these parameters. There are a few exceptions in parameter names: In case of statistics="edger", apart from the rest of the edgeR statistical testing arguments, there is the argument mainMethod which can be either "classic" or "glm", again defining whether the binomial test or GLMs will be used for statistical testing. For examples, please use the getDefaults function. When statistics="nbpseq", apart from the rest arguments of the NBPSeq functions estimate.disp and estimate.dispersion, there is the argument mainMethod which can be "nbpseq" or "nbsmyth". This argument determines the parameters to be used by the estimate.dispersion function or by the estimate.disp function to estimate RNA-Seq count dispersions. The difference between the two is that they constitute different starting points for the two workflows in the package NBPSeq. The first worklfow (with mainMethod="nbpseq" and the estimate.dispersion function is NBPSeq package specific, while the second (with mainMethod="nbsmyth" and the estimate.disp function is similar to the workflow of the edgeR package. For additional information regarding the statistical testing in NBPSeq, please consult the documentation of the NBPSeq package.

Presets

The analysis presets are a set of keywords (only one can be used) that predefine some of the parameters of the metaseqr2 pipeline. For the time being they are quite simple and they control i) the strictness of filtering and statistical thresholding with three basic levels ("all", "medium", "strict") and ii) the data columns that are exported, again in three basic ways ("basic", "normal", "full") controlling the amount of data to be exported. These keywords can be combined with a dot in the middle (e.g. "all.basic" to define an analysis preset. When using analysis presets, the following argumentsof metaseqr2 are overriden: exonFilters, geneFilters, pcut, exportWhat, exportScale, exportValues, exonStats. If you want to explicitly control the above arguments, the preset argument should be set to NULL (default). Following is a synopsis of the different presets and the values of the arguments they moderate:

  • "all_basic": use all genes (do not filter) and export all genes and basic annotation and statistics elements. In this case, the above described arguments become:

    • exonFilters=NULL

    • geneFilters=NULL

    • pcut=1

    • exportWhat=c("annotation","p_value", "adj_p_value","meta_p_value", "adj_meta_p_value","fold_change")

    • exportScale=c("natural","log2")

    • exportValues=c("normalized")

    • exportStats=c("mean")

  • "all_normal": use all genes (do not filter) and export all genes and normal annotation and statistics elements. In this case, the above described arguments become:

    • exonFilters=NULL

    • geneFilters=NULL

    • pcut=1

    • exportWhat=c("annotation","p_value", "adj_p_value","meta_p_value", "adj_meta_p_value","fold_change","stats", "counts")

    • exportScale=c("natural","log2")

    • exportValues=c("normalized")

    • exportStats=c("mean","sd","cv")

  • "all_full": use all genes (do not filter) and export all genes and all annotation and statistics elements. In this case, the above described arguments become:

    • exonFilters=NULL

    • geneFilters=NULL

    • pcut=1

    • exportWhat=c("annotation","p_value", "adj_p_value","meta_p_value", "adj_meta_p_value","fold_change","stats","counts")

    • exportScale=c("natural","log2","log10","vst")

    • exportValues=c("raw","normalized")

    • exportStats=c("mean","median","sd","mad", "cv","rcv")

  • "medium_basic": apply a medium set of filters and and export statistically significant genes and basic annotation and statistics elements. In this case, the above above described arguments become:

    • exonFilters=list(minActiveExons= list(exonsPerGene=5,minExons=2,frac=1/5))

    • geneFilters=list(length=list(length=500), avgReads=list(averagePerBp=100,quantile=0.25), expression=list(median=TRUE,mean=FALSE,quantile=NA, known=NA,custom=NA), biotype=getDefaults("biotypeFilter",org[1]))

    • pcut=0.05

    • exportWhat=c("annotation","p_value", "adj_p_value","meta_p_value", "adj_meta_p_value","fold_change")

    • exportScale=c("natural","log2")

    • exportValues=c("normalized")

    • exportStats=c("mean")

  • "medium_normal": apply a medium set of filters and export statistically significant genes and normal annotation and statistics elements. In this case, the above described arguments become:

    • exonFilters=list(minActiveExons= list(exonsPerGene=5,minExons=2,frac=1/5))

    • geneFilters=list(length=list(length=500), avgReads=list(averagePerBp=100,quantile=0.25), expression=list(median=TRUE,mean=FALSE, quantile=NA,known=NA,custom=NA), biotype=getDefaults("biotypeFilter",org[1]))

    • pcut=0.05

    • export.what=c("annotation","p_value", "adj_p_value","meta_p_value", "adj_meta_p_value","fold_change", "stats","counts")

    • exportScale=c("natural","log2")

    • exportValues=c("normalized")

    • exportStats=c("mean","sd","cv")

  • "medium_full": apply a medium set of filters and export statistically significant genes and full annotation and statistics elements. In this case, the above described arguments become:

    • exonFilters=list(minActiveExons= list(exonsPerGene=5,minExons=2,frac=1/5))

    • geneFilters=list(length=list(length=500), avgReads=list(averagePerBp=100,quantile=0.25), expression=list(median=TRUE,mean=FALSE, quantile=NA,known=NA,custom=NA), biotype=getDefaults("biotypeFilter",org[1]))

    • pcut=0.05

    • exportWhat=c("annotation","p_value", "adj_p_value","meta_p_value", "adj_meta_p_value","fold_change", "stats","counts")

    • exportScale=c("natural","log2","log10", "vst")

    • exportValues=c("raw","normalized")

    • exportStats=c("mean","median","sd","mad", "cv","rcv")

  • "strict_basic": apply a strict set of filters and and export statistically significant genes and basic annotation and statistics elements. In this case, the above described arguments become:

    • exonFilters=list(minActiveExons= list(exonsPerGene=4,minExons=2,frac=1/4))

    • geneFilters=list(length=list(length=750), avgReads=list(averagePerBp=100,quantile=0.5), expression=list(median=TRUE,mean=FALSE, quantile=NA,known=NA,custom=NA), biotype=getDefaults("biotypeFilter",org[1]))

    • pcut=0.01

    • exportWhat=c("annotation","p_value", "adj_p_value","meta.p.value", "adj_meta_p_value","fold_change")

    • exportScale=c("natural","log2")

    • exportValues=c("normalized")

    • exportStats=c("mean")

  • "strict_normal": apply a strict set of filters and export statistically significant genes and normal annotation and statistics elements. In this case, the above described arguments become:

    • exonFilters=list(minActiveExons= list(exonsPerGene=4,minExons=2,frac=1/4))

    • geneFilters=list(length=list(length=750), avgReads=list(averagePerBp=100,quantile=0.5), expression=list(median=TRUE,mean=FALSE, quantile=NA,known=NA,custom=NA), biotype=getDefaults("biotypeFilter",org[1]))

    • pcut=0.01

    • exportWhat=c("annotation","p_value", "adj_p_value","meta_p_value", "adj_meta_p_value","fold_change", "stats","counts")

    • exportScale=c("natural","log2")

    • exportValues=c("normalized")

    • exportStats=c("mean","sd","cv")

  • "strict_full": apply a strict set of filters and export statistically significant genes and full annotation and statistics elements. In this case, the above described arguments become:

    • exonFilters=list(minActiveExons= list(exonsPerGene=4,minExons=2,frac=1/4))

    • geneFilters=list(length=list(length=750), avgReads=list(averagePerBp=100,quantile=0.5), expression=list(median=TRUE,mean=FALSE, quantile=NA,known=NA,custom=NA), biotype=getDefaults("biotypeFilter",org[1]))

    • pcut=0.01

    • exportWhat=c("annotation","p_value", "adj_p_value","meta_p_value", "adj_meta_p_value","fold_change", "stats","counts")

    • exportScale=c("natural","log2","log10", "vst")

    • exportValues=c("raw","normalized")

    • exportStats=c("mean","median","sd","mad", "cv","rcv")

Note

Currently only gene and exon annotation from Ensembl (http://www.ensembl.org), UCSC and RefSeq are supported. In addition, the user may choose to use own GTF file on the fly or import to the backend annotation database (see buildAnnotationDatabase). Thus, the unique gene ids in the counts files should correspond to valid Ensembl, UCSC or RefSeq gene or exon accessions for the organism of interest, or according to the user's GTF. If you are not sure about the source of your counts file or do not know how to produce it, it's better to start from the original BAM/BED files (metaseqr2 will use the read2count function to create a counts file). Keep in mind that in the case of BED files, the performance will be significantly lower and the overall running time significanlty higher as the R functions which are used to read BED files to proper structures (GenomicRanges) and calculate the counts are quite slow. An alternative way is maybe the easyRNASeq package (Delhomme et al, 2012). The read2count function does not use this package but rather makes use of standard Bioconductor functions to handle NGS data. If you wish to work outside R, you can work with other popular read counters such as the HTSeq read counter (http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html). Please also note that in the current version, the members of the geneFilters and exonFilters lists are not checked for validity so be careful to supply with correct names otherwise the pipeline will crash or at the best case scenario, will ignore the filters. Also note that when you are supplying metaseqr2 wtih an exon counts table, gene annotation is always downloaded so please be sure to have a working internet connection. In addition to the above, if you have a multiple core system, be very careful on how you are using the restrictCores argument and generally how many cores you are using with scripts purely written in R. The analysis with exon read data can very easily cause memory problems, so unless you have more than 64Gb of RAM available, consider setting restrict.cores to something like 0.2 when working with exon data. Finally, if you do not wish to download the same annotation again and again when performing multiple analyses, it is best to use the buildAnnotationDatabase function to download and store the resulting data frames in local SQLite database and then use these files with the org, refdb and version options.

Please note that the meta-analysis feature provided by metaseqr2 does not satisfy the strict definition of "meta-analysis", which is the combination of multiple similar datasets under the same statistical methodology. Instead it is the use of mulitple statistical tests applied to the same data. For the Simes method, please consult also "Simes, R. J. (1986). "An improved Bonferroni procedure for multiple tests of significance". Biometrika 73 (3): 751–754."

Also, if weight="meta_perm" ideally one would want to create the same set of indices for a given dataset so as to create reproducible p-values. To achieve this, use the set.seed function prior to any calculations. When metaseqR2 is loaded, the random seed is set to 42.

Author(s)

Panagiotis Moulos

Examples

# An example pipeline with gene counts
data("mm9GeneData",package="metaseqR2")

result <- metaseqr2(
    counts=mm9GeneCounts,
    sampleList=sampleListMm9,
    contrast=c("adult_8_weeks_vs_e14.5"),
    libsizeList=libsizeListMm9,
    annotation="embedded",
    org="mm9",
    countType="gene",
    normalization="edger",
    statistics="edger",
    pcut=0.05,
    figFormat="png",
    qcPlots="mds",
    exportWhat=c("annotation","p_value","adj_p_value","fold_change"),
    exportScale="natural",
    exportValues="normalized",
    exportStats="mean",
    exportWhere=file.path(tempdir(),"test1"),
    restrictCores=0.01,
    geneFilters=list(
        length=list(
            length=500
        ),
        avgReads=list(
            averagePerBp=100,
            quantile=0.25
        ),
        expression=list(
            median=TRUE,
            mean=FALSE,
            quantile=NA,
            known=NA,
            custom=NA
        ),
        biotype=getDefaults("biotypeFilter","mm9")
    ),
    outList=TRUE
)
head(result$data[["adult_8_weeks_vs_e14.5"]])

pmoulos/metaseqR2 documentation built on May 20, 2024, 5:48 a.m.