buildSummarized: Generate summarized Read File for DE analyses
In consensusDE: RNA-seq analysis using multiple algorithms

Description Usage Arguments Value Examples

View source: R/build_summarized.R

This function will create a summarized experiment, decribing reads from RNA-seq experiments that overlap a set of transcript features. Transcript features can be described as a gtf formatted table that is imported, or using a txdb. The summarized experiment can be build directly from bam files or by reading in counts in htseq format. This is designed to be straightforward and with minimised parameters for batch style RNA-seq analyses.

buildSummarized(sample_table = NULL, bam_dir = NULL,
  htseq_dir = NULL, gtf = NULL, tx_db = NULL,
  technical_reps = FALSE, map_reads = "transcript",
  mapping_mode = "Union", read_format = NULL, strand_mode = 0,
  fragments = FALSE, summarized = NULL, output_log = NULL,
  filter = FALSE, BamFileList_yieldsize = NA_integer_, n_cores = 1,
  force_build = FALSE, verbose = FALSE)

`sample_table`	A data.frame describing samples. For paired mode it must at least 2 columns, "file", "group", and option additional columns, "pairs" and "tech_replicate" for describing sample pairing and instances of technical replicates. The filename "file" must correspong to the name of the file in the directory supplied with the "bam_dir" parameter below - or ar error will be reported and buildSummarized will halt. This is not required if an existing summarized file is provided. Default = NULL
`bam_dir`	Full path to location of bam files listed in the "file" column in the sample_table provided above. This is not required if an existing summarized file is provided. Default = NULL
`htseq_dir`	Full path to location of htseq files listed in the "file" column in the sample_table described above. This is not required if an existing summarized file is provided. Files must end in ".txt". Default = NULL
`gtf`	Full path to a gtf file describing the transcript coordinates to map the RNA-seq reads to. GTF file is not required if providing a pre-computed summarized experiment file previously generated using buildSummarized() OR a tx_db object (below). Default = NULL
`tx_db`	An R txdb object. E.g. TxDb.Dmelanogaster.UCSC.dm3.ensGene. Default = NULL
`technical_reps`	Are there technical replicates to merge counts? I.e. are there multiple technical replicates run accross multiple lanes/sequencing runs. If "TRUE", unique sample names should be provided in a "tech_replicate" column of the "sample_table" for identification. Options are "TRUE" or "FALSE". Default = "FALSE"
`map_reads`	Which features to count reads by. Options are "transcript", "exon" or "cds". This will invoke transcriptsBy(), exonsBy() or cdsBy() respectively. Default = "transcript"
`mapping_mode`	Options are "Union", "IntersectionStrict" and "IntersectionNotEmpty". see "mode" in ?summarizeOverlaps for explanation. Default = "Union"
`read_format`	Are the reads from single-end or paired-end data? Option are "paired" or "single". An option must be selected if htseq_dir is NULL and read are summarized from BAM files. Default = NULL
`strand_mode`	indicates how the reads are stranded. Options are 0 (unstranded); 1 (stranded) and 2 (reverse strandedness). see ?strandMode in Genomic Alignments for explanation. Default = 0
`fragments`	When mapping_mode = "paired", include reads from pairs that do not map with their corresponding pair? see "fragments" in ?summarizeOverlaps for explanation. Default = TRUE
`summarized`	Full path to a summarized experiment file. If buildSummarized() has already been performed, the output summarized file, saved in "/output_log/se.R" can be used as the input (e.g. if filtering is to be done). Default = NULL
`output_log`	Full path to directory for output of log files and saved summarized experiment generated.
`filter`	Perform filtering of low count and missing data from the summarized experiment file? This uses default filtering via "filterByExpr". See ?filterByExpr for further information. Default = FALSE
`BamFileList_yieldsize`	If running into memory problems. Set the number of lines to an integer value. See "yieldSize" description in ?BamFileList for an explanation.
`n_cores`	Number of cores to utilise for reading in Bam files. Use with caution as can create memory issues if BamFileList_yieldsize is not parameterised. Default = 1
`force_build`	If the sample_table contains less than two replicates per group, force a summarizedExperiment object to be built. Otherwise buildSummarized will halt. Default = FALSE.
`verbose`	Verbosity ON/OFF. Default = FALSE

A summarized experiment

## Extract summarized following example in the vignette
## The example below will return a summarized experiment
## tx_db is obtained from TxDb.Dmelanogaster.UCSC.dm3.ensGene library
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
## bam files are obtained from the GenomicAlignments package

## 1. Build a sample table that lists files and groupings
## - obtain list of files
file_list <- list.files(system.file("extdata", package="GenomicAlignments"),
                        recursive = TRUE,
                        pattern = "*bam$",
                        full = TRUE)
bam_dir <- as.character(gsub(basename(file_list)[1], "", file_list[1]))

## - create a sample table to be used with buildSummarized() below
## must be comprised of a minimum of two columns, named "file" and "group",
sample_table <- data.frame("file" = basename(file_list),
                           "group" = c("treat", "untreat"))

summarized_dm3 <- buildSummarized(sample_table = sample_table,
                                  bam_dir = bam_dir,
                                  tx_db = TxDb.Dmelanogaster.UCSC.dm3.ensGene,
                                  read_format = "paired",
                                  force_build = TRUE)