ppAuto: RNA Seq and Alternative Splicing preprocessing function

View source: R/ppAuto.R

ppAutoR Documentation

RNA Seq and Alternative Splicing preprocessing function

Description

ppAuto is a wrapper function for several tools and functions that perform preprocessing of RNA Sequencing data. This function performs preprocessing that includes mapping of reads, sorting and indexing of bam files, to summarization of read counts for exons, introns, genes and junctions. ppAuto also creates several prerequisite matrices including junction matrix, ReadMembershipMatrix (RMM), IntronMembershipMatrix (iMM) and Gcount matrix in order to run ExonPointer and IntronPointer algorithms.

System requirements for ppAuto include:

  1. fastq-dump (if files='SRA')

  2. tophat2

  3. samtools

Usage

ppAuto(
  folderSRA = FALSE,
  srlist = NULL,
  pairedend = FALSE,
  genomeBI,
  gtf,
  files = "fastq",
  p = 1,
  N = 6,
  r = 44,
  mate_std_dev = 30,
  read_edit_dist = 6,
  max_intron_length = 10000,
  min_intron_length = 50,
  segment_length = NULL,
  ...
)

Arguments

folderSRA

path of directory containing fastq or SRA files. (default=current directory)

srlist

list of unique sample names of fastq/SRA files created by default in the function. Please follow naming convention for the sample files:
For SRA files : "Sample-S1_1" "Sample-S1_2" (for paired-end reads) and "Sample-S1" (for single-end reads).
For fastq files: "Sample-S1_1.fastq" "Sample-S1_2.fastq" (for paired-end reads) and "Sample-S1.fastq" (for single-end reads).

pairedend

boolean, TRUE if reads are paired-end and FALSE if reads are single-end. All files should be either single-end or paired-end. (default=FALSE)

genomeBI

path of genome build of the organism created using bowtie2-build command.

gtf

intron parsed gtf file of the organism. Please check intronGTFparser to generate intron parsed gtf file (to generate intron read counts).

files

type of raw read file: fastq or sra (downloaded from NCBI). All files should be in same format and have same read length. (default=fastq)

p

number of threads to be utilized by samtools and Rsubread package. (default=1)

N

accepted read mismatches. Reads with more than N mismatches are discarded. (default=6) [tophat2 parameter]

r

expected inner distance between mate pair. (default=44) [tophat2 parameter]

mate_std_dev

the standard deviation for the distribution on inner distances between mate pairs. (default=30) [tophat2 parameter]

read_edit_dist

final read alignments having more than these many edit distance are discarded. (default=6) [tophat2 parameter]

max_intron_length

when searching for junctions ab initio, TopHat2 will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. (default=10000) [tophat2 parameter]

min_intron_length

topHat2 will ignore donor/acceptor pairs closer than this many bases apart. (default=50) [tophat2 parameter]

segment_length

each read is divided into this length and mapped independently to find junctions. [tophat2 parameter]

...

other parameter to be passed to tophat2.

Value

  1. Mapped, sorted and indexed bam files. (Can be run separately using tophat2 and samtools or wrapper function: ppRawData)

  2. Lists of gene counts, exon counts and intron counts saved in folderSRA directory as respective Rdata files. (Can be run separately using featureCounts or wrapper function: ppSumEIG)

  3. Junction Matrix: Matrix with annotated junction count reads. (Can be run separately using getJunctionCountMatrix or wrapper function: ppRawData)

  4. RMM : ReadMembershipMatrix. (Can be run separately using readMembershipMatrix or wrapper function: ppFASE)

  5. iMM : intronMembershipMatrix. (Can be run separately using intronMembershipMatrix or wrapper function: ppFASE)

  6. Gcount : A list of gene-wise read count summarization of meta-features times samples in the study. (Can be run separately using countMatrixGenes or wrapper function: ppFASE)

References

  1. Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47, e47 (2019).


harshsharma-cb/FASE documentation built on Aug. 6, 2023, 1:37 a.m.