qAlign: Align reads

View source: R/qAlign.R

qAlignR Documentation

Align reads

Description

Create read alignments against reference genome and optional auxiliary targets if not yet existing. If necessary, also build target indices for the aligner.

Usage

qAlign(
  sampleFile,
  genome,
  auxiliaryFile = NULL,
  aligner = "Rbowtie",
  maxHits = 1,
  paired = NULL,
  splicedAlignment = FALSE,
  snpFile = NULL,
  bisulfite = "no",
  alignmentParameter = NULL,
  projectName = "qProject",
  alignmentsDir = NULL,
  lib.loc = NULL,
  cacheDir = NULL,
  clObj = NULL,
  checkOnly = FALSE,
  geneAnnotation = NULL
)

Arguments

sampleFile

The name of a text file listing input sequence files and sample names (see ‘Details’).

genome

The reference genome for primary alignments, one of:

  • a string referring to a “BSgenome” package (e.g. “"BSgenome.Hsapiens.UCSC.hg19"”), which will be downloaded automatically from Bioconductor if not present

  • the name of a fasta sequence file containing one or several sequences (chromosomes) to be used as a reference. The aligner index will be created when necessary and stored in a default location (see ‘Details’).

auxiliaryFile

The name of a text file listing sequences to be used as additional targets for alignment of reads not mapping to the reference genome (see ‘Details’).

aligner

selects the aligner program to be used for aligning the reads. Currently, only “Rbowtie” and “Rhisat2” are supported, which are R wrapper packages for ‘bowtie’ / ‘SpliceMap’ and ‘hisat2’, respectively (see Rbowtie-package and Rhisat2-package packages).

maxHits

sets the maximal number of allowed mapping positions per read (default: 1). If a read produces more than maxHits alignments, no alignments will be reported for it. In case of a multi-mapping read, a single alignment is randomly selected.

paired

defines the type of paired-end library and can be set to one of no (single read experiment, default), fr (fw/rev), ff (fw/fw) or rf (rev/fw).

splicedAlignment

If TRUE, reads will be aligned using a spliced aligner, depending on the value of aligner described above:

aligner="Rhisat2"

: This is the recommended setting for spliced alignments and will use hisat2 from the Rhisat2-package. See also the geneAnnotation argument below for providing known exon-exon junctions.

aligner="Rbowtie"

: This is not recommended and only available for legacy reasons. It will use SpliceMap to produce spliced alignments (without using a database of known exon-exon junctions). Compared to the alternative alignment modes (non-spliced or spliced using Rhisat2 as aligner), this alignment mode is about ten-fold slower and also less sensitive. Furthermore, SpliceMap can only be used for reads with a minimal length of 50nt; SpliceMap ignores reads that are shorter, and these reads will not be contained in the BAM file, neither as mapped or unmapped reads.

snpFile

The name of a text file listing single nucleotide polymorphisms to be used for allele-specific alignment and quantification (see ‘Details’).

bisulfite

For bisulfite-converted samples (Bis-seq), the type of bisulfite library (“dir” for directional libraries, “undir” for undirectional libraries).

alignmentParameter

An optional string containing command line parameters to be used for the aligner, to overrule the default alignment parameters used by QuasR. Please use with caution; some alignment parameters may break assumptions made by QuasR. Default parameters are listed in ‘Details’.

projectName

An optional name for the alignment project.

alignmentsDir

The directory to be used for storing alignments (bam files). If set to NULL (default), bam files will be generated at the location of the input sequence files.

lib.loc

can be used to change the default library path of R. The library path is used by QuasR to store aligner index packages created from BSgenome reference genomes.

cacheDir

specifies the location to store (potentially huge) temporary files. If set to NULL (default), the temporary directory of the current R session as returned by tempdir() will be used.

clObj

A cluster object, created by the package parallel, to enable parallel processing and speed up the alignment process.

checkOnly

If TRUE, prevents the automatic creation of alignments or aligner indices. This allows to quickly check for missing alignment files without starting the potentially long process of their creation. In the case of missing alignments or indices, an exception is thrown.

geneAnnotation

Only used if aligner is "Rhisat2". The path to either a gtf file or a sqlite database generated by exporting a TxDb object. This file is used to generate a splice site file for Rhisat2, that will be used to guide the spliced alignment. Please note that if using an sqlite database file, do not use the one contained in the installed package folder of a TxDb package. QuasR (through Rhisat2) creates additional files in the folder which would interfere with the loading of the TxDb package.

Details

Before generating new alignments, qAlign looks for previously generated alignments as well as for an aligner index. If no aligner index exists, it will be automatically created and stored in the same directory as the provided fasta file, or as an R package in the case of a BSgenome reference. The name of this R package will be the same as the BSgenome package name, with an additional suffix from the aligner (e.g. BSgenome.Hsapiens.UCSC.hg19.Rbowtie). The generated bam files contain both aligned und unaligned reads. For paired-end samples, by default no alignments will be reported for read pairs where only one of the reads could be aligned.

sampleFile is a tab-delimited text file listing all the input sequences to be included in a given analysis. The file has either two (single-end) or three columns (paired-end). The first row contains the column names, and additional rows contain relative or absolute path and name of input sequence file(s), as well as the according sample name. Three input file formats are supported (fastq, fasta and bam). All input files in one sampleFile need to be in the same format, and are recognized by their extension (.fq, .fastq, .fa, .fasta, .fna, .bam), in raw or compressed form (e.g. .fastq.gz). If bam files are provided, then no alignments are generated by qAlign, and the alignments contained in the bam files will be used instead.

The column names in sampleFile have to match to the ones in the examples below, for a single-read experiment:

FileName SampleName
chip_1_1.fq.bz2 Sample1
chip_2_1.fq.bz2 Sample2

and for a paired-end experiment:

FileName1 FileName2 SampleName
rna_1_1.fq.bz2 rna_1_2.fq.bz2 Sample1
rna_2_1.fq.bz2 rna_2_2.fq.bz2 Sample2

The “SampleName” column is the human-readable name for each sample that will be used as sample labels. Multiple sequence files may be associated to the same sample name, which instructs QuasR to combine those files.

auxiliaryFile is a tab-delimited text file listing one or several additional target sequence files in fasta format. Reads that do not map against the reference genome will be aligned against each of these target sequence files. The first row contains the column names which have to match to the ones in the example below:

FileName AuxName
NC_001422.1.fa phiX174

snpFile is a tab-delimited text file without a header and contains four columns with chromosome name, position, reference allele and alternative allele, as in the example below:

chr1 8596 G A
chr1 18443 G A
chr1 18981 C T
chr1 19341 G A

The reference and alternative alleles will be injected into the reference genome, resulting in two separate genomes. All reads will be aligned separately to both of these genomes, and the alignments will be combined, only retaining the best alignment for each read. In the final alignment, each read will be marked with a tag that classifies it into reference (R), alternative (A) or unknown (U), if the reads maps equally well to both genomes.

If bisulfite is set to “dir” or “undir”, reads will be C-to-T converted and aligned to a similarly converted genome.

If alignmentParameter is NULL (recommended), qAlign will select default parameters that are suitable for the experiment type. Please note that for bisulfite or allele-specific experiments, each read is aligned multiple times, and resulting alignments need to be combined. This requires special settings for the alignment parameters that are not recommended to be changed. For ‘simple’ experiments (neither bisulfite, allele-specific, nor spliced), alignments are generated using the parameters -m maxHits --best --strata. This will align reads with up to “maxHits” best hits in the genome and selects one of them randomly.

Value

A qProject object.

Author(s)

Anita Lerch, Dimos Gaidatzis, Charlotte Soneson and Michael Stadler

See Also

qProject, makeCluster from package parallel, Rbowtie-package package, Rhisat2-package package

Examples

## Not run: 
# see qCount, qMeth and qProfile manual pages for examples
example(qCount)
example(qMeth)
example(qProfile)

## End(Not run)


fmicompbio/QuasR documentation built on Dec. 11, 2024, 11:22 p.m.