processAmplicons: Process raw fastq data from pooled genetic sequencing screens

Description Usage Arguments Details Value Note Author(s) References

View source: R/processAmplicons.R

Description

Given a list of sample-specific index (barcode) sequences and hairpin/sgRNA-specific sequences from an amplicon sequencing screen, generate a DGEList of counts from the raw fastq file/(s) containing the sequence reads. The position of the index sequences and hairpin/sgRNA sequences is considered variable, with the hairpin/sgRNA sequences assumed to be located after the index sequences in the read.

Usage

1
2
3
4
5
processAmplicons(readfile, readfile2=NULL, barcodefile, hairpinfile,
                    allowMismatch=FALSE, barcodeMismatchBase=1,
                    hairpinMismatchBase=2, dualIndexForwardRead=FALSE,
                    verbose=FALSE, barcodesInHeader=FALSE,
                    plotPositions=FALSE)

Arguments

readfile

character vector giving one or more fastq filenames

readfile2

character vector giving one or more fastq filenames for reverse read, default to NULL

barcodefile

filename containing sample-specific barcode ids and sequences

hairpinfile

filename containing hairpin/sgRNA-specific ids and sequences

allowMismatch

logical, indicates whether sequence mismatch is allowed

barcodeMismatchBase

numeric value of maximum number of base sequence mismatches allowed in a barcode sequence when allowMismatch is TRUE

hairpinMismatchBase

numeric value of maximum number of base sequence mismatches allowed in a hairpin/sgRNA sequence when allowMismatch is TRUE

dualIndexForwardRead

logical, indicates if forward reads contains a second barcode sequence (must be present in barcodefile) which should be matched

verbose

if TRUE, output program progress

barcodesInHeader

logical, indicates if barcode sequences should be matched in the header (sequence identifier) of each read (i.e. the first of every group of four lines in the fastq files)

plotPositions

logical, indicates if a density plot displaying the position of each barcode and hairpin/sgRNA sequence in the reads should be created. If dualIndexForwardRead is TRUE or readfile2 is not NULL, plotPositions will generate two density plots, side by side, indicating the positions of the first barcodes and hairpins in the first plot, and second barcodes in the second.

Details

The processAmplicons function allows for hairpins/sgRNAs/sample index sequences to be in variable positions within each read.

The input barcode file and hairpin/sgRNA files are tab-separated text files with at least two columns (named 'ID' and 'Sequences') containing the sample or hairpin/sgRNA ids and a second column indicating the sample index or hairpin/sgRNA sequences to be matched. If dualIndexForwardRead is TRUE, a third column 'Sequences2' is expected in the barcode file. If readfile2 is specified, another column 'SequencesReverse' is expected in the barcode file. The barcode file may also contain a 'group' column that indicates which experimental group a sample belongs to. Additional columns in each file will be included in the respective $samples or $genes data.frames of the final codeDGEList object. These files, along with the fastq file/(s) are assumed to be in the current working directory.

To compute the count matrix, matching to the given barcodes and hairpins/sgRNAs is conducted in two rounds. The first round looks for an exact sequence match for the given barcode sequences and hairpin/sgRNA sequences through the entire read, returning the first match found. If a match isn't found, the program performs a second round of matching which allows for sequence mismatches if allowMismatch is set to TRUE. The maximum number of mismatch bases in barcode and hairpin/sgRNA are specified by the parameters barcodeMismatchBase and hairpinMismatchBase respectively.

The program outputs a DGEList object, with a count matrix indicating the number of times each barcode and hairpin/sgRNA combination could be matched in reads from input fastq file(s).

For further examples and data, refer to the case studies available from http://bioinf.wehi.edu.au/shRNAseq.

Value

Returns a DGEList object with following components:

counts

read count matrix tallying up the number of reads with particular barcode and hairpin/sgRNA matches. Each row is a hairpin/sgRNA and each column is a sample

genes

In this case, hairpin/sgRNA-specific information (ID, sequences, corresponding target gene) may be recorded in this data.frame

lib.size

auto-calculated column sum of the counts matrix

Note

This function replaced the earlier function processHairpinReads in edgeR 3.7.17.

This function replaces the previous processAmplicons function, which expected the sequences in the fastq files to have a fixed structure (as per Figure 1A of Dai et al., 2014). This function can be used, and is intended for, reads where hairpins/sgRNAs/sample index sequences can be in variable positions within each read. When plotPositions=TRUE a density plot of the match positions is created to allow the user to assess whether they occur in the expected postions.

Author(s)

Oliver Voogd, Zhiyin Dai, Shian Su and Matthew Ritchie

References

Dai Z, Sheridan JM, Gearing, LJ, Moore, DL, Su, S, Wormald, S, Wilcox, S, O'Connor, L, Dickins, RA, Blewitt, ME, Ritchie, ME(2014). edgeR: a versatile tool for the analysis of shRNA-seq and CRISPR-Cas9 genetic screens. F1000Research 3, 95. http://f1000research.com/articles/3-95


hiraksarkar/edgeR_fork documentation built on Dec. 20, 2021, 3:52 p.m.