knitr::opts_chunk$set(echo = TRUE)

Oligo Games

This vignette documents the functions provided in oligoGames. I start with a function to design the oligo pool and take the user through the end of the game - when you learn kmers important for a particular cellular role. Here, I provide mainly high-level overview of the functions and I recommend the user to browse the low-level details using ?function for example: ?modelNucCounts to learn more about how we model nucleotide level counts from oligo level counts.. If something is confusing, please let me know and I will make every attempt to clarify it.

Now, let's play a simple game.

Level 0 - Set up the environment to start playing the oligo games

We will use different packages in this vignette so to begin we must load all of them in the current session.

library(oligoGames)
library(readr)
library(ggplot2)
library(dplyr)

Level 1 - Design an Oligo Pool from a set of sequences

In level one, we design the oligo pool by tiling regions whose activity we wish to understand. For example, lets say you want to find nuclear localization elements in long non coding RNAs (lncRNAs). In this case, your oligo pool would invovle tiling cDNAs of lncRNA genes with oligos of fixed length. To each oligo we need to add a unique barcode which we can sequence and match to the oligo as well as forward and reverse universal primers for making sequencing libraries using PCR. We provide a simple function - designOligoPool for this purpose.

We use this function by providing a fasta file of the regions we wish to tile, a gzipped file with seeds for microRNA sequences (recall that we don't want any barcode to match a miRNA seed otherwise the transcript will be destroyed in vivo), size of each tile, overlap between successive tiles, forward and reverse universal primer sequences (recommended that you use the default ones if the experiment is in human/mouse), barcode length and directory to write the output files. In case we want to, we can also specify number of barcodes per tiled sequence, number of scramble sequences for each tiled sequence. As an example:

regionsFile <- system.file("extdata", "testRegions.fa", package='oligoGames')
microSeedsFile <- system.file('extdata', 'hg19miRSeeds.txt.gz', package='oligoGames')
outDir <- 'demoOligoGame'
designOligoPool(regionsFile = regionsFile, tileSize = 110, overlap = 10, 
                barcodesPerSequence = 1, barcodesFile = '', microSeedsFile = microSeedsFile, 
                reSeq = '', numScrambles = 0, barcodeLen = 10, outDir = outDir)

Level 2 - Uniquely map FASTQ reads to barcodes and obtain oligo counts

Level 3 - Normalize oligo counts

Once we uniquely map the reads to the barcodes, our next challenge is to normalize the read counts across samples to account for difference in library sizes i.e. sequencing depth. For this, we provide the normCounts function which inputs a matrix with raw oligo counts and outputs a matrix of normalized oligo counts.

Also, please note the format of rawCounts which we input to this function. It contains one column for the names of transcripts i.e. oligos and then 1 column each for every sample we sequenced for this game. While it is not important for the first column to be named "Transcript" for this particular function, it is crucial for the downstream functions and so we recommend you to adhere to this formatting strictly. The replicate number is directly appended to the sample type i.e. "Nucleus1", "Nucleus2" etc. We also request the user to first provide all replicates of a given condition and only then proceed to the next condition. For example, we prefer columns be ordered "Nucleus1", "Nucleus2", "Nucleus3", "Total1", "Total2", "Total3" rather than "Nucleu1", "Total1", "Nucleus2", "Total2", "Nucleus3", "Total3".

rawCounts <- read_tsv(system.file("extdata", "allTranscriptsCounts_Raw.tsv", package = "oligoGames"), col_names = T)
tbl_df(rawCounts)
normalizedCounts <- normCounts(rawCounts, normType='median')

Level 4 - Model nucleotide counts from oligo counts

Next, we go from oligo counts to nucleotide counts. In our package, we use the modelNucCounts function for this functionality. This function inputs the matrix obtained by normCounts function along with a meta data file containing high level information about our oligo pool to obtain the modeled nucleotide counts. Let us take a look at the meta data file.

metaData <- system.file("extdata", "oligoMeta.tsv", package = "oligoGames")
meta <- read_tsv(metaData, col_names = T)
meta

We see that we need to given the name of each region in our oligo pool in a column titled name. Besides, we need to specify the starting barcode index for each region - startBarcodeIndex, number of oligos tiling the region - numOfOligos, the length of the region - seqLen as well as the overlap b/w neighboring oligos window and the length of the variable sequence in each oligo - oligoLen. We urge the user to be careful and provide the meta data in exactly the format described here.

Recall that in our oligo design, each nucleotide is tiled by several different oligos and hence we can use different approaches to model nucleotide counts from the corresponding oligo counts. A simple approach might be to simply sum the counts for each oligo tiling a particular nucleotide and assign the sum as the nucleotide count. Another approach would be to take the median of counts of all oligos tiling a given nucleotide. We can specify the exact method we wish to use to model the nucleotide counts using the modelMethods parameter. In addition, we need to provide a parameter called conditionLabels to define the sample labels for our 2 conditions, for example: RNA, DNA or Nuclei, Total. For full documentation on the function, please read ?modelNucCounts.

oligoLen <- 110
conditionLabels = c("Nuclei", "Total")
modeledNucs <- modelNucCounts(normalizedCounts, metaData, conditionLabels, 
                              modelMethod = "median", oligoLen = 110)
tbl_df(modeledNucs)

The output of modelNucCounts is a matrix whose row names indicate the tiled region, a column for the position in the tiled region and 1 column each for every sample present in our experiment.

Level 5 - Identify regions with differential nucleotide counts

Once we model nucleotide counts from our sequencing data, we want to infer differential regions i.e. regions which are enriched or depleted in one condition compared to the other. For this, we use the DRfinder function. You can take a look at all the parameters given to the function using ?DRfinder.

Mainly - you need to provide a matrix of nucleotide counts for each region (OligoSignal), the sample labels for the 2 conditions conditionLabels, minimum number of consecutive positions required for a region (minNumRegion), the threshold for the difference between the 2 conditions (cutoff) and whether you want to smooth the data (smooth).

If you have tiled your oligos deeply, we recommend to turn the smoothing off. Additionally, often times you are only interested in cases where condition1 > condition2. For example, to find nuclear localization sequences you only want regions where counts in nucleus > counts in total. In these situations, we recommend you set onlyUp parameter to be TRUE.

DRregions <- DRfinder(modeledNucs, conditionLabels, minNumRegion = 3, cutoff = 0.5, 
                      smooth = FALSE, verbose = TRUE, workers = 1, maxPerms = 50, onlyUp = TRUE)

Finally, you can subset the DRregions based on p-val or q-val to get the significant hits. For example, to get differential regions with 10% FDR,

filteredRegions <- filter(DRregions, qval<=0.1)

Level 6 - Learn k-mers enriched in regions identified in previous level



cshukla/oligoGames documentation built on May 27, 2019, 8:44 a.m.