(Optional) preprocessing of FASTQ files

Before applying DEUS we strongly recommend to trim your input FASTQ files. Trim Galore can be used to do adapter and quality trimming. The --max_length parameter should be set to (read length - 1) to remove non-small RNA sequences.

# remove adapters (automatically recognized by Trim Galore)
/software/trim_galore_v0.x.x/trim_galore -q 0 --length 16 --max_length 49 $in_file -o $out_dir
# quality trimming (use dummy adapter to suppress adapter trimming)
/software/trim_galore_v0.x.x/trim_galore -q 20 --length 1 -a CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC --stringency 50 $in_file -o $out_dir

Installing DEUS

You can install DEUS from GitHub with:

if (!require("devtools")) install.packages("devtools", repos='http://cran.us.r-project.org')
devtools::install_github("timjeske/DEUS")
library(DEUS)

Alternatively, you can install DEUS from a local copy.

First, you need to unzip it:

unzip DEUS-master.zip

Then you can install it in R:

if (!require("devtools")) install.packages("devtools", repos='http://cran.us.r-project.org')
devtools::install_local("./DEUS-master")
library(DEUS)

Setting parameters

Before running the individual steps of the DEUS pipeline it is helpful to set some required parameters. The in_dir is the folder where the (trimmed) FASTQ files are stored. The out_dir is the folder where plots of differential expression analysis (DEA), clustering and summary output files will be stored. The phenofile is the condition file for DEA (examplary condition file).

in_dir <- system.file("extdata", package = "DEUS")
out_dir <- tempdir()
phenofile <- system.file("extdata", "condition.tsv", package = "DEUS")
pheno_info <- read.table(phenofile, header=T, row.names=1, check.names=FALSE)
pheno_info

Counting unique sequences

To create the count table for differential expression analysis, createCountTableFromFastQs counts all unique sequences in each FASTQ file given in your pheno_info. As an optional step, we only keep sequences with an average sequence count larger than 1 to remove low expressed sequences and reduce the size of the count table.

count_table <- createCountTableFromFastQs(in_dir, pheno_info=pheno_info)
count_table_filtered <- count_table[rowMeans(count_table)>1,]
head(count_table_filtered, n=2)

Create identifiers for each sequence

A map is created by createMap which assigns a unique identifier to each sequence.

map <- createMap(count_table_filtered)
head(map, n=2)

Running differential expression analysis

Differential expression analysis is done by calling the function runDESeq2. Results are filtered using a 0.05 Pvalue cutoff.

design <- ~ condition
de_results <- runDESeq2(count_table_filtered, pheno_info, design, out_dir=out_dir)
sig_results <- de_results$de_result
sig_results <- sig_results[!is.na(sig_results$IHWPvalue) & sig_results$IHWPvalue < 0.05,]
head(sig_results, n=2)

Running BLAST

The significant sequences are subsequently stored in a FASTA file for BLAST annotation using sequencesAsFasta. BLAST is executed via runBlast. It is necessary to set the BLAST binary, your BLAST database and the number of threads to be used.

blast_exec <- "/your/path/to/ncbi-blast-x.x.x/bin/blastn"
blast_db <-system.file("extdata", "blastdb/DASHR_subset.fa", package = "DEUS")
blast_ncores <- 2
sig_seq_fasta <- sequencesAsFasta(sig_results,map)
blast_result <- runBlast(blast_exec, blast_db, blast_ncores, sig_seq_fasta) 

In our example, we load pre-compiled data.

blast_int <- system.file("extdata", "results/blast_result_sig_sequences.tsv", package="DEUS")
blast_result <- read.table(blast_int, header=T, sep="\t")
head(blast_result)

Compute counts per condition

The results of runDESeq2 are used as input for getConditionCountStats to compute the mean and the standard deviation of the normalized counts for each analyzed condition.

count_stats <- getConditionCountStats(de_results$norm_counts, pheno_info)
head(count_stats, n=2)

Run clustering

The CD-Hit algorithm is executed via runClustering. The function requires the cd-hit-est binary.

cd_hit <- "/your/path/to/cdhit/cd-hit-est"
sig_seq_fasta <- sequencesAsFasta(sig_results,map)
clust_result <- runClustering(cd_hit, sig_seq_fasta, out_dir, identity_cutoff=0.9, length_cutoff=0.9, wordlength=9, map)

For this manual, we will load available example data again.

clust_int <- system.file("extdata", "results/clust_result_sig_sequences.tsv", package="DEUS")
clust_result <- read.table(clust_int, header=T, sep="\t")
rownames(clust_result) <- clust_result$SequenceID
head(clust_result)

Merging the results

Finally, the group-wise mean and standard deviation of the normalized counts, the results of DEA, BLASTing and clustering are merged together by mergeResults. Additionally, addCountsOfFeatureClasses is used to count the number of BLAST hits grouped by user defined feature classes. Summary files are then written to the out_dir by writeSummaryFiles.

classes <- c("tRNA","[Hh]sa","^U")
summary <- mergeResults(sig_results, count_stats, blast_result, clust_result, map)
summary <- addCountsOfFeatureClasses(summary, classes)
writeSummaryFiles(summary, out_dir)
head(summary, n=2)

Computing the NA fraction

The function getNoBlastHitFraction can be used to compute the fraction of reads without annotation for each sample. This fraction corresponds to the fraction of reads that are likely to be ignored in mapping approaches because they map to regions that are not annotated.

getNoBlastHitFraction(summary, de_results$norm_counts)

Extended expression analysis using sequence clusters

In addition to the default DEUS pipeline, we developed an adjusted approach that aggregates sequence counts for each sequence cluster and uses this information to find differentially expressed sequence clusters. By some small adjustments, the cluster expression analysis can be included into the default DEUS pipeline.

Instead of creating the sequence map for differentially expressed sequences only, it will be created on the raw input sequences since all sequences will be used for clustering.

map <- createMap(count_table)

Afterwards, all sequences can be clustered. Based on the clustering results, the sum of sequence counts will be created for each cluster via mergeAndAggregate.

cd_hit <- "/your/path/to/cdhit/cd-hit-est"
all_seq_fasta <- sequencesAsFasta(count_table,map)
clust_result <- runClustering(cd_hit, all_seq_fasta, out_dir, identity_cutoff=0.9, length_cutoff=0.9, wordlength=9, map)
cl_counts <- mergeAndAggregate(map, count_table, clust_result)

For this manual, we will load the provided clustering results.

clust_int <- system.file("extdata", "results/clust_result_clust_sequences.tsv", package="DEUS")
clust_result <- read.table(clust_int, header=T, sep="\t")
rownames(clust_result) <- clust_result$SequenceID
cl_counts <- mergeAndAggregate(map, count_table, clust_result)
head(cl_counts, n=2)

The resulting count matrix is then used as input for the DE analysis.

design <- ~ condition
cl_de_results <- runDESeq2(cl_counts, pheno_info, design, out_dir = out_dir, prefix = "Cluster")
cl_sig_results <- cl_de_results$de_result
head(cl_sig_results, n=2)

Subsequently, the existing DEA results based on individual sequence counts and the generated DEA results based on sequence clusters can be merged into a single data frame via mergeSingleAndClusterResults. Based on the individual and cluster Pvalues, the resulting data frame is filtered to detect sequences that are differentially expressed based on their sequence count or part of a differentially expressed sequence cluster.

sig_results <- mergeSingleAndClusterResults(cl_sig_results,clust_result,sig_results,map)
sig_results <- sig_results[((!is.na(sig_results$IHWPval) & sig_results$IHWPval < 0.05) | (!is.na(sig_results$Cl_IHWPval) & sig_results$Cl_IHWPval < 0.05)),]
head(sig_results, n=2)

Next, we run BLAST on all selected sequences.

blast_exec <- "/your/path/to/ncbi-blast-x.x.x/bin/blastn"
blast_db <-system.file("extdata", "blastdb/DASHR_subset.fa", package = "DEUS")
blast_ncores <- 2
sig_seq_fasta <- sequencesAsFasta(sig_results,map)
blast_result <- runBlast(blast_exec, blast_db, blast_ncores, sig_seq_fasta) 

In our example, we load pre-compiled data.

blast_int <- system.file("extdata", "results/blast_result_clust_sequences.tsv", package="DEUS")
blast_result <- read.table(blast_int, header=T, sep="\t")
head(blast_result)

Finally, the summary table is generated again.

# merge results
classes <- c("tRNA","[Hh]sa","^U")
summary <- mergeResults(sig_results, count_stats, blast_result, clust_result, map)
summary <- addCountsOfFeatureClasses(summary, classes)
writeSummaryFiles(summary, out_dir)
head(summary, n=2)

In order to compare both aproaches, the function printClusterSummary can be used to get detailed information about the number of significant sequences and clusters.

printClusterSummary(summary)


timjeske/DEUS documentation built on June 6, 2019, 12:59 p.m.