Before applying DEUS we strongly recommend to trim your input FASTQ files.
Trim Galore can be used to do adapter and quality trimming. The --max_length
parameter should be set to (read length - 1) to remove non-small RNA sequences.
# remove adapters (automatically recognized by Trim Galore) /software/trim_galore_v0.x.x/trim_galore -q 0 --length 16 --max_length 49 $in_file -o $out_dir # quality trimming (use dummy adapter to suppress adapter trimming) /software/trim_galore_v0.x.x/trim_galore -q 20 --length 1 -a CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC --stringency 50 $in_file -o $out_dir
You can install DEUS from GitHub with:
if (!require("devtools")) install.packages("devtools", repos='http://cran.us.r-project.org') devtools::install_github("timjeske/DEUS") library(DEUS)
Alternatively, you can install DEUS from a local copy.
First, you need to unzip it:
unzip DEUS-master.zip
Then you can install it in R:
if (!require("devtools")) install.packages("devtools", repos='http://cran.us.r-project.org') devtools::install_local("./DEUS-master") library(DEUS)
Before running the individual steps of the DEUS pipeline it is helpful to set some required parameters.
The in_dir
is the folder where the (trimmed) FASTQ files are stored.
The out_dir
is the folder where plots of differential expression analysis (DEA), clustering and summary output files will be stored.
The phenofile
is the condition file for DEA (examplary condition file).
in_dir <- system.file("extdata", package = "DEUS") out_dir <- tempdir() phenofile <- system.file("extdata", "condition.tsv", package = "DEUS") pheno_info <- read.table(phenofile, header=T, row.names=1, check.names=FALSE) pheno_info
To create the count table for differential expression analysis, createCountTableFromFastQs counts all unique sequences in each FASTQ file given in your pheno_info
. As an optional step, we only keep sequences with an average sequence count larger than 1 to remove low expressed sequences and reduce the size of the count table.
count_table <- createCountTableFromFastQs(in_dir, pheno_info=pheno_info) count_table_filtered <- count_table[rowMeans(count_table)>1,] head(count_table_filtered, n=2)
A map is created by createMap which assigns a unique identifier to each sequence.
map <- createMap(count_table_filtered) head(map, n=2)
Differential expression analysis is done by calling the function runDESeq2. Results are filtered using a 0.05 Pvalue cutoff.
design <- ~ condition de_results <- runDESeq2(count_table_filtered, pheno_info, design, out_dir=out_dir) sig_results <- de_results$de_result sig_results <- sig_results[!is.na(sig_results$IHWPvalue) & sig_results$IHWPvalue < 0.05,] head(sig_results, n=2)
The significant sequences are subsequently stored in a FASTA file for BLAST annotation using sequencesAsFasta. BLAST is executed via runBlast. It is necessary to set the BLAST binary, your BLAST database and the number of threads to be used.
blast_exec <- "/your/path/to/ncbi-blast-x.x.x/bin/blastn" blast_db <-system.file("extdata", "blastdb/DASHR_subset.fa", package = "DEUS") blast_ncores <- 2 sig_seq_fasta <- sequencesAsFasta(sig_results,map) blast_result <- runBlast(blast_exec, blast_db, blast_ncores, sig_seq_fasta)
In our example, we load pre-compiled data.
blast_int <- system.file("extdata", "results/blast_result_sig_sequences.tsv", package="DEUS") blast_result <- read.table(blast_int, header=T, sep="\t") head(blast_result)
The results of runDESeq2 are used as input for getConditionCountStats to compute the mean and the standard deviation of the normalized counts for each analyzed condition.
count_stats <- getConditionCountStats(de_results$norm_counts, pheno_info) head(count_stats, n=2)
The CD-Hit algorithm is executed via runClustering. The function requires the cd-hit-est binary.
cd_hit <- "/your/path/to/cdhit/cd-hit-est" sig_seq_fasta <- sequencesAsFasta(sig_results,map) clust_result <- runClustering(cd_hit, sig_seq_fasta, out_dir, identity_cutoff=0.9, length_cutoff=0.9, wordlength=9, map)
For this manual, we will load available example data again.
clust_int <- system.file("extdata", "results/clust_result_sig_sequences.tsv", package="DEUS") clust_result <- read.table(clust_int, header=T, sep="\t") rownames(clust_result) <- clust_result$SequenceID head(clust_result)
Finally, the group-wise mean and standard deviation of the normalized counts, the results of DEA, BLASTing and clustering are merged together by mergeResults.
Additionally, addCountsOfFeatureClasses is used to count the number of BLAST hits grouped by user defined feature classes.
Summary files are then written to the out_dir
by writeSummaryFiles.
classes <- c("tRNA","[Hh]sa","^U") summary <- mergeResults(sig_results, count_stats, blast_result, clust_result, map) summary <- addCountsOfFeatureClasses(summary, classes) writeSummaryFiles(summary, out_dir) head(summary, n=2)
The function getNoBlastHitFraction can be used to compute the fraction of reads without annotation for each sample. This fraction corresponds to the fraction of reads that are likely to be ignored in mapping approaches because they map to regions that are not annotated.
getNoBlastHitFraction(summary, de_results$norm_counts)
In addition to the default DEUS pipeline, we developed an adjusted approach that aggregates sequence counts for each sequence cluster and uses this information to find differentially expressed sequence clusters. By some small adjustments, the cluster expression analysis can be included into the default DEUS pipeline.
Instead of creating the sequence map for differentially expressed sequences only, it will be created on the raw input sequences since all sequences will be used for clustering.
map <- createMap(count_table)
Afterwards, all sequences can be clustered. Based on the clustering results, the sum of sequence counts will be created for each cluster via mergeAndAggregate.
cd_hit <- "/your/path/to/cdhit/cd-hit-est" all_seq_fasta <- sequencesAsFasta(count_table,map) clust_result <- runClustering(cd_hit, all_seq_fasta, out_dir, identity_cutoff=0.9, length_cutoff=0.9, wordlength=9, map) cl_counts <- mergeAndAggregate(map, count_table, clust_result)
For this manual, we will load the provided clustering results.
clust_int <- system.file("extdata", "results/clust_result_clust_sequences.tsv", package="DEUS") clust_result <- read.table(clust_int, header=T, sep="\t") rownames(clust_result) <- clust_result$SequenceID cl_counts <- mergeAndAggregate(map, count_table, clust_result) head(cl_counts, n=2)
The resulting count matrix is then used as input for the DE analysis.
design <- ~ condition cl_de_results <- runDESeq2(cl_counts, pheno_info, design, out_dir = out_dir, prefix = "Cluster") cl_sig_results <- cl_de_results$de_result head(cl_sig_results, n=2)
Subsequently, the existing DEA results based on individual sequence counts and the generated DEA results based on sequence clusters can be merged into a single data frame via mergeSingleAndClusterResults. Based on the individual and cluster Pvalues, the resulting data frame is filtered to detect sequences that are differentially expressed based on their sequence count or part of a differentially expressed sequence cluster.
sig_results <- mergeSingleAndClusterResults(cl_sig_results,clust_result,sig_results,map) sig_results <- sig_results[((!is.na(sig_results$IHWPval) & sig_results$IHWPval < 0.05) | (!is.na(sig_results$Cl_IHWPval) & sig_results$Cl_IHWPval < 0.05)),] head(sig_results, n=2)
Next, we run BLAST on all selected sequences.
blast_exec <- "/your/path/to/ncbi-blast-x.x.x/bin/blastn" blast_db <-system.file("extdata", "blastdb/DASHR_subset.fa", package = "DEUS") blast_ncores <- 2 sig_seq_fasta <- sequencesAsFasta(sig_results,map) blast_result <- runBlast(blast_exec, blast_db, blast_ncores, sig_seq_fasta)
In our example, we load pre-compiled data.
blast_int <- system.file("extdata", "results/blast_result_clust_sequences.tsv", package="DEUS") blast_result <- read.table(blast_int, header=T, sep="\t") head(blast_result)
Finally, the summary table is generated again.
# merge results classes <- c("tRNA","[Hh]sa","^U") summary <- mergeResults(sig_results, count_stats, blast_result, clust_result, map) summary <- addCountsOfFeatureClasses(summary, classes) writeSummaryFiles(summary, out_dir) head(summary, n=2)
In order to compare both aproaches, the function printClusterSummary can be used to get detailed information about the number of significant sequences and clusters.
printClusterSummary(summary)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.