run_deseq_analysis: Run differential expression (DE) analysis for several...
In cparsania/parcutils: Utility Functions To Deal Day To Day Bioinformatics Tasks.

run_deseq_analysis

R Documentation

Run differential expression (DE) analysis for several comparisosns at a time. Internally it uses `DESeq2::DESeq()` to perform DE analysis.

Description

DESeq2 is a popular method to perform DE analysis for RNA-seq data. Pre and post DESeq run, however, involves several data wrangling steps. For example, prior to run DESeq genes may need to be filtered out based on number of reads mapped to genes across replicates. Similarly, post DESeq run user needs to set cutoffs (log2fc, pvalue and padj) to define up and down regulated genes. For large RNA-seq experiments involving several DE comparison such as sub-setting and different cutoffs creates messy and less readable code. This function helps to make things bit tidy and increase the code readability. Besides that, output of this function can subsequently used for several other functions of this package.

Usage

run_deseq_analysis(
  counts,
  column_geneid,
  sample_info,
  group_numerator,
  group_denominator,
  delim = "\t",
  comment_char = "#",
  min_counts = 10,
  min_replicates = 1,
  cutoff_lfc = 1,
  cutoff_pval = 0.05,
  cutoff_padj = 0.01,
  regul_based_upon = 1,
  print_rows_all_zero = FALSE
)

Arguments

counts

a character string of the path to a count file or an object of dataframe having raw counts for each gene. See details below to know more about format of the count file and count dataframe.

column_geneid

a character string denoting a column of geneid in counts

sample_info

a character string denoting a name of sample information file or a dataframe. A file or a dataframe both must have at least two columns WITHOUT column names. First column denotes to samples names and second column denotes group name for each sample in first column. For e.g.


control_rep_1	Control
control_rep_2	Control
control_rep_3	Control
treat_1_rep_1	Treatment_1
treat_1_rep_2	Treatment_1
treat_1_rep_3	Treatment_1
treat_2_rep_1	Treatment_2
treat_2_rep_2	Treatment_2
treat_2_rep_3	Treatment_2

group_numerator

a character vector denoting sample groups to use in numerator to calculate fold change.

group_denominator

a character vector denoting sample groups to use in denominator to calculate fold change.

delim

a character denoting deliminator for count file. Only valid if count is a file path.

comment_char

a character denoting comments line in count file. Only valid if count is a file path.

min_counts

a numeric value, default 10, denoting minimum counts for a gene to be used to consider a gene for differential expression analysis.

min_replicates

a numeric value, default 1, denoting minimum samples within a sample group must have minimum_counts. Value provided must not be higher than number of samples in each group. For example for given values min_replicates = 2 and minimum_counts = 10 the genes which have minimum counts 10 in at least 2 samples within a groups will be accounted for DEG. Rest will be filtered out prior to run DESeq.

cutoff_lfc

minimal threshold for log2fold change, default 1 (2 fold).

cutoff_pval

minimal threshold for pvalue, default 0.05. P-value threshold will be applied only when regul_based_upon is either 1 or 3.

cutoff_padj

minimal threshold for Padj, default 0.01. Padj threshold will be applied only when regul_based_upon is either 2 or 3.

regul_based_upon

one of the numeric choices 1, 2, or 3.

if 1 ...

Up : log2fc >= cutoff_lfc & pvalue <= cutoff_pval
Down : log2fc <= (-1) * cutoff_lfc & pvalue <= cutoff_pval
Other : remaining genes

if 2 ...

Up : log2fc >= cutoff_lfc & padj <= cutoff_padj
Down : log2fc <= (-1) * cutoff_lfc & padj <= cutoff_padj
Other : remaining genes

if 3 ...

Up : log2fc >= cutoff_lfc & pvalue <= cutoff_pval & padj <= cutoff_padj
Down : log2fc <= (-1) * cutoff_lfc pvalue <= cutoff_pval & padj <= cutoff_padj
Other : remaining genes

print_rows_all_zero

logical, default FALSE, denoting whether to print genes with value 0 in all columns.

Details

: For the argument count user can either provide a character string denoting a file or an object of dataframe. In each case required format is explained below.

⁠Count file⁠: Count file is a table of row read counts usually derived from .bam file for different genomic features (e.g. genes, transcripts etc.). Data in the count file must be in a tabular format with a valid column deliminator (e.g., tab, comma etc.). First row and first column will be considered as column names and row names respectively. Values for column names and row names are usually character string or combination of character, numbers and special characters such as ⁠_⁠, or .. Both row names and column names must have unique values.
⁠Count dataframe⁠: Count data in a dataframe format having same requirement of row names and column names explained for ⁠count file⁠.

Value

Return object is a dataframe (tibble) having each row denoting a unique differential comparison. There are total 8 columns as explained below.

de_comparisons : It stores the name of differential comparison for each row.
numerator : It stores name of samples which were used as numerator for the differential comparison in each row.
denominator : It stores name of samples which were used as denominator for the differential comparison in each row.
norm_counts : This is a named-list column. Each row in this column is a list of two containing normalised genes expression values in a dataframe for the samples - numerator and denominator. The first column of the dataframe is gene_id and subsequent columns are gene expression values in replicates of corresponding samples. This normalised gene expression values are obtained using counts slot of a DESeqDataSet object. e.g.: counts(dds,normalized=TRUE)
dsr : This is a named-list column stores an object of class DESeq2::DESeqResults() for the differential comparison in each row.
dsr_tibble: This is a named-listcolumn stores and an output of DESeq2::DESeqResults() in the dataframe format for the differential comparison in each row.
dsr_tibble_deg: The data in this column is same as in the column dsr_tibble except it contains two extra columns signif and regul. Values in the signif specifies statistical and fold change significance of the gene while values in the regul denotes whether the gene is up or down regulated.
deg_summmary : This is a named-list column. Each element of the list is a dataframe summarizing number of differential expressed gene for the differential comparison for each row.

Examples


count_file <- system.file("extdata","toy_counts.txt" , package = "parcutils")
count_data <- readr::read_delim(count_file, delim = "\t",show_col_types = FALSE)

sample_info <- count_data %>% colnames() %>% .[-1]  %>%
 tibble::tibble(samples = . , groups = rep(c("control" ,"treatment1" , "treatment2"), each = 3) )


res <- run_deseq_analysis(counts = count_data ,
                         sample_info = sample_info,
                         column_geneid = "gene_id" ,
                         cutoff_lfc = 1 ,
                         cutoff_pval = 0.05,
                         group_numerator = c("treatment1", "treatment2") ,
                         group_denominator = c("control"))

res

## all comparisons

print(res$de_comparisons)

## DESEq result object(s)

print(res$dsr)

## DESEq result data frame

print(res$dsr_tibble)

## DESEq result data frame  DEG assigned, look at the columns 'signif' and 'regul'

print(res$dsr_tibble_deg)

## DEG summary

print(res$deg_summmary)

cparsania/parcutils documentation built on June 11, 2025, 1:01 p.m.