library("hpgltools")
knitr::opts_knit$set(progress = TRUE,
                     verbose = TRUE,
                     width = 90,
                     echo = TRUE)
knitr::opts_chunk$set(error = TRUE,
                      fig.width = 8,
                      fig.height = 8,
                      dpi = 96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 10))
set.seed(1)
ver <- "20170820"
rmd_file <- "d-04_pasilla.Rmd"

Pasilla vignette: r ver

Example hpgltool usage with a real data set (pasilla)

In this document, I am hoping to mostly copy/paste material from the tests/ tree and explain the various functionalities therein. It is my hope therefore to step from data loading all the way through ontology searching with appropriate visualizations at each stage.

Load Data

In test_01load_data.R I perform load some data into an expressionset and get ready to play with it.

## I use sm to keep functions from printing too much (well, anything really)
tt <- sm(library(hpgltools))
tt <- sm(library(pasilla))
tt <- sm(data(pasillaGenes))

Gather annotation data

biomart is an excellent resource for annotation data, but it is entirely too complex. The following function 'get_biomart_annotations()' attempts to make that relatively simple.

## Try loading some annotation information for this species.
gene_info_lst <- sm(load_biomart_annotations(species = "dmelanogaster",
                                             host = "useast.ensembl.org"))
gene_info <- gene_info_lst[["annotation"]]
info_idx <- gene_info[["gene_biotype"]] == "protein_coding"
gene_info <- gene_info[info_idx, ]
rownames(gene_info) <- make.names(gene_info[["ensembl_gene_id"]], unique = TRUE)
head(gene_info)

Load count tables

The pasilla data set provides count tables in a tab separated file, let us read them into an expressionset in the following block along with creating an experimental design. create_expt() will then merge the annotations, experimental design, and count tables into an expressionset.

## This section is copy/pasted to all of these tests, that is dumb.
datafile <- system.file("extdata/pasilla_gene_counts.tsv", package = "pasilla")
## Load the counts and drop super-low counts genes
counts <- read.table(datafile, header = TRUE, row.names = 1)
counts <- counts[rowSums(counts) > ncol(counts),]
## Set up a quick design to be used by cbcbSEQ and hpgltools
design <- data.frame(row.names = colnames(counts),
    condition = c("untreated","untreated","untreated",
        "untreated","treated","treated","treated"),
    libType = c("single_end","single_end","paired_end",
        "paired_end","single_end","paired_end","paired_end"))
metadata <- design
colnames(metadata) <- c("condition", "batch")
metadata[["sampleid"]] <- rownames(metadata)

## Make sure it is still possible to create an expt
pasilla_expt <- sm(create_expt(count_dataframe = counts, metadata = metadata,
                               savefile = "pasilla", gene_info = gene_info))

Graph metrics

In this block I will use a single function graph_metrics() to plot them all. And then follow up with the one at a time. Many functions in hpgltools are quite chatty with liberal usage of message(), as a result I will sm() this call to silence it.

pasilla_metrics <- sm(graph_metrics(pasilla_expt, ma = TRUE, qq = TRUE))
summary(pasilla_metrics)

Print some plots!

pasilla_metrics$libsize
## The library sizes range from 8-21 million reads, this might be a problem for
## some analyses, but it should be ok
pasilla_metrics$nonzero
## Ergo, the lower abundance libraries have more genes of counts == 0 (bottom
## left).
pasilla_metrics$boxplot
## And a boxplot downshifts them (but not that much because it decided to put
## the data on the log scale).
pasilla_metrics$density
## Similarly, one can see those samples are a bit lower with respect to density

## Unless the data is very well behaved, the rest of the plots are not likely to
## look good until the data is normalized, nonetheless, lets see
pasilla_metrics$corheat
pasilla_metrics$disheat
pasilla_metrics$pc_plot
## So the above 3 plots are pretty much the worst case scenario for this data.

Normalize and replot

The most common normalization suggested by Najib is a cpm(quantile(filter(data))). On top of that we often do log2() and/or a batch adjustment. default_norm() does the first and may be supplemented with other arguments.

norm <- default_norm(pasilla_expt, transform = "log2")
norm_metrics <- graph_metrics(norm)
norm_metrics$corheat
norm_metrics$smc
norm_metrics$disheat
norm_metrics$smd
## some samples look a little troublesome here.
norm_metrics$pc_plot

Try a pairwise comparison

With the above metrics in mind, we may perform a pairwise comparison of the data. By default, all_pairwise() performs every possible pairwise contrast, which in the case is comprised of just treated vs. untreated.

pasilla_pairwise <- sm(all_pairwise(pasilla_expt))
pasilla_tables <- sm(combine_de_tables(
  pasilla_pairwise,
  excel = "pasilla_tables.xlsx"))
pasilla_sig <- sm(extract_significant_genes(
  pasilla_tables,
  excel = "pasilla_sig.xlsx"))
pasilla_ab <- sm(extract_abundant_genes(
  pasilla_pairwise,
  excel = "pasilla_abundant.xlsx"))
pasilla_tables[["plots"]][["untreated_vs_treated"]][["deseq_ma_plots"]]$plot
pasilla_tables[["plots"]][["untreated_vs_treated"]][["edger_ma_plots"]]$plot
pasilla_tables[["plots"]][["untreated_vs_treated"]][["limma_ma_plots"]]$plot
up_genes <- pasilla_sig[["deseq"]][["ups"]][["untreated_vs_treated"]]
down_genes <- pasilla_sig[["deseq"]][["downs"]][["untreated_vs_treated"]]
pasilla_go <- load_biomart_go(species = "dmelanogaster")$go
pasilla_length <- fData(pasilla_expt)[, c("ensembl_gene_id", "cds_length")]
colnames(pasilla_length) <- c("ID", "length")

pasilla_up_goseq <- simple_goseq(sig_genes = up_genes, go_db = pasilla_go,
                                 length_db = pasilla_length)
pasilla_up_goseq[["pvalue_plots"]][["bpp_plot_over"]]

pasilla_down_goseq <- simple_goseq(sig_genes = down_genes, go_db = pasilla_go,
                                   length_db = pasilla_length)
pasilla_down_goseq[["pvalue_plots"]][["bpp_plot_over"]]

high_genes <- names(pasilla_ab[["abundances"]][["deseq"]][["high"]][["treated"]])
pasilla_high_goseq <- simple_goseq(sig_genes = high_genes, go_db = pasilla_go,
                                   length_db = pasilla_length)
pasilla_high_goseq[["pvalue_plots"]][["bpp_plot_over"]]

low_genes <- names(pasilla_ab[["abundances"]][["deseq"]][["low"]][["treated"]])
pasilla_low_goseq <- simple_goseq(sig_genes = low_genes, go_db = pasilla_go,
                                  length_db = pasilla_length)
pasilla_low_goseq[["pvalue_plots"]][["bpp_plot_over"]]
pander::pander(sessionInfo())
message(paste0("This is hpgltools commit: ", get_git_commit()))


elsayed-lab/hpgltools documentation built on May 9, 2024, 5:02 a.m.