library(hpgltools)
library(hpgldata)
knitr::opts_knit$set(progress = TRUE,
                     verbose = TRUE,
                     width = 90,
                     echo = TRUE)
knitr::opts_chunk$set(error = TRUE,
                      fig.width = 8,
                      fig.height = 8,
                      dpi = 96)
old_options <- options(digits = 4,
                       stringsAsFactors = FALSE,
                       knitr.duplicate.label = "allow")
ggplot2::theme_set(ggplot2::theme_bw(base_size = 10))
set.seed(1)
rmd_file <- "a-01_bacterial_example.Rmd"

Introduction

This document seeks to show how to load some bacterial data and play with it. There is a caveat, for well over a year I have been saying I am going to move to doing everything in a summarizedExperiment (se) and haven't done it. I think these vignettes will be the place I start that process in a real sense.

Loading (meta)data and annotations

The following examples will use a real data set from an experiment in our lab. The raw data was processed using a mix of trimmomatic, biopieces, bowtie, samtools, and htseq. The final count tables were deposited into the 'preprocessing/count_tables/' tree. The resulting data structure was named 'most_v0M1,' named because it is comprised of count tables with 0 mismatches and 1 randomly-placed multi-match.

The annotation file was mgas_5005.gff.xz resides in 'reference/gff/'.

The count tables and meta-data were loaded through the create_expt() function and the genome annotations were loaded with gff2df().

data_file <- system.file("share/cdm_expt.rda", package = "hpgldata")
cdm <- new.env()
load(data_file, envir = cdm)
rm(data_file)

ls()

Two variables should exist now: rmd_file in case I want to knitr this file, cdm which is a list including the data required to make an expressionset. Using this information, I can create an expressionset.

expt <- create_expt(count_dataframe = cdm$cdm_counts,
                    metadata = cdm$cdm_metadata,
                    gene_info = cdm$gene_info)
cdm_se <- create_se(metadata = cdm$cdm_metadata,
                    gene_info = cdm$gene_info,
                    count_dataframe = cdm$cdm_counts)

knitr::kable(head(pData(expt)))

Introduction

hpgltools was written to make working with high-throughput data analyses easier. These analyses generally fall into a few stages:

  1. Data visualization and outlier/batch evaluation
  2. Differential expression analyses a. Visualization and export of these results
  3. Gene ontology/KEGG analyses a. Visualization and export of these results
  4. Genome visualizations a. With circos b. With genoplotR

Before any of these tasks may be performed, the data must be loaded into memory. hpgltools attempts to make this easier with create_expt() and subset_expt().

A little side-track

I want to use this vignette to explore the relationship between expressionSets and SummarizedExperiments. I use the former extensively, and have noticed but not paid strong attention to the latter.

Working with SEs

The data structure generated by create_se() contains the experimental metadata, counts, and gene information. Other information may be included as well.

Raw metrics

With the above in mind, once we have the various metadata and count data loaded into memory, a most common next step is examine the data before molesting it:

raw_metrics <- graph_metrics(expt, qq = TRUE, cis = NULL)

The function graph_metrics() performs all of the likely plots one might want. Some of which are not really appropriate for non-normalized data unless it is incredibly well behaved (after 30 years, I still want to spell behaved 'behaived', why is that?).

## View a raw library size plot
raw_metrics$libsize
## Or boxplot to see the data distribution
raw_metrics$boxplot
## The warning is because it automatically uses a log scale and there are some 0 count genes.
## Perhaps you prefer density plots
raw_metrics$density
## quantile/quantile plots compared to the median of all samples
raw_metrics$qqrat
raw_metrics$tsne_plot
## Here we can see some samples are differently 'shaped' compared to the median than others
## There are other plots one may view, but this data set is a bit too crowded as is.
## The following summary shows the other available plots:
summary(raw_metrics)

The plots are all generated by calling plot_something() where the somethings are:

Subsetting data

On the other hand, we might take a subset of the data to focus on the late-log vs. early-log samples.

The expt_subset() function allows one to pull material from the experimental design.

Once we have a smaller data set, we can more easily use PCA to see how the sample separate.

head(expt$design)
## elt stands for: "early/late in thy"
batch_a <- subset_se(expt, subset = "batch=='a'")
batch_b <- subset_expt(expt, subset = "batch=='b'")

a_metrics <- sm(graph_metrics(batch_a, cis = NULL))
b_metrics <- sm(graph_metrics(batch_b, cis = NULL))

PCA plots of these subsetted data are not likely to be very interesting, as it has been reduced to a single sample per condition, but we can look at them anyhow.

a_metrics$pc_plot
b_metrics$pc_plot
a_metrics$tsne_plot
b_metrics$tsne_plot

Normalizing data

It is pretty obvious that the raw data is a bit jumbled according to PCA. This is not paricularly suprising since we didn't normalize it at all. Therefore, in this block I will normalize it a few ways and follow up with some visualizations of showing how the apparent relationships change in the data.

## doing nothing to the data except log2 transforming it has a surprisingly large effect
norm_test <- normalize_expt(expt, transform = "log2")
l2_metrics <- sm(graph_metrics(norm_test, cis = NULL))
## a quantile normalization alone affect some, but not all of the data
norm_test <- sm(normalize_expt(expt, norm = "quant"))
q_metrics <- sm(graph_metrics(norm_test, cis = NULL))  ## q for quant, who quaffed nightshade.
## cpm alone brings out some samples, too
norm_test <- sm(normalize_expt(expt, convert = "cpm"))
c_metrics <- sm(graph_metrics(norm_test, cis = NULL))  ## c for cpm, who could not see the train.
## low count filtering has some effect, too
norm_test <- sm(normalize_expt(expt, filter = "pofa"))
f_metrics <- sm(graph_metrics(norm_test, cis = NULL))  ## f for filter, who was hit with a spade.
## how about if we mix and match methods?
norm_test <- sm(normalize_expt(expt, transform = "log2", convert = "cpm",
                               norm = "quant", batch = "combat_scale", filter = TRUE,
                               batch_step = 4, low_to_zero = TRUE))
## Some metrics are not very useful on (especially quantile) normalized data
norm_graphs <- sm(graph_metrics(norm_test, cis = NULL))

Now lets see some of the resulting metrics, in this case I will just compare some pca plots, as they are good at fooling our silly visual brains into seeing patterns.

l2_metrics$pc_plot
## Also viewable with plot_pca()$plot
## PCA plots seem (to me) to prefer log2 scale data.
q_metrics$pc_plot
## only normalizing on the quantiles leaves the data open to scale effects.
c_metrics$pc_plot
## but cpm alone is insufficient
f_metrics$pc_plot
## only filtering out low-count genes is helpful as well
norm_graphs$pc_plot
## The different batch effect testing methods have a pretty widely ranging effect on the clustering
## play with them by changing the batch= parameter to:
## "limma", "sva", "svaseq", "limmaresid", "ruvg", "combat", combatmod"
knitr::kable(norm_graphs$pc_summary)
## Thus we see a dramatic decrease in variance accounted for
## by batch after applying limma's 'removebatcheffect'
## (see batch.R2 here vs. above)
norm_graphs$smc
norm_graphs$disheat  ## svaseq's batch correction seems to draw out the signal quite nicely.
## It is worth noting that the wt, early log, thy, replicate c samples are still a bit weird.
norm_graphs$tsne_plot

Performing DE analyses

This is a relatively small data set, so performing some differential expression analyses really should not take long at all.

When performing these analyses with hpgltools, it will attempt to perform similar analyses with limma, edgeR, and DESeq2 via the all_pairwise() function. The most likely argument is 'model_batch' which may be used to explicitly include/exclude a batch factor in the model, or ask it to attempt including batch factors from sva/ruv/etc. By default it will attempt to include a column from the experimental design named 'batch'.

spyogenes_de <- sm(all_pairwise(expt))
## Even the lowest correlations are quite high.

The result of all_pairwise() is a list of the results from limma, edger, and deseq. In addition, I implemented a very simplistic, differential expression function named 'basic()'. It also provides some simple measurements of how well the various analyses agree (ergo the black and white heatmap).

Working with these separate tables can be more than a little annoying, combine_de_tables() attempts to simplify this. It will bring together the various tables, and if asked attempt to bring them together into a pretty-ified excel workbook.

all_pairwise() arbitrarily performs all possible pairwise comparisons. This is not necessarily what one actually wishes to see. Therefore, the argument keepers takes a list of contrasts:

my_keepers <- list(
  ## name    =   numerator / denominator
  "wt_media" = c("wt_ll_cf", "wt_ll_cg"),
  "mga_media" = c("mga_ll_cf", "mga_ll_cg"))

In the above example, if the keepers argument to combine_de_tables() is given as my_keepers, then the resulting table will not have the set of 6 possible comparisons, but instead will only have 2 tables named 'wt_media' and 'mga_media', which if printed to excel will be sheets named accordingly.

spyogenes_tables <- sm(combine_de_tables(spyogenes_de, excel = FALSE))
summary(spyogenes_tables)
## Try changing the p-adjustment
spyogenes_tables <- sm(combine_de_tables(spyogenes_de, excel = FALSE, padj_type = "BH"))
knitr::kable(head(spyogenes_tables$data[[1]]))

Finally, extract_significant_genes() may choose 'significant' genes based upon a few metrics including z-score vs. the distribution of logFC; a logFC cutoff, (ajusted)p-value cutoff, and/or top/bottom n genes.

spyogenes_sig <- sm(extract_significant_genes(spyogenes_tables, excel = FALSE))
knitr::kable(head(spyogenes_sig$limma$ups[[1]]))

Make pretty circos graphs

Since most of my circos graphs are for pyogenes, it is likely that the defaults are appropriate for this particular organism.

Much(all) of the following is taken from the material in tests/testthat/test_70mga.R

##microbe_ids <- as.character(sm(get_microbesonline_ids("pyogenes MGAS5005")))
## A caveat!  The new version of microbesonline changed the IDs so that they no longer
## match my old rnaseq analysis!!  Thus I put my old gff file used for mapping into inst/
## and will load the annotation data from that; but I will use this data to gather
## the COG information.
mgas_gff_df <- sm(load_gff_annotations(gff = system.file("share/gas.gff", package = "hpgldata")))
mgas_gff_df <- mgas_gff_df[-1, ]
mgas_gff_df[["sysName"]] <- gsub(pattern = "Spy_", replacement = "Spy",
                                 x = mgas_gff_df[["locus_tag"]])
rownames(mgas_gff_df) <- make.names(mgas_gff_df[["sysName"]], unique = TRUE)

mgas_microbes_df <- sm(load_microbesonline_annotations(id = 293653))
mgas_microbes_df$sysName <- gsub(pattern = "Spy_", replacement = "Spy",
                                 x = mgas_microbes_df$sysName)
rownames(mgas_microbes_df) <- make.names(mgas_microbes_df$sysName, unique = TRUE)

mgas_df <- merge(x = mgas_gff_df, y = mgas_microbes_df, by = "row.names")
rownames(mgas_df) <- mgas_df[["Row.names"]]
mgas_df <- mgas_df[, -1]
colnames(mgas_df) <- c("seqnames", "start", "end", "width", "strand", "source", "type",
                       "score", "phase", "ID", "Dbxref", "Is_circular", "gbkey", "genome",
                       "mol_type", "strain", "Name", "Note", "gene", "locus_tag",
                       "Parent", "product", "protein_id", "transl_table", "gene_synonym",
                       "sysName_again", "locusId", "accession", "GI", "scaffoldId",
                       "start_again", "stop", "strand_again", "sysName_again", "name",
                       "desc", "COG", "COGFun", "COGDesc", "TIGRFam", "TIGRRoles",
                       "GO", "EC", "ECDesc")

## First make a template configuration
circos_test <- circos_prefix(annotation = mgas_df)
## Fill it in with the data for s.pyogenes
lengths <- 1838600
names(lengths) <- "NC_007297"

circos_kary <- circos_karyotype(cfg = circos_test, lengths = lengths)
## Fill in the gene category annotations by gene-strand
circos_plus <- circos_plus_minus(cfg = circos_test)
circos_limma_hist <- circos_hist(cfg = circos_test,
                                 df = spyogenes_de$limma$all_tables[[1]],
                                 basename = "limma",
                                 colname = "logFC",
                                 outer = circos_plus)
circos_deseq_hist <- circos_hist(cfg = circos_test,
                                 df = spyogenes_de$deseq$all_tables[[1]],
                                 basename = "deseq",
                                 colname = "logFC",
                                 outer = circos_limma_hist)
circos_edger_hist <- circos_hist(cfg = circos_test,
                                 df = spyogenes_de$edger$all_tables[[1]],
                                 basename = "edger",
                                 colname = "logFC",
                                 outer = circos_deseq_hist)
circos_suffix(cfg = circos_test)
circos_made <- sm(circos_make(cfg = circos_test, target = "mgas"))
getwd()

circos result

GenoplotR is new to me, but it seems to work?

genoplot_chromosome()

Change conditions

Perhaps instead of looking at subsets of the data, we may want to consider wt/mga. If so, we can just change the condition to any column in the design matrix.

wt_mga_expt <- set_expt_conditions(expt = expt, fact = "type")
wt_mga_plots <- sm(graph_metrics(wt_mga_expt))
wt_mga_norm <- sm(normalize_expt(wt_mga_expt, transform = "log2", convert = "raw", filter = TRUE, norm = "quant"))
wt_mga_nplots <- sm(graph_metrics(wt_mga_norm))
wt_mga_de <- sm(all_pairwise(input = wt_mga_expt,
                          combined_excel = "wt_mga.xlsx",
                          sig_excel = "wt_mga_sig.xlsx",
                          abundant_excel = "wt_mga_abundant.xlsx"))
wt_mga_de$combined$comp_plot
## How well do the various DE tools agree on this data?

wt_mga_plots$tsne_plot
wt_mga_nplots$pc_plot
wt_mga_de$combined$limma_plots$WT_vs_mga$scatter
wt_mga_de$combined$limma_ma_plots$WT_vs_mga$plot
wt_mga_de$combined$limma_vol_plots$WT_vs_mga$plot
pander::pander(sessionInfo())


elsayed-lab/hpgltools documentation built on May 9, 2024, 5:02 a.m.