In pkimes/upbm: Tools for the analysis of universal protein binding microarrays

knitr::opts_chunk$set(tidy = FALSE, cache = TRUE, dev = "png",
                      message = FALSE, error = FALSE, warning = TRUE)

Introduction

While a large collection of methods and software packages have been developed for processing and analyzing traditional DNA microarrays, the development of tools for protein binding microarrays (PBMs) has been limited. PBMs and DNA microarrays measure fundamentally different things, with PBMs detecting protein-DNA interactions and DNA microarrays assaying nucleic acid hybridization. Blindly applying DNA microarray tools to PBM data can lead to problems.

This package provides tools and methods for reading, processing, and analyzing PBM data using standard Bioconductor data structures. The primary focus is on universal-design PBMs (uPBMs) which provide approximately uniform coverage of all 8-mer DNA sequences across probes on the array [@berger2006]. The methods in this package have not been tested with other array designs.

suppressPackageStartupMessages(library("upbm"))

In this quickstart guide, we demonstrate the basic functionality of the upbm package. More details on the individual steps for pre-processing and inference, as well as the specialized classes (PBMExperiment and PBMDesign) are provided in other vignettes (vignette("upbm-classes")).

HOXC9 Dataset

The input to the upbm package is raw GenePix Results (GPR) data for Alexa488 and Cy3 scans. Functions for reading GPR data into R are included with this package (see vignette("upbm-classes")). For this quickstart guide, we will use just the Alexa488 scans for a PBM dataset included in the upbmData package. Details on how Cy3 scans can also be used provided in a separate vignette (see vignette("upbm-preprocessing")).

data(hoxc9alexa, package = "upbmData")

The dataset includes PBM data for four allelic variants of the HOXC9 human TF (wild type, K195R, R193K, R222W). For simplicity, we will subset this data to scans at a single PMT gain (450) and keep just the wild type and R222W alleles.

alexa_subset <- hoxc9alexa[, colData(hoxc9alexa)$pmt == 450]
alexa_subset <- alexa_subset[, colData(alexa_subset)$condition %in%
                                                    c("HOXC9-REF", "HOXC9-R222W")]

The data is stored as a PBMExperiment object, which extends the standard Bioconductor SummarizedExperiment class. Each sample is stored as a column, with rows corresponding to probes on the array.

alexa_subset

Sample Pre-Processing

Analysis of uPBM data is performed in three steps: sample pre-processing, 8-mer summarization, and 8-mer inference. First, samples are preprocessed using upbm::upbmPreprocess. This wrapper functions performs various pre-processing steps, including Cy3 normalization. In this quickstart, we will not be using Cy3 scans and instead set the cy3pe= option to NULL. We also silence verbose output here.

alexa_normed <- upbmPreprocess(pe = alexa_subset, cy3pe = NULL,
                               verbose = FALSE)

The returned object is a PBMExperiment object with the same number of columns as the original PBMExperiment, with normalized probe intensities now stored in the normalized assay. Rows have been filtered to remove background and control probes.

alexa_normed

8-mer Summarization

After normalization, probe-level intensities are aggregated across replicates by fitting a probe-level model to the data. Since the dataset includes samples for different conditions, this must be specified. In our dataset, this information is included in the "condition" column of the colData of the PBMExperiment. This is passed to the stratify= parameter.

alexa_pfit <- probeFit(alexa_normed, stratify = "condition")

The resulting object is a PBMExperiment with the same number of rows, but only one column for each unique value in the column specified by stratify=. In this case, we have two conditions, HOXC9-REF and HOXC9-R222W. Aggregated probe summaries are stored in the beta assay.

alexa_pfit

Next, probe-level summaries are used to obtain 8-mer-level summaries. A helper function, upbm::uniqueKmers is provided for generating the full set of K-mers of a specified length (unique to reverse complementation). During this step, parameters are also estimated for downstream differential testing and inference. The baseline condition must be specified to the baseline= parameter.

alexa_kfit <- kmerFit(alexa_pfit, kmers = uniqueKmers(8L),
                      baseline = "HOXC9-REF")

The result is a SummarizedExperiment with K-mer-level affinity and variance estimates for each condition, with rows now corresponding to K-mers rather than probes.

alexa_kfit

8-mer Inference

To test for differential 8-mer affinity across conditions between a baseline condition (e.g. the wild type HOXC9 TF) and all other conditions, we make a call to upbm::kmerTestContrast. Other methods for inference are also defined in upbm::kmerTestAffinity and upbm::kmerTestSpecificity. Note that all differential affinity and specificity calls are relative to the baseline condition specified above.

alexa_kdiff <- kmerTestContrast(alexa_kfit)

The resulting object is again a SummarizedExperiment with assays containing the results of differential affinity testing.

alexa_kdiff

K-mer level averages and differences between the baseline and all other conditions are included in the contrastAverage and contrastDifference assays. FDR-adjusted p-values are included in the contrastQ assay. We can tidy the results from a PBMExperiment to a tibble using the broom::tidy function.

diffdat <- broom::tidy(alexa_kdiff, assay = c("contrastAverage",
                                              "contrastDifference",
                                              "contrastQ"))
diffdat <- dplyr::filter(diffdat, !is.na(contrastAverage))
diffdat

Assays are now contained in columns with rows corresponding to single samples and rows from the PBMExperiment. The tidy function can be used with any of the PBMExperiment objects above. More details are provided in vignette("other-tidydata"). Using the tidy data, plot the results of differential testing is now relatively straightforward. We can create a standard MA plot by plotting the contrastAverage and contrastDifference columns as the X and Y axes.

suppressPackageStartupMessages(library("ggplot2"))

ggplot(diffdat, aes(x = contrastAverage,
                    y = contrastDifference,
                    color = contrastQ < 0.001)) +
    geom_point(alpha = 1/10) +
    geom_hline(yintercept = 0) + 
    scale_color_brewer(palette = "Set1", direction = -1) +
    theme_bw() +
    guides(color = guide_legend(override.aes = list(alpha = 1))) +
    facet_grid(. ~ cname) +
    ggtitle("HOXC9 differential 8-mer affinity MA plot")

Additionally, we can also take a look at the distribution of 8-mer-level differential affinities between conditions.

ggplot(diffdat, aes(x = contrastDifference,
                    fill = contrastQ < 0.001)) +
    geom_histogram(color = 'black', binwidth = .05, boundary = 0) +
    geom_vline(xintercept = 0) + 
    scale_fill_brewer(palette = "Set1", direction = -1) +
    theme_bw() +
    guides(color = guide_legend(override.aes = list(alpha = 1))) +
    facet_grid(. ~ cname) +
    ggtitle("HOXC9 differential 8-mer-level affinity")

Using the tabular data, we can also identify the set of 8-mers with largest differences between the two conditions.

suppressPackageStartupMessages(library("dplyr"))

diffdat %>%
    dplyr::top_n(10, abs(contrastDifference)) %>%
    dplyr::arrange(contrastDifference) %>%
    dplyr::select(cname, seq, contrastDifference, contrastQ)

Other Vignettes

More details on various components of the package are provided in several other vignettes included with the package.

vignette("upbm-classes"): Details on the two uPBM classes defined in the package.
vignette("analysis-preprocessing"): Details on the pre-processing procedures.
vignette("analysis-summarization"): Details on the steps for summarizing probes to 8-mers.
vignette("analysis-inference"): Details on the various K-mer-level inference procedures.
vignette("other-qcplots"): Details on various QC plots useful when evaluating the quality of PBM data.
vignette("other-tidydata"): Details on how data can be tidied from PBMExperiment to tibble format.