knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)

Setup

The BioPlex project uses affinity-purification mass spectrometry to profile protein-protein interactions (PPIs) in human cell lines.

To date, the BioPlex project has created two proteome-scale, cell-line-specific PPI networks. The first, BioPlex 3.0, results from affinity purification of 10,128 human proteins —- half the proteome —- in 293T cells and includes 118,162 interactions among 14,586 proteins. The second results from 5,522 immunoprecipitations in HCT116 cells and includes 70,966 interactions between 10,531 proteins.

For more information, please see:

The BioPlex R package implements access to the BioPlex protein-protein interaction networks and related resources from within R. Besides protein-protein interaction networks for 293T and HCT116 cells, this includes access to CORUM protein complex data, and transcriptome and proteome data for the two cell lines.

Functionality focuses on importing these data resources and storing them in dedicated Bioconductor data structures, as a foundation for integrative downstream analysis of the data. For a set of downstream analyses and applications, please see the BioPlexAnalysis package and analysis vignettes.

Installation

To install the package, start R and enter:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BioPlex")

After the installation, we proceed by loading the package and additional packages used in the vignette.

library(BioPlex)
library(AnnotationHub)
library(ExperimentHub)
library(graph)

Data resources

General

Connect to AnnotationHub:

ah <- AnnotationHub::AnnotationHub()

Connect to ExperimentHub:

eh <- ExperimentHub::ExperimentHub()

OrgDb package for human:

orgdb <- AnnotationHub::query(ah, c("orgDb", "Homo sapiens"))
orgdb <- orgdb[[1]]
orgdb
keytypes(orgdb)

BioPlex PPIs

Available networks include:

Let's get the latest version of the 293T PPI network:

bp.293t <- getBioPlex(cell.line = "293T", version = "3.0")
head(bp.293t)
nrow(bp.293t)

Each row corresponds to a PPI between a bait protein A and a prey protein B, for which NCBI Entrez Gene IDs, Uniprot IDs, and gene symbols are annotated. The last three columns reflect the likelihood that each interaction resulted from either an incorrect protein identification (pW), background (pNI), or a bona fide interacting partner (pInt) as determined using the CompPASS algorithm.

Analgously, we can obtain the latest version of the HCT116 PPI network:

bp.hct116 <- getBioPlex(cell.line = "HCT116", version = "1.0")
head(bp.hct116)
nrow(bp.hct116)

ID mapping

The protein-to-gene mappings from BioPlex (i.e. UNIPROT-to-SYMBOL and UNIPROT-to-ENTREZID) are based on the mappings available from Uniprot at the time of publication of the BioPlex 3.0 networks.

We can update those based on Bioc annotation functionality:

bp.293t.remapped <- getBioPlex(cell.line = "293T",
                               version = "3.0",
                               remap.uniprot.ids = TRUE)

Data structures for BioPlex PPIs

We can also represent a given version of the BioPlex PPI network for a given cell line as one big graph where bait and prey relationship are represented by directed edges from bait to prey.

bp.gr <- bioplex2graph(bp.293t)
bp.gr
head(graph::nodeData(bp.gr))
head(graph::edgeData(bp.gr))

PFAM domains

We can easily add PFAM domain annotations to the node metadata:

bp.gr <- annotatePFAM(bp.gr, orgdb)
head(graph::nodeData(bp.gr, graph::nodes(bp.gr), "PFAM"))

CORUM complexes

Obtain the complete set of human protein complexes from CORUM:

all <- getCorum(set = "all", organism = "Human")
dim(all)
colnames(all)
all[1:5, 1:5]

Core set of complexes:

core <- getCorum(set = "core", organism = "Human")
dim(core)

Complexes with splice variants:

splice <- getCorum(set = "splice", organism = "Human")
dim(splice)

ID mapping

The protein-to-gene mappings from CORUM (i.e. UNIPROT-to-SYMBOL and UNIPROT-to-ENTREZID) might not be fully up-to-date.

We can update those based on Bioc annotation functionality:

core.remapped <- getCorum(set = "core", 
                          organism = "Human",
                          remap.uniprot.ids = TRUE)

Data structures for CORUM complexes

We can represent the CORUM complexes as a list of character vectors. The names of the list are the complex IDs/names, and each element of the list is a vector of UniProt IDs for each complex.

core.list <- corum2list(core, subunit.id.type = "UNIPROT")
head(core.list)
length(core.list)

We can also represent the CORUM complexes as a list of graph instances, where all nodes of a complex are connected to all other nodes of that complex with undirected edges.

core.glist <- corum2graphlist(core, subunit.id.type = "UNIPROT")
head(core.glist)
length(core.glist)
core.glist[[1]]@graphData
graph::nodeData(core.glist[[1]])

Note that we can easily convert a graph object into an igraph object using igraph::graph_from_graphnel.

CNV data

HEK293T cells

Genomic data from whole-genome sequencing for six different lineages of the human embryonic kidney HEK293 cell line can be obtained from hek293genome.org.

This includes copy number variation (CNV) data for the 293T cell line. Available CNV tracks include (i) CNV regions inferred from sequencing read-depth analysis, and (ii) CNV regions inferred from Illumina SNP arrays.

Here, we obtain CNV segments obtained from applying a hidden Markov model (HMM) to sequencing-inferred copy numbers in 2kbp windows. More details on how copy numbers were calculated can be obtained from the primary publication.

cnv.hmm <- getHEK293GenomeTrack(track = "cnv.hmm", cell.line = "293T")
cnv.hmm

See also the data checks vignette, Section 5 for an exploration of the agreement between inferred copy numbers from both assay types (SNP arrays vs. sequencing).

Transcriptome data

HEK293T cells

GSE122425

Obtain transcriptome data for 293T cells from GEO dataset: GSE122425.

se <- getGSE122425()
se
head(assay(se, "raw"))
head(assay(se, "rpkm"))
colData(se)
rowData(se)

The dataset includes three wild type samples and three NSUN2 knockout samples.

See also the data checks vignette, Section 7 for an exploration of the relationship between expression level and the frequency of a protein being detected as prey.

HCT116 cells

Cancer Cell Line Encyclopedia (CCLE)

RNA-seq data for 934 cancer cell lines (incl. HCT116) from the Cancer Cell Line Encyclopedia is available from the ArrayExpress-ExpressionAtlas (Accession: E-MTAB-2770).

The data can be obtained as a SummarizedExperiment using the ExpressionAtlas package.

ccle.trans <- ExpressionAtlas::getAtlasExperiment("E-MTAB-2770")

See also the Transcriptome-Proteome analysis vignette for further exploration of the correlation between CCLE HCT116 transcript and protein expression.

Klijn et al., 2015

RNA-seq data of 675 commonly used human cancer cell lines (incl. HCT116) from Klijn et al., 2015 is available from the ArrayExpress-ExpressionAtlas (Accession: E-MTAB-2706)

The data can be obtained as a SummarizedExperiment using the ExpressionAtlas package.

klijn <- ExpressionAtlas::getAtlasExperiment("E-MTAB-2706")

See also the Transcriptome-Proteome analysis vignette for further exploration of differential transcript and protein expression between 293T and HCT116 cells.

Splicing data

For the inference of differential exon usage between cell lines, raw RNA-seq read counts on exon level can be obtained from ExperimentHub.

RNA-seq data for 293T cells was obtained from GEO accession GSE122633 and RNA-seq data for HCT116 cells was obtained from GEO accession GSE52429.

The data can be obtained as a DEXSeqDataSet which is a SummarizedExperiment-derivative and can be accessed and manipulated very much like a DESeqDataSet.

AnnotationHub::query(eh, c("BioPlex"))
dex <- eh[["EH7563"]]
dex

We take a closer look at the sample annotation, the counts for each exon for both cell lines, and the genomic coordinates and additional annotation for each exon.

DEXSeq::sampleAnnotation(dex)
head(DEXSeq::featureCounts(dex))
rowRanges(dex)

Proteome data

CCLE

Pull the CCLE proteome data from ExperimentHub. The dataset profiles 12,755 proteins by mass spectrometry across 375 cancer cell lines.

AnnotationHub::query(eh, c("gygi", "depmap"))
ccle.prot <- eh[["EH3459"]]
ccle.prot <- as.data.frame(ccle.prot)

Explore the data:

dim(ccle.prot)
colnames(ccle.prot)
head(ccle.prot)

Restrict to HCT116:

ccle.prot.hct116 <- subset(ccle.prot, cell_line == "HCT116_LARGE_INTESTINE")
dim(ccle.prot.hct116)
head(ccle.prot.hct116)

Or turn into a SummarizedExperiment for convenience (we can restrict this to selected cell lines, but here we keep all cell lines):

se <- ccleProteome2SummarizedExperiment(ccle.prot, cell.line = NULL)
assay(se)[1:5, 1:5]
assay(se)[1:5, "HCT116"]
rowData(se)

Relative protein expression data from BioPlex3.0

The BioPlex 3.0 publication, Supplementary Table S4A, provides relative protein expression data comparing 293T and HCT116 cells based on tandem mass tag analysis.

bp.prot <- getBioplexProteome()
assay(bp.prot)[1:5,1:5]
colData(bp.prot)
rowData(bp.prot)

The data contains 5 replicates each for 293T and for HCT116 cells. As a result of the data collection process, the data represent relative protein abundance scaled to add up to 100% in each row.

See also the data checks vignette, Section 8 for a basic exploration of the annotated differential expression measures.

Caching

Note that calling functions like getCorum or getBioPlex with argument cache = FALSE will automatically overwrite the corresponding object in your cache. It is thus typically not required for a user to interact with the cache.

For more extended control of the cache, use from within R:

cache.dir <- tools::R_user_dir("BioPlex", which = "cache") 
bfc <- BiocFileCache::BiocFileCache(cache.dir)

and then proceed as described in the BiocFileCache vignette, Section 1.10 either via cleanbfc() to clean or removebfc() to remove your cache.

To do a hard reset (use with caution!):

BiocFileCache::removebfc(bfc)

SessionInfo

sessionInfo()


ccb-hms/BioPlex documentation built on Aug. 5, 2023, 9:15 p.m.