knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html )
The BioPlex project uses affinity-purification mass spectrometry to profile protein-protein interactions (PPIs) in human cell lines.
To date, the BioPlex project has created two proteome-scale, cell-line-specific PPI networks. The first, BioPlex 3.0, results from affinity purification of 10,128 human proteins —- half the proteome —- in 293T cells and includes 118,162 interactions among 14,586 proteins. The second results from 5,522 immunoprecipitations in HCT116 cells and includes 70,966 interactions between 10,531 proteins.
For more information, please see:
The BioPlex R package implements access to the BioPlex protein-protein interaction networks and related resources from within R. Besides protein-protein interaction networks for 293T and HCT116 cells, this includes access to CORUM protein complex data, and transcriptome and proteome data for the two cell lines.
Functionality focuses on importing these data resources and storing them in dedicated Bioconductor data structures, as a foundation for integrative downstream analysis of the data. For a set of downstream analyses and applications, please see the BioPlexAnalysis package and analysis vignettes.
To install the package, start R and enter:
if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("BioPlex")
After the installation, we proceed by loading the package and additional packages used in the vignette.
library(BioPlex) library(AnnotationHub) library(ExperimentHub) library(graph)
Connect to AnnotationHub:
ah <- AnnotationHub::AnnotationHub()
Connect to ExperimentHub:
eh <- ExperimentHub::ExperimentHub()
OrgDb package for human:
orgdb <- AnnotationHub::query(ah, c("orgDb", "Homo sapiens")) orgdb <- orgdb[[1]] orgdb keytypes(orgdb)
Available networks include:
Let's get the latest version of the 293T PPI network:
bp.293t <- getBioPlex(cell.line = "293T", version = "3.0") head(bp.293t) nrow(bp.293t)
Each row corresponds to a PPI between a bait protein A and a prey protein B, for which
NCBI Entrez Gene IDs, Uniprot IDs, and gene symbols are annotated.
The last three columns reflect the likelihood that each interaction resulted
from either an incorrect protein identification (pW
), background (pNI
), or
a bona fide interacting partner (pInt
) as determined using the
CompPASS algorithm.
Analgously, we can obtain the latest version of the HCT116 PPI network:
bp.hct116 <- getBioPlex(cell.line = "HCT116", version = "1.0") head(bp.hct116) nrow(bp.hct116)
The protein-to-gene mappings from BioPlex (i.e. UNIPROT-to-SYMBOL and UNIPROT-to-ENTREZID) are based on the mappings available from Uniprot at the time of publication of the BioPlex 3.0 networks.
We can update those based on Bioc annotation functionality:
bp.293t.remapped <- getBioPlex(cell.line = "293T", version = "3.0", remap.uniprot.ids = TRUE)
We can also represent a given version of the BioPlex PPI network for a given cell line as one big graph where bait and prey relationship are represented by directed edges from bait to prey.
bp.gr <- bioplex2graph(bp.293t) bp.gr head(graph::nodeData(bp.gr)) head(graph::edgeData(bp.gr))
We can easily add PFAM domain annotations to the node metadata:
bp.gr <- annotatePFAM(bp.gr, orgdb) head(graph::nodeData(bp.gr, graph::nodes(bp.gr), "PFAM"))
Obtain the complete set of human protein complexes from CORUM:
all <- getCorum(set = "all", organism = "Human") dim(all) colnames(all) all[1:5, 1:5]
Core set of complexes:
core <- getCorum(set = "core", organism = "Human") dim(core)
Complexes with splice variants:
splice <- getCorum(set = "splice", organism = "Human") dim(splice)
The protein-to-gene mappings from CORUM (i.e. UNIPROT-to-SYMBOL and UNIPROT-to-ENTREZID) might not be fully up-to-date.
We can update those based on Bioc annotation functionality:
core.remapped <- getCorum(set = "core", organism = "Human", remap.uniprot.ids = TRUE)
We can represent the CORUM complexes as a list of character vectors. The names of the list are the complex IDs/names, and each element of the list is a vector of UniProt IDs for each complex.
core.list <- corum2list(core, subunit.id.type = "UNIPROT") head(core.list) length(core.list)
We can also represent the CORUM complexes as a list of graph instances, where all nodes of a complex are connected to all other nodes of that complex with undirected edges.
core.glist <- corum2graphlist(core, subunit.id.type = "UNIPROT") head(core.glist) length(core.glist) core.glist[[1]]@graphData graph::nodeData(core.glist[[1]])
Note that we can easily convert a
graph object into an
igraph object using
igraph::graph_from_graphnel
.
Genomic data from whole-genome sequencing for six different lineages of the human embryonic kidney HEK293 cell line can be obtained from hek293genome.org.
This includes copy number variation (CNV) data for the 293T cell line. Available CNV tracks include (i) CNV regions inferred from sequencing read-depth analysis, and (ii) CNV regions inferred from Illumina SNP arrays.
Here, we obtain CNV segments obtained from applying a hidden Markov model (HMM) to sequencing-inferred copy numbers in 2kbp windows. More details on how copy numbers were calculated can be obtained from the primary publication.
cnv.hmm <- getHEK293GenomeTrack(track = "cnv.hmm", cell.line = "293T") cnv.hmm
See also the data checks vignette, Section 5 for an exploration of the agreement between inferred copy numbers from both assay types (SNP arrays vs. sequencing).
Obtain transcriptome data for 293T cells from GEO dataset: GSE122425.
se <- getGSE122425() se head(assay(se, "raw")) head(assay(se, "rpkm")) colData(se) rowData(se)
The dataset includes three wild type samples and three NSUN2 knockout samples.
See also the data checks vignette, Section 7 for an exploration of the relationship between expression level and the frequency of a protein being detected as prey.
RNA-seq data for 934 cancer cell lines (incl. HCT116) from the Cancer Cell Line Encyclopedia is available from the ArrayExpress-ExpressionAtlas (Accession: E-MTAB-2770).
The data can be obtained as a SummarizedExperiment
using the
ExpressionAtlas package.
ccle.trans <- ExpressionAtlas::getAtlasExperiment("E-MTAB-2770")
See also the Transcriptome-Proteome analysis vignette for further exploration of the correlation between CCLE HCT116 transcript and protein expression.
RNA-seq data of 675 commonly used human cancer cell lines (incl. HCT116) from Klijn et al., 2015 is available from the ArrayExpress-ExpressionAtlas (Accession: E-MTAB-2706)
The data can be obtained as a SummarizedExperiment
using the
ExpressionAtlas package.
klijn <- ExpressionAtlas::getAtlasExperiment("E-MTAB-2706")
See also the Transcriptome-Proteome analysis vignette for further exploration of differential transcript and protein expression between 293T and HCT116 cells.
For the inference of differential exon usage between cell lines, raw RNA-seq read counts on exon level can be obtained from ExperimentHub.
RNA-seq data for 293T cells was obtained from GEO accession GSE122633 and RNA-seq data for HCT116 cells was obtained from GEO accession GSE52429.
The data can be obtained as a DEXSeqDataSet
which is a
SummarizedExperiment
-derivative and can be accessed and manipulated very much
like a DESeqDataSet
.
AnnotationHub::query(eh, c("BioPlex")) dex <- eh[["EH7563"]] dex
We take a closer look at the sample annotation, the counts for each exon for both cell lines, and the genomic coordinates and additional annotation for each exon.
DEXSeq::sampleAnnotation(dex) head(DEXSeq::featureCounts(dex)) rowRanges(dex)
Pull the CCLE proteome data from ExperimentHub. The dataset profiles 12,755 proteins by mass spectrometry across 375 cancer cell lines.
AnnotationHub::query(eh, c("gygi", "depmap")) ccle.prot <- eh[["EH3459"]] ccle.prot <- as.data.frame(ccle.prot)
Explore the data:
dim(ccle.prot) colnames(ccle.prot) head(ccle.prot)
Restrict to HCT116:
ccle.prot.hct116 <- subset(ccle.prot, cell_line == "HCT116_LARGE_INTESTINE") dim(ccle.prot.hct116) head(ccle.prot.hct116)
Or turn into a SummarizedExperiment
for convenience (we can restrict
this to selected cell lines, but here we keep all cell lines):
se <- ccleProteome2SummarizedExperiment(ccle.prot, cell.line = NULL) assay(se)[1:5, 1:5] assay(se)[1:5, "HCT116"] rowData(se)
The BioPlex 3.0 publication, Supplementary Table S4A, provides relative protein expression data comparing 293T and HCT116 cells based on tandem mass tag analysis.
bp.prot <- getBioplexProteome() assay(bp.prot)[1:5,1:5] colData(bp.prot) rowData(bp.prot)
The data contains 5 replicates each for 293T and for HCT116 cells. As a result of the data collection process, the data represent relative protein abundance scaled to add up to 100% in each row.
See also the data checks vignette, Section 8 for a basic exploration of the annotated differential expression measures.
Note that calling functions like getCorum
or getBioPlex
with argument
cache = FALSE
will automatically overwrite the corresponding object in your
cache. It is thus typically not required for a user to interact with the cache.
For more extended control of the cache, use from within R:
cache.dir <- tools::R_user_dir("BioPlex", which = "cache") bfc <- BiocFileCache::BiocFileCache(cache.dir)
and then proceed as described in the
BiocFileCache vignette, Section 1.10
either via cleanbfc()
to clean or removebfc()
to remove your cache.
To do a hard reset (use with caution!):
BiocFileCache::removebfc(bfc)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.