SubCellBarCode can be installed through BiocManager package as follows:
``` {r install,eval = FALSE} if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install("SubCellBarCode")
# Load the package ```r library(SubCellBarCode)
As example data we here provide the publicly available HCC827 (human lung 
adenocarcinoma cell line) TMT10plex labelled proteomics dataset 
(Orre et al. 2019, Molecular Cell). The data.frame consists of 
10480 proteins as rows (rownames are gene -centric protein ids) 
and 5 fractions with duplicates as columns 
(replicates must be named ".A." and ".B.", repectively). 
head(hcc827Ctrl)
The classification of protein localisation using the SubCellBarCode 
method is dependent on 3365 marker proteins as defined in Orre et al. 
The markerProteins data.frame contain protein names (gene symbol), 
associated subcellular localization (compartment), color code for 
the compartment and the median normalized fractionation
profile (log2) based on five different human cell lines 
(NCI-H322, HCC827, MCF7, A431 and U251) here called the 
“5CL marker profile”.
head(markerProteins)
Input data.frame is checked with "NA" values and for the correct 
format. If there is any "NA" value, corresponding row is deleted. 
Then, data frame is log2 transformmed. 
df <- loadData(protein.data = hcc827Ctrl)
cat(dim(df)) head(df)
Additional step:
We use gene symbols for the protein identification. Therefore, we require 
gene symbols for the identifiaction. However, if the input data has 
other identifier e.g. UNIPROT, IPI, Entrez ID, you can convert it to gene
symbol by our defined function. 
Please be aware of possible (most likely few) id loss during the 
conversion to one another.
##Run if you have another identifier than gene symbols. ##Function will convert UNIPROT identifier to gene symbols. ##Deafult id is "UNIPROT", make sure you change it if you use another. #df <- convert2symbol(df = df, id = "UNIPROT")
For the downstream analysis, we used the randomly selected subset data.
set.seed(2) df <- df[sample(nrow(df), 6000),]
The overlap between marker proteins (3365) and input data.frame is calculated and visualized for each compartment by a bar plot.
Note that we recommend at least 20% coverage of marker proteins for each compartment. If certain compartments are underrrepresented we recommend you to perform the cell fractionation again. If all compartments are low in coverage we recommend increasing the analytical depth of the MS-analysis.
c.prots <- calculateCoveredProtein(proteinIDs = rownames(df), markerproteins = markerProteins[,1])
To avoid reduced classification accuracy, marker proteins with noisy quantification and marker proteins that are not representative of their associated compartment (e.g.due to cell type specific localization) are filtered out by a two-step quality control.
Marker proteins with pearson correlations less than 0.8 between A and B duplicates for each cell line were filtered out (Figure A).
Pairwise correlations between 5CL marker profile and input data for each protein (A and B replicate experiments separately) were calculated using both Pearson and Spearman correlation. The lowest value for each method were then used for filtering with cut-offs set to 0.8 and 0.6 respectively, to exclude non-representative marker proteins (Figure B).
r.markers <- markerQualityControl(coveredProteins = c.prots,protein.data = df) r.markers[1:5]
Optional step: After removing non-marker proteins, you can re-calculate and visualize the final coverage of the marker proteins.
## uncomment the function when running # f.prots <- calculateCoveredProtein(r.markers, markerProteins[,1])
The spatial distribution of the marker proteins is visualized in t-SNE map. This plot will be informative for the quality control of the generated data as it offers evaluation of the spatial distribution and separation of marker proteins.
#Default parameters #Run the broad-range parameters to optimize well !!! #Output dimensionality #dims = 3 #Speed/accuracy trade-off (increase for less accuracy) #theta = c(0.1, 0.2, 0.3, 0.4, 0.5) #Perplexity parameter #perplexity = c(5, 10, 20, 30, 40, 50, 60)
Information about the different t-SNE parameters that can be 
modified by the user is available by typing ?Rtsne in the console.  
Although the applications of t-SNE is widespread in the field 
of machine learning, it can be misleading if it is not well 
optimized. Therefore, we optimize t-SNE map by grid search, 
a process that can take some time 
set.seed(6) tsne.map <- tsneVisualization(protein.data = df, markerProteins = r.markers, dims = 3, theta = c(0.1), perplexity = c(60))
We recommend 3D vizualisation by setting dims = 3, 
for optimal evaluation of marker protein cluster separation 
and data modularity. You can also visualize the marker 
proteins in 2 dimensional space by setting dims = 2, although 
reducing the dimensionality results in loss of information and 
underestimation of data resolution. 
set.seed(9) tsne.map2 <- tsneVisualization(protein.data = df, markerProteins = r.markers, dims = 2, theta = c(0.5), perplexity = c(60))
For replicate A and B separately, marker proteins are used for training a Support Vector Machine (SVM) classifier with a Gaussian radial basis function kernel algorithm. After tuning the parameters, the SVM model predicts (classifies) the subcellular localization for all proteins in the input data with corresponding probabilities for A and B replicate classification.
set.seed(4) cls <- svmClassification(markerProteins = r.markers, protein.data = df, markerprot.df = markerProteins)
# testing data predictions for replicate A and B test.A <- cls[[1]]$svm.test.prob.out test.B <- cls[[2]]$svm.test.prob.out head(test.A)
# all predictions for replicate A and B all.A <- cls[[1]]$all.prot.pred all.B <- cls[[2]]$all.prot.pred
Classification probabilities close to 1 indicate high confidence predictions, whereas probabilities close to 0 indicate low confidence predictions. To increase the overal prediction accuracy and to filter out poor predictions, one criterion and two cut-offs are defined.
The criterion is the consensus of preliminary predictions between biological duplicates. Proteins are kept in the analysis, if there is an agreement between biological duplicates. Subsequently, prediction probabilities from the two duplicates are averaged for each protein.
Cut-off is (precision - based) set when precision reach 0.9 in the test data.
Cut-off is (recall - based) set as the probability of the lowest true positive in the test data.
t.c.df <- computeThresholdCompartment(test.repA = test.A, test.repB = test.B)
head(t.c.df)
The determined thresholds for the compartment levels are applied to all classifications.
c.cls.df <- applyThresholdCompartment(all.repA = all.A, all.repB = all.B, threshold.df = t.c.df)
head(c.cls.df)
Compartment level classification probabilities are summed to neighborhood probabilities and thresholds for neighborhood analysis are estimated as described above for compartment level analysis except precision based cut-off is set to 0.95.
t.n.df <- computeThresholdNeighborhood(test.repA = test.A, test.repB = test.B)
head(t.n.df)
The determined thresholds for the neighborhood levels are applied to all classifications.
n.cls.df <- applyThresholdNeighborhood(all.repA = all.A, all.repB = all.B, threshold.df = t.n.df)
head(n.cls.df)
Individual classifications (compartment and neighborhood) are merged into one data frame.
cls.df <- mergeCls(compartmentCls = c.cls.df, neighborhoodCls = n.cls.df)
head(cls.df)
You can query one protein at a time to plot barcode of the protein of the interest.
PSM (Peptide-spectra-matching) count table is required for the plotting
SubCellBarCode. It is in data.frame format;
head(hcc827CtrlPSMCount)
plotBarcode(sampleClassification = cls.df, protein = "NLRP4", s1PSM = hcc827CtrlPSMCount)
To evaluate localization and of multiple proteins at the same time, a vector of proteins (identified by gene symbols) can be prepared and used to create a barplot showing the distribution of classifications across compartments and neighborhoods. This analysis could be helpful when evaluating co-localization of proteins, protein complex formation and compartmentalized protein level regulation.
# 26S proteasome complex (26s proteasome regulatory complex) proteasome26s <- c("PSMA7", "PSMC3","PSMA4", "PSMB4", "PSMB6", "PSMB5", "PSMC2","PSMC4", "PSMB3", "PSMA6","PSMC5","PSMC6") plotMultipleProtein(sampleClassification = cls.df, proteinList = proteasome26s)
Regulation of  protein localization is a the key 
process in cellular signalling. The SubCellBarCode 
method can be used for differential localization analysis 
given two conditions such as control vs treatment, cancer
cell vs normal cell, cell state A vs cell state B, etc. 
As example, we compared untreated and gefitinib (EGFR inhibitor) 
treated HCC827 cells (for details, see Orre et al.). 
Neighborhood classifications  for condition 1 (untreated) and 
condition 2 (gefitinib) is first done separately, and 
classifications for overlapping proteins are then vizualized by a 
sankey plot.   
The HCC827 gefitinib cell lines classification was embedded 
into the package for example analysis.
head(hcc827GEFClass)
sankeyPlot(sampleCls1 = cls.df, sampleCls2 = hcc827GEFClass)
As the differential localization analysis is an outlier analysis, 
it will include analytical noise. To filter out such noise, 
PSM (Peptide-spectra-matching) counts and fractionation profile 
correlation analysis (Pearson) was done to identify strong 
candidates. The PSM count format for the input have to be the 
same between the compared conditions;
head(hcc827CtrlPSMCount)
For each protein, the minimum PSM count between the two conditions 
is plotted against the fractionation profile (median) correlation 
between the two conditions. For proteins with different localizations 
between conditions, the fractionation profile differs and therefore 
we are expecting a low fractionation profile correlation. As a standard 
setting for filtering of analytical noise in the differential 
localization analysis we suggest to demand a fractionation profile 
correlation below 0.8, and a minimum PSM count of at least 3. 
However, for the exploratory analysis, you can adjust
the min.psm and pearson.cor parameters. 
##parameters #sampleCls1 = sample 1 classification output #s1PSM = sample 2 PSM count #s1Quant = Sample 1 Quantification data #sampleCls2 = sample 2 classification output #s2PSM = sample 2 classification output #sample2Quant = Sample 2 Quantification data #min.psm = minumum psm count #pearson.cor = perason correlation coefficient
candidate.df <- candidateRelocatedProteins(sampleCls1 = cls.df, s1PSM = hcc827CtrlPSMCount, s1Quant = hcc827Ctrl, sampleCls2 = hcc827GEFClass, s2PSM = hcc827GefPSMCount, s2Quant = hcc827GEF, min.psm = 2, pearson.cor = 0.8)
cat(dim(candidate.df))
head(candidate.df)
Candidate subset of differentially localizing proteins can 
be annotated with names by setting annotation = TRUE, min.psm 
and pearson.cor 
candidate2.df <- candidateRelocatedProteins(sampleCls1 = cls.df, s1PSM = hcc827CtrlPSMCount, s1Quant = hcc827Ctrl, sampleCls2 = hcc827GEFClass, s2PSM = hcc827GefPSMCount, s2Quant = hcc827GEF, annotation = TRUE, min.psm = 9, pearson.cor = 0.05)
Our main analysis is based on gene-centric. However, based on the analytical
depth of the data, peptide/exon/transcript centric classifications can
be performed. Moreover, this approach is also applicable for
post translation modification (PTM) enriched data.
Fundamentally, we will use the same classifiers, compartment 
and neighborhood levels thresholds and apply 
them to peptide/exon/transcript data. 
Here, we have provided subset of the HCC827 exon-centric data.
head(hcc827exon)
For the classification, we will use the gene-centric model that we built in section 3.7.
##recall the models modelA <- cls[[1]]$model modelB <- cls[[2]]$model
svmExternalData function greps replicates A and B, repectively. So 
the input data can include other features like, here, peptide counts. 
exon.cls <- svmExternalData(df = hcc827exon, modelA = modelA, modelB= modelB)
exon.A <- exon.cls[[1]] exon.B <- exon.cls[[2]]
head(exon.A)
Next steps are exactly same as what we did in section 3.7. We will use same thresholdsthat we have estimated in 3.7.1 and 3.7.3. Finally, we will merge two classifications same function used in 3.7.5.
exon.comp.cls <- applyThresholdCompartment(all.repA = exon.A[,2:17], all.repB = exon.B[,2:17], threshold.df = t.c.df) exon.neigh.cls <- applyThresholdNeighborhood(all.repA = exon.A[,2:17], all.repB = exon.B[,2:17], threshold.df = t.n.df) exon.cls.df <- mergeCls(compartmentCls = exon.comp.cls, neighborhoodCls = exon.neigh.cls) #same order exon.cls.df <- exon.cls.df[rownames(exon.A),] # we will add gene symbols as well as peptide count # (PSM count is also accepted) in case for comparing with # gene-centric classifications exon.cls.df$GeneSymbol <- exon.A$Gene_Symbol exon.cls.df$PeptideCount <- hcc827exon$PeptideCount
head(exon.cls.df)
The insteresting part of the exon classification is to detect deviated 
classification between gene centric classification. However, we enrich
more noise than gene centric classification due to analtyical dept of the exon
centric data. Therefore, we, additioanlly, internally calculate correlation
between gene-centric data hcc827Ctrl and exon-centric data hcc827exon, 
and report
number of peptides (PSM count works, too) that match to corresponding exon 
as an indication of the confidence of the results.
comp.df <- compareCls(geneCls = cls.df, exonCls = exon.cls.df)
head(comp.df)
You can easily vizualize/filter the results using correlations and number of peptides.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.