knitr::opts_chunk$set(
  #collapse = TRUE,
  comment = "#>",
  fig.width = 4,
  fig.height = 4,
  message = FALSE,
  warning = FALSE,
  tidy.opts = list(
    keep.blank.line = TRUE,
    width.cutoff = 150
    ),
  options(width = 150),
  eval = TRUE
)

\newpage

1. Getting started with the assignment of proteins to subcellular locations using protlocassign

Introduction

This vignette provides an overview of using the protlocassign package for constrained proportional assignment. A more detailed guide is available in a companion set of seven tutorials in the package "protlocassigndoc"; details on installing that are given in the next section. Additional details may be found in a companion paper (DF Moore, D Sleat, and P Lobel, 2022).

Determining the locations of proteins in the cell is an important but complex problem. A frequently employed approach for this involves centrifugation-based methods to partially separate different organelles and other cellular compartments and then to determine the relative distribution of different proteins among the centrifugation fractions. The location of proteins of interest in the cellular compartments are then inferred by comparing their distributions across the centrifugation fractions to the distributions of a set of reference proteins (markers) with known cellular locations.

This package implements a subcellular protein assignment procedure known as “constrained proportional assignment”, or CPA (Jadot et al, 2017). The basic concepts involved in CPA and analysis of subcellular proteomics data are described in the main paper (Moore et al., 2022). This vignette serves as a basis for implementing CPA and various utilities for both experienced and beginner R users.

There are two fundamental inputs for CPA. The first is a file with information regarding the distribution of each protein or other species of interest to be analyzed across centrifugation fractions. Each row has a name that serves as a unique identifier, which we here refer to as a protein name, but one can use other identifiers (e.g., protein group, protein isoform, gene name, accession number, etc.). For each protein (or other identifier), there are data that reflect its distribution among the centrifugal fractions, which represent a profile. The second is a file containing a list of single-compartment reference proteins (markers) and their associated subcellular locations. As an initial step, the package uses the markers to compute profiles for individual cellular compartments. Then, for each protein, it finds the best match for its profile using a linear combination of compartment profiles. The relative weights of the linear combination in principle reflect the relative abundance of a given protein among different compartments. The CPA method can thus account for proteins that have multiple residences, and estimate the relative proportion among these residences.

We illustrate with an example. Tannous et al. (2020) presented an experiment (designated here as AT5) analyzing abundance levels of proteins across a total of nine fractions: six fractions (N, M, L1, L2, P, and S) from differential centrifugation of a rat liver homogenate and three fractions (Nyc1, Nyc2, and Nyc3) from a Nycodenz density gradient centrifugation of the differential fraction L1. Eight subcellular compartments were considered: nucleus (Nuc), mitochondria (Mito), lysosomes (Lyso), peroxisomes (Perox), endoplasmic reticulum (ER), Golgi apparatus (Golgi), plasma membrane (PM), and cytosol (Cyto). The CPA method assigns each protein to one or more of these compartments, based on profiles from the set of reference proteins.

Other Bioconductor packages are available for working with proteomics data. The "HPAanalyze" package (Tran et al., 2019) provides tools for accessing the Human Protein Atlas and for visualizing the data. The "pRoloc" package (Breckels et al.) provides tools for studying localisation of proteins inside cells and for manipulating and visualizing relevant data from quantitative fractionation experiments. In this vignette and in the tutorials in the accompanying "protlocassigndoc" package we provide detailed instructions on how to format data in ".csv" files for import in this package. Data stored in "pRoloc" or "hpa" R formats may be converted into the formats required by "protlocassign" in a similar fashion.

Installing the protlocassign package

First install the Bioconductor utility BiocManager from CRAN:

install.packages("BiocManager")

Next, install the protlocassign package from Bioconductor:

BiocManager::install("protlocassign")

Alternatively one may install a development version directly from Github as follows:

install.packages("devtools")
devtools::install_github("mooredf22/protlocassign")

An extensive set of companion supplementary tutorials may be installed directly from github:

devtools::install_github("mooredf22/protlocassigndoc")

Finally, install other required packages:

BiocManager::install(c("BB", "pracma", "lme4", "outliers", "BiocParallel"))

Once you have installed these packages, you do not need to re-install them the next time you start up R. However, you will need to load them using the library function before use.

Working with data expressed as normalized specific amounts (NSA)

To use this package, you will need two data sets as described above. One, the protein profile data set, contains rows with a unique identifier followed by data describing the profile associated with each identifier across a series of centrifugation fractions. In this tutorial, the profiles associated with each identifier are specific amounts with sums constrained to 1 (NSA, see Moore et al., 2002 and Tutorial 3) but can be in a different form or further transformed to improve the quality of the fit (see Tutorials 2-4). The protein profile data set may contain two additional values representing the numbers of peptides and spectra that were used to compute the profile (see below). The other data set consists of the list of reference proteins and their associated known subcellular compartments.

Consider for example a test data set drawn from the TMT MS2 data from Tannous et al. (2020) experiment AT5. The full data set has 7894 proteins

but we use a small subset of these for purposes of illustration and to reduce processing time. To get started, load the package and attach the embedded test protein profile data set; we see that it has 40 rows and 11 columns:

library(protlocassign)

data(protNSA_test)
dim(protNSA_test)

For the sake of brevity, we rename it protNSA:

protNSA <- protNSA_test

The first few rows of the data can be examined using the head command, rounding to improve legibility. The first nine columns represent the protein profile. The last two columns, which are optional, give the number of spectra and the number of peptides (sequences) for each protein. Note that while we use a nested random effects model described in Jadot et al., 2017 to compute the means across spectra (also see Tutorial 6), other methods, including taking a straight average or weighted average (e.g., based on reporter ion intensities or peak areas), may be appropriate for other applications.

round(head(protNSA), digits=2)

We can also use the “str” command as an alternate way to examine the structure of the protNSA data set. The first line returned indicates that protNSA is represented in R as a data frame with 40 rows and 11 variables. The remaining lines show each variable labeled according to its name (column name) followed by its type (here “num”) and the first few values, omitting row names that are present within the data frame.

str(protNSA)

The list containing reference proteins is derived from Jadot et al (2017) and examining the dimensions of the data reveals that it contains 37 rows and two columns, the first for the protein names and the second for their respective subcellular compartments.

data(markerListJadot)
dim(markerListJadot)

The data set can be viewed by entering its name as follows:

markerListJadot

While the protNSA_test and markerListJadot data sets are included in the package, they were initially read into R from external files and automatically converted to data frames. This will need to be done for any new data set. As an example, we demonstrate this for a case where one is working with the Windows operating system and the data sets reside in the directory C:\temp\myproteindata. (If one is working with either the Linux or Mac OS, appropriate changes to these procedures will be needed to access files.) One can set the working directory to point there, either by navigating to it in R studio (session menu pane) or by entering:

setwd("C:\\temp\\myproteindata")

Here, for for illustration, we shall instead create a temporary directory as follows:

currentDir <- tempdir()
setwd(currentDir)
currentDir

To illustrate how to read in the protein profiles and list of marker proteins, we first write out our protein profiles and marker list as comma-delimited files. For the protein profiles, we write out the protein names (which are row names in R) by specifying row.names=TRUE. Note that we first alter the protein names so that they are preceded by a single quote since, if one subsequently opens the file with Microsoft Excel, it could automatically reformat some names to dates (e.g., March1 to 1-Mar).
To write protein profiles and marker proteins to .csv files, input:

setwd(currentDir)
protNSAout <- protNSA
rownames(protNSAout) <- paste("'", rownames(protNSA), sep="")
markerListJadotOut <- markerListJadot
markerListJadotOut$protName <- paste("'", markerListJadot$protName, sep="")
write.csv(protNSAout, file="protNSAout.csv", row.names=TRUE)
write.csv(markerListJadotOut, file="markerListJadotOut.csv", row.names=FALSE)

We may examine these files by importing them into Excel. For the marker protein data file, the first row must contain the two column names (protein unique identifier and compartment designation). Note that the marker proteins must be specified precisely as listed in the protein profile data set.

We then read in the two data sets (protein profile data and reference protein list), which must be in .csv format, with the first row containing column names. The option row.names=1 takes the first column of the protNSAout.csv file and uses it as row names for the R file MyProtNSAin.

setwd(currentDir)
MyProtNSAin <- read.csv(file="protNSAout.csv", row.names=1)
MyMarkerListIn <- read.csv(file="markerListJadotOut.csv")

We then remove the preceding single quotes from the protein names.

rownames(MyProtNSAin) <- sub("^'", "", rownames(MyProtNSAin))
MyMarkerListIn[,1] <- sub("^'", "", MyMarkerListIn[,1])

We man examine the files as follows:

round(head(MyProtNSAin), digits=3)
head(MyMarkerListIn)

These new R data frames are identical to protNSA_test and markerListJadot. If the user reads in their own data, be sure that the first row in each of the .csv files contains column names. In particular, the names in the first row of the markers file must be "protName" and "referenceCompartment" to ensure that the resulting R data frame is in the proper format.

If one desires to work with the full set of 7894 proteins in the Tannous et al. data set, it may be downloaded from the MassIVE repository. To do so, execute the following code:

urlProtProfiles <-  "https://massive.ucsd.edu/ProteoSAFe/DownloadResultFile?file=f.MSV000083842/updates/2022-03-14_ablatannous_c39637a3/quant/protNSA_AT5tmtMS2.csv&forceDownload=true"

protNSA_AT5tmtMS2 <- read.csv(urlProtProfiles)
dim(protNSA_AT5tmtMS2)
head(protNSA_AT5tmtMS2)

To run the examples in this vignette using the full data set protNSA_AT5tmtMS2 in place of the test data set, one may rename the full data set as follows:

protNSA <- protNSA_AT5tmtMS2

Creating and viewing compartment and marker protein profiles

In order to assign proteins proportionately to their respective compartments, we first use the function locationProfileSetup to obtain profiles for the compartments based on the means of the individual reference proteins that represent each compartment. This function produces a matrix that has one row for each compartment and one column for each fraction that comprises the profile.

refLocationProfilesNSA <- locationProfileSetup(profile=protNSA, 
                                    markerList=markerListJadot, numDataCols=9)

We display the data, using rounding to improve readability.

round(refLocationProfilesNSA, digits=4)

To examine the available compartments, view the row names of refLocationProfilesNSA.

rownames(refLocationProfilesNSA)

To graphically display a particular compartment profile and its component proteins, use markerProfilePlot. For example, to plot the profile for plasma membrane, input:

markerProfilePlot(refLoc="PM", profile=protNSA, markerList=markerListJadot,
                     refLocationProfiles=refLocationProfilesNSA, ylab="NSA")

This displays the PM compartmental profile (dashed yellow-black line) and its five component reference proteins (red lines). To plot, for example, the second PM marker protein, use the option refProtPlot=2 in the markerProfilePlot function:

markerProfilePlot(refLoc="PM", profile=protNSA, 
                  markerList=markerListJadot,
                  refLocationProfiles=refLocationProfilesNSA, ylab="NSA",
                  refProtPlot=2)

To show individually each of the five reference proteins that are used to calculate the PM compartmental profile, use par(mfrow=c(3,2)) to set up a plot array with 3 rows and 2 columns which will accommodate up to six plots on a single page. Then plot all five of the them one-by-one by looping through the five PM marker proteins by first specifying for (j in 1:5) and invoking the option refProtPlot=j in the markerProfilePlot function. (To make them legible, you may first need to open a new window by entering windows(width=5, height=7).) The parameters in par(mfrow=c(3,2)) and for (j in 1:5) can be adjusted to display any number of plots.

par(mfrow=c(3,2)) # this will be new default layout for subsequent plots.  
# Will need to reset par(mfrow=c(1,1)) for single graph layouts

for (j in 1:5) {
  markerProfilePlot(refLoc="PM", profile=protNSA,
                    markerList=markerListJadot,
                    refLocationProfiles=refLocationProfilesNSA, 
                    ylab="NSA", refProtPlot=j)
  }

To plot all of the compartment and individual marker protein profiles, we may set up a plot window with four rows and two columns and then loop through the eight subcellular compartments:

loc.list <- rownames(refLocationProfilesNSA)
n.loc <- length(loc.list)
par(mfrow=c(4,2))
for (i in 1:n.loc) {
  markerProfilePlot(refLoc=loc.list[i], profile=protNSA,
                     markerList=markerListJadot,
                     refLocationProfiles=refLocationProfilesNSA, ylab="NSA")
  }

Obtaining CPA of proteins to compartments

Once we are satisfied with the marker set and compartment profiles, we can run the CPA routine which, for each protein in our data set, apportions its residency among the different specified compartments. Note that the command below uses default options. It is also possible to specify optional parameters including writing out the obtained goodness of fit minimum (minVal=TRUE, default is minVal=FALSE), setting initial parameters for each CPA value using an eight element vector specifying the starting proportions (startProp, if not specified the function will assign a value of 1/8 for each of the eight compartments), and by constraining the output CPA values for specified compartments to be zero (see help file, ?fitCPA). The optimization procedure applies a projection onto a subspace in order to enforce the constraint that the estimated proportions add up to one. (Chen and Ye, 2011).

protCPAfromNSA <- fitCPA(profile=protNSA,
                          refLocationProfiles=refLocationProfilesNSA, 
                          numDataCols=9)

We can view the structure of the output data file and see that the rows contain the protein names, and the next eight columns contain the allocation proportions of each protein to the eight compartments. The last two columns are integers representing the numbers of peptides and spectra assigned to each respective protein; these are carried over from the input protProfileSummary profile data and are not essential for the CPA analysis.

round(head(protCPAfromNSA), digits=2)

Note that the protein "AIF1" has all missing values, which is why an error is reported for that one protein. While AIF1 is the only one of the 7894 proteins in the full AT5tmtMS2 data set that has no profile, we include it in our 40 protein test data set to show that such cases can occur.

We can look at the profile of, for example, the protein "TLN1". We first ensure that the protein is in the dataset:

protIndex("TLN1", profile=protNSA)

This function also accepts partial matching of the first few letters of a protein. For example, we can find the indices of the proteins starting with "TLN":

protIndex("TLN", profile=protNSA)

Now we plot the results for protein TLN1:

protPlotfun(protName="TLN1", profile=protNSA, 
numDataCols=9, n.compartments=8,
  refLocationProfiles=refLocationProfilesNSA, 
  assignPropsMat=protCPAfromNSA, 
  yAxisLabel="Normalized Specific Amount")

The x-axis represents the nine fractions, which are N, M, L1, L2, P, S, Nyc.1, Nyc.2, and Nyc.3. In each of the eight plots, the red line is the average profile of the protein. The dashed yellow-black lines show the expected profile for a protein entirely resident in the respective subcellular location. In this set of plots, we see that the CPA procedure assigns a 57 percent residence proportion to plasma membrane and 36 percent residence to cytosol. The observed red profile would closely resemble a mixture of the yellow-black lines weighted by the indicated proportions.

The protPlotfun function is designed to plot profiles of eight subcellular locations. If a data set has more than eight of these, it will be necessary to modify the code to accommodate the larger number.

Saving the CPA output

Data output

To save the results of the CPA procedure, as before, we prepend the protein names with a single quote and write out a .csv file to a specified directory.

protCPAfromNSAout <- protCPAfromNSA
rownames(protCPAfromNSAout) <- paste("'", rownames(protCPAfromNSAout), sep="")

setwd("C:/temp/myProteinOutput")

write.csv(protCPAfromNSAout, file="protCPAfromNSAout_AT5tmtMS2.csv", 
      row.names=TRUE)

Plot output as pdf files

To save the plot of a protein (TLN1 for example) as a pdf file, we first specify a pdf plot window, call the protPlotfun function as before, and then close the plot window using dev.off() to allow R to complete producing the file:

pdf(file="myPlotPDFfile.pdf", width=7, height=10)
protPlotfun(protName="TLN1", profile=protNSA, 
    numDataCols=9, n.compartments=8,
    refLocationProfiles=refLocationProfilesNSA, 
    assignPropsMat=protCPAfromNSA,
    yAxisLabel="Normalized Specific Amount")
dev.off()

To output plots all of the protein profiles into a single pdf file, one can set up a loop as follows:

pdf(file="allPlotsPDFfile.pdf", width=7, height=10)
n.prots <- nrow(protCPAfromNSA)
for (i in 1:n.prots) {
   protPlotfun(protName=rownames(protCPAfromNSA)[i],
       profile=protNSA, numDataCols=9, n.compartments=8,
       refLocationProfiles=refLocationProfilesNSA, 
       assignPropsMat=protCPAfromNSA,
       yAxisLabel="Normalized Specific Amount")
}
dev.off()

This will result in a single file, allPlotsPDFfile.pdf, with a page for each protein plotted. Note that one protein in our data set, AIF1, does not have a profile as all peptides were outliers (see Tutorial 6).

\newpage

2. Relative specific amount (RSA) transformation and CPA

Introduction

In Section 1 (“Getting Started”), we illustrated the principles of CPA using protein profiles that represent mass spectrometry data from a set of subcellular fractions in the form of normalized specific amounts (NSA). NSA profiles have equivalent amounts of total protein analyzed per fraction, with the sum of all fractions constrained to 1. This section describes how to use functions in the protlocassign package to transform NSA profiles in ways that may provide more accurate assignments of proportional residence to subcellular compartments.

Setting up the data and reference protein files and transforming to RSA

As explained in the Section 1, we will demonstrate using two R data sets that are included in the protlocassign package.
One, protNSA_test, consists of row names that indicate protein identifiers, each of which is followed by data describing the identifier profile across the nine normalized specific amounts derived from a subcellular fractionation experiment. The other data set, markerListJadot, consists of a list of reference proteins and their associated known subcellular compartments.
As before, to run the program, the protlocassign library must be installed.

library(protlocassign)

In Tutorial 1, we use protNSA_test and an untransformed average of reference protein profiles in the form of NSA for each compartment to conduct CPA. However, it may be advantageous to transform the data prior to conducting CPA to yield a more accurate prediction of cellular location.
For this purpose, we express profile data as RSAs.
As explained in the main text and elaborated in Tutorial 3, RSA is the ratio of two ratios: the numerator is the amount of a given protein in a particular fraction divided by the amount of that given protein in the starting material while the denominator is amount of total protein in a particular fraction divided by the amount of total protein in the starting material. The RSA describes the fold-enrichment (RSA>1) or depletion (RSA<1) of a protein during the fractionation process, and is analogous to the relative specific activity term used in classical analytical subcellular fractionation.
Be aware that to perform this transformation, one needs to have estimates of all these quantities, and this was incorporated into our experimental design. In our example, the first six fractions (the differential fractions) can be used to estimate amounts in the starting material. We also measured total protein in each fraction, and these are contained in the 9-element vector totProtAT5 which is preloaded in protlocassign. Note that the order and numbers of the measurements for total protein (e.g., N, M, L1, L2, P, S, Nyc1, Nyc2 and Nyc3 in totProtAT5) must correspond to those in the data set containing individual protein profiles (e.g., protNSA_test). For clarity of presentation, we rename totProtAT5 and protNSA_test to totProt and protNSA, respectively.

data(protNSA_test)
protNSA <- protNSA_test 
head(round(protNSA, digits=3))

data(totProtAT5)
totProt <- totProtAT5
round(totProt, digits=4)

The function RSAfromNSA calculates transformed profiles from individual and total protein measurements. This requires specifying which values are used to estimate the amount in the starting material (typically the homogenate) and the values used to construct the profile. In our case, the first six fractions of the nine-fraction profile are summed to estimate the starting material. Our code requires that the fractions representing the starting material are contiguous and are located at the beginning of the profile. Note that the function RSAfromNSA can use protein profiles expressed either as NSA or as specific amounts (see Tutorial 3). Thus we select the first nine columns of protNSA:

protRSA  <- RSAfromNSA(NSA=protNSA[,1:9],
                         NstartMaterialFractions=6, totProt=totProt)
dim(protRSA)
round(head(protRSA), digits=3)
str(protRSA)

Since there is additional information in the last two columns of protNSA that we want to include in the new file, specifically the numbers of spectra and peptides (Nspectra and Nseq), we add them to the output as follows:

protRSA <- data.frame(protRSA, protNSA[,10:11])
#note data frame is being overwritten
dim(protRSA)
str(protRSA)

We also need to transform the profiles of the markers for each compartment. As in Tutorial 1, we use the function locationProfilesetup to average the profiles (which must be normalized specific amounts) to obtain profiles for the reference proteins:

data(markerListJadot)
refLocationProfilesNSA <- locationProfileSetup(profile=protNSA,
                          markerList=markerListJadot, numDataCols=9)
round(refLocationProfilesNSA, digits=4)

We then use RSAfromNSA to transform these reference profiles.

refLocationProfilesRSA <- RSAfromNSA(NSA=refLocationProfilesNSA, 
                      NstartMaterialFractions=6,
                      totProt=totProt)
round(refLocationProfilesRSA, digits=4)

We computed the RSA reference profiles above from the NSA reference profiles. Note that, as an alternative, one could compute the RSA reference profiles directly from protRSA. These two approaches yield similar but non-identical results, and we typically use the first procedure to generate RSA-transformed reference location profiles.

refLocationProfilesRSA_2 <- locationProfileSetup(profile=protRSA,
                          markerList=markerListJadot, numDataCols=9)
round(refLocationProfilesRSA_2, digits=4)
# we use the `as.matrix` function for display purposes in the tutorial
as.matrix(all.equal(refLocationProfilesRSA, refLocationProfilesRSA_2, 
                    precision=0, countEQ=TRUE))

Plotting RSA-transformed profiles, and finding RSA-based CPA

As in Tutorial 1, we can plot reference profiles, but this time using the RSA-transformed data. For example, here is a plot for all markers:

loc.list <- rownames(refLocationProfilesRSA)
n.loc <- length(loc.list)
par(mfrow=c(4,2))
for (i in 1:n.loc) {
  markerProfilePlot(refLoc=loc.list[i], profile=protRSA,
                     markerList=markerListJadot,
                     refLocationProfiles=refLocationProfilesRSA, ylab="RSA")
  }

Now we can run the CPA routine on the RSA-transformed levels. The result is a matrix with protein identifiers as row names and data indicating the estimated proportional assignments of each protein among the eight subcellular locations.

protCPAfromRSA <- fitCPA(profile=protRSA,
                        refLocationProfiles=refLocationProfilesRSA, 
                        numDataCols=9)

Note that an error is reported; this is from the attempted fit of the protein “AIF1” which doesn’t have a profile (see Tutorial 1). We examine the output data set as follows:

round(head(protCPAfromRSA), digits=3)

The protPlotfun function is designed to plot profiles of eight subcellular locations. If a data set has more than eight of these, it will be necessary to modify the code to accommodate the larger number.

Now we plot the results for protein TLN1:

protPlotfun(protName="TLN1", profile=protRSA, numDataCols=9,
                        refLocationProfiles=refLocationProfilesRSA,
                        assignPropsMat=protCPAfromRSA,
                        yAxisLabel="Relative Specific Amount")

The x-axis represents the nine fractions, which are N, M, L1, L2, P, S, Nyc.1, Nyc.2, and Nyc.3. In each of the eight plots, the red line is the average profile of the protein and the dashed yellow-black lines show the expected profile for a protein entirely resident in the respective subcellular location. We see that the CPA procedure assigns a 35 percent residence proportion to plasma membrane and 53 percent residence to cytosol. As in Tutorial 1, the observed red profile is a weighted mixture of the expected yellow-black lines.

\newpage

3. Data transformations: Notation and mixtures using protlocassign

Introduction

As explained in the main text and in Section 2, there are different ways to transform protein profile data. Here, we describe a way to explore the effect of using transformed data with CPA.
Briefly, we conduct different data transformations on a set of theoretical proteins that have a range of distributions between two cellular compartments, and then conduct CPA and determine how well it predicts the original distribution. To create the theoretical proteins for our simulations, we use data from the experiment from Tannous et al. which consists of a TMT-MS2 analysis of six differential fractions (N, M, L1, L2, P, and S) obtained from centrifugation of a rat liver homogenate and three fractions from a Nycodenz density gradient separation of the differential fraction L1 (Nyc1, Nyc2, and Nyc3).

In our procedure, we first use the eight compartment profiles generated from the marker protein set to simulate a set of eight theoretical proteins that wholly reside in each of the respective compartments. We then then create binary mixtures with defined combinations of the eight theoretical proteins to simulate proteins that are distributed in varying proportions between two compartments. Note that data from mass spectrometry experiments generally represent specific amounts ${s_{\alpha ,l}}$ of a protein $\alpha$ in fraction $l$, with the same amount of total protein being analyzed for each sample (fraction). For conducting our simulations, we first must transform this data into relative amounts, so that each protein has precisely the same total amount in the initial starting material used for fractionation.

Computation of Relative Amounts (Acup)

Consider an $n$ by 9 matrix of specific amounts that details the average distribution of $n$ proteins among 9 fractions: $$ \mathbf{S}=\left[ \begin{matrix} {{s}{11}} & {{s}{12}} & \cdots & {{s}{19}} \ {{s}{21}} & {{s}{22}} & \cdots & {{s}{29}} \ \vdots & \vdots & {} & \vdots \ {{s}{n1}} & {{s}{n2}} & \cdots & {{s}_{n9}} \ \end{matrix} \right] $$

Each row $\alpha$ represents a mean profile for a protein $\alpha$, with each profile consisting of $f=9$ fractions.

In Section 1 we used a normalized specific amount (NSA), denoted here as $\tilde{s}_{\alpha,l}$ calculated as follows:

$$ {{\tilde{s}}{\alpha ,l}}=\frac{{{s}{\alpha ,l}}}{\sum\limits_{j=1}^{f}{{{s}_{\alpha ,j}}}} $$ In some experiments, these are the only values available for using the CPA procedure. However, with appropriate experimental design and execution, using balance sheet analysis (“bookkeeping”), one can estimate the amounts of total protein present in different samples obtained from a set amount of starting material. These include the total protein content of the starting material (designated $t_h$) and the total protein content of any given fraction (designated $t_l$ for fraction $l$). We can use these to calculate the amount of protein in arbitrary units in fraction $l$ derived from a set amount of starting material as $a_l = s_l t_l$. Note that as $a_l$ is in arbitrary units, one can do the same calculation using either $s_l$ or ${\tilde s_l}$, as these will yield the same values when calculating relative amounts (Acup) (see below).

To see how this works in protlocassign, let us consider the Jadot et al. reference protein profiles, which we generate using the locationProfileSetup function. As in earlier sections, we rename protNSA_test as protNSA and we tailor the output using the round function.

library(protlocassign)
data(protNSA_test)
data(markerListJadot)
protNSA <- protNSA_test
refLocationProfilesNSA <- locationProfileSetup(profile=protNSA, 
                           markerList=markerListJadot, numDataCols=9)
round(refLocationProfilesNSA, digits=3)

Here, each row represents the profile $\tilde{s}_l$ for a protein resident solely in a particular cellular compartment. The amount of total protein derived from a set amount of starting material from all fractions in the experiment used for this vignette is in totProtAT5, a vector supplied with the protassign package. For convenience we rename it totProt.

data(totProtAT5)
totProt <- totProtAT5
round(totProt, digits=3)

We denote these values using a vector $\mathbf{t}= ({t_1},{t_2}, \ldots ,{t_9})$.

As noted above, for protein $\alpha$ in fraction $l$, we have ${{a}{\alpha ,l}}={{\tilde{s}}{\alpha ,l}}{{t}_{l}}$. In matrix form,

$\mathbf{A}=\widetilde{\mathbf{S}}\cdot\mathrm{diag}(\mathbf{t})=\left[\begin{matrix}{\widetilde{s}}{11}t_1&{\widetilde{s}}{12}t_2&\cdots&{\widetilde{s}}{19}t_9\{\widetilde{s}}{21}t_1&{\widetilde{s}}{22}t_2&\cdots&{\widetilde{s}}{29}t_9\\vdots&\vdots&&\vdots\{\widetilde{s}}{n1}t_1&{\widetilde{s}}{n2}t_2&\cdots&{\widetilde{s}}_{n9}t_9\\end{matrix}\right]$

We need this matrix to do the next step, which is to convert to a common scale for all proteins. We do this by normalizing to the amount of protein $\alpha$ in the starting material, which we denote as $a_{\alpha, h}$. If the homogenate is measured directly, this can be calculated from ${\widetilde{s}{\alpha ,h}}{{t}{h}}$. Alternatively, if a complete set of fractions that entirely represent the homogenate are available, it is preferable to calculate this by summing $\widetilde{s}_{\alpha,l}t_l$ over these fractions. In our case, the first six fractions (N, M, L1, L2, P, and S) are a complete set of differential fractions that represent the starting material, and thus:

$${a_{\alpha ,h}} = \sum\nolimits_{i = 1}^6 {{\tilde{s}{\alpha ,i}}{t_i}} $$ and $t_h=\sum{i=1}^{6}t_i$. Note that this is readily calculated by summing the first six elements of $\mathbf{t}$.

sum(totProt[1:6])

(The last three "Nyc" columns were derived from L1, so we do not include them in the sum.)

Finally, we normalize amounts in any given fraction to amounts in starting material, which we call the relative amount, designated here as Acup, and denote as ${\breve{a}}_{\alpha,l}$

In our example,

$${\overset{\smile}{a}{\alpha ,l}} =\frac{a{\alpha,l}}{a_{\alpha,h}} = \frac{{{s_{\alpha ,l}}{t_l}}}{{{s_{\alpha ,h}}{t_h}}} = \frac{{{s_{\alpha ,l}}{t_l}}}{{\sum\nolimits_{i=1}^6 {{s_{\alpha ,i}}{t_i}} }} = \frac{{{{\tilde s}{\alpha ,l}}{t_l}}}{{\sum\nolimits{i = 1}^6 {{{\tilde s}_{\alpha ,i}}{t_i}} }}$$

In matrix form, we may write this as:

$\breve{\mathbf{A}}=\mathbf{A}\cdot\mathrm{diag}(1/a_{{\alpha},h})=\left[\begin{matrix}{ \tilde{s}}{11}t_1/a{1,h}&{ \tilde{s}}{12}t_2/a{1,h}&\cdots&{ \tilde{s}}{19}t_9/a{1,h}\{ \tilde{s}}{21}t_1/a{2,h}&{ \tilde{s}}{22}t_2/a{2,h}&\cdots&{ \tilde{s}}{29}t_9/a{2,h}\\vdots&\vdots&&\vdots\{ \tilde{s}}{n1}t_1/a{n,h}&{ \tilde{s}}{n2}t_2/a{n,h}&\cdots&{ \tilde{s}}{n9}t_9/a{n,h}\\end{matrix}\right]$

or simply as ${\breve{\mathbf{A}}} = [\breve{a}_{\alpha ,l}]$.

The function AcupFromNSA computes this:

refLocationProfilesAcup <- AcupFromNSA(NSA=refLocationProfilesNSA, 
                               NstartMaterialFractions=6, 
                               totProt=totProt)
round(refLocationProfilesAcup, digits=4)

The values in refLocationProfilesAcup represent, in principle, the relative amount (Acup) of each cellular compartment distributed to each centrifugation fraction.

Computation of RSA from Acup

When examining the distribution of a given protein in different fractions, it is particularly useful to consider its abundance relative to that of total protein. We refer to this as a RSA or $r$, which can be calculated as ${r_{\alpha ,l}} = \breve{a}{\alpha ,l}/ \breve{t}_l$, where the vector $\breve{\mathbf{t}}=\frac{\mathbf{t}}{t_h}=\frac{\mathbf{t}}{\sum{1}^{6}t_i}$. For a protein $\alpha$ the RSA in fraction $l$ is given by:

$r_{\alpha,l}=\frac{{\breve{a}}{\alpha,l}}{{\breve{t}}_l}=\frac{\frac{{\widetilde{s}}{\alpha,l}t_l}{\sum_{i=1}^{6}{{\widetilde{s}}{\alpha,i}t_i}}}{\frac{t_l}{\sum{i=1}^{6}t_i}}=\frac{s_{\alpha,l}\cdot\sum_{i=1}^{6}t_i}{\sum_{i=1}^{6}s_{\alpha,i}t_i}=\frac{{\widetilde{s}}{\alpha,l}\cdot\sum{i=1}^{6}t_i}{\sum_{i=1}^{6}{\widetilde{s}}_{\alpha,i}t_i}$

In matrix form, this is $\mathbf{R}=\left[ {{r}_{\alpha ,l}} \right]$.

We can get the RSA matrix from refLocationProfilesAcup and totProt as follows:

refLocationProfilesRSA <- RSAfromAcup(refLocationProfilesAcup, 
                              NstartMaterialFractions=6, totProt=totProt)
round(refLocationProfilesRSA, digits=3)

Note that we can also obtain the RSA transformed data directly from NSA data as described in Section 2:

refLocationProfilesRSA_2 <- RSAfromNSA(NSA=refLocationProfilesNSA,
                                NstartMaterialFractions=6, totProt=totProt)

This yields precisely the same values as calculated previously using locationProfileSetup.

identical(refLocationProfilesRSA, refLocationProfilesRSA_2)

Finally, if we normalize an RSA profile so that the rows sum to one, this yields a normalized specific amount profile (see Appendix of main paper, Moore et al., 2022). Consider, for example, the matrix refLocationProfilesRSA, which contains the RSA-transformed compartment profiles. We normalize the rows using the apply function, and then transpose using the t function to yield a matrix of the normalized specific amounts; these values are essentially identical to those that we started with in refLocationProfiles. This is performed using the function NSAfromRSA:

refLocationProfilesNSA_2 <- NSAfromRSA(refLocationProfilesRSA_2)

Note that refLocationProfilesNSA_2 is not identical to the values obtained previously using the locationProfileSetup function with the protein profiles containing NSA data because of internal precision issues, but both are essentially equivalent.

as.matrix(all.equal(refLocationProfilesNSA_2, refLocationProfilesNSA, 
                    tolerance=0, countEQ=TRUE))

The available transformation functions are AcupFromNSA, RSAfromAcup, RSAfromNSA (a combination of the previous two), and NSAfromRSA. For completeness, we also include the functions NSAfromAcup and AcupFromRSA. All functions except NSAfromRSA require arguments for NstartMaterialFractions and totProt. These functions allow any profile to be expressed in the form of NSA, Acup, and RSA.

Simulating proteins resident in multiple subcellular locations

We may simulate data from proteins with multiple residences using the proteinMix function. For example, to simulate data from proteins that are distributed in a range of proportions between cytosol and lysosomes, we use this function with relative amounts (Acup) of their single-compartment profiles. Note that we need to use Acup-transformed data to create the mixtures so that the total amount of the given protein summed across all fractions will be invariant for all mixtures. By default, we vary the proportions by increments of 0.1, but different values can be specified using the argument increment.prop (e.g., increment.prop=0.2 or increment.prop=0.05). We specify the mixing locations by the arguments Loc1=1 and Loc2=4, which are the row numbers of the desired compartment profiles in refLocationProfiles (i.e. Cyto and Lyso):

refLocationProfilesAcup <- AcupFromNSA(NSA=refLocationProfilesNSA, 
                                       NstartMaterialFractions=6, 
                                       totProt=totProt)
data.frame(rownames(refLocationProfilesAcup))
mixCytoLysoAcup <- proteinMix(AcupRef=refLocationProfilesAcup, 
                              increment.prop=0.1,
                              Loc1=1, Loc2=4)
# Note that the default value of increment.prop=0.1. 
# This does not need to be explicitly 
#  specified unless a different increment is desired.

The result is a matrix that contains the Acup values (relative amounts) for the simulated proteins:

round(mixCytoLysoAcup, digits=3)

Then we can test the CPA algorithm by first converting this mixture data to RSA:

mixCytoLysoRSA <- RSAfromAcup(Acup=mixCytoLysoAcup, 
                              NstartMaterialFractions=6, totProt=totProt)

round(mixCytoLysoRSA, digits=3)

Finally, we fit the CPA algorithm to this RSA-transformed, simulated data using the previously generated RSA-transformed marker protein profiles:

mixCytoLysoCPAfromRSA <- fitCPA(profile=mixCytoLysoRSA,
                             refLocationProfiles=refLocationProfilesRSA,
                            numDataCols=9)
round(mixCytoLysoCPAfromRSA, digits=3)

The estimated proportions correspond closely to the proportions used in the simulation (see below).

Plotting these estimates against the “true” distributions of the simulated proteins provides insights into the effects of different transformations on the goodness of fit. As an introduction, we first illustrate this using plots of simulated proteins distributed in varying amounts between the cytoplasm and the lysosome with CPA conducted on the RSA-transformed data. We then extend this to simulated proteins distributed between the cytoplasm and each of the other compartments. In Tutorial 4 in the supplemental documentation we use the simulated mixtures to evaluate the effect of various transformations on the CPA procedure.

Plotting mixtures of proteins with transformations

We can use the mixturePlot function to evaluate the effect of different data transformations used to represent protein profiles on the compartmental distributions estimated by CPA. The function produces graphical representations by plotting the theoretical distribution (based on simulation parameters, x-coordinate) versus the predicted values (based on CPA, y-coordinate).

This function also evaluates the prediction error by computing the area separating the predicted and expected protein mixtures via the trapezoidal rule; this is done with the trapz function in the pracma library, which must have been previously installed. We also need to tell the program which two locations were used to generate the mixtures using Loc1 and Loc2. As our first example, we examine the results CPA using the RSA transformation on simulated proteins distributed between Cyto and Lyso.

library(pracma)
par(mfrow=c(1,1))  # reset window for a single plot
# The argument increment.prop needs to match the value used  
#   in creating the mixture using proteinMix. This does
#   not need to be specified if using the value of 0.1
mixturePlot(mixProtiProtjCPA=mixCytoLysoCPAfromRSA, 
            NstartMaterialFractions=6, Loc1=1, Loc2=4,
            increment.prop=0.1, xaxisLab=TRUE, yaxisLab=TRUE)

Here we see visually that the estimated proportions match the simulated ones. The prediction error, i.e., the area separated by the observed and expected CPA estimates, is nearly zero, and is shown in parentheses.

Next, we consider a mixture of Cyto with each of the other seven compartments using RSA-based transformations. We begin by setting up the plot area for a 4 by 2 array of plots. Optionally, to control the size of the window, we may want to explicitly open a window using windows(height=10, width=7). Next, we fix one component of the mixture to the first, which is Cyto, and loop over the other 7 subcellular compartments, creating mixtures of the refLocationProfilesAcup values. For each case, we transform these mixtures to RSA and obtain CPA mixing proportion estimates from these RSA-transformed mixtures and compute the area-based prediction errors. These values are stored in a data frame mixErrorMat which is then renamed to avoid overwriting since multiple mixtures and transformations are being explored.

par(mfrow=c(4,2))
i <- 1
mixErrorMat <- NULL
for (j in 2:8) {   
   # Create the mixture of Cyto (i = 1) with compartment j
   mixProtiProtjAcup <- proteinMix(Acup=refLocationProfilesAcup, 
                                   Loc1=i, Loc2=j)

   # Tranform the mixtures to RSAs
   mixProtiProtjRSA <- RSAfromAcup(Acup=mixProtiProtjAcup, 
                        NstartMaterialFractions=6, totProt=totProt)

   # Find the CPAs                
   mixProtiProtjCPAfromRSA <- fitCPA(profile=mixProtiProtjRSA,
                    refLocationProfiles=refLocationProfilesRSA, 
                    numDataCols=9)

   # Plot the results, including the area-based error estimate, 
   #    and collect the area-based errors (errorReturn=TRUE)                         
    mixResult <- mixturePlot(mixProtiProtjCPA=mixProtiProtjCPAfromRSA, 
                             NstartMaterialFractions=6, Loc1=i, Loc2=j, 
                             increment.prop=0.1, errorReturn = TRUE)
    mixErrorMat <- rbind(mixErrorMat, mixResult)         
}
mixErrorAllCytoRSA <- mixErrorMat

All seven mixtures have essentially zero area-based error, as we expect. We can examine these errors with more precision as follows:

mixErrorAllCytoRSA

\newpage

4. Interface with pRoloc and MSnbase data structures

The Bioconductor "pRoloc" package provides methods to analyze organelle proteomics data and the "MSnbase" package, also from Bioconductor, provides a flexible structure to store this data. In this section we show how to convert the rectanular proteomics data structures of "protlocassign" into "MSnbase" data structures on which "pRoloc" functions may operate.

We illustrate the interface using the "protRSAtest" data discussed above, which we renamed "protRSA".

First, attach the necessary libraries and extract the protein names and data:

library(pRoloc)
library(MSnbase)
ProteinID <- as.factor(row.names(protRSA[-2,]))
protRSAdata <- protRSA[-2,1:9]
rownames(protRSAdata) <- 1:nrow(protRSAdata)

Next create an expression data frame containing protein names and RSA abundance levels:

exprsProt <- data.frame(ProteinID, protRSAdata)
names(exprsProt) <- as.factor(names(exprsProt))

To create a file of metadata, we first need a list of organelle locations, and then create assignments as follows:

Locations <- c(unique(markerListJadot$referenceCompartment), "unclassified")
assign0p8 <- apply(protCPAfromRSA[-2,1:8], 1, assignCPAloc, 
                   cutoff=0.8, Locations=Locations)

Finally we create the metadata file with organelle assignments and spectral and peptide counts:

fdataProt <- data.frame(ProteinID, protRSA[-2,10:11], assign0p8)
#fdataProt <- data.frame(ProteinID, assign0p8)
rownames(fdataProt) <- 1:nrow(fdataProt)

fdataProt

To create an MSnSet data structure we need to create a temporary folder and write out these two files in comma-separated format:

currentDir <- tempdir()
setwd(currentDir)
currentDir

write.csv(exprsProt, file="exprs.csv", row.names=FALSE)
write.csv(fdataProt, file="fdata.csv", row.names=FALSE)

Finally, we point to these files and use the "readMSnSet" function to create the data structure

f1p <- paste(currentDir, "\\exprs.csv", sep="")
f2p <- paste(currentDir, "\\fdata.csv", sep="")

protRSAstruct <- readMSnSet(exprsFile=f1p,
                       featureDataFile=f2p,
                       sep=",", header=TRUE, check.names=FALSE)

We may examine the expression data and meta data as follows:

head(exprs(protRSAstruct))
head(fData(protRSAstruct))

Having created the data structure "protRSAstruct" we may apply pRoloc functions to it. For example, to create a sub-cellular dendrogram, we may use the mrkHClust function, after first using "fvarLabels" to identify the column names:

fvarLabels(protRSAstruct)
mrkHClust(protRSAstruct, fcol="\"assign0p8\"")

Similarly, we may create a heat map of average oranelle class profiles as follows:

fmat <- mrkConsProfiles(protRSAstruct, fcol="\"assign0p8\"")
plotConsProfiles(fmat)

One may also extract data from a MSnbase proteomics data structure for use by protlocassign. We illustrate with the "protRSAstruct" that we just created:

exprsDataTemp <- exprs(protRSAstruct)
fDataTemp <- fData(protRSAstruct)
exprsData <- data.frame(exprsDataTemp, fDataTemp[,1:2])
head(exprsData)

This data frame is in the same format as the original file, "protRSA", which we defined previously.

5. Summary of tutorials for more detailed information on the protlocassign package.

The current vignette provides an introduction to the basic concepts of using the protlocassign package. The more extensive set of tutorials is available in the "protlocassigndoc" package; details on obtaining this were given in Section 1 of this vignette. The first three tutorials are similar to the contents of this vignette. Tutorial 4 discusses using CPA on simulations of proteins that are resident in two compartments, and Tutorial 5 discusses how these results change when we reformat the data into the form of a classical five-fraction differential fractionation experiment. Tutorial 6 presents details on preparing spectral-level data for further CPA analysis. Finally, Tutorial 7 presents methods of finding proteins that have similar profiles to a given protein.

Details of how data are pre-processed for input into the protlocassign package may be found in the protlocassignDataDescription.md file in the \inst subdirectory of the R package. We note that many of these pre-processing steps could, as an alternative, be carried out within R using the MSnSet facilities also used by the pRoloc package. An advantage of this approach is that these packages provide a natural way to store metadata such as protein markers and their localization as well including built-in validity checks and visualization tools. See the references by L Gatto, LM Breckels, KS Lilley and colleagues for details.

References

Chen, Yunmei and Ye, Xiaojing. Projection Onto A Simplex. arXiv 2011 https://arxiv.org/abs/1101.6081

Gatto L, Breckels LM, Wieczorek S, Burger T, Lilley KS. Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata. Bioinformatics. 2014 May 1;30(9):1322-4. doi:10.1093/bioinformatics/btu013. Epub 2014 Jan 11. PubMed PMID: 24413670; PubMed Central PMCID: PMC3998135.

Gatto L, Gibb S. Rainer J. MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data. bioRxiv 2020.04.29.067868; doi: https://doi.org/10.1101/2020.04.29.067868

Gatto L, Lilley KS. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. doi: 10.1093/bioinformatics/btr645. PMID: 22113085.

Jadot, M.; Boonen, M.; Thirion, J.; Wang, N.; Xing, J.; Zhao, C.; Tannous, A.; Qian, M.; Zheng, H.; Everett, J. K., Accounting for protein subcellular localization: A compartmental map of the rat liver proteome. Molecular & Cellular Proteomics 2017, 16, (2), 194-212. https://doi.org/10.1074/mcp.M116.064527

Moore DF, Sleat D, Lobel P. A method to estimate the distribution of proteins across multiple compartments using data from quantitative proteomics subcellular fractionation experiments. Journal of Proteome Research, 2022, 21, 6, 1371-1381 https://doi.org/10.1021/acs.jproteome.1c00781

Tannous, A.; Boonen, M.; Zheng, H.; Zhao, C.; Germain, C. J.; Moore, D. F.; Sleat, D. E.; Jadot, M.; Lobel, P., Comparative analysis of quantitative mass spectrometric methods for subcellular proteomics. J Proteome Res 2020, 19, (4), 1718-1730. https://doi.org/10.1021/acs.jproteome.9b00862

Tran, A.N., Dussaq, A.M., Kennell, T. et al. HPAanalyze: an R package that facilitates the retrieval and analysis of the Human Protein Atlas data. BMC Bioinformatics 20, 463 (2019) https://doi:10.1186/s12859-019-3059-z

Reproducibility

Following is the output of the utility sessionInfo. This output contains details of the packages and version numbers used to generate these tutorials.

print(utils::sessionInfo(), width=80)


mooredf22/protlocassign documentation built on Sept. 13, 2023, 3:57 p.m.