knitr::opts_chunk$set(
  #collapse = TRUE,
  comment = "#>",
  fig.width = 4,
  fig.height = 4,
  message = FALSE,
  warning = FALSE,
  tidy.opts = list(
    keep.blank.line = TRUE,
    width.cutoff = 150
    ),
  options(width = 150),
  eval = TRUE
)

Introduction

As explained in the main text and in Vignette 2, there are different ways to transform protein profile data. Here, we describe a way to explore the effect of using transformed data with CPA. Briefly, we conduct different data transformations on a set of theoretical proteins that have a range of distributions between two cellular compartments, and then conduct CPA and determine how well it predicts the original distribution. To create the theoretical proteins for our simulations, we use data from the experiment from Tannous et al which consists of a TMT-MS2 analysis of six differential fraction (N, M, L1, L2, P, and S) obtained from centrifugation of a rat liver homogenate and three fractions from a Nycodenz density gradient separation of the differential fraction L1 (Nyc1, Nyc2, and Nyc3).

In our procedure, we first use the eight compartment profiles generated from the marker protein set to simulate a set of eight theoretical proteins that wholly reside in each of the respective compartments. We then then create binary mixtures with defined combinations of the eight theoretical proteins to simulate proteins that are distributed in varying proportions between two compartments. Note that data from mass spectrometry experiments generally represent specific amounts ${s_{\alpha ,l}}$ of a protein $\alpha$ in fraction $l$, with the same amount of total protein being analyzed for each sample (fraction). For conducting our simulations, we first must transform this data into relative amounts, so that each protein has precisely the same total amount in the initial starting material used for fractionation.

Computation of Relative Amounts

Consider an $n$ by 9 matrix of specific amounts that details the average distribution of $n$ proteins among 9 fractions: $$ \mathbf{S}=\left[ \begin{matrix} {{s}{11}} & {{s}{12}} & \cdots & {{s}{19}} \ {{s}{21}} & {{s}{22}} & \cdots & {{s}{29}} \ \vdots & \vdots & {} & \vdots \ {{s}{n1}} & {{s}{n2}} & \cdots & {{s}_{n9}} \ \end{matrix} \right] $$

Each row $\alpha$ represents a mean profile for a protein $\alpha$, with each profile consisting of $f=9$ fractions.

In the Vignette 1 we actually used a normalized specific amount (NSA), denoted here as $\tilde{s}_{\alpha,l}$ calculated as follows:

$$ {{\tilde{s}}{\alpha ,l}}=\frac{{{s}{\alpha ,l}}}{\sum\limits_{j=1}^{f}{{{s}_{\alpha ,j}}}} $$ In some experiments, these are the only values available for using the CPA procedure. However, with appropriate experimental design and execution, using balance sheet analysis (“bookkeeping”), one can estimate the amounts of total protein present in different samples obtained from a set amount of starting material. These include the total protein content of the starting material (designated $t_h$) and the total protein content of any given fraction (designated $t_l$ for fraction $l$). We can use these to calculate the amount of protein in arbitrary units in fraction $l$ derived from a set amount of starting material as $a_l = s_l t_l$. Note that as $a_l$ is in arbitrary units one can do the same calculation using either $s_l$ or ${\tilde s_l}$, as later, these will yield the same values when calculating relative amounts (see below).

To see how this works in protlocassign, let us consider the Jadot et al. reference protein profiles which we generate using the locationProfileSetup function. As in earlier vignettes, we rename protNSA_AT5tmtMS2 as protNSA and we tailor the output using the round function.

library(protlocassign)
data(protNSA_AT5tmtMS2)
data(markerListJadot)
protNSA <- protNSA_AT5tmtMS2
refLocationProfilesNSA <- locationProfileSetup(profile=protNSA, 
                           markerList=markerListJadot, numDataCols=9)
round(refLocationProfilesNSA, digits=3)

Here, each row represents the profile $\tilde{s}_l$ for a protein resident solely in a particular cellular compartment. The amount of total protein derived from a set amount of starting material from all fractions in the experiment used for these vignettes is in totProtAT5, a vector supplied with the protassign package. For convenience we rename it totProt.

data(totProtAT5)
totProt <- totProtAT5
round(totProt, digits=3)

We denote these values using a vector $\mathbf{t}= ({t_1},{t_2}, \ldots ,{t_9})$.

As noted above, for protein $\alpha$ in fraction $l$, we have ${{a}{\alpha ,l}}={{\tilde{s}}{\alpha ,l}}{{t}_{l}}$. In matrix form,

$\mathbf{A}=\widetilde{\mathbf{S}}\cdot\mathrm{diag}(\mathbf{t})=\left[\begin{matrix}{\widetilde{s}}{11}t_1&{\widetilde{s}}{12}t_2&\cdots&{\widetilde{s}}{19}t_9\{\widetilde{s}}{21}t_1&{\widetilde{s}}{22}t_2&\cdots&{\widetilde{s}}{29}t_9\\vdots&\vdots&&\vdots\{\widetilde{s}}{n1}t_1&{\widetilde{s}}{n2}t_2&\cdots&{\widetilde{s}}_{n9}t_9\\end{matrix}\right]$

We need this matrix to do the next step, which is to convert to a common scale for all proteins. We do this by normalizing to the amount of protein $\alpha$ in the starting material, which we denote as $a_{\alpha, h}$. If the homogenate is measured directly, this can be calculated from ${\widetilde{s}{\alpha ,h}}{{t}{h}}$. Alternatively, if a complete set of fractions that entirely represent the homogenate are available, it is preferable to calculate this by summing $\widetilde{s}_{\alpha,l}t_l$ over these fractions. In our case, the first six fractions (N, M, L1, L2, P, and S) are a complete set of differential fractions that represent the starting material, and thus

$${a_{\alpha ,h}} = \sum\nolimits_{i = 1}^6 {{\tilde{s}{\alpha ,i}}{t_i}} $$ and $t_h=\sum{i=1}^{6}t_i$. Note that this is readily calculated by summing the first six elements of $\mathbf{t}$.

sum(totProt[1:6])

(The last three "Nyc" columns were derived from L1, so we do not include them in the sum.)

Finally, we normalize amounts in any given fraction to amounts in starting material, which we call the relative amount, designated here as Acup, and denote as ${\breve{a}}_{\alpha,l}$

In our example,

$${\overset{\smile}{a}{\alpha ,l}} =\frac{a{\alpha,l}}{a_{\alpha,h}} = \frac{{{s_{\alpha ,l}}{t_l}}}{{{s_{\alpha ,h}}{t_h}}} = \frac{{{s_{\alpha ,l}}{t_l}}}{{\sum\nolimits_{i=1}^6 {{s_{\alpha ,i}}{t_i}} }} = \frac{{{{\tilde s}{\alpha ,l}}{t_l}}}{{\sum\nolimits{i = 1}^6 {{{\tilde s}_{\alpha ,i}}{t_i}} }}$$

In matrix form, we may write this as:

$\breve{\mathbf{A}}=\mathbf{A}\cdot\mathrm{diag}(1/a_{{\alpha},h})=\left[\begin{matrix}{ \tilde{s}}{11}t_1/a{1,h}&{ \tilde{s}}{12}t_2/a{1,h}&\cdots&{ \tilde{s}}{19}t_9/a{1,h}\{ \tilde{s}}{21}t_1/a{2,h}&{ \tilde{s}}{22}t_2/a{2,h}&\cdots&{ \tilde{s}}{29}t_9/a{2,h}\\vdots&\vdots&&\vdots\{ \tilde{s}}{n1}t_1/a{n,h}&{ \tilde{s}}{n2}t_2/a{n,h}&\cdots&{ \tilde{s}}{n9}t_9/a{n,h}\\end{matrix}\right]$

or simply as ${\breve{\mathbf{A}}} = [\breve{a}_{\alpha ,l}]$.

The function AcupFromNSA computes this:

refLocationProfilesAcup <- AcupFromNSA(NSA=refLocationProfilesNSA, NstartMaterialFractions=6, 
                               totProt=totProt)
round(refLocationProfilesAcup, digits=4)

The values in refLocationProfilesAcup represent in principle the relative amount of a given cellular compartment that ends up in a given centrifugation fraction.

Computation of Relative Specific Amounts from Relative Amounts

When examining the distribution of a given protein in different fractions, it is particularly useful to consider its abundance relative to that of total protein. We refer to this as a Relative Specific Amount (RSA or $r$), which can be calculated as ${r_{\alpha ,l}} = \breve{a}{\alpha ,l}/ \breve{t}_l$, where the vector $\breve{\mathbf{t}}=\frac{\mathbf{t}}{t_h}=\frac{\mathbf{t}}{\sum{1}^{6}t_i}$. For a protein $\alpha$ the RSA in fraction $l$ is given by:

$r_{\alpha,l}=\frac{{\breve{a}}{\alpha,l}}{{\breve{t}}_l}=\frac{\frac{{\widetilde{s}}{\alpha,l}t_l}{\sum_{i=1}^{6}{{\widetilde{s}}{\alpha,i}t_i}}}{\frac{t_l}{\sum{i=1}^{6}t_i}}=\frac{s_{\alpha,l}\cdot\sum_{i=1}^{6}t_i}{\sum_{i=1}^{6}s_{\alpha,i}t_i}=\frac{{\widetilde{s}}{\alpha,l}\cdot\sum{i=1}^{6}t_i}{\sum_{i=1}^{6}{\widetilde{s}}_{\alpha,i}t_i}$

In matrix form, this is $\mathbf{R}=\left[ {{r}_{\alpha ,l}} \right]$.

We can get the RSA matrix from refLocationProfilesAcup and totProt as follows:

refLocationProfilesRSA <- RSAfromAcup(refLocationProfilesAcup, NstartMaterialFractions=6, totProt=totProt)
round(refLocationProfilesRSA, digits=3)

Note that we can also obtain the RSA transformed data directly from NSA data as described in Vignette 2:

refLocationProfilesRSA_2 <- RSAfromNSA(NSA=refLocationProfilesNSA,
                                NstartMaterialFractions=6, totProt=totProt)

This yields precisely the same values as calculated previously using locationProfileSetup.

identical(refLocationProfilesRSA, refLocationProfilesRSA_2)

Finally, if we normalize an RSA profile so that the rows sum to one, this yields a normalized specific amount profile (see Appendix of main paper). Consider example the matrix refLocationProfilesRSA, which contains the RSA-transformed compartment profiles. We normalize the rows using the apply function, and then transpose using the t function to yield a matrix of the normalized specific amounts; these values are essentially identical to those that we started with in refLocationProfiles. This is performed using the function NSAfromRSA:

refLocationProfilesNSA_2 <- NSAfromRSA(refLocationProfilesRSA_2)

Note that refLocationProfilesNSA_2 is not identical to the values obtained previously using the locationProfileSetup function with the protein profiles containing NSA data because of internal precision issues, but both are essentially equivalent.

as.matrix(all.equal(refLocationProfilesNSA_2, refLocationProfilesNSA, 
                    tolerance=0, countEQ=TRUE))

The available transformation functions are AcupFromNSA, RSAfromAcup, RSAfromNSA (a combination of the previous two), and NSAfromRSA. For completeness, we also include the functions NSAfromAcup and AcupFromRSA. All functions except NSAfromRSA require arguments for NstartMaterialFractions and totProt. These functions allow any profile to be expressed in the form of NSA, Acup, and RSA.

Simulating proteins resident in multiple subcellular locations

We may simulate data from proteins with multiple residences using the proteinMix function. For example, to simulate data from proteins that are distributed in a range of proportions between cytosol and lysosomes, we use this function with relative amounts of their single-compartment profiles. Note that we need to use Acup-transformed data to create the mixtures so that the total amount of the given protein summed across all fractions will be invariant for all mixtures. By default, we vary the proportions by increments of 0.1, but different values can be specified using the argument increment.prop (e.g., increment.prop=0.2 or increment.prop=0.05). We specify the mixing locations by the arguments Loc1=1 and Loc2=4, which are the row numbers of the desired compartment profiles in refLocationProfiles:

refLocationProfilesAcup <- AcupFromNSA(NSA=refLocationProfilesNSA, 
                                       NstartMaterialFractions=6, 
                                       totProt=totProt)
data.frame(rownames(refLocationProfilesAcup))
mixCytoLysoAcup <- proteinMix(AcupRef=refLocationProfilesAcup, 
                              increment.prop=0.1,
                              Loc1=1, Loc2=4)
# Note that the default value of increment.prop=0.1. 
# This does not need to be explicitly 
#  specified unless a different increment is desired.

The result is a matrix that contains the Acup values (relative amounts) for the simulated proteins:

round(mixCytoLysoAcup, digits=3)

Then we can test the CPA algorithm by first converting this mixture data to RSAs:

mixCytoLysoRSA <- RSAfromAcup(Acup=mixCytoLysoAcup, 
                              NstartMaterialFractions=6, totProt=totProt)

round(mixCytoLysoRSA, digits=3)

Finally, we fit the CPA algorithm to this RSA-transformed, simulated data using the previously generated RSA-transformed marker protein profiles:

mixCytoLysoCPAfromRSA <- fitCPA(profile=mixCytoLysoRSA,
                             refLocationProfiles=refLocationProfilesRSA,
                            numDataCols=9)
round(mixCytoLysoCPAfromRSA, digits=3)

The estimated proportions correspond closely to the proportions used in the simulation (see below).

Plotting these estimates against the “true” distributions of the simulated proteins provides insights into different transformations on the goodness of fit. As an introduction, we first illustrate this using plots of simulated proteins distributed in varying amounts between the cytoplasm and the lysosome with CPA conducted on the RSA-transformed data. We then extend this to simulated proteins distributed cytoplasm and each of the other compartments. In Vignette 4, we use the simulated mixtures to evaluate the effect of various transformations on the CPA procedure.

Plotting mixtures of proteins with transformations

We can use the mixturePlot function to evaluate the effect of different data transformations used to represent protein profiles on the compartmental distributions estimated by CPA. The function produces graphical representations by plotting the theoretical distribution (based on simulation parameters, x-coordinate) versus the predicted values (based on CPA, y-coordinate).

This function also evaluates the prediction error by computing the area separating the predicted and expected protein mixtures via the trapezoidal rule; this is done with the trapz function in the pracma library, which must have been previously installed. We also need to tell the program which two locations were used to generate the mixtures using Loc1 and Loc2. As our first example, we examine the results CPA using the RSA transformation on simulated proteins distributed between Cyto and Lyso.

library(pracma)
par(mfrow=c(1,1))  # reset window for a single plot
# The argument increment.prop needs to match the value used  
#   in creating the mixture using proteinMix. This does
#   not need to be specified if using the value of 0.1
mixturePlot(mixProtiProtjCPA=mixCytoLysoCPAfromRSA, 
            NstartMaterialFractions=6, Loc1=1, Loc2=4,
            increment.prop=0.1, xaxisLab=TRUE, yaxisLab=TRUE)

Here we see visually that the estimated proportions match the simulated ones. The prediction error, the area separated by the observed and expected CPA estimates, is nearly zero, and is shown in parentheses.

Next, we consider a mixture of Cyto with each of the other seven compartments using RSA-based transformations. We begin by setting up the plot area for a 4 by 2 array of plots. Optionally, to control the size of the window, we may want to explicitly open a window using windows(height=10, width=7). Next we fix one component of the mixture to the first, which is Cyto, and loop over the other 7 subcellular compartments, creating mixtures of the refLocationProfilesAcup values. For each case, we transform these mixtures to RSA and obtain CPA mixing proportion estimates from these RSA-transformed mixtures and compute the area-based prediction errors. These values are stored in a data frame mixErrorMat which is then renamed to avoid overwriting since multiple mixtures and transformations are being explored.

par(mfrow=c(4,2))
i <- 1
mixErrorMat <- NULL
for (j in 2:8) {   
   # Create the mixture of Cyto (i = 1) with compartment j
   mixProtiProtjAcup <- proteinMix(Acup=refLocationProfilesAcup, 
                                   Loc1=i, Loc2=j)

   # Tranform the mixtures to relative specific amounts
   mixProtiProtjRSA <- RSAfromAcup(Acup=mixProtiProtjAcup, 
                        NstartMaterialFractions=6, totProt=totProt)

   # Find the constrained proportional assigments (CPA)                 
   mixProtiProtjCPAfromRSA <- fitCPA(profile=mixProtiProtjRSA,
                    refLocationProfiles=refLocationProfilesRSA, 
                    numDataCols=9)

   # Plot the results, including the area-based error estimate, 
   #    and collect the area-based errors (errorReturn=TRUE)                         
    mixResult <- mixturePlot(mixProtiProtjCPA=mixProtiProtjCPAfromRSA, 
                             NstartMaterialFractions=6, Loc1=i, Loc2=j, 
                             increment.prop=0.1, errorReturn = TRUE)
    mixErrorMat <- rbind(mixErrorMat, mixResult)         
}
mixErrorAllCytoRSA <- mixErrorMat

All seven mixtures have essentially zero area-based error, as we expect. We can examine these errors with more precision as follows:

mixErrorAllCytoRSA

References

Jadot, M.; Boonen, M.; Thirion, J.; Wang, N.; Xing, J.; Zhao, C.; Tannous, A.; Qian, M.; Zheng, H.; Everett, J. K., Accounting for protein subcellular localization: A compartmental map of the rat liver proteome. Molecular & Cellular Proteomics 2017, 16, (2), 194-212.

Tannous, A.; Boonen, M.; Zheng, H.; Zhao, C.; Germain, C. J.; Moore, D. F.; Sleat, D. E.; Jadot, M.; Lobel, P., Comparative Analysis of Quantitative Mass Spectrometric Methods for Subcellular Proteomics. J Proteome Res 2020, 19, (4), 1718-1730



mooredf22/protlocassign0p1p1 documentation built on Feb. 7, 2022, 1:55 a.m.