Introduction

This script is designed to process MS-CETSA melt curve result from TMT labelling and quantification scheme.
The main tasks include:
1. Read in the quantitative CETSA data (Search result txt.file from Proteome Discoverer, PD) into R/Rstudio
2. Clean up data, mainly to remove the entries without quantitative information
3. Normalize data based on the principle that the overall median of relative protein fold changes across the heating temperature range should follow a sigmoidal trend
4. Try to fit each entry of data into a temperature-response relationship and capture the fitting parameters
5. Visualize the data using customizable plot layout, for either all the data or a subset


Proceessing

  1. For better organization of data analysisk, it is highly recommended to use the Project management feature in Rstudio.
    For each project, you are suggested to create a brand new working folder as your local working directory.
    Save the PD derived data files to this working directory folder.
    At the right upper corner of RStudio, click Project -> New Project -> Existing Directory -> browse and choose the designated working directory folder.
    By establishing a specfic project for each experiment, all the data content and execution history will be saved in the project for future re-analysis even when R session is terminated.

  2. Activate mineCETSA package by library it.

library("mineCETSA")
  1. Read in data from result files, ms_rawread() is designed for read in one or many files in a vector of file names.
    By default, PDversion=21 assume the data is searched using PD2.1; otherwise, specify the exact PD version used.
    Refer to Slides "Data searching using PD2.1" for how-to.
    Note that for simplicity, always use channel 126 as control channel for data search in PD, regardless of the labelling arrangement, which would be resolved in the downstream processing as shown below.
    You need to specify a temp (short for temperature) vector that was the heat challenge the samples/cells were exposed to in CETSA treatment, corresponding to the labelling done from 126 up to 131, of which the function default is temp=c(37,40,43,46,49,52,55,58,61,64).
LY <- ms_rawread(c("LY_K562_Ctrl_rep1_PD21_Proteins.txt","LY_K562_Ctrl_rep2_PD21_Proteins.txt","LY_K562_Treatment_rep1_PD21_Proteins.txt","LY_K562_Treatment_rep2_PD21_Proteins.txt"))
  1. Customize the condition names in dataset using ms_conditionrename().
    When read in multiple files, a number suffix is added to the condition of each data file in sequence. This feature seems redundanct but could allow retrieval of file read in sequence and more inportantly make sure the condition names are always unique and distinguishable.
    For proper assignment of condition names, a format of "Experimental conditon.Replicate number" is suggested. More specifically, the following three formats are supported: 1) condition names followed by dot and number (e.g., DMSO.1, 37C.2); 2) condition names followed by dot and rep and number (e.g., DMSO.rep2, 52C.rep1); 3) condition names followed by dot and letter r and number (e.g., Ctrl.r1, dTITTR37C.r2).
    It is recommended that you should try to omit using dot in the experimental condition names, you can segregate your long condition name with underline (_) if necessary.
    The argument incondition corresponses to the vector of current condition naming, whereas outcondition correponses to vector of new condition naming, both order and length should match exactly.
LY1 <- ms_conditionrename(LY, incondition=c("Treatmentrep1.3", "Treatmentrep2.4"), outcondition=c("Treatment.1","Treatment.2"))
  1. Remove proteins without quantitative information by default using ms_clean().
    By default the Keratin family proteins are also removed.
    By default the orphan proteins that only appear in one of the datasets are not removed, however you can specify remsinglecondprot=TRUE, these orphan proteins will be excluded from downstream analysis.
LY_cleaned <- ms_clean(LY1)
  1. Resolve isoform ambiguity from multiple runs of searches using ms_isoform_resolve().
    When two or more isoforms of the same parental proteins are present in the combined datasets but within each individual condition, only one isoform is identified, according to the principle of parsimony, the minor isoforms will be automatically adjusted to be the corresponding major or more frequent oberserved isoform. At the same time, a match table will be suggested to further consolidate the cases that when two or more isoforms of the same parental proteins are identified within the same individual conditioned dataset, which could then be achieved by using ms_isoform_consolidate().
LY_cleaned1 <- ms_isoform_resolve(LY_cleaned)
# The next isoform cleaning step is optional. You are encouraged to double check and adjust the match table entries. 
LY_cleaned2 <- ms_isoform_consolidate(LY_cleaned1, matchtable="./subfolder/tobe_consolidated.txt")
  1. Apply systematic scaling to the dataset using either ms_scaling() or ms_forcescaling().
    The ms_scaling() function achieves: The medians of soluble protein ratios are fitted into a sigmoidal function of the temperature range
    Output in working directory:
  2. Scaling factors used are written into a local file
  3. Medians of soluble protein ratios pre- and post- data normalization is plotted
  4. The distribution of soluble protein ratio pre- and post- normalization is plotted
LY_scaled <- ms_scaling(LY_cleaned2)

The ms_forcescaling() fucntion achieves a systematic scaling by force scaling the medians to be same as the medians from a reference subset/data. However, make sure you have enough good reason to perform forcescaling.

LY_fs_all <- ms_forcescaling(LY_cleaned, refcondition="all")
LY_fs <- ms_forcescaling(LY_cleaned, refcondition=c("Ctrl.1", "Ctrl.2"))
  1. Try the temperature-response function fitting to generate fitting parameters for each entry/curve using ms_fitting().
    The returned data contains the input scaled data plus fitting parameters (including Tm, R2, Slope, Plateau(bottom plateau)).
    Output in working directory:
  2. A copy of the fitted dataset are written into a local file
LY_fitted <- ms_fitting(LY_scaled)
  1. Plot out melt curves using ms_ggplotting().
    This function is previously designed for single replicated experiment, mainly for the simplicity and generality.
    The resulting plots were sorted based on the difference of AUCs of the same protein under different experimental conditions, i.e., the proteins showing most significant "shift"/change would be ranked at the top of the file. However, it has been observed that noisy readings tend to mess up the rankings.
ms_ggplotting(LY_scaled)
  1. Plot out replicated melt curves using ms_ggplotting_rep().
    This function mainly cater for the (technical or biological) replicated runs of CETSA melt curves, the ranking system not only considers the apparent shifts between different experimental conditions, but also the internal reproducibility among the replicated runs.
    The arguments and its default value include:
    • nread=10, how many readings per curve (from lowest heating temperature) were used for plotting, default is 10
    • levelvector, a vector of conditions to be specified, preferably starting with Ctrl or Reference sets
    • remsinglecondprot=TRUE, whether orphan proteins to be plotted, default to exclude them from plotting
    • nreplicate=2, how many replicates there are for each condition, default is duplicate thus 2, change the value accordingly, this is relavent to automatic coloring scheme, up to 4 is possible for now
    • presetcolor=TRUE, whether automatically apply the preset coloring scheme for each condition, otherwise, provide a vector for example colorpanel=c("black", "red", "blue", "orange") to customize the colors
    • pfdatabase=FALSE, only set to TRUE when the input dataset is searched with malaria database
    • layout=c(5,5), the default multiplot layout is 5 times 5 in each page
    • pdfheight=12, set the height of pdf file
    • pdfwidth=12, set the width of pdf file
ms_ggplotting_rep(LY_scaled, levelvector=c("Ctrl.1", "Ctrl.2", "Treatment.1","Treatment.2"))
ms_ggplotting_rep(LY_fitted, levelvector=c("Ctrl.1", "Ctrl.2", "Treatment.1","Treatment.2"))
  1. Export the dataset (in the current working space) using ms_filewrite() to a local txt file.
ms_filewrite(LY_scaled, "CETSA_data_scaled.txt")
ms_filewrite(LY_fitted, "CETSA_data_with_fitting_parameters.txt")
  1. Read in the dataset using ms_fileread() from a saved txt file.
MyCETSAdata <- ms_fileread("./subfolder_name/myCETSAdata_to_readin.txt")
# . indicate the current working directory
# Here the subfolder_name refer to a first level folder name directly under current working directory
# Remember that typically now you are still working in the main working directory
# The subfolder should contain the data file to read in
  1. Focus on a list of target proteins to study using ms_subsetting().
    You need to first make an excel file with a list of Uniprot IDs you want to study (for example, P00000, P12345-6) under the header “id”, then save as txt file in working directory, say under the name “hit_list.txt”. The alternative for a short Uniprot ID list, you can simply provide a vector of Uniprot IDs.
    The arguments and its default value include:
    • isfile=TRUE, whether the provided hitidlist is in a txt. file, default set to TRUE
    • allisoform=TRUE, whethre to retrive other isoforms of the same parental Uniprot ID, default set to TRUE
    • revsel=FALSE, short for reverse selection, when set to TRUE, the output is the dataset after removing the targets, default set to FALSE
CETSAdata_subset <- ms_subsetting(LY_fitted, hitidlist="hit_list.txt", isfile=TRUE)
CETSAdata_subset <- ms_subsetting(LY_fitted, hitidlist=c("P00000", "P12345-6"), isfile=FALSE)


nkdailingyun/mineCETSA documentation built on Feb. 27, 2021, 8:26 p.m.