Prerequisites

The DIAgui package is currently on github, so for installing it you'll also need the "devtools" package. You can install the DIAgui package with this commands:

if(!requireNamespace("devtools", quietly = TRUE)){ #check if you already have the devtools package
 install.packages("devtools")  #if not, install it
}
devtools::install_github("marseille-proteomique/DIAgui")

You can access the same informations (and the source code) on the github repository


Set up

  1. For better organization of data analysis, it is highly recommended to use the Project management feature in Rstudio.
    For each project, you are suggested to create a brand new working folder as your local working directory.
    When you will use the package and/or the app inside this directory, all your results will be saved in this directory. Indeed, when you will load the package, it will save the path of your working directory under the variable WD. You can modify it if you want to quickly change your saving directory, but only do it if you are sure and of course if your file exists.

  2. Activate DIAgui package by library it.

library("DIAgui")

Analysis

Run the shiny application


runDIAgui()      #this function will directly start the app

If you want more informations on how to use the app, I recommend you to watch the tutorial.

An overview from the functions


$\underline{1.\space The\space data\space}$

DIAgui contains a small report from DIA-NN, which is also available in diann-rpcakge from V. Demichev. This report was obtained from proteomics data from the SACCHAROMYCES CEREVISIAE.

report <- small_report
View(report) #take a look at the data

With this small dataset, you can easily test the functions from DIAgui package and also save it as a tsv file to test the shiny application.

readr::write_tsv(report, "small_report.tsv") # save the data as a tsv file

$\underline{2.\space Analyze\space DIA-NN\space report\space}$

If you want you can try the analysis on you own file. If so, you can import it with diann_load function:

report <- diann_load("small_report.tsv") 

From this file, you can extract quantitative information on either the precursor level, peptide level, protein level or gene level. On the peptide level, you have the choice to use the MaxLFQ algorithm or not. For the protein level, you will have to apply the MaxLFQ algorithm.

First we will extract the quantitative data on the precursor level. For this, we will need the diann_matrix function. It will filter the data according specific q-values and extract the quantification for each identifier (here the precursors). For more information, you can check the documentation of the function diann_matrix.

precursor <- diann_matrix(report, proteotypic.only = TRUE, method = "max")

head(precursor)  # check the results

We can apply the same method on the peptide level; we just need to precise the new id.header. You can either choose "Stripped.Sequence" or "Modified.Sequence". In order to avoid any duplicates and not loose any informations, I strongly advised to only use the identifier "Modified.Sequence"

peptide <- diann_matrix(report, id.header = "Modified.Sequence",
                        proteotypic.only = TRUE, method = "max")

head(peptide)  # check the results

You can also choose to use the MaxLFQ algorithm to extract the quantification at the peptide level. The iq R package is the fastest way to apply this algorithm. Before running it, we will need to filter our data according q-values.

peptide_maxlfq <- report %>% dplyr::filter(Q.Value <= 0.01 & PG.Q.Value <= 0.01 & Protein.Q.Value <= 1 & GG.Q.Value <= 1)
peptide_maxlfq <- iq::preprocess(peptide_maxlfq,
                            intensity_col = "Precursor.Normalised",
                            primary_id = "Modified.Sequence.",
                            sample_id  = "File.Name",
                            secondary_id = "Precursor.Id",
                            median_normalization = FALSE,
                            pdf_out = NULL)
peptide_maxlfq <- iq::fast_MaxLFQ(peptide_maxlfq)
peptide_maxlfq <- peptide_maxlfq$estimate
peptide_maxlfq <- as.data.frame(peptide_maxlfq)     

head(peptide_maxlfq)  # check the results

You can also use the diann_maxlfq function which calls C++ script which equivalent to the function from the diann-rpcakge from V. Demichev. It gives equivalent results but it is 10 times slower.

peptide_maxlfq <- report %>% dplyr::filter(Q.Value <= 0.01 & PG.Q.Value <= 0.01 & Protein.Q.Value <= 1 & GG.Q.Value <= 1)
peptide_maxlfq <- diann_maxlfq(peptide_maxlfq,
                          group.header = "Modified.Sequence",
                          id.header = "Precursor.Id",
                          quantity.header = "Precursor.Normalised",
                          count_pep = FALSE
        )   

head(peptide_maxlfq)  # check the results

You have the exact same two options for the protein level. However, we will focus here on the use of diann_matrix as iq R package is already well documented. We will see how to extract the number of peptides used for the quantification and estimate abundances with the Top3 and iBAQ methods. This step is meant for people who are already confident with R and don't want to use the shiny app.

protein <- re %>% report %>% dplyr::filter(Q.Value <= 0.01 & PG.Q.Value <= 0.01 & Protein.Q.Value <= 1 & GG.Q.Value <= 1)
n_cond <- length(unique(df$File.Name))

protein_maxlfq <- diann_maxlfq(protein,
                  group.header="Protein.Group",
                  id.header = "Precursor.Id",
                  quantity.header = "Precursor.Normalised",
                  only_countsall = FALSE,
                  Top3 = TRUE
                  )

# extract useful information from report
nc <- ncol(protein_maxlfq)
protein_maxlfq$Protein.Group <- rownames(protein_maxlfq)
rownames(protein_maxlfq) <- 1:nrow(protein_maxlfq)
protein <- protein[(protein$Protein.Group %in% protein_maxlfq$Protein.Group),]
protein <- protein[order(protein$Protein.Group),]
protein_maxlfq$Protein.Names <- unique(protein[,c("Protein.Group", "Protein.Names")])$Protein.Names
protein_maxlfq$First.Protein.Description <- unique(protein[,c("Protein.Group", "First.Protein.Description")])$First.Protein.Description
protein_maxlfq$Genes <- unique(protein[,c("Protein.Group", "Genes")])$Genes
protein_maxlfq <- protein_maxlfq[,c((nc+1):ncol(protein_maxlfq), 1:nc)]

# get iBAQ quantification
protein_seq <- getallseq(pr_id = protein_maxlfq$Protein.Group,
                         spec = "SACCHAROMYCES CEREVISIAE")
# here, the function makes a query to swissprot to get the amino-acid sequence from each protein, so it can be long
# you can also put one or several FASTA files (go check getallseq documentation)

# to compute iBAQ quantification you'll the raw intensities
raw <- diann_matrix(report, id.header = "Protein.Group",
                             quantity.header = "Precursor.Quantity",
                             method = "sum")
raw$Genes <- NULL
raw$Protein.Names <- NULL
# compute iBAQ quantification 
raw <- get_iBAQ(raw, proteinDB = protein_seq,
                id_name = "Protein.Group",
                ecol = 2:(n_cond+1),
                peptideLength = c(5,36),
                proteaseRegExp = DIAgui:::getProtease("trypsin"),
                log2_transformed = FALSE)

raw <- raw[,-c(2:(n_cond+1))]

protein_maxlfq <- dplyr::left_join(protein_maxlfq, raw, by = "Protein.Group")


head(protein_maxlfq)  # check the results

Finally, you can also extract the quantification on the gene level with diann_matrix with which you can extract the number of peptides used for the quantification and the Top3 quantification.

genes <- diann_matrix(report,
                   id.header="Genes",
                   quantity.header="Genes.MaxLFQ.Unique",
                   proteotypic.only = TRUE,
                   get_pep = TRUE, only_pepall = TRUE,
                   Top3 = TRUE,
                   method = "max") 

head(genes)  # check the results

As you can see, it can be a lot of writing. That is why we also provide a function to all this step at once, using the iq package to be faster to compute the MaxLFQ algorithm. It also plot a density plot, an MDS plot and the retention time vs the i-retention time. The results are save in one folder and the data can be saved as an xlsx, txt or csv file. For example, if you want to extract the protein.group quantification and the iBAQ from your file:

report_process("small_report.tsv", # needs to be path to a report file
               get_iBAQ = TRUE, get_Top3 = FALSE,
               species = "SACCHAROMYCES CEREVISIAE")

Once you obtained the data you wanted, you can save it as csv, txt or xlsx file. You can always use these files in the shiny app to visualize your data.

$\underline{3.\space Data\space visualization\space}$

DIAgui contains several functions to check the quality of your data. But first they need to be in the right format. Let's take the genes dataset. Your data needs to have a columns that corresponds to the id you're interested in (in our example the genes). If you don't, the rownames will be taken but the rownames have to be non numerical. To avoid potential confusion, the best is to name one column 'id':

names(genes)[1] <- "id"

You can either plot a density plot, an MDS plot or an interactive heatmap:

# density plot
densityDIA(genes, transformation =  "log2", area = TRUE, data_type = "intensity", tit = "My plot")

# MDS plot
MDS_DIA(genes, transformation =  "log2", data_type = "intensity", tit = "My plot")

# interactive heatmap
heatmapDIA(genes, transformation = "log2", print_val = FALSE, data_type = "intensity")

You can also visualize the proportion of non-missing values according the percentage of identifiers (here the genes) for an accepted specific percentage of missing value:

validDIA(genes, "log2", data_type = "intensity", prop_cut = 0.75) # we keep approximately 90% of the genes

validDIA(genes, "log2", data_type = "intensity", prop_cut = 0.3) # we keep all genes

For the last function, get_bestwind, you'll need the report-lib file from the DIA-NN's outputs; it needs to have "FileName" and "ProductMz" columns. From this file (so from a first experiment), you can extract what would be the best m/z windows. For example, if the m/z range is between 0 and 1000; you could divide this range in 10 windows of same width, however it might not be the optimal one to identify a maximum of protein. For more details, you can check the documentation from the function.



mgerault/DIAgui documentation built on Nov. 7, 2024, 12:12 a.m.