addIdentificationData-methods: Adds Identification Data
In MSnbase: Base Functions and Classes for Mass Spectrometry and Proteomics

Description Details Methods Author(s) See Also Examples

These methods add identification data to a raw MS experiment (an "MSnExp" object) or to quantitative data (an "MSnSet" object). The identification data needs to be available as a mzIdentML file (and passed as filenames, or directly as identification object) or, alternatively, can be passed as an arbitrary data.frame. See details in the Methods section.

The featureData slots in a "MSnExp" or a "MSnSet" instance provides only one row per MS2 spectrum but the identification is not always bijective. Prior to addition, the identification data is filtered as documented in the filterIdentificationDataFrame function: (1) only PSMs matching the regular (non-decoy) database are retained; (2) PSMs of rank greater than 1 are discarded; and (3) only proteotypic peptides are kept.

If after filtering, more then one PSM per spectrum are still present, these are combined (reduced, see reduce,data.frame-method) into a single row and separated by a semi-colon. This has as side-effect that feature variables that are being reduced are converted to characters. See the reduce manual page for examples.

See also the section about identification data in the MSnbase-demo vignette for details and additional examples.

After addition of the identification data, new feature variables are created. The column nprot contains the number of members in the protein group; the columns accession and description contain a semicolon separated list of all matches. The columns npsm.prot and npep.prot represent the number of PSMs and peptides that were matched to a particular protein group. The column npsm.pep indicates how many PSMs were attributed to a peptide (as defined by its sequence pepseq). All these values are re-calculated after filtering and reduction.

signature(object = "MSnExp", id = "character", ...: Adds the identification data stored in mzIdentML files to a "MSnExp" instance. The method handles one or multiple mzIdentML files provided via id. id has to be a character vector of valid filenames. See below for additional arguments.
signature(object = "MSnExp", id = "mzID", ...): Same as above but id is a mzID object generated by mzID::mzID. See below for additional arguments.
signature(object = "MSnExp", id = "mzIDCollection", ...): Same as above but id is a mzIDCollection object. See below for additional arguments.
signature(object = "MSnExp", id = "mzRident", ...: Same as above but id is a mzRident object generated by mzR::openIdfile. See below for additional arguments.
signature(object = "MSnExp", id = "data.frame", ...: Same as above but id could be a data.frame. See below for additional arguments.
signature(object = "MSnSet", id = "character", ...): Adds the identification data stored in mzIdentML files to an "MSnSet" instance. The method handles one or multiple mzIdentML files provided via id. id has to be a character vector of valid filenames. See below for additional arguments.
signature(object = "MSnSet", id = "mzID", ...): Same as above but id is a mzID object. See below for additional arguments.
signature(object = "MSnSet", id = "mzIDCollection", ...): Same as above but id is a mzIDCollection object. See below for additional arguments.
signature(object = "MSnSet", id = "data.frame", ...): Same as above but id is a data.frame. See below for additional arguments.

The methods above take the following additional argument. These need to be set when adding identification data as a data.frame. In all other cases, the defaults are set automatically.

fcol: The matching between the features (raw spectra or quantiative features) and identification results is done by matching columns in the featue data (the featureData slot) and the identification data. These values are the spectrum file index and the acquisition number, passed as a character of length 2. The default values for these variables in the object's feature data are "spectrum.file" and "acquisition.num". Values need to be provided when id is a data.frame.
icol: The default values for the spectrum file and acquisition numbers in the identification data (the id argument) are "spectrumFile" and "acquisitionNum". Values need to be provided when id is a data.frame.
acc: The protein (group) accession number or identifier. Defaults are "DatabaseAccess" when passing filenames or mzRident objects and "accession" when passing mzID or mzIDCollection objects. A value needs to be provided when id is a data.frame.
desc: The protein (group) description. Defaults are "DatabaseDescription" when passing filenames or mzRident objects and "description" when passing mzID or mzIDCollection objects. A value needs to be provided when id is a data.frame.
pepseq: The peptide sequence variable name. Defaults are "sequence" when passing filenames or mzRident objects and "pepseq" when passing mzID or mzIDCollection objects. A value needs to be provided when id is a data.frame.
key: The key to be used when the identification data need to be reduced (see details section). Defaults are "spectrumID" when passing filenames or mzRident objects and "spectrumid" when passing mzID or mzIDCollection objects. A value needs to be provided when id is a data.frame.
decoy: The feature variable used to define whether the PSM was matched in the decoy of regular fasta database for PSM filtering. Defaults are "isDecoy" when passing filenames or mzRident objects and "isdecoy" when passing mzID or mzIDCollection objects. A value needs to be provided when id is a data.frame. See filterIdentificationDataFrame for details.
rank: The feature variable used to defined the rank of the PSM for filtering. Defaults is "rank". A value needs to be provided when id is a data.frame. See filterIdentificationDataFrame for details.
accession: The feature variable used to defined the protein (groupo) accession or identifier for PSM filterin. Defaults is to use the same value as acc . A value needs to be provided when id is a data.frame. See filterIdentificationDataFrame for details.
verbose: A logical defining whether to print out messages or not. Default is to use the session-wide open from isMSnbaseVerbose.

Sebastian Gibb <mail@sebastiangibb.de> and Laurent Gatto

filterIdentificationDataFrame for the function that filters identification data, readMzIdData to read the identification data as a unfiltered data.frame and reduce,data.frame-method to reduce it to a data.frame that contains only unique PSMs per row.

## find path to a mzXML file
quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "mzXML$")
## find path to a mzIdentML file
identFile <- dir(system.file(package = "MSnbase", dir = "extdata"),
                 full.name = TRUE, pattern = "dummyiTRAQ.mzid")

## create basic MSnExp
msexp <- readMSData(quantFile)

## add identification information
msexp <- addIdentificationData(msexp, identFile)

## access featureData
fData(msexp)

idSummary(msexp)

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: mzR
Loading required package: Rcpp
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: ProtGenerics

Attaching package: ‘ProtGenerics’

The following object is masked from ‘package:stats’:

    smooth


This is MSnbase version 2.16.0 
  Visit https://lgatto.github.io/MSnbase/ to get started.


Attaching package: ‘MSnbase’

The following object is masked from ‘package:base’:

    trimws

      spectrum acquisition.number          sequence chargeState rank
F1.S1        1                  1 VESITARHGEVLQLRPK           3    1
F1.S2        2                  2     IDGQWVTHQWLKK           3    1
F1.S3        3                  3              <NA>          NA   NA
F1.S4        4                  4              <NA>          NA   NA
F1.S5        5                  5           LVILLFR           2    1
      passThreshold experimentalMassToCharge calculatedMassToCharge peptideRef
F1.S1          TRUE                 645.3741               645.0375       Pep2
F1.S2          TRUE                 546.9586               546.9633       Pep1
F1.S3            NA                       NA                     NA       <NA>
F1.S4            NA                       NA                     NA       <NA>
F1.S5          TRUE                 437.8040               437.2997       Pep4
      modNum isDecoy post  pre start end DatabaseAccess DBseqLength DatabaseSeq
F1.S1      0   FALSE    A    R   170 186        ECA0984         231            
F1.S2      0   FALSE    A    K    50  62        ECA1028         275            
F1.S3     NA      NA <NA> <NA>    NA  NA           <NA>          NA        <NA>
F1.S4     NA      NA <NA> <NA>    NA  NA           <NA>          NA        <NA>
F1.S5      0   FALSE    L    K    22  28        ECA0510         166            
                                                             DatabaseDescription
F1.S1                                        ECA0984 DNA mismatch repair protein
F1.S2 ECA1028 2,3,4,5-tetrahydropyridine-2,6-dicarboxylate N-succinyltransferase
F1.S3                                                                       <NA>
F1.S4                                                                       <NA>
F1.S5           ECA0510 putative capsular polysacharide biosynthesis transferase
      scan.number.s.          idFile MS.GF.RawScore MS.GF.DeNovoScore
F1.S1              1 dummyiTRAQ.mzid            -39                77
F1.S2              2 dummyiTRAQ.mzid            -30                39
F1.S3             NA            <NA>             NA                NA
F1.S4             NA            <NA>             NA                NA
F1.S5              5 dummyiTRAQ.mzid            -42                 5
      MS.GF.SpecEValue MS.GF.EValue modPeptideRef modName modMass modLocation
F1.S1     5.527468e-05     79.36958          <NA>    <NA>      NA          NA
F1.S2     9.399048e-06     13.46615          <NA>    <NA>      NA          NA
F1.S3               NA           NA          <NA>    <NA>      NA          NA
F1.S4               NA           NA          <NA>    <NA>      NA          NA
F1.S5     2.577830e-04    366.38422          <NA>    <NA>      NA          NA
      subOriginalResidue subReplacementResidue subLocation nprot npep.prot
F1.S1               <NA>                  <NA>          NA     1         1
F1.S2               <NA>                  <NA>          NA     1         1
F1.S3               <NA>                  <NA>          NA    NA        NA
F1.S4               <NA>                  <NA>          NA    NA        NA
F1.S5               <NA>                  <NA>          NA     1         1
      npsm.prot npsm.pep
F1.S1         1        1
F1.S2         1        1
F1.S3        NA       NA
F1.S4        NA       NA
F1.S5         1        1
      spectrumFile          idFile coverage
1 dummyiTRAQ.mzXML dummyiTRAQ.mzid      0.6