README.md

Get UniProt Protein Info (GUPPI)

Process TDPortal and ProSightPD top-down reports (.tdReport files) by retrieving information from UniProt and filtering by false detection rate. TDViewer is also highly recommended for viewing top-down reports.

The easiest way to use the functionality of GUPPI by accessing the GUPPI web application.

Installation

Install from Github with:

remotes::install_github("davidsbutcher/GUPPI")

Running with Docker

A pre-built Docker image is available at Docker hub. Alternately, after downloading the source code the Dockerfile found in /inst/shiny can be used to build the Docker image.

Docker on Windows and macOS

To download the image and start a container, install Docker Desktop, start it, and run the following commands in the appropriate shell (e.g. Windows PowerShell). NOTE: Check Docker hub for the latest release version before pulling an image.


docker pull davidsbutcher/guppi:release6

docker run -p 3838:3838 davidsbutcher/guppi:release6

The application can be accessed in a web browser at http://localhost:3838 after the Docker container is started.

Input

The processing of tdReports is carried out by the guppi() function. An example of running the function:

library(GUPPI)

guppi(
   "C:/Users/David Butcher/TDReports",
   c(
      "20200420_Excellent_TDReport_01.tdReport",
      "20200420_Excellent_TDReport_02.tdReport"
   ),
   83333,
   GOLocType = "bacteria",
   fractionAssignments = NULL,
   outputdir = "C:/Users/David Butcher/guppi_output",
   fdr = 0.01,
   saveOutput = TRUE,
   makeDashboard = TRUE
)

Arguments to the guppi function are as follows:

Mandatory arguments

Optional arguments

Analysis of tdReports

A connection is established to the SQLite database in the TD Report using RSQLite. All protein- and proteoform-level IDs and other relevant data for each ID are extracted. The taxon number is checked against files in the package directory to see if a corresponding UniProt taxon database has already been downloaded. The following UniProt databases are included with the package: Escherichia coli, Saccharomyces cerevisiae, Chlamydomonas reinhardtii, Mycoplasma genitalium, Caenorhabditis elegans, Homo sapiens, and Mus musculus.

If a database is not available, the UniProt web service is queried for all UniProt accession numbers in the taxon using the package UniProt.ws. Protein name, organism, organism taxon ID, protein sequence, protein function, subcellular location, and any associated GO IDs are returned. Note that some of these values may not be found and come back as empty or NA. This process can take a long time, owing to limitations with the UniProt web service.

The UniProt taxon database is used to add information for all IDs extracted from the tdReport. GO terms are obtained for all GO IDs using the GO.db package and terms corresponding to subcellular locations are saved in column “GO_subcellular_locations”.

Minimum Q value from among all hits, average and monoisotopic masses, and data file for lowest Q value hit are obtained for all proteoforms. Proteoforms whose Q values are above the FDR cutoff are deleted. Proteoforms whose corresponding protein entry is above the FDR cutoff are also deleted.

Output

The “main” output includes Q values, observed precursor masses, data files, subcellular locations from the GO database and a variety of other parameters for all protein and proteoform IDs. All protein and proteoform IDs with Q values which are missing or greater than the cutoff value (fdr) are deleted.

Output files are saved to the output directory (outputdir). Files are timestamped with the time the script was initialized or share the same name as the input file.

Dependencies

Package dependencies are listed in the Imports section of the DESCRIPTION file and include packages from CRAN, Bioconductor, and Github.

License and attribution

Package developed by David S. Butcher and available under the MIT license.



davidsbutcher/GUPPI documentation built on Feb. 26, 2021, 5:41 a.m.