library(knitr)
knitr::opts_chunk$set(echo = TRUE)
devtools::load_all()

CRAN_Status_Badge R-CMD-check Codecov test coverage CRAN mirror downloads

The queryup R package aims to facilitate retrieving information from the UniProt database using R. Programmatic access to the UniProt database is performed by submitting queries to the UniProt website REST API.

Install

You can install the package from CRAN using:

install.packages("queryup")

Alternatively, you may also install the package from github using devtools:

devtools::install_github("VoisinneG/queryup")

Queries

Queries combine different fields to identify matching database entries. Here, queries are submitted using the function query_uniprot(). In the queryup R package, a query must be formatted as a list containing character vectors named after existing UniProt fields (available query fields can be found in the API documentation or in the package data query_fields$field). Different query fields must be matched simultaneously. For instance, the following query uses the fields gene_exact to return the UniProt entries of all proteins encoded by gene Pik3r1 :

library(queryup)
query <- list("gene_exact" = "Pik3r1")
df <- query_uniprot(query, show_progress = FALSE)
head(df)

Available query fields can be listed using the package data query_fields:

query_fields$field

Columns

By default, query_uniprot() returns a data.frame with UniProt accession IDs, gene names, organism and Swiss-Prot review status. You can choose which data columns to retrieve using the columns parameter.

df <- query_uniprot(query, 
                    columns = c("id", "sequence", "keyword", "gene_primary"),
                    show_progress = FALSE)

See the API documentation or the package data return_fields for all available columns. Available returned fields can be listed using the package data return_fields:

head(return_fields)

Note that the parameter columns and the name of the corresponding column in the output data frame do not necessarily match (they correspond to columns "field" and "label" respectively in the package data return_fields).

names(df)

Let's check the sequence and the UniProt keywords corresponding to the first entry :

as.character(df$Sequence[1])
as.character(df$Keywords[1])

Combining query fields

Our first query returned many matches. We can build more specific queries by using more than one query field. By default, matching entries must satisfy all query fields simultaneously. Let's retrieve the only Swiss-Prot reviewed protein entry encoded by gene Pik3r1 in Homo sapiens (taxon: 9606):

query <- list("gene_exact" = "Pik3r1", 
              "reviewed" = "true", 
              "organism_id" = "9606")
df <- query_uniprot(query, show_progress = FALSE)
print(df)

Multiple items per query field

It is also possible to look for entries that match different items within a single query field. Items from a given query field are looked for independently. Hence, the following query will return all Swiss-Prot reviewed proteins encoded by either Pik3r1 or Pik3r2 in either Mus musculus (taxon: 10090) or Homo sapiens (taxon: 9606):

query <- list("gene_exact" = c("Pik3r1", "Pik3r2"), 
              "reviewed" = "true", 
              "organism_id" = c("9606", "10090"))
df <- query_uniprot(query, show_progress = FALSE)
print(df)

Queries with invalid entries

If a query containing invalid entries is sent to the UniProt REST API, an error message is returned and no information about the other potentially valid entries can be retrieved. To overcome this limitation, queryup parses the error messages and remove invalid entries from the query. Hence, query_uniprot() will return information for valid entries only :

invalid_ids <- c("P226", "CON_P22682", "REV_P47941")
valid_ids <- c("A0A0U1ZFN5", "P22682")
ids <- c(invalid_ids, valid_ids)
query <- list("accession_id" = ids)
query_uniprot(query)

Long queries

Because UniProt REST API limits the size of queries, long queries containing more than a few hundreds entries cannot be passed in a single request. To overcome this limitation, the queryup package splits long queries into smaller ones. For instance, the dataset uniprot_entries that is bundled with the queryup package contains information for 1000 UniProt entries. We could retrieve the ENSEMBL ids corresponding to these entries using :

ids <- uniprot_entries$Entry
query <- list("accession_id" = ids)
columns <- c("gene_names", "xref_ensembl")
df <- query_uniprot(query, columns = columns, show_progress = FALSE)
head(df)

Protein-protein interactions

Another usage could be to retrieve protein-protein interactions among a set of UniProt entries:

ids <- sample(uniprot_entries$Entry, 400)
query <- list("accession_id" = ids, 
              "interactor" = ids)
columns <- "cc_interaction"
df <- query_uniprot(query = query, columns = columns, show_progress = FALSE)
head(df)


VoisinneG/queryup documentation built on June 30, 2023, 2:05 a.m.