keep_longest_isoform_per_gene: Resolving ambiguity in PTM mapping.

Description Usage Arguments Examples

Description

Typically the downstream use of data is based on gene symbols. One gene symbol may map to multiple UniProt or RefSeq IDs corresponding to different isoforms. This utility simply retains only the longest isoform per gene.

Usage

1
2
3
4
5
6
keep_longest_isoform_per_gene(
  ids,
  gene_id_col,
  isoform_id_col,
  isoform_len_col
)

Arguments

ids

data.frame object. Must contain 3 columns described below.

gene_id_col

character. Name of the column with gene IDs in the 'ids' object.

isoform_id_col

character. Name of the column with protein isoform IDs in the 'ids' object.

isoform_len_col

character. Name of the column with protein isoform lengths.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
fasta_file_name <- system.file("extdata/FASTAs", 
                               "rattus_norvegics_uniprot_2018_09.fasta.gz", 
                               package = "vp.misc")
library(Biostrings)
# FASTA
fst <- readAAStringSet(fasta_file_name, format="fasta", 
                       nrec=-1L, skip=0L, use.names=TRUE)
# extracting UniProt Accessions
names(fst) <- sub("^.*\\|(.*)\\|.*$","\\1",names(fst))

data(phospho_identifications_rat)

ids_with_sites <- map_PTM_sites(ids, fst, "UniProtAccFull", "Peptide", "*")

# Adding gene annotation. Note, this is rat data searched against UniProt.
library(dplyr)
# 10116 is rat taxonomy ID
URL <- "http://www.uniprot.org/uniprot/?query=organism:10116&columns=id,genes(PREFERRED)&format=tab"
ids_with_sites <- read.delim(URL, check.names = F, stringsAsFactors = FALSE) %>%
   rename(GeneMain = "Gene names  (primary )",
          UniProtAcc = "Entry") %>%
   inner_join(ids_with_sites, ., by="UniProtAcc")
nrow(ids_with_sites)
ids_with_sites <- keep_longest_isoform_per_gene(ids_with_sites, 
     "GeneMain", "UniProtAccFull", "ProtLength")
nrow(ids_with_sites)

vladpetyuk/vp.misc documentation built on June 25, 2021, 6:35 a.m.