add_taxonomy_columns: Add ncbi taxonomy levels to ncbi protein accession

Description Usage Arguments Details Value Examples

Description

Given a tbl with a column of valid ncbi protein accession, the function assigns the ncbi taxonomy levels to each ncbi protein accession.

Usage

1
2
3
4
5
6
7
8
add_taxonomy_columns(
  tbl,
  ncbi_accession_colname = "ncbi_accession",
  ncbi_acc_key = NULL,
  taxonomy_level = "kingdom",
  map_superkindom = FALSE,
  batch_size = 20
)

Arguments

tbl

an object of class tbl

ncbi_accession_colname

a string (default : "ncbi_accession") denoting column name of ncbi accession.

ncbi_acc_key

user specific ENTREZ api key. Get one via taxize::use_entrez()

taxonomy_level

a string indicating level of ncbi taxonomy to be assigned to each ncbi protein accession. An input can be one of the followings

  1. superkingdom

  2. kingdom

  3. phylum

  4. subphylum

  5. class

  6. subclass

  7. infraclass

  8. cohort

  9. order

  10. suborder

  11. infraorder

  12. superfamily

  13. family

  14. subfamily

  15. genus

  16. species

  17. tribe

  18. no rank

map_superkindom

logical (default FALSE). Assign superkingdom if kingdom is not found. Valid only when taxonomy_level == "kingdom".

batch_size

The number of queries to submit at a time.

Details

The aim of this function is to assign the specific level of ncbi taxonomy to the ncbi accession (protein). To do so, it requires a tibble with at least one column of ncbi (protein) accession. Returned taxonomy columns will be added on input tibble object keeping original columns as they were. Internally, first, it finds the ncbi taxonomy id for each ncbi accession and then it maps required taxonomy level. Assigning taxonomy id to each ncbi accession may take time depending upon number of input ncbi accessions. On subsequent runs or in a first run you may supply taxonomy column ('taxid') in input tibble, which will reduce the time to find taxonomy ids and directly assign the taxonomy level to given taxonomy id. To map taxonomy levels for large number of ncbi accession one may choose parallel processing approach as shown in the example.

Value

a tbl.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## Not run: 
f <- system.file("extdata","blast_output_01.txt" ,package = "phyloR")
d <- readr::read_delim(f, delim ="\t" , col_names = F , comment = "#")
colnames(d) <- phyloR::get_blast_outformat_7_colnames()

## add kingdom
with_kingdom <- d %>%
        dplyr::slice(1:50) %>%
        add_taxonomy_columns(ncbi_accession_colname ="subject_acc_ver" )

## add species
with_kingdom_and_species <- with_kingdom %>%
        add_taxonomy_columns(ncbi_accession_colname ="subject_acc_ver",taxonomy_level = "species")
dplyr::glimpse(with_kingdom_and_species)

#------------------------------------
## using parallel processing approach

library(furrr)
num_of_splits <- 10
d <- d %>% dplyr::slice(1:100)
split_vec <- rep(1:num_of_splits , length.out = nrow(d))
qq_split <- d %>% dplyr::mutate(split_vec = split_vec)  %>%
dplyr::group_by(split_vec) %>% dplyr::group_split()
future::plan("multiprocess")
out <- qq_split[1:num_of_splits] %>%
        future_map( ~ phyloR::add_taxonomy_columns(tbl = ..1 ,
 taxonomy_level = "species" ,map_superkindom = F,
 ncbi_accession_colname = "subject_acc_ver" , batch_size = 20,
 ncbi_acc_key = "64c65ab9c52e0312bbcf4c32d3056cbcaa09"),
                   .progress = TRUE) %>%
        dplyr::bind_rows()

## End(Not run)

cparsania/phyloR documentation built on Aug. 6, 2020, 7:28 a.m.