knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(onekp)
library(knitr)
library(magrittr)

Accessing the OneKP metadata

All project with the onekp R package start at the same place:

onekp <- retrieve_onekp()
class(onekp)

The retrieve_onekp function scrapes the metadata associated with each transcriptome project from the 1KP public data page. It also links each species to its NCBI taxonomy ID (which is used later to filter by clade).

The only part of the OneKP object that you will need to interact with directly is the @table slot, a data.frame with the form:

onekp@table %>% head(3) %>% knitr::kable()

Retrieving sequence

To get sequence, first subset the onekp@table until it contains only the species you want. There are several ways to do this.

You can use all the normal tools for subsetting the table directly, e.g.

onekp@table <- subset(onekp@table, family == 'Nymphaeaceae')

onekp also has a few builtin tools for taxonomic selection

# filter by species name ('species' column of onekp@table)
filter_by_species(onekp, 'Pinus radiata')

# filter by species NCBI taxon ID  ('tax_id' column of onekp@table)
filter_by_species(onekp, 3347)

# filter by clade name scientific name (get all data for the Brassicaceae family)
filter_by_clade(onekp, 'Brassicaceae')

# filter by clade NCBI taxon ID
filter_by_clade(onekp, 3700)

Once you have chosen the studies you want, you can retrieve the protein or transcript FASTA files:

download_peptides(filter_by_clade(onekp, 'Brassicaceae'))
download_nucleotides(filter_by_clade(onekp, 'Brassicaceae'))

This will download the files into a temporary directory. Alternatively, you may set your own directory with the dir argument. The downloaded protein FASTA files have the extension .faa and the DNA files the extension .fna. The basename for each file is the 1KP 4-letter code.



arendsee/onekp documentation built on April 9, 2023, 7:55 a.m.