Introduction

The r Biocpkg("Organism.dplyr") creates an on disk sqlite database to hold data of an organism combined from an 'org' package (e.g., r Biocpkg("org.Hs.eg.db")) and a genome coordinate functionality of the 'TxDb' package (e.g., r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")). It aims to provide an integrated presentation of identifiers and genomic coordinates. And a src_organism object is created to point to the database.

The src_organism object is created as an extension of src_sql and src_sqlite from dplyr, which inherited all dplyr methods. It also implements the select() interface from r Biocpkg("AnnotationDbi") and genomic coordinates extractors from r Biocpkg("GenomicFeatures").

Constructing a src_organism

Make sqlite datebase from 'TxDb' package

The src_organism() constructor creates an on disk sqlite database file with data from a given 'TxDb' package and corresponding 'org' package. When dbpath is given, file is created at the given path, otherwise temporary file is created.

suppressPackageStartupMessages({
    library(Organism.dplyr)
    library(GenomicRanges)
    library(ggplot2)
})
library(Organism.dplyr)

Running src_organism() without a given path will save the sqlite file to a tempdir():

src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")

Alternatively you can provide explicit path to where the sqlite file should be saved, and re-use the data base at a later date.

path <- "path/to/my.sqlite"
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene", path)

supportedOrganisms() provides a list of organisms with corresponding 'org' and 'TxDb' packages being supported.

supportedOrganisms()

Make sqlite datebase from organism name

Organism name, genome and id could be specified to create sqlite database. Organism name (either Organism or common name) must be provided to create the database, if genome and/or id are not provided, most recent 'TxDb' package is used.

src <- src_ucsc("human", path)

Access existing sqlite file

An existing on-disk sqlite file can be accessed without recreating the database. A version of the database created with TxDb.Hsapiens.UCSC.hg38.knownGene, with just 50 Entrez gene identifiers, is distributed with the Organism.dplyr package

src <- src_organism(dbpath = hg38light())
src

The "dplyr" interface

All methods from package dplyr can be used for a src_organism object.

Look at all available tables.

src_tbls(src)

Look at data from one specific table.

tbl(src, "id")

Look at fields of one table.

colnames(tbl(src, "id"))

Below are some examples of querying tables using dplyr.

  1. Gene symbol starting with "SNORD" (the notation SNORD% is from SQL, with % representing a wild-card match to any string)
tbl(src, "id") %>%
    filter(symbol %like% "SNORD%") %>%
    dplyr::select(entrez, map, ensembl, symbol) %>%
    distinct() %>% arrange(symbol) %>% collect()
  1. Gene ontology (GO) info for gene symbol "ADA"
inner_join(tbl(src, "id"), tbl(src, "id_go")) %>%
    filter(symbol == "ADA") %>%
    dplyr::select(entrez, ensembl, symbol, go, evidence, ontology)
  1. Gene transcript counts per gene symbol
txcount <- inner_join(tbl(src, "id"), tbl(src, "ranges_tx")) %>%
    dplyr::select(symbol, tx_id) %>%
    group_by(symbol) %>%
    summarize(count = n()) %>%
    dplyr::select(symbol, count) %>%
    arrange(desc(count)) %>%
    collect(n=Inf)

txcount

library(ggplot2)
ggplot(txcount, aes(x = symbol, y = count)) + 
    geom_bar(stat="identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    ggtitle("Transcript count") +
    labs(x = "Symbol") +
    labs(y = "Count")
  1. Gene coordinates of symbol "ADA" and "NAT2" as GRanges
inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>%
    filter(symbol %in% c("ADA", "NAT2")) %>%
    dplyr::select(gene_chrom, gene_start, gene_end, gene_strand,
                  symbol, map) %>%
    collect() %>% GenomicRanges::GRanges()

The "select" interface

Methods select(), keytypes(), keys(), columns() and mapIds from r Biocpkg("AnnotationDbi") are implemented for src_organism objects.

Use keytypes() to discover which keytypes can be passed to keytype argument of methods select() or keys().

keytypes(src)

Use columns() to discover which kinds of data can be returned for the src_organism object.

columns(src)

keys() returns keys for the src_organism object. By default it returns the primary keys for the database, and returns the keys from that keytype when the keytype argument is used.

Keys of entrez

head(keys(src))

Keys of symbol

head(keys(src, "symbol"))

select() retrieves the data as a tibble based on parameters for selected keys columns and keytype arguments. If requested columns that have multiple matches for the keys, select_tbl() will return a tibble with one row for each possible match, and select() will return a data frame.

keytype <- "symbol"
keys <- c("ADA", "NAT2")
columns <- c("entrez", "tx_id", "tx_name","exon_id")
select_tbl(src, keys, columns, keytype)

mapIds() gets the mapped ids (column) for a set of keys that are of a particular keytype. Usually returned as a named character vector.

mapIds(src, keys, column = "tx_name", keytype)
mapIds(src, keys, column = "tx_name", keytype, multiVals="CharacterList")

Genomic Coordinates Extractor Interfaces

Eleven genomic coordinates extractor methods are available in this package: transcripts(), exons(), cds(), genes(), promoters(), transcriptsBy(), exonsBy(), cdsBy(), intronsByTranscript(), fiveUTRsByTranscript(), threeUTRsByTranscript(). Data can be returned in two versions, for instance tibble (transcripts_tbl()) and GRanges or GRangesList (transcripts()).

Filters can be applied to all extractor functions. The output can be resctricted by an AnnotationFilter, an AnnotationFilterList, or an expression that can be tranlated into an AnnotationFilterList. Valid filters can be retrieved by supportedFilters(src).

supportedFilters(src)

All filters take two parameters: value and condition, condition could be one of "==", "!=", "startsWith", "endsWith", ">", "<", ">=" and "<=", default condition is "==".

EnsemblFilter("ENSG00000196839")
SymbolFilter("SNORD", "startsWith")

The following illustrates several ways of inputting filters to an extractor function.

smbl <- SymbolFilter("SNORD", "startsWith")
transcripts_tbl(src, filter=smbl)
filter <- AnnotationFilterList(smbl)
transcripts_tbl(src, filter=filter)
transcripts_tbl(src, filter=~smbl) 

A GRangesFilter() can also be used as a filter for the methods with result displaying as GRanges or GRangesList.

gr <- GRangesFilter(GenomicRanges::GRanges("chr15:25062333-25065121"))
transcripts(src, filter=~smbl & gr)

Filters in extractor functions support &, |, !(negation), and ()(grouping). Transcript coordinates of gene symbol equal to "ADA" and transcript start position less than 44619810.

transcripts_tbl(src, filter = AnnotationFilterList(
    SymbolFilter("ADA"),
    TxStartFilter(44619810,"<"),
    logicOp="&")
)
## Equivalent to
transcripts_tbl(src, filter = ~symbol == "ADA" & tx_start < 44619810)

Transcripts coordinates of gene symbol equal to "ADA" or transcript end position equal to 243843236.

txend <- TxEndFilter(243843236, '==')
transcripts_tbl(src, filter = ~symbol == "ADA" | txend)

Using negation to find transcript coordinates of gene symbol equal to "ADA" and transcript start positions NOT less than 44619810.

transcripts_tbl(src, filter = ~symbol == "ADA" & !tx_start < 44618910)

Using grouping to find transcript coordinates of a long filter statement.

transcripts_tbl(src,
    filter = ~(symbol == 'ADA' & !(tx_start >= 44619810 | tx_end < 44651742)) | 
              (smbl & !tx_end > 25056954)
)
sessionInfo()


Bioconductor/Organism.dplyr documentation built on May 2, 2024, 4:22 p.m.