knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(genofeatutil)

Introduction to genofeatutil

This is a collection of functions for working with Drosophila genomic datasets. Currently, it is mainly used to fetch gene names from objects created by Seurat and converting gene names, aliases, and FlyBase gene numbers to designated version of symbols of interest to sychronize the use of gene symbols in different datasets.

General mode of action

Prepare FlyBase and Reference Genome Annotation in an Acessible Format: make_database()

Gene name / number conversion functions rely on FlyBase. make_database() fetches the conversion table of aliases to up-to-date gene names and of deprecated FlyBase gene numbers from FlyBase, and creates a list object that can be used to efficently convert gene symbols or FlyBase numbers. Additionally, make_database() also requires a GTF file to specify the standard version of gene names to use, with which every conversion of gene names will end up (In constrast, conversion to FlyBase gene number gives gene numbers to the FlyBase database version specified in make_database(), which by default is the current version.) Available arguments including: 1. species: Currently only "dmel" is supported, so it's not really on option. 2. version: The version of FlyBase to use (e.g., "FB2019_01").

# Fetch the current version of gene name and number from FlyBase
# Use the short BDGP6 GTF file in extdata folder as 
# an example target for later conversion
flybasedb <- make_database(species = "dmel", 
                           gtf.path = system.file("extdata", "short_bdgp6.gtf", 
                                                  package = "genofeatutil"))

str(flybasedb)

Conversion

To standardize gene names

Using the database generated by make_database(), convert_to_genename() converts gene names (including aliases) and FlyBase gene number to the gene name provided with the GTF file. Two arguments provide additional control to its behavior: 1. normalize: If TRUE, the output will be prefixed with gn_, and punctuations not allowed (like parentheses or dash) in variable names will be replaced by ".". This behavior is nice when your expression matrix is a data frame of which the column names are gene names, because column names are automatically converted to legit variable names. To apply the same strategy of modifying gene names, see ?normalize_genename and ?denormalize_genename for more information. 2. remove.dup: If TRUE, duplicated items will be removed from the output. This is useful when you are integrating many genes of interest without knowing if there are duplicated names under dinstinct aliases. If there are genes removed, a warning message will remind you.

# No normalization and keep duplicates
convert_to_genename(c("Cyp307a2", "FBgn0086917", "spo2"), 
                    db = flybasedb, 
                    normalize = FALSE, 
                    remove.dup = FALSE)
# Remove duplicates
convert_to_genename(c("Cyp307a2", "FBgn0086917", "spo2"), 
                    db = flybasedb, 
                    normalize = FALSE, 
                    remove.dup = TRUE)
# Remove duplicates and normalize the names
convert_to_genename(c("Cyp307a2", "FBgn0086917", "spo2"), 
                    db = flybasedb, 
                    normalize = TRUE, 
                    remove.dup = TRUE)

From gene names to FlyBase gene numbers

convert_gene_to_fbgn() converts gene names and aliases to current version of FlyBase gene number if the database is fetched in default setting and the version argument is not customized. If the alias provided matches multiple FlyBase gene numbers, all matching numbers will be returned with a warning reminding you some gene names are not mapped 1-vs-1.

convert_gene_to_fbgn(c("tj", "ap", "VGlut"), db = flybasedb)
convert_gene_to_fbgn(c("Cha", "spo2", "side"), db = flybasedb)

Update outdated FlyBase gene names with update_fbgn()

update_fbgn() converts older FlyBase IDs to current counterpart. This is a partial equivalent of Tools > Query by symbols/IDs > Upload/Convert IDs. If there's no matching information for update, it would be assumed that the provided FlyBase gene number is uptodate.

# dachs used to be FBgn0032045, but now it is FBgn0262029.
update_fbgn("FBgn0032045", db = flybasedb)

Feature Selection: Scoring, then exploration

genofeatutil assumes feature selection will be done as:

  1. Assess the importance of features in several ways (score_predictors())
  2. Integrate scoring of the same feature from datasets with different context (integrate_score())
  3. Visualize and explore to make sense of the features (plot_score())

score_predictors()

score_predictors() is a versatile score calculator. Currently, it takes the gene symbols of the predictors, targets, and the increment of mean squared error when removed in random forest prediction.

Besides feature importance, it evaluates predictors and see if the predictor has a motif near the regulated target with the motif score files from databases with package SCENIC (using argument score.path and motif.path; files can be downloaded from database link) either by reading the raw file, or by importing separatedly with get_motif_info. The scores are calculated by taking the top N motifs per target (by setting top.motif.number) or motifs above certain threshold (by setting motif.threshold).

One might expect a regulator will be co-expressed with its targets, so correlation of expression levels could also provide information in how good the suggested regulation is. score_predictors() takes an expression matrix, and examine the correlation coefficient and uses both a raw value or whether the raw value passes certain threshold (by setting cor.threshold).

In short, score_predictors() takes a data frame containing predictor-target pairs and importance score in a random forest model, and from this, if provided a motif score database (for example from SCENIC) or an expression matrix, further scoring considering whether a motif resides close to the target (if the predictor is a transcription factor) or whether the expression of a predictor and its target is correlated.

# Loading dummydb
dummypath <- system.file("extdata", "dummy.gtf", package = "genofeatutil")
testdb <- make_database(species = "test", gtf.path = dummypath)

# Generate dummy result for demo
dummyresult <- data.frame("predictor" = c("tf_1", "tf_2", "tf_3"),
                          "target" = rep("alias1.2", 3),
                          "Raw_%IncMSE" = c(0.3, 0.1, 1e-3),
                          row.names = NULL, stringsAsFactors = FALSE)

dummyexpMat <- data.frame("sample1" = c(3, 1, 1, 3),
                          "sample2" = c(1, 1, 3, 2),
                          "sample3" = c(1, 3, 1, 1),
                          row.names = c("tf_1", "tf_2", "tf_3", "alias1.2"),
                          stringsAsFactors = FALSE)

# Example score data frame
score_result <- score_predictors(x = dummyresult,
                                 expMat = dummyexpMat,
                                 use.row = TRUE,
                                 score.path = system.file(
                                   "extdata",
                                   "scoremat.feather",
                                   package = "genofeatutil"),
                                 motif.path = system.file(
                                   "extdata",
                                   "dummy_motif.tbl",
                                   package = "genofeatutil"),
                                 db = testdb)

integrate_score()

integrate_scores() integrates scores generated from random forest models of different conditions or time points. It takes a column specified by name from each score data frame provided and merge them. It is mostly a wrapper of merge() that takes multiple data frames at a time with a default of all = TRUE.

# Generate dummy data
t1 <- data.frame("predictor" = c("tf_1", "tf_2", "tf_3"),
                 "target" = c("gene_1", "gene_2", "gene_3"),
                 "MSE" = c(1, 1, 0))
t2 <- data.frame("predictor" = c("tf_1", "tf_2", "tf_3"),
                 "target" = c("gene_1", "gene_2", "gene_3"),
                 "MSE" = c(1, 0, 1))
t3 <- data.frame("predictor" = c("tf_1", "tf_2", "tf_3", "tf_4"),
                 "target" = c("gene_1", "gene_2", "gene_3", "gene_1"),
                 "MSE" = c(0, 1, 1, 1))

integrated_table <- integrate_score(t1 = t1, t2 = t2, t3 = t3,
                                    column.name = "MSE")

plot_score()

plot_score() anticipates a data frame fro integrate_score(), or any data frame whose rows are predictor-target pairs with columns containing scores from different conditions or time points. It plots either a heatmap (plot.type = "heatmap") or a line plot (plot.type = "line") to visualize the trend of score. Additionally, a hierarchy tree can be plotted (plot.type = "tree") to show the similarities between predictor-target pairs.

# Generating a dummy data frame containing predictor, target, and columns of
# scores
intedf <- data.frame(
  "predictor" = c("tf_1", "tf_2", "tf_3", "tf_4"),
  "target" = c("gene_1", "gene_2", "gene_1", "gene_2"),
  "t1" = c(0, 5, 1, 5),
  "t2" = c(0, 3, 0, 1),
  "t3" = c(0, 6, 2, 6),
  "t4" = c(2, 7, 8, 9),
  "t5" = c(9, 2, 2, 7),
  stringsAsFactors = FALSE
)

# Plot a line plot
plot_score(x = intedf, plot.type = "line")

# Plot a heatmap
plot_score(x = intedf, plot.type = "heatmap")

# Plot a tree
plot_score(x = intedf, plot.type = "tree")


chenyenchung/genofeatutil documentation built on May 15, 2019, 10:38 p.m.