README.md

SynExtend

Table of contents: - Introduction - Installation - Usage - Reporting Bugs and Errors

Introduction

SynExtend is a package of tools for working with synteny objects generated from the Bioconductor Package DECIPHER.

Installation

SynExtend is present in the package repository Bioconductor and the current release version can be installed via:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

BiocManager::install("SynExtend")
library("SynExtend")

The devel version in Bioconductor can be installed via:

BiocManager::install("SynExtend", version = "devel")
library("SynExtend")

The version in this repository (which should always be up to date with devel and frequently ahead of release) can be installed with devtools via:

devtools::install_github(repo = "npcooley/synextend")
library("SynExtend")

Additionally, SynExtend is maintained in a Docker container on Dockerhub that is here.

Usage

Prediction of Orthologous Gene Pairs

SynExtend facilitates the prediction of orthologous gene pairs from synteny maps. The functions herein rely on synteny maps built by the FindSynteny() function in the package DECIPHER. The process is relatively straightforward and only requires assemblies and genecalls, which can be conveniently pulled from the NCBI using Entrez Direct. However, as long as given assemblies have associated gene calls, or can have gene calls generated, they will work just fine. DECIPHER now contains the function FindGenes() that produces high quality de novo gene calls.

Using Entrez Direct, this process would start like this:

library(SynExtend)

EntrezQuery <- paste("esearch -db assembly ",
                     "-query '",
                     "kitasatospora[organism] ",
                     'AND "complete genome"[filter] ', # only complete genomes
                     'AND "refseq has annotation"[properties] ', # only genomes with annotations
                     'AND "latest refseq"[filter] ', # only latest
                     "NOT anomalous[filter]' ",
                     '| ',
                     'esummary ',
                     '| ',
                     'xtract -pattern DocumentSummary -element FtpPath_RefSeq',
                     sep = "")

FtPPaths <- system(command = EntrezQuery,
                   intern = TRUE,
                   timeout = 300L) # timeout argument is required for RStudio only

FNAs <- unname(sapply(FtPPaths,
                      function(x) paste(x,
                                        "/",
                                        strsplit(x,
                                                 split = "/",
                                                 fixed = TRUE)[[1]][10],
                                        "_genomic.fna.gz",
                                        sep = "")))

GFFs <- unname(sapply(FtPPaths,
                      function(x) paste(x,
                                        "/",
                                        strsplit(x,
                                                 split = "/",
                                                 fixed = TRUE)[[1]][10],
                                        "_genomic.gff.gz",
                                        sep = "")))

With these matched character vectors describing files on the NCBI FTP site, we can pull given assemblies, and their associated GFF files.

Assemblies are pulled using functions within the package Biostrings and placed in a sqlite database using the package DECIPHER. Note, DBPATH can be a specific local file location, though a tempfile works fine for this example. The tempfile in this example will be destroyed upon closure of the active R session, so specific file paths are preferable for most user activity. The code chunk below provides an example of pulling down assemblies from the NCBI and using DECIPHER's built in gene finder to generate gene calls.

DBPATH <- tempfile()

GC01 <- vector(mode = "list",
               length = length(FNAs))

for (m1 in seq_along(FNAs)) {
  X <- readDNAStringSet(FNAs[m1])
  Seqs2DB(seqs = X,
          type = "XStringSet",
          dbFile = DBPATH,
          identifier = as.character(m1),
          verbose = TRUE)

  GC01[[m1]] <- FindGenes(myDNAStringSet = X)
}

names(GC01) <- seq(length(FNAs))

A current quirk to this process is that genecalls must be present in a named list, where the names can be matched to integer identifiers in the DECIPHER database used to construct a synteny object. Users have the option of using the function FindGenes in DECIPHER, using the rtracklayer function import to import a valid GFF file, or importing a GFF with the SynExtend function gffToDataFrame which performs some data munging to convert the GFF into a DataFrame for user convenience.

GC02 <- vector(mode = "list",
               length = length(GFFs))

for (m1 in seq_along(GFFs)) {
  GC02[[m1]] <- gffToDataFrame(GFF = GFFs[m1],
                               Verbose = TRUE)

}

names(GC02) <- seq(length(GC02))

With assemblies and genecalls now in a user's workspace, users can build a synteny map with FindSynteny. See ?FindSynteny in R for further reading. Finding and tabulating where genes are connected by syntenic k-mers is then performed with NucleotideOverlap, and that raw data is converted into a convenient data.frame with the function PairSummariers.

Syn <- FindSynteny(dbFile = DBPATH)

Links <- NucleotideOverlap(SyntenyObject = Syn,
                           GeneCalls = GC01,
                           Verbose = TRUE)

Pairs01 <- PairSummaries(SyntenyLinks = Links,
                         PIDs = TRUE,
                         Verbose = TRUE)

Clusters01 <- DisjointSet(Pairs = Pairs01,
                          Verbose = TRUE)

Pairs where predictions create conflicts can be removed by a simple subsetting function.

Pairs02 <- SubSetPairs(CurrentPairs = Pairs01,
                       Verbose = TRUE)

Clusters02 <- DisjointSet(Pairs = Pairs02,
                          Verbose = TRUE)

A more self-contained example is maintained in this package's Bioconductor vignette, and can be found here

Creating Functional Association Networks

SynExtend also allows for prediction of functional association networks from evidence of correlated selective pressures between of pairs of clusters of orthologous genes (COGs). Given a list of gene trees or presence/absence data, the EvoWeaver class in SynExtend uses multiple algorithms to compute likelihood of pairwise functional association. This list can be generated with DisjointSet(...), or can be created by the user.

An example of using EvoWeaver can be done using the built-in example Streptomyces dataset. This will predict functional association of 200 genes from 301 genomes.

exData <- get(data("ExampleStreptomycesData"))
ew <- EvoWeaver(exData$Genes)

# For faster runtime, set subset as follows:
# predictions <- predict(ew, mySpeciesTree=exData$Tree, subset=1:20)
predictions <- predict(ew, mySpeciesTree=exData$Tree)

# View number of genes and number of predictions:

# Print out the adjacency matrix:
predictions

# Print out pairwise associations:
as.data.frame(predictions)

# Plot genes using force-directed embedding:
plot(predictions)

Current algorithms used for prediction are Jaccard distance, Hamming distance, Mutual Information, Direct Coupling Analysis, Phylogenetic Gain/Loss events, Co-localization, MirrorTree, and ContextTree. See ?EvoWeaver for more details.

Planned Updates

SynExtend currently has an extensive TODO list that includes: 1. Parsing communities of predicted pairs. 2. User friendly accessor functions to help with printing and plotting data.

If users have specific requests I would love to incorporate them.

Reporting Bugs and Errors

Feel free to email me at npc19@pitt.edu or submit issues here on GitHub. Questions related to EvoWeaver can be sent to ahl27@pitt.edu or submitted similarly on GitHub.



npcooley/SynExtend documentation built on May 2, 2024, 7:28 p.m.