fCAT is a feature-aware completeness assessment tool, which provide a solution to assess the completeness of a genome assembly or a newly sequenced genome, based on the ortholog prediction with reciprocity criterion and the feature architecture similarity between the ortholog sequences of the interested genome and the training sequences.
To install fCAT, open R in your terminal
Using devtools to install fCAT
if (!requireNamespace("devtools"))
install.packages("devtools")
devtools::install_github("giangnguyen0709/fCAT")
The function to check the completeness of an interested genome
The function returns two reports. A detailed report of the completeness of the interested genome and a frequency table of all taxa, which were checked completeness with fCAT with option extend = TRUE. The frequency table show how many core genes "similar", "dissimilar", "duplicated", "missing" and "ignored" in each taxon.
genome <- "/path/to/query/genome.fa"
fasAnno <- "/path/to/fas/annotation/genome.json"
coreDir <- "/path/to/the/core/directory"
coreSet <- "name of the core set"
extend <- TRUE #by default is FALSE
redo <- TRUE #by default is FALSE
scoreMode <- 2 #Choices: 1,2,3, "len"
priorityList <- c("HUMAN@9606@1", "ECOLI@511145@1")
cpu <- 4
blastDir <- "/path/to/blast_dir" #Optional
weightDir <- "/path/to/weight_dir" #Optional
cleanup <- TRUE #by default is FASLE
output <- "path/to/location/to/save/output"
checkCompleteness <- function(genome, fasAnno, coreDir, coreSet, extend, redo, scoreMode, refSpecList, cpu, blastDir, weightDir, cleanup, output)
The function to compute the original phylogenetic profile, which will contains the phylogenetic profile of all core taxa of the core set. This phylogenetic profile can be used to assess the completeness of the core taxa and their'completeness will be reported together with the interested genome in the frequency table. It is optional, the tool can still check the completeness of a genome, even the orginal phylogenetic profile was not computed
This function will check completeness for all core genomes in the core set and store the output in the output's folder
coreDir <- "/path/to/the/core/directory"
coreSet <- "name of the core set"
scoreMode <- 2 #Choices: 1,2,3, "busco"
cpu <- 4
fCAT::computeOriginal(coreDir, coreSet, scoreMode, cpu)
The function calculate all cutoff values for all mode in the set. For score mode 1 it will calculate the avarage of all vs all FAS scores between the training sequences in the core gene. For score mode 2 it will calculate the avarage of the FAS score between each sequence against all training sequences in the core gene. The scores will be writen in a table with a column is the ID of the sequences and a column is the corresponding value. For score mode 3, the function will calculate the avarage of 1 vs all FAS scores for each training sequence in the core gene. The avarages build a distribution, the function will calculate the confidence interval of this distribution and write the upper value and the lower value of the interval in a file in the core gene folder.
coreDir <- "/path/to/the/core/directory"
coreSet <- "name of the core set"
fCAT::processCoreSet(coreDir, coreSet)
The test data is the eukaryota_busco set, which can be downloaded in https://applbio.biologie.uni-frankfurt.de/download/core-sets/BUSCO_Eukaryota/
The core set folder has some conflicts with the input of fCAT, which must be removed first. All the core gene folders in the folder core_orthologs must be contained in a subfolder (The name of the subfolder is the core set argument of fCAT) and this subfolder must be stored in core orthologs. In this document I will set the name of this subfolder eukaryota_busco. In the blast dir folder of CRYNE, the symbolic link of the fasta file of CRYNE was directed by a mistake to the symbolic link of CHRLE. This must be corrected before testing
The folder weight_dir of the core set contains the the xml files, which can not be run with fCAT. Please download the annotation files of the core set from https://drive.google.com/file/d/113MBwT1n7E64Xk54Ul-_r82aEK2v8jdl/view?usp=sharing and replace the xml files with the json files
In all following examples, I assumed that I has a genome fasta file and its FAS annotation file, which named HUMAN@9606@3.fa and HUMAN@9606@3.json, the core folder named eukaryota_busco, the core set named eukaryota_busco and all this data is placed in the home folder. You can replace them by your corresponding path and names
genome <- "/home/user/HUMAN@9606@3.fa"
fasAnno <- "/home/user/HUMAN@9606@3.json"
coreDir <- "/home/user/eukaryota_busco"
coreSet <- "eukaryota_busco"
extend <- TRUE
refSpecList <- c("HOMSA@9606@2")
scoreMode <- 1
cpu <- 4
fCAT::checkCompleteness(genome = genome, fasAnno = fasAnno, coreDir = coreDir, coreSet = coreSet, extend = extend, refSpecList = refSpecList, scoreMode = scoreMode, cpu = cpu)
The report will be storede by default in /home/user/eukaryota_busco/output/eukaryota_busco/1/report
coreDir <- "/home/user/eukaryota_busco"
coreSet <- "eukaryota_busco"
scoreMode <- 1
cpu <- 4
fCAT::computeOriginal(coreDir = coreDir, coreSet = coreSet, scoreMode = scoreMode, cpu = cpu)
The phylogenetic profile of all core taxa will be computed and be stored by default in /home/user/eukaryota_busco/output/eukaryota_busco/1
coreDir <- "/home/user/eukaryota_busco"
coreSet <- "eukaryota_busco"
fCAT::processCoreSet(coreDir = coreDir, coreSet = coreSet)
The function will calculate all cutoff values for all core genes in the set and write them in a text file, which will be stored in /home/user/eukaryota_busco/core_orthologs/eukaryota_busco/core_gene/fas_dir/score_dir
fCAT is depended on some tools.
R.utils
taxize
EnvStats
Any bug reports or comments, suggestions are highly appreciated. Please open an issue on GitHub or be in touch via email.
Thanh-Giang Nguyen > giangnguyen0709@gmail.com
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.