README.md
In lijing28101/fagin2: Annotate Orphan Genes

Fagin2

A pipeline for the classification of orphans into origin classes using a syntenic filter.

This work is funded by the National Science Foundation grant:

NSF-IOS 1546858 Orphan Genes: An Untapped Genetic Reservoir of Novel Traits

# Clone the repository
git clone https://github.com/lijing28101/fagin2.git

cd fagin2

# Create a new environment for Fagin2
conda create --name fagin2 --file requirements.txt

# Activate the new environment
conda activate fagin2

# Go back to the parent dir and build fagin2
cd ..
R CMD build fagin2

# Install
R CMD INSTALL fagin2_2.0.0.tar.gz

Potential error for installation

Error in dyn.load(file, DLLpath = DLLpath, ...) :
  unable to load shared object '/work/LAS/mash-lab/jing/bin/Anaconda3/envs/fagin2/lib/R/library/L1pack/libs/L1pack.so':
  libRlapack.so: cannot open shared object file: No such file or directory
Calls: <Anonymous> ... asNamespace -> loadNamespace -> library.dynam -> dyn.load
Execution halted
ERROR: lazy loading failed for package ‘fagin2’

Method to solve the problem (not the best way)

# change to your lib path
cd Anaconda3/envs/fagin2/lib/

ls -ltrh 'liblapack.so'
ls -ltrh 'libblas.so'
mv liblapack.so libRlapack.so
mv libblas.so libRblas.so

Phylogeny tree for all included species, e.g. ((Saccharomyces_cerevisiae,Saccharomyces_kudriavzevii)s3,Saccharomyces_arboricola)s2;
Genome sequence for each species
GFF file for each species
Must include "gene", "mRNA", "CDS", and "exon" in column 3
Must include "Parent", "ID" or "Name" (or both) in column 9

# sample GFF file NC_001133.9 RefSeq gene 1807 2169 . - . ID=PAU8;Parent=- NC_001133.9 RefSeq mRNA 1807 2169 . - . ID=NM_001180043.1;Parent=PAU8 NC_001133.9 RefSeq exon 1807 2169 . - . ID=id3;Parent=NM_001180043.1 NC_001133.9 RefSeq CDS 1807 2169 . - 0 ID=NP_009332.1;Parent=NM_001180043.1 NC_001133.9 RefSeq gene 2480 2707 . + . ID=YAL067W-A;Parent=- NC_001133.9 RefSeq mRNA 2480 2707 . + . ID=NM_001184582.1;Parent=YAL067W-A NC_001133.9 RefSeq exon 2480 2707 . + . ID=id4;Parent=NM_001184582.1 NC_001133.9 RefSeq CDS 2480 2707 . + 0 ID=NP_878038.1;Parent=NM_001184582.1 NC_001133.9 RefSeq gene 7235 9016 . - . ID=SEO1;Parent=- NC_001133.9 RefSeq mRNA 7235 9016 . - . ID=NM_001178208.1;Parent=SEO1 NC_001133.9 RefSeq exon 7235 9016 . - . ID=id5;Parent=NM_001178208.1 NC_001133.9 RefSeq CDS 7235 9016 . - 0 ID=NP_009333.1;Parent=NM_001178208.1

Synteny map for the focal species versus each other species
Suggest obtain from mummer4 bash # Name of query and target species query=saccharomyces_cerevisiae target=saccharomyces_arboricola prefix=${query}_${target} # # ${query}_genome.fna and ${target}_genome.fna are the path of genome sequences nucmer --mum -t 36 -p ${prefix} ${query}_genome.fna ${target}_genome.fna show-coords -T -r -c -l ${prefix}.delta > ${prefix}.coords # # Convert coords format to the syn format required for fagin awk 'BEGIN{FS="\t"; OFS="\t"} {if ($4>$3) $8="+"; else $8="-"; print $12,$1,$2,$13,$3,$4,$7,$8}' ${prefix}.coords | awk 'BEGIN{OFS="\t"}{if($5<$6){start=$5;stop=$6}else{start=$6;stop=$5};print $1,$2,$3,$4,start,stop,$7,$8}' | sed '1,4d' > ${prefix}.syn
Synteny format as below (without header):

diff ! start coordinate should be always smaller than end even if the orientation is - ! use tab as delimiter - query_chromosome query_start query_end focal_chromosome focal_start focal_end orientation NC_001133.9 1783 2167 CM001577.1 1020875 1021256 90.65 - NC_001133.9 1801 2167 CM001574.1 205548 205924 85.68 + NC_001133.9 1804 2167 CM001576.1 770149 770512 89.29 - NC_001133.9 1805 2167 CM001568.1 106576 106941 90.44 + NC_001133.9 1805 2165 CM001574.1 491 860 87.57 + NC_001133.9 1806 2167 CM001574.1 84073 84443 79.78 -

A gene list for each focal species which will compare to target species
must use the Name but not the ID of mRNA in GFF file if both Name and ID existing on column 9

! Important notes:
! Chromosome ID in GFF, fna, and syn files should be consistent.
! The number of chromosomes or scaffolds in GFF, fna, and syn files should be same.
! You may need to request large memory to run the program if the synteny map is large

To run and configure fagin, you need to set paths to your data in a configuration object. The default configuration can be generated

config()

Example to generate a configuration object:

library(fagin2)

get_yeast_config <- function(){

  # create a default configuration
  con <- config()

  # set parameter
  con@synder@offsets = c(1L,1L) # offsets for mummer, c(0L,1L) for satsuma
  con@synder@trans = "p" # percent identity transform (0-100), or "d" for proportion transform (0-1)
  # alnrate is the proportion of target sequence match to query sequence for alignment
  # default cutoff for alnrate are 0
  con@alignment@alnrate@prot2prot=0.5 # protein vs protein
  con@alignment@alnrate@prot2allorf=0.5 # protein vs all ORF
  con@alignment@alnrate@prot2transorf=0.5 #protein vs ORF on the mRNA
  con@alignment@alnrate@dna2dna=0.5 # DNA vs DNA

  # focal species can be one or more species (max: the total number of species in tree file)
  con@input@focal_species = c("Saccharomyces_cerevisiae","Saccharomyces_arboricola")

  # set path for input files
  # name of list should be same as the species in tree
  con@input@gff <- list(
    "Saccharomyces_arboricola"   = "saccharomyces_arboricola_annotation.gff"
    , "Saccharomyces_cerevisiae"   = "saccharomyces_cerevisiae_annotation.gff"
    , "Saccharomyces_kudriavzevii" = "saccharomyces_kudriavzevii_annotation.gff"
  )
  con@input@fna <- list(
    "Saccharomyces_arboricola"   = "saccharomyces_arboricola_genome.fna"
    , "Saccharomyces_cerevisiae"   = "saccharomyces_cerevisiae_genome.fna"
    , "Saccharomyces_kudriavzevii" = "saccharomyces_kudriavzevii_genome.fna"
  )

  # syn path is a two-dimensional list
  # first level: focal species; second level: target species
  con@input@syn <- list(
    Saccharomyces_cerevisiae = list(
      "Saccharomyces_arboricola"   = "saccharomyces_cerevisiae.vs.saccharomyces_arboricola.syn"
      , "Saccharomyces_kudriavzevii" = "saccharomyces_cerevisiae.vs.saccharomyces_kudriavzevii.syn"
    ),
    Saccharomyces_arboricola = list(
      "Saccharomyces_cerevisiae"   = "saccharomyces_arboricola_saccharomyces_cerevisiae.syn"
      , "Saccharomyces_kudriavzevii" = "saccharomyces_arboricola_saccharomyces_kudriavzevii.syn"
    )
  )
  con@input@tree <- "tree"
  con@input@gene_list <- list(
    "Saccharomyces_cerevisiae" = "sc.list.txt"
    , "Saccharomyces_arboricola"   =  "sa.list.txt"
  )
  con@archive = "yeast-archive" # dir for output files

  validate_config(con)
  con
}

con <- get_yeast_config() # create revised configuration

Load input genome sequence and GFF file for each species, then processing and summarize the input data (save as [species_name]_data.rds for reuse).
Do alignment and comparison for each pair of focal and target species (save as [focal_species]-[target_species].rds for reuse)
Summarize the comparison and export as xlsx file for each focal species (also save as [focal_species]_result.rds)

# con is your configuration object
# overwrite.result: whether rerun the summarize step for each focal species
# overwrite.result=TRUE, when you add new target species in con for comparison
# overwrite.result=FALSE, only when you add new focal species, but not add new target species for previous focal

m <- run_fagin(con, overwrite.result=T) # without parallel, 30 min for sample data

or

# cores: number of cores used for parallel steps, recommend for large data
# type: "FORK" shared memory, only for linux and OS system, "PSOCK" for all system. Default: "FORK"
m <- run_fagin_parallel(con, cores=16, cl.type="FORK") # run loading data and comparison in parallel, 13 min for sample data

Note: Fagin will save the input data for each individual species, and pair of focal-target comparison as rds file for reuse. run_fagin and run_fagin_parallel will skip to load input data or pairwise comparison if the rds file already existed in the archive dir. You should delete the rds file for a specific species or pairwise comparison if you want to re-analysis that step.

The output m is a multi-dimensional list. First level is focal species. Second level include 5 data frames: