fagin: fagin: Trace the origins of orphan genes
In arendsee/fagin: Annotate Orphan Genes

Description GFF input Genome input Input synteny maps Input Tree Synder parameters Fagin parameters

This documentation is a work in progress ...

Absolute path to a directory containing a GFF file for each species used in the pipeline. This GFF file must contain at minimum mRNA and coding sequence (CDS) features. All start and stop positions must be relative to the reference genomes in FNA_DIR (see argument -n).

The following must be true of all GFF files:

Contain CDS, exon and mRNA entries
All CDS and exon fields contain Parent tags in the 9th column
All features contain an ID or Name field

1
2
3

Chr1   .   mRNA   3631   5899   .   +   .   ID=AT1G01010.1
Chr1   .    CDS   3760   3913   .   +   .   ID=AT1G01010.1.CDS-1;Parent=AT1G01010.1
Chr1   .    CDS   3996   4276   .   +   .   ID=AT1G01010.1.CDS-2;Parent=AT1G01010.1

Expected extension: *.gff

This must be a fasta file (extension 'fna', for Fasta Nucleic Acid). The header must contain sequence ids that match those of the GFF.

Absolute path to a directory containing one synteny map for each species that will be compared. each synteny map should consist of a single file named according to the pattern "<query>.vs.<target>.syn", for example, "arabidopsis_thaliana.vs.arabidopsis_lyrata.tab". these files should contain the following columns:

query contig name (e.g. chromosome or scaffold)
query start position
query stop position
target contig name
target start position
target stop position
score (not necessarily used)
strand relative to query

Example:

chr2   193631   201899   tchr2   193631   201899   100   +
chr2   225899   235899   tchr2   201999   202999   100   +
chr1   5999     6099     tchr1   6099     6199     100   +
chr1   5999     6099     tchr1   8099     8199     100   +
chr1   17714    18714    tchr2   17714    18714    100   +
chr2   325899   335899   tchr2   301999   302999   100   +

a synteny map like this can be created using a whole genome synteny program, such as satsuma (highly recommended). building a single synteny map requires hundreds of cpu hours, so it is best done on a cluster. an example pbs script is provided, see src/satsuma.pbs.

Expected filename format: <query_sciname>.vs.<target_sciname>.syn

Absolute path to a newick format file specifying the topology of the species tree. It must contain all species used in the pipeline AND NO OTHERS (I may relax this restriction later).

NOTE: There must be no spaces in the species names.

Here is an example tree:

(Brassica_rapa,(Capsella_rubella,(Arabidopsis_lyrata,Arabidopsis_thaliana)));

See documentation in synder

PROT2PROT_PVAL: default=0.05 - Base p-value cutoffs. These will be ladjusted for multiple testing query protein versus target gene alignments.
PROT2ALLORF_PVAL: default=0.05 - query protein versus all SI translated ORFs.
PROT2TRANSORF_PVAL: default=0.05 - query protein versus translated ORFs from spliced transcripts
DNA2DNA_PVAL: default=0.05 - query genes versus entire SI (nucleotide match)
PROT2PROT_NSIMS: default=1000 - number of simulations
PROT2ALLORF_NSIMS: default=1000 - number of simulations
PROT2TRANSORF_NSIMS: default=1000 - number of simulations
DNA2DNA_MAXSPACE: default=1e8 - Maximum value of m*n that will be searched
INDEL_THRESHOLD: default=0.25 - Ratio of search interval to query interval below which an indel is called