fagin: fagin: Trace the origins of orphan genes

Description GFF input Genome input Input synteny maps Input Tree Synder parameters Fagin parameters

Description

This documentation is a work in progress ...

GFF input

Absolute path to a directory containing a GFF file for each species used in the pipeline. This GFF file must contain at minimum mRNA and coding sequence (CDS) features. All start and stop positions must be relative to the reference genomes in FNA_DIR (see argument -n).

The following must be true of all GFF files:

1
2
3
Chr1   .   mRNA   3631   5899   .   +   .   ID=AT1G01010.1
Chr1   .    CDS   3760   3913   .   +   .   ID=AT1G01010.1.CDS-1;Parent=AT1G01010.1
Chr1   .    CDS   3996   4276   .   +   .   ID=AT1G01010.1.CDS-2;Parent=AT1G01010.1

Expected extension: *.gff

Genome input

This must be a fasta file (extension 'fna', for Fasta Nucleic Acid). The header must contain sequence ids that match those of the GFF.

Input synteny maps

Absolute path to a directory containing one synteny map for each species that will be compared. each synteny map should consist of a single file named according to the pattern "<query>.vs.<target>.syn", for example, "arabidopsis_thaliana.vs.arabidopsis_lyrata.tab". these files should contain the following columns:

  1. query contig name (e.g. chromosome or scaffold)

  2. query start position

  3. query stop position

  4. target contig name

  5. target start position

  6. target stop position

  7. score (not necessarily used)

  8. strand relative to query

Example:

1
2
3
4
5
6
chr2   193631   201899   tchr2   193631   201899   100   +
chr2   225899   235899   tchr2   201999   202999   100   +
chr1   5999     6099     tchr1   6099     6199     100   +
chr1   5999     6099     tchr1   8099     8199     100   +
chr1   17714    18714    tchr2   17714    18714    100   +
chr2   325899   335899   tchr2   301999   302999   100   +

a synteny map like this can be created using a whole genome synteny program, such as satsuma (highly recommended). building a single synteny map requires hundreds of cpu hours, so it is best done on a cluster. an example pbs script is provided, see src/satsuma.pbs.

Expected filename format: <query_sciname>.vs.<target_sciname>.syn

Input Tree

Absolute path to a newick format file specifying the topology of the species tree. It must contain all species used in the pipeline AND NO OTHERS (I may relax this restriction later).

NOTE: There must be no spaces in the species names.

Here is an example tree:

(Brassica_rapa,(Capsella_rubella,(Arabidopsis_lyrata,Arabidopsis_thaliana)));

Synder parameters

See documentation in synder

Fagin parameters

PROT2PROT_PVAL

default=0.05 - Base p-value cutoffs. These will be ladjusted for multiple testing query protein versus target gene alignments.

PROT2ALLORF_PVAL

default=0.05 - query protein versus all SI translated ORFs.

PROT2TRANSORF_PVAL

default=0.05 - query protein versus translated ORFs from spliced transcripts

DNA2DNA_PVAL

default=0.05 - query genes versus entire SI (nucleotide match)

PROT2PROT_NSIMS

default=1000 - number of simulations

PROT2ALLORF_NSIMS

default=1000 - number of simulations

PROT2TRANSORF_NSIMS

default=1000 - number of simulations

DNA2DNA_MAXSPACE

default=1e8 - Maximum value of m*n that will be searched

INDEL_THRESHOLD

default=0.25 - Ratio of search interval to query interval below which an indel is called


arendsee/fagin documentation built on Aug. 27, 2019, 11:58 a.m.