findGSE: Estimating genome size by fitting k-mer frequencies in short...

View source: R/findGSE_v1.94.R

findGSER Documentation

Estimating genome size by fitting k-mer frequencies in short reads with a skew normal distribution model.

Description

findGSE is a function for (heterozygous diploid or homozygous) genome size estimation by fitting k-mer frequencies iteratively with a skew normal distribution model. (version still under testing)

To use findGSE, one needs to prepare a histo file, which contains two tab-separated columns. The first column gives frequencies at which k-mers occur in reads, while the second column gives counts of such distinct k-mers. Parameters k and related histo file are required for any estimation.

Dependencies (R library) required: pracma, fGarch - see INSTALL.

For heterozygous genomes, another parameter about the average k-mer coverage for the homozygous regions must be provided.

Usage

findGSE(histo = "", sizek = 0, outdir = "", exp_hom = 0, species = "")

Arguments

histo

is the histo file (mandatory).

sizek

is the size of k used to generate the histo file (mandatory). K is involved in calculating heterzygosity if the genome is heterozygous.

outdir

is the path to write output files (optional). If not provided, by default results will be written in the folder where the histo file is.

exp_hom

a rough average k-mer coverage for finding the homozygous regions. In general, one can get peaks in the k-mer frequencies file, but has to determine which one is for the homozygous regions, and which one is for the heterozygous regions. It is optional, however, it must be provided if one wants to estimate size for a heterozygous genome. VALUE for exp_hom must satisfy fp < VALUE < 2*fp, where fp is the freq for homozygous peak. If not provided, 0 by default assumes the genome is homozygous.

species

an optional parameter only applied in calculating heterozygosity for human. This is used to indicate that (lx-ly)*hom_c/2 k-mers should be removed from het-kmers, where lx is length of chromosome X, ly is length of chromosome Y, and hom_c is the average k-mer coverage for the homozygous k-mers. Two estimates will be provied as ORIGINAL_EST CORRECTED_EST: for males, select the second (CORRECTED_EST); for females, select the first (ORIGINAL_EST)..


schneebergerlab/findGSE documentation built on Jan. 26, 2024, 8:10 a.m.