haplofreq: Evaluates the Occurrence of Haplotype Segments in Particular...
In optiSel: Optimum Contribution Selection and Population Genetics

haplofreq

R Documentation

Evaluates the Occurrence of Haplotype Segments in Particular Breeds

Description

For each haplotype from thisBreed and every SNP the occurence of the haplotype segment containing the SNP in a set of reference breeds is evaluated. The maximum frequency each segment has in one of these reference breeds is computed, and the breed in which the segment has maximum frequency is identified. Results are either returned in a list or saved to files.

Usage

haplofreq(files, phen, map, thisBreed, refBreeds="others", minSNP=20, minL=1.0, 
  unitL="Mb", ubFreq=0.01, keep=NULL, skip=NA, cskip=NA, w.dir=NA, 
  what=c("freq", "match"), cores=1, quiet=FALSE)

Arguments

`files`	Either a character vector with file names, or a list containing character vectors with file names. The files contain phased genotypes, one file for each chromosome. File names must contain the chromosome name as specified in the `map` in the form `"ChrNAME."`, e.g. `"Breed2.Chr1.phased"`. The required format of the marker files is described under `Details`. If `file` is a character vector then, genotypes of all animals must be in the same files. Alternatively, `files` can be a list with the following two components: `hap.thisBreed`: Character vector with names of the phased marker files for the individuals from `thisBreed`, one file for each chromosome. `hap.refBreeds`: Character vector with names of the phased marker files for the individuals from the reference breeds (`refBreeds`), one file for each chromosome. If this component is missing, then it is assumed that the haplotypes of these animals are also included in `hap.thisBreed`.
`phen`	Data frame containing the ID (column `"Indiv"`) and the breed name (column `"Breed"`) of each genotyped individual.
`map`	Data frame providing the marker map with columns including marker name `'Name'`, chromosome number `'Chr'`, and possibly the position on the chromosome in mega base pairs `'Mb'`, and the position in centimorgan `'cM'`. The order of the markers must be the same as in the files `files`. Marker names must have no white spaces.
`thisBreed`	Name of a breed from column `Breed` in `phen`: The occurence of each haplotype segment from this breed in the reference breeds will be evaluated.
`refBreeds`	Vector with names of breeds from column `Breed` in `phen`. These breeds are used as reference breeds. The occurence of haplotype segments in these breeds will be evaluated. By default, all breeds in `phen`, except `thisBreed` are used as reference breeds. In contrast, for `refBreeds="all"`, all genotyped breeds are used as reference breeds.
`minSNP`	Minimum number of marker SNPs included in a segment.
`minL`	Minimum length of a segment in `unitL` (e.g. in cM or Mb).
`unitL`	The unit for measuring the length of a segment. Possible units are the number of marker SNPs included in the segment (`'SNP'`), the number of mega base pairs (`'Mb'`), and the genetic distances between the first and the last marker in centiMorgan (`'cM'`). In the last two cases the map must include columns with the respective names.
`ubFreq`	If a haplotype segment has frequency smaller than `ubFreq` in all reference breeds then the breed name is replaced by `'1'`, which indicates that the segment is native.
`keep`	Subset of the IDs of the individuals from data frame `phen`, or a logical vector indicating the animals in data frame `phen` that should be used. The default `keep=NULL` means that all individuals included in `phen` will be considered.
`skip`	Take line `skip+1` of the files as the line with column names. By default, the number is determined automatically.
`cskip`	Take column `cskip+1` of the files as the first column with genotypes. By default, the number is determined automatically.
`w.dir`	Output file directory. Writing results to files has the advantage that much less working memory is required. By default, no files are created. The function returns only the file names if files are created.
`what`	For `what="freq"`, the maximum frequency each haplotype segment has in the reference breeds will be computed. For `what="match"`, the name of the reference breed in which the segment has maximum frequency will be determined. By default, the frequencies and the breed names both are determined.
`cores`	Number of cores to be used for parallel processing of chromosomes. By default one core is used. For `cores=NA` the number of cores will be chosen automatically. Using more than one core increases execution time if the function is already fast.
`quiet`	Should console output be suppressed?

Details

Marker file format: Each marker file containing phased genotypes has a header and no row names. Cells are separated by blank spaces. The number of rows is equal to the number of markers from the respective chromosome and the markers are in the same order as in the map. The first cskip columns are ignored. The remaining columns contain genotypes of individuals written as two alleles separated by a character, e.g. A/B, 0/1, A|B, A B, or 0 1. The same two symbols must be used for all markers. Column names are the IDs of the individuals. If the blank space is used as separator then the ID of each individual should repeated in the header to get a regular delimited file. The columns to be skipped and the individual IDs must have no white spaces.

Value

If w.dir=NA then a list is returned. The list may have the following components:

`freq`	Mx(2N) - matrix containing for every SNP and for each of the 2N haplotypes from `thisBreed` the maximum frequency the segment containing the SNP has in a the reference breeds.
`match`	Mx(2N) - matrix containing for every SNP and for each of the 2N haplotypes from `thisBreed` the first letter of the name of the reference breed in which the segment containing the SNP has maximum frequency. Segments with frequencies smaller than `ubFreq` in all reference breeds are marked as `'1'`, which indicates that the segment is native for `thisBreed`.

The list has attributes thisBreed, and map.

If w.dir is the name of a directory, then results are written to files, whereby each file corresponds to one chromosome, and a data frame with file names is returned.

Author(s)

Robin Wellmann

Examples

data(map)
data(Cattle)
dir   <- system.file("extdata", package="optiSel")
files <- file.path(dir, paste("Chr", 1:2, ".phased", sep=""))

Freq <- freqlist(
 haplofreq(files, Cattle, map, thisBreed="Angler", refBreeds="Rotbunt",   minL=2.0),
 haplofreq(files, Cattle, map, thisBreed="Angler", refBreeds="Holstein",  minL=2.0),
 haplofreq(files, Cattle, map, thisBreed="Angler", refBreeds="Fleckvieh", minL=2.0)
  )

plot(Freq, ID=1, hap=2, refBreed="Rotbunt")
plot(Freq, ID=1, hap=2, refBreed="Holstein", Chr=1)


## Test for using multiple cores:

Freq1 <- haplofreq(files, Cattle, map, thisBreed="Angler", refBreeds="Rotbunt", 
                   minL=2.0, cores=NA)$freq
range(Freq[[1]]-Freq1)
#[1] 0 0


## Creating output files with allele frequencies and allele origins:

rdir  <- system.file("extdata", package = "optiSel")
wdir  <- file.path(tempdir(), "HaplotypeEval")
chr   <- unique(map$Chr)
files <- file.path(rdir, paste("Chr", chr, ".phased", sep=""))
wfile <- haplofreq(files, Cattle, map, thisBreed="Angler", minL=2.0, w.dir=wdir)

View(read.table(wfile$match[1],skip=1))
#unlink(wdir, recursive = TRUE)

optiSel documentation built on Sept. 11, 2024, 7:19 p.m.