data2haplohh: Convert data from input file to an object of class haplohh

Description Usage Arguments Details Value References Examples

View source: R/data2haplohh.R

Description

Convert input data files to an object of haplohh-class.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
data2haplohh(
  hap_file,
  map_file = NA,
  min_perc_geno.hap = NA,
  min_perc_geno.mrk = 100,
  min_maf = NA,
  chr.name = NA,
  popsel = NA,
  recode.allele = FALSE,
  allele_coding = "12",
  haplotype.in.columns = FALSE,
  remove_multiple_markers = FALSE,
  polarize_vcf = TRUE,
  capitalize_AA = TRUE,
  vcf_reader = "data.table",
  position_scaling_factor = NA,
  verbose = TRUE
)

Arguments

hap_file

file containing haplotype data (see details below).

map_file

file containing map information (see details below).

min_perc_geno.hap

threshold on percentage of missing data for haplotypes (haplotypes with less than min_perc_geno.hap percent of markers genotyped are discarded). Default is NA, hence no constraint.

min_perc_geno.mrk

threshold on percentage of missing data for markers (markers genotyped on less than min_perc_geno.mrk percent of haplotypes are discarded). By default, min_perc_geno.mrk=100, hence only fully genotyped markers are retained. This value cannot be set to NA or zero.

min_maf

threshold on the Minor Allele Frequency. Markers having a MAF lower than or equal to minmaf are discarded. In case of multi-allelic markers the second-most frequent allele is referred to as minor allele. Setting this value to zero eliminates monomorphic sites. Default is NA, hence no constraint.

chr.name

name of the chromosome considered (relevant if data for several chromosomes is contained in the haplotype or map file).

popsel

code of the population considered (relevant for fastPHASE output which can contain haplotypes from various populations).

recode.allele

*Deprecated*. logical. FALSE by default. TRUE forces parameter allele_coding to "map", FALSE leaves it unchanged.

allele_coding

the allele coding provided by the user. Either "12" (default), "01", "map" or "none". The option is irrelevant for vcf files and ms output.

haplotype.in.columns

logical. If TRUE, phased input haplotypes are assumed to be in columns (as produced by the SHAPEIT2 program (O'Connell et al., 2014).

remove_multiple_markers

logical. If FALSE (default), conversion stops, if multiple markers with the same chromosomal position are encountered. If TRUE, duplicated markers are removed (all but the first marker with identical positions).

polarize_vcf

logical. Only of relevance for vcf files. If TRUE (default), tries to polarize variants with help of the AA entry in the INFO field. Unpolarized alleles are discarded. If FALSE, allele coding of vcf file is used unchanged as internal coding.

capitalize_AA

logical. Only of relevance for vcf files with ancestral allele information. Low confidence ancestral alleles are usually coded by lower-case letters. If TRUE (default), these are changed to upper case before the alleles of the sample are matched for polarization.

vcf_reader

library used to read vcf. By default, low-level parsing is performed using the generic package data.table. In order to read compressed files, the package R.utils must be installed, too. If the specialized package vcfR is available, set this parameter to "vcfR".

position_scaling_factor

intended primarily for output of ms where positions lie in the interval [0,1]. These can be rescaled to sizes of typical markers in real data.

verbose

logical. If TRUE (default), report verbose progress.

Details

Five haplotype input formats are supported:

The "transposed format" has to be explicitly set while the other formats are recognized automatically.

The map file contains marker information in three, or, if it is used for polarization (see below), five columns:

The markers must be in the same order as in the haplotype file. If several chromosomes are represented in the map file, it is necessary to choose that which corresponds to the haplotype file by parameter chr.name.

Haplotypes can be given either with alleles already coded as numbers (in two possible ways) or with the actual alleles (e.g. nucleotides) which can be translated into numbers either using the fourth and fifth column of the map file or by their alpha-numeric order. Correspondingly, the parameter allele_coding has to be set to either "12", "01", "map" or "none":

The information of allelic ancestry is exploited only in the frequency-bin-wise standardization of iHS (see ihh2ihs). However, although ancestry status does not figure in the formulas of the cross populations statistics Rsb and XP-EHH, their values do depend on the assigned status.

The arguments min_perc_geno.hap, min_perc_geno.mrk and min_maf are evaluated in this order.

Value

The returned value is an object of haplohh-class.

References

Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet, 78, 629-644.

O'Connell J, Gurdasani D, Delaneau O, et al (2014) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet, 10, e1004234.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#copy example files into the current working directory.
make.example.files()
#create object using a haplotype file in "standard format"
hap <- data2haplohh(hap_file = "bta12_cgu.hap",
                   map_file = "map.inp",
                   chr.name = 12,
                   allele_coding = "map")
#create object using fastPHASE output
hap <- data2haplohh(hap_file = "bta12_hapguess_switch.out",
                   map_file = "map.inp",
                   chr.name = 12,
                   popsel = 7,
                   allele_coding = "map")
#clean up demo files
remove.example.files()

Example output

* Reading input file(s) *
Map info: 1424 markers declared for chromosome 12 .
Haplotype input file in standard format assumed.
Alleles are being recoded according to fourth and fifth column of map file.
* Filtering data *
Discard markers genotyped on less than 100 % of haplotypes.
No marker discarded.
Data consists of 280 haplotypes and 1424 markers.
Number of mono-, bi-, multi-allelic markers:
1 2 
27 1397 
* Reading input file(s) *
Map info: 1424 markers declared for chromosome 12 .
Haplotype input file in fastPHASE format assumed.
Haplotypes in the fastPHASE output file originate from 8 populations.
Alleles are being recoded according to fourth and fifth column of map file.
* Filtering data *
Discard markers genotyped on less than 100 % of haplotypes.
No marker discarded.
Data consists of 280 haplotypes and 1424 markers.
Number of mono-, bi-, multi-allelic markers:
1 2 
27 1397 

rehh documentation built on Sept. 15, 2021, 5:06 p.m.