getGenotypes: XIBD Pre-Analysis Data Processing
In bahlolab/XIBD: Identity-by-Descent Analysis of the Human Genome

Description Usage Arguments Value Examples

Perform pre-analysis data processing of PLINK formatted unphased haplotype data, including removal of SNPs and samples with high proportions of missing data, SNPs with low minor allele frequencies and SNPs in high linkage disequilibrium (LD, based on R^2 if model=2). Also, calculate population allele frequencies as well as haplotype frequencies between pairs of SNPs (based on R^2 if model=2).

getGenotypes(ped.map, reference.ped.map = NULL, snp.ld = NULL, model = 1,
  maf = 0.01, sample.max.missing = 0.1, snp.max.missing = 0.1,
  maximum.ld.r2 = 0.99, chromosomes = NULL, input.map.distance = "M",
  reference.map.distance = "M")

`ped.map`	a list with 2 objects: a data frame which contains the PLINK PED information. The first six columns of this data frame are: Family ID (type `"character"`, `"numeric"` or `"integer"`) Individual ID (type `"character"`, `"numeric"` or `"integer"`) Paternal ID (type `"character"`, `"numeric"` or `"integer"`) Maternal ID (type `"character"`, `"numeric"` or `"integer"`) Gender (1 = male, 2 = female) Phenotype (1 = unaffected, 2 = affected, 0 = unknown) where each row describes a single sample The IDs are alphanumeric: the combination of family and individual ID should uniquely identify a sample. The phenotype column is not used in XIBD analyses however it is required for completeness of a standard pedigree. Columns 7 onwards are the sample haplotypes where the A and B alleles are coded as 1 and 2 respectively and missing data is coded as 0. All SNPs (whether haploid or not) must have two alleles specified and each allele should be in a separate column. For example, the alleles in columns 7 and 8 correspond to the unphased haplotypes of SNP 1 in the map file. For haploid chromosomes, haplotypes should be specified as homozygous. Either both alleles should be missing (i.e. 0) or neither. No header row should be given. a data frame which contains the PLINK MAP information. This data frame contains exactly four columns of information: Chromosome (`"numeric"` or `"integer"`) SNP identifier (type `"character"`) Genetic map distance (centi morgans cM, or morgans M - default) (type `"numeric"`) Base-pair position (type `"numeric"` or `"integer"`) where each row describes a single marker. Genetic map distance and base-pair positions are expected to be positive values. The MAP file must be ordered by increasing chromosomes and positions. SNP identifiers can contain any characters expect spaces or tabs; also you should avoid * symbols in the names. The MAP file must contain as many markers as are in the PED file. No header row should be given.
`reference.ped.map`	a list containing reference data used to calculate population allele frequencies and haplotype frequencies, in the same format as `ped.map`. The default value is `reference.ped.map=NULL` and XIBD will calculate the population allele frequencies and haplotype frequencies from `ped.map`. This is not recommended for small datasets or datasets of mixed populations. Genetic map positions and base-pair positions in this dataset are used in the analysis. HapMap phase 2 and 3 PED and MAP data (hg19/build 37) for the 11 HapMap populations can be downloaded from http://bioinf.wehi.edu.au/software/XIBD.
`snp.ld`	optional for `model=1`; compulsory for `model=2`. A data frame generated from PLINK containing information on LD between pairs of SNPs (A and B). This data frame contains exactly 7 columns of information: Chromosome of SNP A (type `"numeric"` or `"integer"`) Base-pair position of SNP A (type `"numeric"` or `"integer"`) SNP A identifier (type `"character"`) Chromosome of SNP B (type `"numeric"` or `"integer"`) Base-pair position of SNP B (type `"numeric"` or `"integer"`) SNP B identifier (type `"character"`) R-squared LD statistic between SNP A and SNP B (type `"numeric"`) where each row contains the LD information for a single pair of SNPs. The data frame should contain the header `CHR_A, BP_A, SNP_A, CHR_B, BP_B, SNP_B` and `R2`. HapMap phase 2 and 3 PED and MAP data (hg19/build 37) for the 11 HapMap populations can be downloaded from http://bioinf.wehi.edu.au/software/XIBD. Alternatively, an LD file can be created using using PLINK (http://www.cog-genomics.org/plink2).
`model`	an integer of either 1 or 2 denoting which of the two models should be run. `model=1` is based on the HMM implemented in PLINK (Purcell et al., 2007) which assumes the SNPs are in linkage equilibrium (LE). This often requires thinning of markers prior to use. `model=2` is based on the HMM implemented in RELATE (Albrechtsen et al., 2009) which allows SNPs to be in LD and implicitly accounts for the LD through conditional emission probabilities where the current genotype probability is conditioned on the genotype of a single previous SNP (haplotype frequencies). `model=1` requires significantly less time to run than `model=2` due to the reduced number of SNPs and simplified emission probabilities. It may be wise to use `model=1` when there are many SNPs and many samples.
`maf`	the smallest minor allele frequency allowed in the analysis. The default value is 0.01.
`sample.max.missing`	the maximum proportion of missing data allowed for each sample. The default value is 0.1.
`snp.max.missing`	the maximum proportion of missing data allowed for each SNP. The default value is 0.1.
`maximum.ld.r2`	the maximum linkage disequilibrium R2 value allowed between pairs of SNPs.The default value is 0.99.
`chromosomes`	a numeric vector containing a subset of chromosomes to perform genotype filtering on. The default is `chromosomes=NULL` which will format genotypes for all chromosomes in `ped.map`. Autosomes are represented by numbers 1-22 and the X chromosome is denoted 23.
`input.map.distance`	either "M" or "cM" denoting whether the genetic map distances in `ped.map` are in Morgans (M) or centi-Morgans (cM). The default is Morgans.
`reference.map.distance`	either "M" or "cM" denoting whether the genetic map distances in `reference.ped.map` are in Morgans (M) or centi-Morgans (cM). The default is Morgans. HapMap reference data is in Morgans.

A named list of three objects:

A pedigree containing the samples that remain after filtering. The pedigree is the first six columns of the PED file and these columns are headed fid, iid, pid, mid, sex and aff, respectively.
A data frame with the first five columns:
1. Chromosome (type "character", "numeric" or "integer")
2. SNP identifiers (type "character")
3. Genetic map distance (Morgans, M) (type "numeric")
4. Base-pair position (type "numeric" or "integer")
5. Population allele frequency (type "numeric")
where each row describes a single marker. These columns are headed chr, snp_id, pos_M, pos_bp and freq respectively. If model=2 then the following columns are also included:
1. Numeric ID of condition SNP (type "numeric" or "integer")
2. Haplotype probability: pba (type "numeric")
3. Haplotype probability: pbA (type "numeric")
4. Haplotype probability: pBa (type "numeric")
5. Haplotype probability: pBA (type "numeric")
6. Population allele frequency on the condition SNP (type "numeric")
with the headers condition_snp, pba, pbA, pBa, pBA and freq_condition_snp. The remaining columns contain the genotype data for each sample, where a single column corresponds to a single sample. These columns are labeled with merged family IDs and individual IDs separated by a slash symbol (/).
The model selected.

The list is named pedigree, genotypes and model respectively.

# look at the simulated data
str(example_pedmap)

# format and filter the example data using model 2 and reference data
my_genotypes <- getGenotypes(ped.map = example_pedmap,
                             reference.ped.map = example_reference_pedmap,
                             snp.ld = example_reference_ld,
                             model = 2,
                             maf = 0.01,
                             sample.max.missing = 0.1,
                             snp.max.missing = 0.1,
                             maximum.ld.r2 = 0.99,
                             chromosomes = NULL,
                             input.map.distance = "M",
                             reference.map.distance = "M")