getGenotypes: XIBD Pre-Analysis Data Processing

Description Usage Arguments Value Examples

Description

Perform pre-analysis data processing of PLINK formatted unphased haplotype data, including removal of SNPs and samples with high proportions of missing data, SNPs with low minor allele frequencies and SNPs in high linkage disequilibrium (LD, based on R^2 if model=2). Also, calculate population allele frequencies as well as haplotype frequencies between pairs of SNPs (based on R^2 if model=2).

Usage

1
2
3
4
getGenotypes(ped.map, reference.ped.map = NULL, snp.ld = NULL, model = 1,
  maf = 0.01, sample.max.missing = 0.1, snp.max.missing = 0.1,
  maximum.ld.r2 = 0.99, chromosomes = NULL, input.map.distance = "M",
  reference.map.distance = "M")

Arguments

ped.map

a list with 2 objects:

  1. a data frame which contains the PLINK PED information. The first six columns of this data frame are:

    1. Family ID (type "character", "numeric" or "integer")

    2. Individual ID (type "character", "numeric" or "integer")

    3. Paternal ID (type "character", "numeric" or "integer")

    4. Maternal ID (type "character", "numeric" or "integer")

    5. Gender (1 = male, 2 = female)

    6. Phenotype (1 = unaffected, 2 = affected, 0 = unknown)

    where each row describes a single sample The IDs are alphanumeric: the combination of family and individual ID should uniquely identify a sample. The phenotype column is not used in XIBD analyses however it is required for completeness of a standard pedigree. Columns 7 onwards are the sample haplotypes where the A and B alleles are coded as 1 and 2 respectively and missing data is coded as 0. All SNPs (whether haploid or not) must have two alleles specified and each allele should be in a separate column. For example, the alleles in columns 7 and 8 correspond to the unphased haplotypes of SNP 1 in the map file. For haploid chromosomes, haplotypes should be specified as homozygous. Either both alleles should be missing (i.e. 0) or neither. No header row should be given.

  2. a data frame which contains the PLINK MAP information. This data frame contains exactly four columns of information:

    1. Chromosome ("numeric" or "integer")

    2. SNP identifier (type "character")

    3. Genetic map distance (centi morgans cM, or morgans M - default) (type "numeric")

    4. Base-pair position (type "numeric" or "integer")

    where each row describes a single marker. Genetic map distance and base-pair positions are expected to be positive values. The MAP file must be ordered by increasing chromosomes and positions. SNP identifiers can contain any characters expect spaces or tabs; also you should avoid * symbols in the names. The MAP file must contain as many markers as are in the PED file. No header row should be given.

reference.ped.map

a list containing reference data used to calculate population allele frequencies and haplotype frequencies, in the same format as ped.map. The default value is reference.ped.map=NULL and XIBD will calculate the population allele frequencies and haplotype frequencies from ped.map. This is not recommended for small datasets or datasets of mixed populations. Genetic map positions and base-pair positions in this dataset are used in the analysis. HapMap phase 2 and 3 PED and MAP data (hg19/build 37) for the 11 HapMap populations can be downloaded from http://bioinf.wehi.edu.au/software/XIBD.

snp.ld

optional for model=1; compulsory for model=2. A data frame generated from PLINK containing information on LD between pairs of SNPs (A and B). This data frame contains exactly 7 columns of information:

  1. Chromosome of SNP A (type "numeric" or "integer")

  2. Base-pair position of SNP A (type "numeric" or "integer")

  3. SNP A identifier (type "character")

  4. Chromosome of SNP B (type "numeric" or "integer")

  5. Base-pair position of SNP B (type "numeric" or "integer")

  6. SNP B identifier (type "character")

  7. R-squared LD statistic between SNP A and SNP B (type "numeric")

where each row contains the LD information for a single pair of SNPs. The data frame should contain the header CHR_A, BP_A, SNP_A, CHR_B, BP_B, SNP_B and R2. HapMap phase 2 and 3 PED and MAP data (hg19/build 37) for the 11 HapMap populations can be downloaded from http://bioinf.wehi.edu.au/software/XIBD. Alternatively, an LD file can be created using using PLINK (http://www.cog-genomics.org/plink2).

model

an integer of either 1 or 2 denoting which of the two models should be run.

  1. model=1 is based on the HMM implemented in PLINK (Purcell et al., 2007) which assumes the SNPs are in linkage equilibrium (LE). This often requires thinning of markers prior to use.

  2. model=2 is based on the HMM implemented in RELATE (Albrechtsen et al., 2009) which allows SNPs to be in LD and implicitly accounts for the LD through conditional emission probabilities where the current genotype probability is conditioned on the genotype of a single previous SNP (haplotype frequencies).

model=1 requires significantly less time to run than model=2 due to the reduced number of SNPs and simplified emission probabilities. It may be wise to use model=1 when there are many SNPs and many samples.

maf

the smallest minor allele frequency allowed in the analysis. The default value is 0.01.

sample.max.missing

the maximum proportion of missing data allowed for each sample. The default value is 0.1.

snp.max.missing

the maximum proportion of missing data allowed for each SNP. The default value is 0.1.

maximum.ld.r2

the maximum linkage disequilibrium R2 value allowed between pairs of SNPs.The default value is 0.99.

chromosomes

a numeric vector containing a subset of chromosomes to perform genotype filtering on. The default is chromosomes=NULL which will format genotypes for all chromosomes in ped.map. Autosomes are represented by numbers 1-22 and the X chromosome is denoted 23.

input.map.distance

either "M" or "cM" denoting whether the genetic map distances in ped.map are in Morgans (M) or centi-Morgans (cM). The default is Morgans.

reference.map.distance

either "M" or "cM" denoting whether the genetic map distances in reference.ped.map are in Morgans (M) or centi-Morgans (cM). The default is Morgans. HapMap reference data is in Morgans.

Value

A named list of three objects:

  1. A pedigree containing the samples that remain after filtering. The pedigree is the first six columns of the PED file and these columns are headed fid, iid, pid, mid, sex and aff, respectively.

  2. A data frame with the first five columns:

    1. Chromosome (type "character", "numeric" or "integer")

    2. SNP identifiers (type "character")

    3. Genetic map distance (Morgans, M) (type "numeric")

    4. Base-pair position (type "numeric" or "integer")

    5. Population allele frequency (type "numeric")

    where each row describes a single marker. These columns are headed chr, snp_id, pos_M, pos_bp and freq respectively. If model=2 then the following columns are also included:

    1. Numeric ID of condition SNP (type "numeric" or "integer")

    2. Haplotype probability: pba (type "numeric")

    3. Haplotype probability: pbA (type "numeric")

    4. Haplotype probability: pBa (type "numeric")

    5. Haplotype probability: pBA (type "numeric")

    6. Population allele frequency on the condition SNP (type "numeric")

    with the headers condition_snp, pba, pbA, pBa, pBA and freq_condition_snp. The remaining columns contain the genotype data for each sample, where a single column corresponds to a single sample. These columns are labeled with merged family IDs and individual IDs separated by a slash symbol (/).

  3. The model selected.

The list is named pedigree, genotypes and model respectively.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# look at the simulated data
str(example_pedmap)

# format and filter the example data using model 2 and reference data
my_genotypes <- getGenotypes(ped.map = example_pedmap,
                             reference.ped.map = example_reference_pedmap,
                             snp.ld = example_reference_ld,
                             model = 2,
                             maf = 0.01,
                             sample.max.missing = 0.1,
                             snp.max.missing = 0.1,
                             maximum.ld.r2 = 0.99,
                             chromosomes = NULL,
                             input.map.distance = "M",
                             reference.map.distance = "M")

bahlolab/XIBD documentation built on May 11, 2019, 5:24 p.m.