getGenotypes: Pre-Analysis Data Processing

Description Usage Arguments Value See Also Examples

Description

getGenotypes() performs pre-analysis data processing of PLINK formatted unphased genotype data, including removal of SNPs and isolates with high proportions of missing data and SNPs with low minor allele frequencies. It also calculates SNP allele frequencies from either the input dataset or a specified reference dataset.

Usage

1
2
3
getGenotypes(ped.map, reference.ped.map = NULL, maf = 0.01,
  isolate.max.missing = 0.1, snp.max.missing = 0.1, chromosomes = NULL,
  input.map.distance = "cM", reference.map.distance = "cM")

Arguments

ped.map

A list with 2 objects:

  1. A data frame which contains the PLINK PED information. The first six columns of this data frame are:

    1. Family ID (type "character", "numeric" or "integer")

    2. Isolate ID (type "character", "numeric" or "integer")

    3. Paternal ID (type "character", "numeric" or "integer")

    4. Maternal ID (type "character", "numeric" or "integer")

    5. Multiplicity of infection (MOI) (1 = single infection or haploid, 2 = multiple infections or diploid)

    6. Phenotype (type "character", "numeric" or "integer")

    where each row describes a single isolate. The IDs are alphanumeric: the combination of family and isolate ID should uniquely identify a sample. The paternal, maternal and phenotype columns are not used by isoRelate, however they are required for completeness of a standard pedigree and are typically filled with the numeric value zero. Columns 7 onwards are the isolate genotypes where the A and B alleles are coded as 1 and 2 respectively and missing data is coded as 0. All SNPs must have two alleles specified and each allele should be in a separate column. For example, the alleles in columns 7 and 8 correspond to the unphased genotypes of SNP 1 in the map file. For single infections, genotypes should be specified as homozygous. Either both alleles should be missing (i.e. 0) or neither. Column names are not required.

  2. A data frame which contains the PLINK MAP information. This data frame contains exactly four columns of information:

    1. Chromosome (type "character", "numeric" or "integer")

    2. SNP identifier (type "character")

    3. Genetic map distance (centi morgans, cM, or morgans, M) (type "numeric")

    4. Base-pair position (type "numeric" or "integer")

    where each row describes a single SNP Genetic map distance and base-pair positions are expected to be positive values. The MAP file must be ordered by increasing genetic map distance. SNP identifiers can contain any characters expect spaces or tabs; also you should avoid * symbols in the names. The MAP file must contain as many markers as are in the PED file. Column names are not required.

reference.ped.map

An optional list containing reference data used to calculate SNP allele frequencies. The list has 2 objects in the same format as ped.map. The default value is reference.ped.map=NULL and isoRelate will calculate the SNP allele frequencies from the input data. This is not recommended for small datasets or datasets of mixed populations.

maf

A numeric value denoting the smallest minor allele frequency allowed in the analysis. The default value is 0.01.

isolate.max.missing

A numeric value denoting the maximum proportion of missing data allowed for each isolate. The default value is 0.1.

snp.max.missing

A numeric value denoting the maximum proportion of missing data allowed for each SNP. The default value is 0.1.

chromosomes

A vector containing a subset of chromosomes to perform formatting on. The default value is chromosomes=NULL which will reformat all genotypes for all chromosomes in the MAP data frame.

input.map.distance

A character string of either "M" or "cM" denoting whether the genetic map distances in the input MAP data frame are in Morgans (M) or centi-Morgans (cM). The default is cM.

reference.map.distance

A character string of either "M" or "cM" denoting whether the genetic map distances in the reference MAP data frame are in Morgans (M) or centi-Morgans (cM). The default is cM.

Value

A list of two objects named pedigree and genotypes:

  1. A pedigree containing the isolates that remain after filtering. The pedigree is the first six columns of the PED file and these columns are headed fid, iid, pid, mid, moi and aff respectively.

  2. A data frame with the first five columns:

    1. Chromosome (type "character", "numeric" or "integer")

    2. SNP identifiers (type "character")

    3. Genetic map distance (Morgans, M) (type "numeric")

    4. Base-pair position (type "integer")

    5. Population allele frequency (type "integer")

    where each row describes a single SNP. These columns are headed chr, snp_id, pos_M, pos_bp and freq respectively. Columns 6 onwards contain the genotype data for each isolate, where a single column corresponds to a single isolate. These columns are labeled with merged family IDs and isolate IDs separated by a slash symbol (/).

See Also

getIBDparameters and getIBDsegments.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# take a look at the data
str(png_pedmap)

# reformat and filter to call genotypes
my_genotypes <- getGenotypes(ped.map = png_pedmap,
                             reference.ped.map = NULL,
                             maf = 0.01,
                             isolate.max.missing = 0.1,
                             snp.max.missing = 0.1,
                             chromosomes = NULL,
                             input.map.distance = "cM",
                             reference.map.distance = "cM")

bahlolab/isoRelate documentation built on May 11, 2019, 5:25 p.m.