genotype_filter: Filter SNP data

View source: R/genotype_filter.R

genotype_filterR Documentation

Filter SNP data

Description

This function applies different filters on SNP data so as to generate as set of markers suitable for haplotype analysis. Although the function can be called separately for filtering purposes, it was thought and designed specifically for the needs of package HaplotypeMiner and might not suit the needs of the general. See section Details for a discussion of the different mandatory and optional filters applied by this function.

Usage

genotype_filter(snp_data, chrom, center_pos, max_distance_to_gene = 10^9,
  max_missing_threshold = NULL, max_het_threshold = NULL,
  min_alt_threshold = NULL, min_allele_count = NULL, verbose = TRUE)

Arguments

snp_data

A list of data pertaining so SNP markers and having at least elements Markers and Genotypes. A more detailed account of the contents of this object can be found in the documentation for functions read_hapmap and read_vcf.

chrom

A character of length one. The name of the chromosome for which markers should be kept.

center_pos

A numeric of length one. The central position (in base pairs) of the gene of interest.

max_distance_to_gene

A numeric of length one. The maximum distance (in base pairs) between center_pos and the position of a marker for this marker to be kept in the analysis.

max_missing_threshold

A numeric of length one between 0 and 1, or NULL. If NULL (default), no such filter is applied. Otherwise, all markers with a missing data rate over this value are removed.

max_het_threshold

A numeric of length one between 0 and 1, or NULL. If NULL (default), no such filter is applied. Otherwise, all markers with a heterozygosity rate over this value are removed.

min_alt_threshold

A numeric of length one between 0 and 1, or NULL. If NULL (default), no such filter is applied. Otherwise, all markers with a minor allele frequency below this value are removed.

min_allele_count

A positive numeric value, or NULL. If NULL (default), no such filter is applied. Otherwise, all markers with a minor allele count below this threshold are removed.

verbose

Logical. Should information regarding the filtering process be printed to screen? Defaults to TRUE.

Details

genotype_filter applies automaticaly three filters to the snp_data object provided as an argument. These filters are applied both when the function is used separately and when used internally inside a call to haplo_selection. The three filters are :

  • chrom Only markers lying on the chromosome of interest are kept for further analysis.

  • distance Only markers less than max_distance_to_gene base pairs away from the central position of interest (usually the center of the gene for which haplotypes are beign generated) are kept for further analysis. The value of this distance is 1 Gb by default, which should mean that all markers on a given chromosome are kept by default (a more sensible default value could be used).

  • multiallelism Markers that are not biallelic (i.e. either triallelic or tetraallelic) are automatically removed from the dataset as package HaplotypeMiner does not know how to handle these markers yet.

Other filters are optional and are not applied by default, although is is recommended that users do apply these filters either prior to the analysis, externally to package HaplotypeMiner, or as part of the analysis pipeline implemented by function haplo_selection. These four filters are :

  • Missing data Markers harbouring a missing data rate higher than max_missing_threshold can be selectively removed during the analysis.

  • Heterozygosity Markers harbouring a heterozygosity rate higher than max_het_threshold can be selectively removed during the analysis. This may not be relevant for species found in the while, but is relevant e.g. for crop species which are expected to by homozygous at all loci.

  • Minor allele frequency Markers harbouring a minor allele frequency (MAF) lower than min_alt_treshold can be selectively removed from the analysis.

  • Minor allele count Markers harbouring a minor allele count (MAC) lower than min_allele_count can be selectively removed from the analysis.

Value

A list containing 3 or 4 elements depending on the snp_data object used as input :

  • GenotypesAn object of class snpMatrix containing the genotypes corresponding to the various markers for every individual. This is essentially a subset of snp_data$Genotypes that contains only markers that have been selected.

  • MarkersA data.frame containing metadata relative to the genotyped markers. This is essentially a subset of snp_data$Markers that contains only markers that have been selected.

  • FiltersA list of eight integer vectors indicating how many markers remained following different filtering steps : (1) the total number of markers, (2) the number of markers located on the chromosome of interest, (3) the number of markers located close enough to the central gene position, (4) the number of biallelic markers, (5) the number of markers passing the missing data filter, (6) the number of markers passing the heterozygosity filter, (7) the number of markers passing the MAF filter, and (8) the number of markers passing the MAC filter. All these numbers are the number of markers remaining after every preceding step and not the absolute number of markers passing this filter.

  • VCFIf a VCF element was present in the initial snp_data object, this element is a subset of it containing only the markers remaining following filtering.

Examples

NULL


malemay/HaplotypeMiner documentation built on Feb. 6, 2024, 3:29 a.m.