replace_miss_genos: Replace missing genotypes

View source: R/replace_miss_genos.R

replace_miss_genosR Documentation

Replace missing genotypes

Description

For each locus, missing genotypes are replaced with the most common genotype. Can be done across all sampled individuals or by population. Loci must be biallelic.

Usage

replace_miss_genos(
  dat,
  sampCol = "SAMPLE",
  locusCol = "LOCUS",
  genoCol = "GT",
  popCol = NULL
)

Arguments

dat

Data table: A long data table, e.g. like that imported from vcf2DT. Genotypes can be coded as '/' separated characters (e.g. '0/0', '0/1', '1/1'), or integers as Alt allele counts (e.g. 0, 1, 2). Must contain the following columns,

  1. The sampled individuals (see param sampCol).

  2. The locus ID (see param locusCol).

  3. The genotype column (see param genoCol).

sampCol

Character: The column name with the sampled individual information. Default is 'SAMPLE'.

locusCol

Character: The column name with the locus information. Default is 'LOCUS'.

genoCol

Character: The column name with the genotype information. Default is 'GT'.

popCol

Character: An optional argument. The column name with the population information. Default is NULL. If specified, genotype replacement at each locus is done per population, not across all sampled individuals.

Details

NOTE: it is recommended that missing genotypes are imputed using inferences of linkage and genotype likelihood. However, if you need a quick-and-dirty approach, this function might be useful for preliminary analyses, or if missing data is very low.

If genotypes are coded as characters, NA or './.' should be used to code missing genotypes. Otherwise if genotypes are coded as integers, NA should code missing genotypes. Whether the most common genotype is estimated across individuals or for each population depends on parameterisation of popCol.

Examples

library(genomalicious)

data(data_Genos)

D <- data_Genos %>% copy

# Sites with missing data
D[sample(1:nrow(D), round(0.1*nrow(D)), FALSE), GT:=NA] %>%
 setnames(., 'GT', 'GT.MISS')

# Replace across individuals
D.rep.inds <- replace_miss_genos(
   dat=D, sampCol='SAMPLE', locusCol='LOCUS', genoCol='GT.MISS'
) %>%
   setnames(., 'GT', 'GT.INDS')

# Replace within populations
D.rep.pops <- replace_miss_genos(
   dat=D, sampCol='SAMPLE', locusCol='LOCUS', genoCol='GT.MISS', popCol='POP'
) %>%
   setnames(., 'GT', 'GT.POPS')

# Tabulate comparisons between methods
compReplace <- left_join(
   data_Genos[, c('LOCUS','SAMPLE','POP','GT')],
   D[, c('LOCUS','SAMPLE','POP','GT.MISS')]
) %>%
.[is.na(GT.MISS), !'GT.MISS'] %>%
   left_join(., D.rep.inds[,c('LOCUS','SAMPLE','POP','GT.INDS')]) %>%
   left_join(., D.rep.pops[,c('LOCUS','SAMPLE','POP','GT.POPS')])

# Number of correct matches is slightly higher when using the most
# common genotype within populations
compReplace[GT==GT.INDS] %>% nrow
compReplace[GT==GT.POPS] %>% nrow



j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.