runFindhap: Genotype imputation via findhap
In timflutre/rutilstimflutre: Timothee Flutre's personal R code

runFindhap

R Documentation

Genotype imputation via findhap

Description

Impute SNP genotypes via findhap (VanRaden et al, 2013).

Usage

runFindhap(
  X = NULL,
  chips = NULL,
  snp.coords = NULL,
  vcf.file = NULL,
  yieldSize = 10000,
  ped = NULL,
  work.dir = getwd(),
  task.id = "findhap",
  params.findhap = NULL,
  nb.threads = 1,
  clean = "none",
  verbose = 1
)

Arguments

`X`	matrix of bi-allelic SNP genotypes encoded in number of copies of the 2nd allele, i.e. as allele doses in 0,1,2, with genotypes in rows and SNPs in columns; missing values should be encoded as NA; if not NULL, will be used in priority even if `vcf.file` is not NULL
`chips`	if several microarrays were used, provide a vector with chip numbers which names are genotype identifiers (same as in `X`)
`snp.coords`	data.frame with SNP identifiers as row names, and two compulsory columns in that order, chromosome identifiers and coordinate/position identifiers; the maximum length of SNP identifiers is 50 characters; chromosome identifiers should be numeric; other optional column(s) contain the SNP order on the microarray(s) (maximum 10); compulsory if `X` is specified
`vcf.file`	path to the VCF file (if the bgzip index doesn't exist in the same directory, it will be created); used only if `X=NULL`
`yieldSize`	number of records to yield each time the file is read from (see `?TabixFile`)
`ped`	data frame of pedigree with four columns in that order, genotype identifiers (same as in `X`), parent1 identifiers (considered as father/sire), parent2 identifiers (considered as mother/dam), and sex (as M or F)
`work.dir`	directory in which the input and output files will be saved
`task.id`	identifier of the task (used in temporary and output file names)
`params.findhap`	list of additional parameters to pass to findhap
`nb.threads`	number of threads to pass to findhap
`clean`	remove files: none, some (temporary only), all (temporary and results)
`verbose`	verbosity level (0/1)

Value

list

Author(s)

Timothee Flutre

Examples

## Not run: ## simulate haplotypes and genotypes in a single population
set.seed(1859)
nb.inds <- 5*10^2
nb.chrs <- 1
Ne <- 10^4
chrom.len <- 1*10^6
mu <- 10^(-8)
c.rec <- 10^(-7)
genomes <- simulCoalescent(nb.inds=nb.inds,
                           nb.reps=nb.chrs,
                           pop.mut.rate=4 * Ne * mu * chrom.len,
                           pop.recomb.rate=4 * Ne * c.rec * chrom.len,
                           chrom.len=chrom.len,
                           get.alleles=TRUE)
nb.snps <- nrow(genomes$snp.coords)
plotHaplosMatrix(genomes$haplos[[1]]) # quick view of the amount of LD

## discard some genotypes according to a "microarray" design:
## some inds with high density of genotyped SNPs, and the others with
## low density of SNPs, these being on both microarrays
ind.names <- rownames(genomes$genos)
inds.high <- sample(x=ind.names, size=floor(0.4 * nb.inds))
inds.low <- setdiff(ind.names, inds.high)
snp.names <- colnames(genomes$genos)
mafs <- estimSnpMaf(X=genomes$genos)
length(idx <- which(mafs >= 0.05))
snps.high <- names(idx)
snps.high.only <- sample(x=snps.high, size=floor(0.7*length(snps.high)))
dim(X <- genomes$genos[ind.names, snps.high])
X.na <- X
X.na[inds.low, snps.high.only] <- NA
sum(is.na(X.na)) / length(X.na)
plotGridMissGenos(X=X.na)

## perform imputation
snp.coords <- genomes$snp.coords[colnames(X.na),]
snp.coords$chr <- as.numeric(sub("chr", "", snp.coords$chr))
out.fh <- runFindhap(X=X.na, snp.coords=snp.coords, clean="all")
X.impfh <- outfh$genos.imp

## assess imputation accuracy
genomes$haplos[[1]][1:4, 1:7]
genomes$genos[1:2, 1:7]
X.na[1:2, 1:6]
X.impfh[1:2, 1:6]
100 * sum(X.impfh != X) / sum(is.na(X.na))

## End(Not run)

timflutre/rutilstimflutre documentation built on Feb. 12, 2025, 11:35 p.m.