runFastphase: Genotype imputation via fastPHASE
In timflutre/rutilstimflutre: Timothee Flutre's personal R code

runFastphase

R Documentation

Genotype imputation via fastPHASE

Description

Impute SNP genotypes via fastPHASE (Scheet and Stephens, 2006).

Usage

runFastphase(
  X,
  snp.coords,
  alleles,
  out.dir = getwd(),
  task.id = "fastphase",
  nb.starts = 20,
  nb.iters = 25,
  nb.samp.haplos = 50,
  estim.haplos = TRUE,
  nb.clusters = 10,
  seed = 1859,
  clean = FALSE,
  verbose = 1
)

Arguments

`X`	matrix of bi-allelic SNP genotypes encoded in number of copies of the 2nd allele, i.e. as allele doses in 0,1,2, with genotypes in rows and SNPs in columns
`snp.coords`	data.frame with SNP identifiers as row names, and two columns, "chr" and "coord" or "pos"
`alleles`	data.frame with SNPs in rows (names as row names) and alleles in columns, named "minor" and "major", in whatever order as long as the second column corresponds to the allele which number of copies is counted at each SNP in `X`
`out.dir`	directory in which the output files will be saved
`task.id`	identifier of the task (used in temporary and output file names)
`nb.starts`	number of random starts of the EM algorithm
`nb.iters`	number of iterations of the EM algorithm
`nb.samp.haplos`	number of haplotypes sampled from the posterior distribution obtained from a particular random start of the EM algorithm
`estim.haplos`	estimate haplotypes by minimizing genotype error
`nb.clusters`	number of clusters
`seed`	seed for the pseudo-random number generator for fastPHASE
`clean`	remove files
`verbose`	verbosity level (0/1)

Value

list with matrix of genotypes before imputation, matrix of haplotypes (2 per genotype, in rows) after imputation, stdout/stderr

Author(s)

Timothee Flutre

Examples

## Not run: ## simulate haplotypes and genotypes in a single population
set.seed(1859)
nb.inds <- 5*10^2
nb.chrs <- 1
Ne <- 10^4
chrom.len <- 1*10^6
mu <- 10^(-8)
c.rec <- 10^(-7)
genomes <- simulCoalescent(nb.inds=nb.inds,
                           nb.reps=nb.chrs,
                           pop.mut.rate=4 * Ne * mu * chrom.len,
                           pop.recomb.rate=4 * Ne * c.rec * chrom.len,
                           chrom.len=chrom.len,
                           get.alleles=TRUE)
nb.snps <- nrow(genomes$snp.coords)
plotHaplosMatrix(genomes$haplos[[1]]) # quick view of the amount of LD

## discard some genotypes according to a "microarray" design:
## some inds with high density of genotyped SNPs, and the others with
## low density of SNPs, these being on both microarrays
ind.names <- rownames(genomes$genos)
inds.high <- sample(x=ind.names, size=floor(0.4 * nb.inds))
inds.low <- setdiff(ind.names, inds.high)
snp.names <- colnames(genomes$genos)
mafs <- estimSnpMaf(X=genomes$genos)
length(idx <- which(mafs >= 0.05))
snps.high <- names(idx)
snps.high.only <- sample(x=snps.high, size=floor(0.7*length(snps.high)))
dim(X <- genomes$genos[ind.names, snps.high])
X.na <- X
X.na[inds.low, snps.high.only] <- NA
sum(is.na(X.na)) / length(X.na)
plotGridMissGenos(X=X.na)

## perform imputation
snp.coords <- genomes$snp.coords[colnames(X.na),]
alleles <- genomes$alleles[colnames(X.na),]
out.fP <- runFastphase(X=X.na, snp.coords=snp.coords, alleles=alleles,
                       nb.starts=3, clean=TRUE)
out.fP$stdouterr
tmp <- haplosAlleles2num(haplos=out.fP$haplos.switch, alleles=alleles)
X.imp.fP <- segSites2allDoses(seg.sites=list(tmp), ind.ids=ind.names)

## assess imputation accuracy
genomes$haplos[[1]][1:4, 1:7]
genomes$genos[1:2, 1:7]
X.na[1:2, 1:6]
head(alleles)
out.fP$genos[1:4, 1:6]
out.fP$haplos.switch[1:4, 1:6]
X.imp.fP[1:2, 1:6]
100 * sum(X.imp.fP != X) / sum(is.na(X.na))

## End(Not run)

timflutre/rutilstimflutre documentation built on Feb. 12, 2025, 11:35 p.m.