runFastphase: Genotype imputation via fastPHASE

View source: R/fastPHASE.R

runFastphaseR Documentation

Genotype imputation via fastPHASE

Description

Impute SNP genotypes via fastPHASE (Scheet and Stephens, 2006).

Usage

runFastphase(
  X,
  snp.coords,
  alleles,
  out.dir = getwd(),
  task.id = "fastphase",
  nb.starts = 20,
  nb.iters = 25,
  nb.samp.haplos = 50,
  estim.haplos = TRUE,
  nb.clusters = 10,
  seed = 1859,
  clean = FALSE,
  verbose = 1
)

Arguments

X

matrix of bi-allelic SNP genotypes encoded in number of copies of the 2nd allele, i.e. as allele doses in 0,1,2, with genotypes in rows and SNPs in columns

snp.coords

data.frame with SNP identifiers as row names, and two columns, "chr" and "coord" or "pos"

alleles

data.frame with SNPs in rows (names as row names) and alleles in columns, named "minor" and "major", in whatever order as long as the second column corresponds to the allele which number of copies is counted at each SNP in X

out.dir

directory in which the output files will be saved

task.id

identifier of the task (used in temporary and output file names)

nb.starts

number of random starts of the EM algorithm

nb.iters

number of iterations of the EM algorithm

nb.samp.haplos

number of haplotypes sampled from the posterior distribution obtained from a particular random start of the EM algorithm

estim.haplos

estimate haplotypes by minimizing genotype error

nb.clusters

number of clusters

seed

seed for the pseudo-random number generator for fastPHASE

clean

remove files

verbose

verbosity level (0/1)

Value

list with matrix of genotypes before imputation, matrix of haplotypes (2 per genotype, in rows) after imputation, stdout/stderr

Author(s)

Timothee Flutre

See Also

writeInputsFastphase, readGenosFastphase, readHaplosFastphase, runFimpute

Examples

## Not run: ## simulate haplotypes and genotypes in a single population
set.seed(1859)
nb.inds <- 5*10^2
nb.chrs <- 1
Ne <- 10^4
chrom.len <- 1*10^6
mu <- 10^(-8)
c.rec <- 10^(-7)
genomes <- simulCoalescent(nb.inds=nb.inds,
                           nb.reps=nb.chrs,
                           pop.mut.rate=4 * Ne * mu * chrom.len,
                           pop.recomb.rate=4 * Ne * c.rec * chrom.len,
                           chrom.len=chrom.len,
                           get.alleles=TRUE)
nb.snps <- nrow(genomes$snp.coords)
plotHaplosMatrix(genomes$haplos[[1]]) # quick view of the amount of LD

## discard some genotypes according to a "microarray" design:
## some inds with high density of genotyped SNPs, and the others with
## low density of SNPs, these being on both microarrays
ind.names <- rownames(genomes$genos)
inds.high <- sample(x=ind.names, size=floor(0.4 * nb.inds))
inds.low <- setdiff(ind.names, inds.high)
snp.names <- colnames(genomes$genos)
mafs <- estimSnpMaf(X=genomes$genos)
length(idx <- which(mafs >= 0.05))
snps.high <- names(idx)
snps.high.only <- sample(x=snps.high, size=floor(0.7*length(snps.high)))
dim(X <- genomes$genos[ind.names, snps.high])
X.na <- X
X.na[inds.low, snps.high.only] <- NA
sum(is.na(X.na)) / length(X.na)
plotGridMissGenos(X=X.na)

## perform imputation
snp.coords <- genomes$snp.coords[colnames(X.na),]
alleles <- genomes$alleles[colnames(X.na),]
out.fP <- runFastphase(X=X.na, snp.coords=snp.coords, alleles=alleles,
                       nb.starts=3, clean=TRUE)
out.fP$stdouterr
tmp <- haplosAlleles2num(haplos=out.fP$haplos.switch, alleles=alleles)
X.imp.fP <- segSites2allDoses(seg.sites=list(tmp), ind.ids=ind.names)

## assess imputation accuracy
genomes$haplos[[1]][1:4, 1:7]
genomes$genos[1:2, 1:7]
X.na[1:2, 1:6]
head(alleles)
out.fP$genos[1:4, 1:6]
out.fP$haplos.switch[1:4, 1:6]
X.imp.fP[1:2, 1:6]
100 * sum(X.imp.fP != X) / sum(is.na(X.na))

## End(Not run)

timflutre/rutilstimflutre documentation built on Feb. 7, 2024, 8:17 a.m.