invClust: 'invClust' Class and Methods

Description Usage Arguments Details Value Author(s) Examples

View source: R/invClust.R

Description

Mixture model fitting with Hardy Weinberg Equilibrium and population stratification to infer haplotype (inversion) alleles.

Usage

1
invClust(roi, wh = 1, geno, annot, SNPtagg="n", SNPsel=1:ncol(geno), method=1, dim = 1, pc = 0, ngroups = 1, ...)

Arguments

roi

text file of data.frame with Region of Interest information. Four columns are required: chr, LBP, RBP, reg, with chromosome, left break point, right break point and a character "reg" that identifies the inversion

wh

which ROI (row of roi) to be considered for computation.

geno

genotypes in snpMatrix format of snpStats

annot

snp annotation as .map PLINK format, or read.plink()$map from snpStats. The columns "chromosome", "names", "position" are required and have to conform to this naming.

SNPtagg

set SNPtagg="y" to use tagg SNPs in roi for tagging the haplotype groups

SNPsel

vector with snps to be selected for computation

method

method=1 performs EM algorithm for three genotypes, method=2, performs a clustering within the the genotypes for an additional third haplotype

dim

either 1 or 2 indicating the number fo MDS compoents to be used

pc

if population clustering is to be performed, first component of a genome-wide PCA of geno

ngroups

maximum number of subpopualtions to be considered

...

control arguments for the EM algorithm: it (it=1000) maximum number fo iterations, tol (tol=10e-5) convergence tolerance

Details

invClust computes the biallelic haplotypes in Hardy Weinberg Equilibrium (with the possibility of clustering by geographical subpopulations) that may underlie an inversion event. It fits a mixture model with an expectation maximization routine, only controled by convergence criteria. Initial conditions are general for a wide range of cases. Clustering is performed in 1 or 2 dimensions of mutidimeansional scaling (argument dim) and, if geographical subpopulation is considered, the first component of a genome wide PCA (argument pc). In this last case, a visualization of the PCA analysis can inform on the suitable number of groups to be considered (argument ngroup).

Each subject in the sample is assigned a probability to a given genotype (NN, NI, II) which can be recovered by x["genotypes"], where x is of class invClust ( e.g. the result of an invClust call). Most probable genotypes can be extracted with getGenotypes(x). In a similar way, if subpopulation classification is considered, probability for group membership is recovered with x["groups"].

Plots are also implemented for this class, plot(x) will display the clustered data on the fitted distribution, according to dimensions used. For inclussion of subpopulation classification, selection of the marginals can be done though a plot argument wh=c("yy","xy"). wh="like" plots the likelihood with respect to the number of itereations only for dim=1.

A useful quantity is a quality score (getQuality(x)) that computes the overlap integral of the cluster components, a value of 1 gives no overlap while 0 refers to complete overlap.

Value

EMestimate

List with fitted parameters

datin

List with data used to fit the model: x for firt PCA component and y MDS components

Author(s)

Alejandro Caceres

Examples

1
2
3
4
data(geno)
inv<-invClust(roi=roi,wh=1,geno=geno,annot=annot,dim=1)
plot(inv)
head(inv["genotypes"])

isglobal-brge/invClust documentation built on May 19, 2020, 5:19 a.m.