hapFabia: IBD segment extraction by FABIA
In hapFabia: hapFabia: Identification of very short segments of identity by descent (IBD) characterized by rare variants in large sequencing data

Description Usage Arguments Details Value Author(s) References See Also Examples

hapFabia: R implementation of the hapFabia method.

hapFabia extracts short IBD segments tagged by rare variants from phased or unphased genotypes. hapFabia is designed for rare variants in very large sequencing data. The method is based on FABIA biclustering and utilizes the package fabia.

hapFabia(fileName,prefixPath="",sparseMatrixPostfix="_mat",
   annotPostfix="_annot.txt",individualsPostfix="_individuals.txt",
   labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.05,
   p=10,iter=40,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1,
   write_file=0,norm=0,lap=100.0,IBDsegmentLength=50,Lt = 0.1,
   Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.03,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

`fileName`	name of the file that contains the genotype matrix in sparse format.
`prefixPath`	path to the genotype file.
`sparseMatrixPostfix`	postfix string for the sparse matrix.
`annotPostfix`	postfix string for the SNV annotation file.
`individualsPostfix`	postfix string for the file containing the names of the individuals.
`labelsA`	annotation of the individuals as matrix individuals x 4 (individual ID, subpopulation, population, genotyping platform).
`pRange`	indicates which DNA interval is processed.
`individuals`	vector of individuals that are included into the analysis; default = 0 (all individuals).
`lowerBP`	lower bound on minor allele frequencies (MAF); however at least two occurrences are required to remove private SNVs.
`upperBP`	upper bound on minor allele frequencies (MAF) to extract rare variants.
`p`	number of biclusters per fabia iteration.
`iter`	number of fabia iterations.
`quant`	percentage of fabia loadings L that are removed after each iteration.
`eps`	lower bound on fabia variational parameter lapla; default 1e-5.
`alpha`	fabia sparseness of the loadings; default = 0.03.
`cyc`	number of cycles per fabia iteration; default 50.
`non_negative`	non-negative fabia factors and loadings if non_negative = 1; default = 1 (yes).
`write_file`	fabia results are written to files (L in sparse format), default = 0 (not written).
`norm`	fabia data normalization; default 1 (no normalization).
`lap`	minimal value of fabia's variational parameter; default 100.0.
`IBDsegmentLength`	expected typical IBD segment length in kbp.
`Lt`	percentage of largest fabia L values to consider for IBD segment extraction.
`Zt`	percentage of largest fabia Z values to consider for IBD segment extraction.
`thresCount`	p-value of random histogram hit; default 1e-5.
`mintagSNVsFactor`	percentage of the histogram tagSNVs count threshold (`mintagSNVs` in `extractIBDsegments`); used to define minimal overlap of individual intervals in an IBD segment; default 3/4.
`pMAF`	averaged and corrected (for non-uniform distributions) minor allele frequency.
`haplotypes`	`haplotypes = TRUE` indicates phased genotypes that is two chromosomes per individual otherwise unphased genotypes.
`cut`	cutoff for merging IBD segments after a hierarchical clustering; default 0.8.
`procMinIndivids`	Percentage of cluster individuals a tagSNV must tag to be considered as tagSNV for the IBD segment.
`thresPrune`	Threshold for pruning border tagSNVs based on an exponential distribution where border tagSNVs with large distances to the next tagSNV are pruned.
`simv`	Similarity measure for merging clusters: `"minD"` (percentage of smaller explained by larger set), `"jaccard"` (Jaccard index), `"dice"` (Dice index), or `"maxD"`; default `"minD"`.
`minTagSNVs`	Minimum matching tagSNVs for cluster similarity; otherwise the similarity is set to zero.
`minIndivid`	Minimum matching individuals for cluster similarity; otherwise the similarity is set to zero.
`avSNVsDist`	average distance between SNVs in base pairs - used together with `IBDsegmentLength` to compute the number of SNVs in the histogram bins; default=100.
`SNVclusterLength`	if `IBDsegmentLength=0` then the number of SNVs in the histogram bins can be given directly; default 100.

This function uses a genotype matrix in sparse matrix format and extracts IBD segments by biclustering. First, it performs Fabia biclustering and then extracts local accumulations of loadings. Then it disentangles IBD segments and prunes off spurious correlated SNVs. Finally, it merges similar IBD segments to account for larger IBD segments that were broken during the analysis.

Annotation file ..._annot.txt for SNVs:

first line: number individuals;
second line: number SNVs;
for each SNV a line containing following field that are blank separated: "chromosome", "physical position", "snvNames", "snvMajor", "snvMinor", "quality", "pass", "info of vcf file", "fields in vcf file", "frequency", "0/1: 1 is changed if major allele is actually minor allele".

labelsA is a matrix ("number individuals" x 4), where for each individual following characteristics are given:

id;
subPopulation;
population;
platform.

The probability of observing k or more correlated SNVs in a histogram bin is computed. The minimal k which pushes the probability below the threshold thresCount is used to find accumulation of correlated SNVs that are assumed to belong to a IBD segment.

Let p be the probability of a random minor allele match between t individuals. The probability of observing k or more matches for n SNVs in a histogram bin is given by one minus the binomial distribution F(k;n,p):

1 - F(k-1;n,p) = Pr(K >= k) = SUM(i = k to infinity; (n over i) power(p,i) power(1-p,n-i))

If q is the minor allele frequency (MAF) for one SNV, the probability p of observing the minor allele of this SNV in all t individuals is p=power(q,t). The value q is given by the parameter pMAF.

Implementation in R.

List containing

`mergedIBDsegmentList`	an object of the class `IBDsegmentList` that contains the extracted IBD segments that were extracted from two histograms with different offset.
`res`	the result of FABIA.
`sPF`	individuals per loading of this FABIA result.
`annot`	annotation for the genotype data.
`IBDsegmentList1`	an object of the class `IBDsegmentList` that contains the result of IBD segment extraction from the first histogram.
`IBDsegmentList2`	an object of the class `IBDsegmentList` that contains the result of IBD segment extraction from the second histogram.
`mergedIBDsegmentList1`	an object of the class `IBDsegmentList` that contains the merged result of the first IBD segment extraction (redundancies removed).
`mergedIBDsegmentList2`	an object of the class `IBDsegmentList` that contains the merged result of the second IBD segment extraction (redundancies removed).

Sepp Hochreiter

S. Hochreiter et al., ‘FABIA: Factor Analysis for Bicluster Acquisition’, Bioinformatics 26(12):1520-1527, 2010.

IBDsegment-class, IBDsegmentList-class, analyzeIBDsegments, compareIBDsegmentLists, extractIBDsegments, findDenseRegions, hapFabia, hapFabiaVersion, hapRes, chr1ASW1000G, IBDsegmentList2excel, identifyDuplicates, iterateIntervals, makePipelineFile, matrixPlot, mergeIBDsegmentLists, mergedIBDsegmentList, plotIBDsegment, res, setAnnotation, setStatistics, sim, simu, simulateIBDsegmentsFabia, simulateIBDsegments, split_sparse_matrix, toolsFactorizationClass, vcftoFABIA

old_dir <- getwd()
setwd(tempdir())

data(simu)

namesL <- simu[["namesL"]]
haploN <- simu[["haploN"]]
snvs <- simu[["snvs"]]
annot <- simu[["annot"]]
alleleIimp <- simu[["alleleIimp"]]
write.table(namesL,file="dataSim1fabia_individuals.txt",
   quote = FALSE,row.names = FALSE,col.names = FALSE)
write(as.integer(haploN),file="dataSim1fabia_annot.txt",
   ncolumns=100)
write(as.integer(snvs),file="dataSim1fabia_annot.txt",
   append=TRUE,ncolumns=100)
write.table(annot,file="dataSim1fabia_annot.txt",
   sep = " ",quote = FALSE,row.names = FALSE,
   col.names = FALSE,append=TRUE)
write(as.integer(haploN),file="dataSim1fabia_mat.txt",
   ncolumns=100)
write(as.integer(snvs),file="dataSim1fabia_mat.txt",
   append=TRUE,ncolumns=100)

for (i in 1:haploN) {

  a1 <- which(alleleIimp[i,]>0.01)

  al <- length(a1)
  b1 <- alleleIimp[i,a1]

  a1 <- a1 - 1
  dim(a1) <- c(1,al)
  b1 <- format(as.double(b1),nsmall=1)
  dim(b1) <- c(1,al)

  write.table(al,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)
  write.table(a1,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)
  write.table(b1,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)

}


hapRes <- hapFabia(fileName="dataSim1fabia",prefixPath="",
   sparseMatrixPostfix="_mat",
   annotPostfix="_annot.txt",individualsPostfix="_individuals.txt",
   labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.15,
   p=10,iter=1,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1,
   write_file=0,norm=0,lap=100.0,IBDsegmentLength=10,Lt = 0.1,
   Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.1,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

summary(hapRes$mergedIBDsegmentList)

plot(hapRes$mergedIBDsegmentList[[1]],filename="dataSim1fabia_mat")


### Another Example
simulateIBDsegmentsFabia(fileprefix="dataSim",
   minruns=2,maxruns=2,snvs=1000,individualsN=100,
   avDistSnvs=100,avDistMinor=10,noImplanted=1,
   implanted=10,length=50,minors=30,mismatches=0,
   mismatchImplanted=0.5,overlap=50)

hapRes <- hapFabia(fileName="dataSim2fabia",prefixPath="",
   sparseMatrixPostfix="_mat",
   annotPostfix="_annot.txt",individualsPostfix="_individuals.txt",
   labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.15,
   p=10,iter=1,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1,
   write_file=0,norm=0,lap=100.0,IBDsegmentLength=10,Lt = 0.1,
   Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.1,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

## Summary of the IBD segment list
summary(hapRes$mergedIBDsegmentList)

## Summary of the IBD segment 
summary(hapRes$mergedIBDsegmentList[[1]])


## Plot an IBD segment
plot(hapRes$mergedIBDsegmentList[[1]],filename="dataSim2fabia_mat")

## Not run: 
## It is interactive, thus dontrun!

## Plot an IBD segment list
plot(hapRes$mergedIBDsegmentList,filename="dataSim2fabia_mat")


## End(Not run)

setwd(old_dir)