iterateIntervals: Loop over DNA intervals with a call of 'hapFabia'
In hapFabia: hapFabia: Identification of very short segments of identity by descent (IBD) characterized by rare variants in large sequencing data

Description Usage Arguments Details Value Author(s) References See Also Examples

iterateIntervals: R implementation of iterateIntervals.

Loops over all intervals and calls hapFabia and then stores the results. Intervals have been generated by split_sparse_matrix.

iterateIntervals(startRun=1,endRun,shift=5000,intervalSize=10000,
   annotationFile=NULL,fileName,prefixPath="",
   sparseMatrixPostfix="_mat",annotPostfix="_annot.txt",
   individualsPostfix="_individuals.txt",individuals=0,
   lowerBP=0,upperBP=0.05,p=10,iter=40,quant=0.01,eps=1e-5,
   alpha=0.03,cyc=50,non_negative=1,write_file=0,norm=0,
   lap=100.0,IBDsegmentLength=50,Lt = 0.1,Zt = 0.2,
   thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.03,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

`startRun`	first interval.
`endRun`	last interval.
`shift`	distance between starts of adjacent intervals.
`intervalSize`	number of SNVs in a interval.
`annotationFile`	file name of the annotation file for the individuals.
`fileName`	passed to hapFabia: file name of the genotype matrix in sparse format.
`prefixPath`	passed to hapFabia: path to the genotype file.
`sparseMatrixPostfix`	passed to hapFabia: postfix string for the sparse matrix.
`annotPostfix`	passed to hapFabia: postfix string for the SNV annotation file.
`individualsPostfix`	passed to hapFabia: postfix string for the file containing the names of the individuals.
`individuals`	passed to hapFabia: vector of individuals which are included into the analysis; default = 0 (all individuals).
`lowerBP`	passed to hapFabia: lower bound on minor allele frequencies (MAF); however at least two occurrences are required to remove private SNVs.
`upperBP`	passed to hapFabia: upper bound on minor allele frequencies (MAF) to extract rare variants.
`p`	passed to hapFabia: number of biclusters per iteration.
`iter`	passed to hapFabia: number of iterations.
`quant`	passed to hapFabia: percentage of loadings L to remove in each iteration.
`eps`	passed to hapFabia: lower bound for variational parameter lapla; default 1e-5.
`alpha`	passed to hapFabia: sparseness of the loadings; default = 0.03.
`cyc`	passed to hapFabia: number of cycles per iterations; default 50.
`non_negative`	passed to hapFabia: non-negative factors and loadings if non_negative = 1; default = 1 (yes).
`write_file`	passed to hapFabia: results are written to files (L in sparse format), default = 0 (not written).
`norm`	passed to hapFabia: data normalization; default 0 (no normalization).
`lap`	passed to hapFabia: minimal value of the variational parameter; default 100.0.
`IBDsegmentLength`	passed to hapFabia: typical IBD segment length in kbp.
`Lt`	passed to hapFabia: percentage of largest Ls to consider for IBD segment extraction.
`Zt`	passed to hapFabia: percentage of largest Zs to consider for IBD segment extraction.
`thresCount`	passed to hapFabia: p-value of random histogram hit; default 1e-5.
`mintagSNVsFactor`	passed to hapFabia: percentage of IBD segment overlap; default 3/4.
`pMAF`	passed to hapFabia: averaged and corrected (for non-uniform distributions) minor allele frequency.
`haplotypes`	passed to hapFabia: haplotypes = TRUE then phased genotypes meaning two chromosomes per individual otherwise unphased genotypes.
`cut`	passed to hapFabia: cutoff for merging IBD segments after a hierarchical clustering; default 0.8.
`procMinIndivids`	passed to hapFabia: percentage of cluster individuals a tagSNV must tag to be considered as tagSNV for the IBD segment.
`thresPrune`	passed to hapFabia: threshold for pruning border tagSNVs based on an exponential distribution where border tagSNVs with large distances to the next tagSNV are pruned.
`simv`	passed to hapFabia: similarity measure for merging clusters: `"minD"` (percentage of smaller explained by larger set), `"jaccard"` (Jaccard index), `"dice"` (Dice index), or `"maxD"`; default `"minD"`.
`minTagSNVs`	passed to hapFabia: minimum matching tagSNVs for cluster similarity; otherwise the similarity is set to zero.
`minIndivid`	passed to hapFabia: minimum matching individuals for cluster similarity; otherwise the similarity is set to zero.
`avSNVsDist`	passed to hapFabia: average distance between SNVs in base pairs - used together with `IBDsegmentLength` to compute the number of SNVs in the histogram bins; default=100.
`SNVclusterLength`	passed to hapFabia: if `IBDsegmentLength=0` then the number of SNVs in the histogram bins can be given directly; default 100.

Implementation in R. Reads annotation of the individuals if available, then calls hapFabia and stores its results. Results are saved in EXCEL format and as R binaries.

iterateIntervals loops over all intervals and calls hapFabia and then stores the results. Intervals have been generated by split_sparse_matrix. The results are the indentified IBD segments which are stored separately per interval. A subsequent analysis first calls identifyDuplicates to identify IBD segments that are found more than one time and then analyzes the IBD segments by analyzeIBDsegments.

The SNV annotation file ..._annot.txt contains:

first line: number individuals;
second line: number SNVs;
for each SNV a line containing following field that are blank separated: "chromosome", "physical position", "snvNames", "snvMajor", "snvMinor", "quality", "pass", "info of vcf file", "fields in vcf file", "frequency", "0/1: 1 is changed if major allele is actually minor allele".

The individuals annotation file, which name is give to annotationFile, contains per individual a tab separated line with

id;
subPopulation;
population;
platform.

Loop over DNA intervals with a call of hapFabia

Sepp Hochreiter

S. Hochreiter et al., ‘FABIA: Factor Analysis for Bicluster Acquisition’, Bioinformatics 26(12):1520-1527, 2010.

IBDsegment-class, IBDsegmentList-class, analyzeIBDsegments, compareIBDsegmentLists, extractIBDsegments, findDenseRegions, hapFabia, hapFabiaVersion, hapRes, chr1ASW1000G, IBDsegmentList2excel, identifyDuplicates, iterateIntervals, makePipelineFile, matrixPlot, mergeIBDsegmentLists, mergedIBDsegmentList, plotIBDsegment, res, setAnnotation, setStatistics, sim, simu, simulateIBDsegmentsFabia, simulateIBDsegments, split_sparse_matrix, toolsFactorizationClass, vcftoFABIA

## Not run: 

###here an example of the the automatically generated pipeline
### with: shiftSize=5000,intervalSize=10000,fileName="filename"


#####define intervals, overlap, filename #######
shiftSize <- 5000
intervalSize <- 10000
fileName="filename" # without type
haplotypes <- TRUE
dosage <- FALSE

#####load library#######
library(hapFabia)

#####convert from .vcf to _mat.txt#######
vcftoFABIA(fileName=fileName)

#####copy haplotype, genotype, or dosage matrix to matrix#######
if (haplotypes) {
    file.copy(paste(fileName,"_matH.txt",sep=""), paste(fileName,"_mat.txt",sep=""))
} else {
    if (dosage) {
        file.copy(paste(fileName,"_matD.txt",sep=""), paste(fileName,"_mat.txt",sep=""))
    } else {
        file.copy(paste(fileName,"_matG.txt",sep=""), paste(fileName,"_mat.txt",sep=""))
    }
}

#####split/ generate intervals#######
split_sparse_matrix(fileName=fileName,intervalSize=intervalSize,
shiftSize=shiftSize,annotation=TRUE)

#####compute how many intervals we have#######
ina <- as.numeric(readLines(paste(fileName,"_mat.txt",sep=""),n=2))
noSNVs <- ina[2]
over <- intervalSize%/%shiftSize
N1 <- noSNVs%/%shiftSize
endRunA <- (N1-over+2)

#####analyze each interval#######
#####may be done by parallel runs#######
iterateIntervals(startRun=1,endRun=endRunA,shift=shiftSize,
intervalSize=intervalSize,fileName=fileName,individuals=0,
upperBP=0.05,p=10,iter=40,alpha=0.03,cyc=50,IBDsegmentLength=50,
Lt = 0.1,Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,
pMAF=0.035,haplotypes=haplotypes,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

#####identify duplicates#######
identifyDuplicates(fileName=fileName,startRun=1,endRun=endRunA,
shift=shiftSize,intervalSize=intervalSize)

#####analyze results; parallel#######
anaRes <- analyzeIBDsegments(fileName=fileName,startRun=1,endRun=endRunA,
shift=shiftSize,intervalSize=intervalSize)
print("Number IBD segments:")
print(anaRes$noIBDsegments)
print("Statistics on IBD segment length in SNVs (all SNVs in the IBD segment):")
print(anaRes$avIBDsegmentLengthSNVS)
print("Statistics on IBD segment length in bp:")
print(anaRes$avIBDsegmentLengthS)
print("Statistics on number of individuals belonging to IBD segments:")
print(anaRes$avnoIndividS)
print("Statistics on number of tagSNVs of IBD segments:")
print(anaRes$avnoTagSNVsS)
print("Statistics on MAF of tagSNVs of IBD segments:")
print(anaRes$avnoFreqS)
print("Statistics on MAF within the group of tagSNVs of IBD segments:")
print(anaRes$avnoGroupFreqS)
print("Statistics on number of changes between major and minor allele frequency:")
print(anaRes$avnotagSNVChangeS)
print("Statistics on number of tagSNVs per individual of an IBD segment:")
print(anaRes$avnotagSNVsPerIndividualS)
print("Statistics on number of individuals that have the minor allele of tagSNVs:")
print(anaRes$avnoindividualPerTagSNVS)

#####load result for interval 50#######
posAll <- 50 # (50-1)*5000 = 245000: interval 245000 to 255000
start <- (posAll-1)*shiftSize
end <- start + intervalSize
pRange <- paste("_",format(start,scientific=FALSE),"_",
format(end,scientific=FALSE),sep="")
load(file=paste(fileName,pRange,"_resAnno",".Rda",sep=""))
IBDsegmentList <- resHapFabia$mergedIBDsegmentList # $

summary(IBDsegmentList)
#####plot IBD segments in interval 50#######
plot(IBDsegmentList,filename=paste(fileName,pRange,"_mat",sep=""))
   ##attention: filename without type ".txt"

#####plot the first IBD segment in interval 50#######

IBDsegment <- IBDsegmentList[[1]]
plot(IBDsegment,filename=paste(fileName,pRange,"_mat",sep=""))
   ##attention: filename without type ".txt"


## End(Not run)

#Work in a temporary directory.

old_dir <- getwd()
setwd(tempdir())


# Load data and write to vcf file.
data(chr1ASW1000G)
write(chr1ASW1000G,file="chr1ASW1000G.vcf")

#Create the analysis pipeline for haplotype data (1000Genomes)
makePipelineFile(fileName="chr1ASW1000G",shiftSize=500,intervalSize=1000,haplotypes=TRUE)

source("pipeline.R")

# Following files are produced:
list.files(pattern="chr1")



# Next we load interval 5 and there the first and second IBD segment
posAll <- 5
start <- (posAll-1)*shiftSize
end <- start + intervalSize
pRange <- paste("_",format(start,scientific=FALSE),"_",format(end,scientific=FALSE),sep="")
load(file=paste(fileName,pRange,"_resAnno",".Rda",sep=""))
IBDsegmentList <- resHapFabia$mergedIBDsegmentList
summary(IBDsegmentList)
IBDsegment1 <- IBDsegmentList[[1]]
summary(IBDsegment1)
IBDsegment2 <- IBDsegmentList[[2]]
summary(IBDsegment2)




#Plot the first IBD segment in interval 5
plot(IBDsegment1,filename=paste(fileName,pRange,"_mat",sep=""))


#Plot the second IBD segment in interval 5
plot(IBDsegment2,filename=paste(fileName,pRange,"_mat",sep=""))

setwd(old_dir)