iterateIntervals: Loop over DNA intervals with a call of 'hapFabia'

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/hapFabia.R

Description

iterateIntervals: R implementation of iterateIntervals.

Loops over all intervals and calls hapFabia and then stores the results. Intervals have been generated by split_sparse_matrix.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
iterateIntervals(startRun=1,endRun,shift=5000,intervalSize=10000,
   annotationFile=NULL,fileName,prefixPath="",
   sparseMatrixPostfix="_mat",annotPostfix="_annot.txt",
   individualsPostfix="_individuals.txt",individuals=0,
   lowerBP=0,upperBP=0.05,p=10,iter=40,quant=0.01,eps=1e-5,
   alpha=0.03,cyc=50,non_negative=1,write_file=0,norm=0,
   lap=100.0,IBDsegmentLength=50,Lt = 0.1,Zt = 0.2,
   thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.03,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

Arguments

startRun

first interval.

endRun

last interval.

shift

distance between starts of adjacent intervals.

intervalSize

number of SNVs in a interval.

annotationFile

file name of the annotation file for the individuals.

fileName

passed to hapFabia: file name of the genotype matrix in sparse format.

prefixPath

passed to hapFabia: path to the genotype file.

sparseMatrixPostfix

passed to hapFabia: postfix string for the sparse matrix.

annotPostfix

passed to hapFabia: postfix string for the SNV annotation file.

individualsPostfix

passed to hapFabia: postfix string for the file containing the names of the individuals.

individuals

passed to hapFabia: vector of individuals which are included into the analysis; default = 0 (all individuals).

lowerBP

passed to hapFabia: lower bound on minor allele frequencies (MAF); however at least two occurrences are required to remove private SNVs.

upperBP

passed to hapFabia: upper bound on minor allele frequencies (MAF) to extract rare variants.

p

passed to hapFabia: number of biclusters per iteration.

iter

passed to hapFabia: number of iterations.

quant

passed to hapFabia: percentage of loadings L to remove in each iteration.

eps

passed to hapFabia: lower bound for variational parameter lapla; default 1e-5.

alpha

passed to hapFabia: sparseness of the loadings; default = 0.03.

cyc

passed to hapFabia: number of cycles per iterations; default 50.

non_negative

passed to hapFabia: non-negative factors and loadings if non_negative = 1; default = 1 (yes).

write_file

passed to hapFabia: results are written to files (L in sparse format), default = 0 (not written).

norm

passed to hapFabia: data normalization; default 0 (no normalization).

lap

passed to hapFabia: minimal value of the variational parameter; default 100.0.

IBDsegmentLength

passed to hapFabia: typical IBD segment length in kbp.

Lt

passed to hapFabia: percentage of largest Ls to consider for IBD segment extraction.

Zt

passed to hapFabia: percentage of largest Zs to consider for IBD segment extraction.

thresCount

passed to hapFabia: p-value of random histogram hit; default 1e-5.

mintagSNVsFactor

passed to hapFabia: percentage of IBD segment overlap; default 3/4.

pMAF

passed to hapFabia: averaged and corrected (for non-uniform distributions) minor allele frequency.

haplotypes

passed to hapFabia: haplotypes = TRUE then phased genotypes meaning two chromosomes per individual otherwise unphased genotypes.

cut

passed to hapFabia: cutoff for merging IBD segments after a hierarchical clustering; default 0.8.

procMinIndivids

passed to hapFabia: percentage of cluster individuals a tagSNV must tag to be considered as tagSNV for the IBD segment.

thresPrune

passed to hapFabia: threshold for pruning border tagSNVs based on an exponential distribution where border tagSNVs with large distances to the next tagSNV are pruned.

simv

passed to hapFabia: similarity measure for merging clusters: "minD" (percentage of smaller explained by larger set), "jaccard" (Jaccard index), "dice" (Dice index), or "maxD"; default "minD".

minTagSNVs

passed to hapFabia: minimum matching tagSNVs for cluster similarity; otherwise the similarity is set to zero.

minIndivid

passed to hapFabia: minimum matching individuals for cluster similarity; otherwise the similarity is set to zero.

avSNVsDist

passed to hapFabia: average distance between SNVs in base pairs - used together with IBDsegmentLength to compute the number of SNVs in the histogram bins; default=100.

SNVclusterLength

passed to hapFabia: if IBDsegmentLength=0 then the number of SNVs in the histogram bins can be given directly; default 100.

Details

Implementation in R. Reads annotation of the individuals if available, then calls hapFabia and stores its results. Results are saved in EXCEL format and as R binaries.

iterateIntervals loops over all intervals and calls hapFabia and then stores the results. Intervals have been generated by split_sparse_matrix. The results are the indentified IBD segments which are stored separately per interval. A subsequent analysis first calls identifyDuplicates to identify IBD segments that are found more than one time and then analyzes the IBD segments by analyzeIBDsegments.

The SNV annotation file ..._annot.txt contains:

  1. first line: number individuals;

  2. second line: number SNVs;

  3. for each SNV a line containing following field that are blank separated: "chromosome", "physical position", "snvNames", "snvMajor", "snvMinor", "quality", "pass", "info of vcf file", "fields in vcf file", "frequency", "0/1: 1 is changed if major allele is actually minor allele".

The individuals annotation file, which name is give to annotationFile, contains per individual a tab separated line with

  1. id;

  2. subPopulation;

  3. population;

  4. platform.

Value

Loop over DNA intervals with a call of hapFabia

Author(s)

Sepp Hochreiter

References

S. Hochreiter et al., ‘FABIA: Factor Analysis for Bicluster Acquisition’, Bioinformatics 26(12):1520-1527, 2010.

See Also

IBDsegment-class, IBDsegmentList-class, analyzeIBDsegments, compareIBDsegmentLists, extractIBDsegments, findDenseRegions, hapFabia, hapFabiaVersion, hapRes, chr1ASW1000G, IBDsegmentList2excel, identifyDuplicates, iterateIntervals, makePipelineFile, matrixPlot, mergeIBDsegmentLists, mergedIBDsegmentList, plotIBDsegment, res, setAnnotation, setStatistics, sim, simu, simulateIBDsegmentsFabia, simulateIBDsegments, split_sparse_matrix, toolsFactorizationClass, vcftoFABIA

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
## Not run: 

###here an example of the the automatically generated pipeline
### with: shiftSize=5000,intervalSize=10000,fileName="filename"


#####define intervals, overlap, filename #######
shiftSize <- 5000
intervalSize <- 10000
fileName="filename" # without type
haplotypes <- TRUE
dosage <- FALSE

#####load library#######
library(hapFabia)

#####convert from .vcf to _mat.txt#######
vcftoFABIA(fileName=fileName)

#####copy haplotype, genotype, or dosage matrix to matrix#######
if (haplotypes) {
    file.copy(paste(fileName,"_matH.txt",sep=""), paste(fileName,"_mat.txt",sep=""))
} else {
    if (dosage) {
        file.copy(paste(fileName,"_matD.txt",sep=""), paste(fileName,"_mat.txt",sep=""))
    } else {
        file.copy(paste(fileName,"_matG.txt",sep=""), paste(fileName,"_mat.txt",sep=""))
    }
}

#####split/ generate intervals#######
split_sparse_matrix(fileName=fileName,intervalSize=intervalSize,
shiftSize=shiftSize,annotation=TRUE)

#####compute how many intervals we have#######
ina <- as.numeric(readLines(paste(fileName,"_mat.txt",sep=""),n=2))
noSNVs <- ina[2]
over <- intervalSize%/%shiftSize
N1 <- noSNVs%/%shiftSize
endRunA <- (N1-over+2)

#####analyze each interval#######
#####may be done by parallel runs#######
iterateIntervals(startRun=1,endRun=endRunA,shift=shiftSize,
intervalSize=intervalSize,fileName=fileName,individuals=0,
upperBP=0.05,p=10,iter=40,alpha=0.03,cyc=50,IBDsegmentLength=50,
Lt = 0.1,Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,
pMAF=0.035,haplotypes=haplotypes,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

#####identify duplicates#######
identifyDuplicates(fileName=fileName,startRun=1,endRun=endRunA,
shift=shiftSize,intervalSize=intervalSize)

#####analyze results; parallel#######
anaRes <- analyzeIBDsegments(fileName=fileName,startRun=1,endRun=endRunA,
shift=shiftSize,intervalSize=intervalSize)
print("Number IBD segments:")
print(anaRes$noIBDsegments)
print("Statistics on IBD segment length in SNVs (all SNVs in the IBD segment):")
print(anaRes$avIBDsegmentLengthSNVS)
print("Statistics on IBD segment length in bp:")
print(anaRes$avIBDsegmentLengthS)
print("Statistics on number of individuals belonging to IBD segments:")
print(anaRes$avnoIndividS)
print("Statistics on number of tagSNVs of IBD segments:")
print(anaRes$avnoTagSNVsS)
print("Statistics on MAF of tagSNVs of IBD segments:")
print(anaRes$avnoFreqS)
print("Statistics on MAF within the group of tagSNVs of IBD segments:")
print(anaRes$avnoGroupFreqS)
print("Statistics on number of changes between major and minor allele frequency:")
print(anaRes$avnotagSNVChangeS)
print("Statistics on number of tagSNVs per individual of an IBD segment:")
print(anaRes$avnotagSNVsPerIndividualS)
print("Statistics on number of individuals that have the minor allele of tagSNVs:")
print(anaRes$avnoindividualPerTagSNVS)

#####load result for interval 50#######
posAll <- 50 # (50-1)*5000 = 245000: interval 245000 to 255000
start <- (posAll-1)*shiftSize
end <- start + intervalSize
pRange <- paste("_",format(start,scientific=FALSE),"_",
format(end,scientific=FALSE),sep="")
load(file=paste(fileName,pRange,"_resAnno",".Rda",sep=""))
IBDsegmentList <- resHapFabia$mergedIBDsegmentList # $

summary(IBDsegmentList)
#####plot IBD segments in interval 50#######
plot(IBDsegmentList,filename=paste(fileName,pRange,"_mat",sep=""))
   ##attention: filename without type ".txt"

#####plot the first IBD segment in interval 50#######

IBDsegment <- IBDsegmentList[[1]]
plot(IBDsegment,filename=paste(fileName,pRange,"_mat",sep=""))
   ##attention: filename without type ".txt"


## End(Not run)

#Work in a temporary directory.

old_dir <- getwd()
setwd(tempdir())


# Load data and write to vcf file.
data(chr1ASW1000G)
write(chr1ASW1000G,file="chr1ASW1000G.vcf")

#Create the analysis pipeline for haplotype data (1000Genomes)
makePipelineFile(fileName="chr1ASW1000G",shiftSize=500,intervalSize=1000,haplotypes=TRUE)

source("pipeline.R")

# Following files are produced:
list.files(pattern="chr1")



# Next we load interval 5 and there the first and second IBD segment
posAll <- 5
start <- (posAll-1)*shiftSize
end <- start + intervalSize
pRange <- paste("_",format(start,scientific=FALSE),"_",format(end,scientific=FALSE),sep="")
load(file=paste(fileName,pRange,"_resAnno",".Rda",sep=""))
IBDsegmentList <- resHapFabia$mergedIBDsegmentList
summary(IBDsegmentList)
IBDsegment1 <- IBDsegmentList[[1]]
summary(IBDsegment1)
IBDsegment2 <- IBDsegmentList[[2]]
summary(IBDsegment2)




#Plot the first IBD segment in interval 5
plot(IBDsegment1,filename=paste(fileName,pRange,"_mat",sep=""))


#Plot the second IBD segment in interval 5
plot(IBDsegment2,filename=paste(fileName,pRange,"_mat",sep=""))

setwd(old_dir)

hapFabia documentation built on Nov. 8, 2020, 5:17 p.m.