hlaAttrBagging: Build a HIBAG model
In zhengxwen/HIBAG: HLA Genotype Imputation with Attribute Bagging

hlaAttrBagging

R Documentation

Build a HIBAG model

Description

To build a HIBAG model for predicting HLA types with SNP markers.

Usage

hlaAttrBagging(hla, snp, nclassifier=100L, mtry=c("sqrt", "all", "one"),
    prune=TRUE, na.rm=TRUE, mono.rm=TRUE, maf=NaN, nthread=1L, verbose=TRUE,
    verbose.detail=FALSE)

Arguments

`hla`	the training HLA types, an object of `hlaAlleleClass`
`snp`	the training SNP genotypes, an object of `hlaSNPGenoClass`
`nclassifier`	the total number of individual classifiers
`mtry`	a character or a numeric value, the number of variables randomly sampled as candidates for each selection. See details
`prune`	if TRUE, to perform a parsimonious forward variable selection, otherwise, exhaustive forward variable selection. See details
`na.rm`	if TRUE, remove the samples with missing HLA alleles
`mono.rm`	if TRUE, remove monomorphic SNPs
`maf`	MAF threshold for SNP filter, excluding any SNP with MAF < `maf`
`nthread`	specify the number of threads used in the model building; if `TRUE`, use the number of threads returned from `RcppParallel::defaultNumThreads()` (by default using all threads)
`verbose`	if `TRUE`, show information
`verbose.detail`	if `TRUE`, show more information

Details

mtry (the number of variables randomly sampled as candidates for each selection, "sqrt" by default): "sqrt", using the square root of the total number of candidate SNPs; "all", using all candidate SNPs; "one", using one SNP; an integer, specifying the number of candidate SNPs; 0 < r < 1, the number of candidate SNPs is "r * the total number of SNPs".

prune: there is no significant difference on accuracy between parsimonious and exhaustive forward variable selections. If prune=TRUE, the searching algorithm performs a parsimonious forward variable selection: if a new SNP predictor reduces the current out-of-bag accuracy, then it is removed from the candidate SNP set for future searching. Parsimonious selection helps to improve the computational efficiency by reducing the searching times on non-informative SNP markers.

hlaParallelAttrBagging extends hlaAttrBagging to allow parallel computing with multiple compute nodes in a cluster. An autosave function is available in hlaParallelAttrBagging when an new individual classifier is built internally without completing the ensemble.

Value

Return an object of hlaAttrBagClass:

`n.samp`	the total number of training samples
`n.snp`	the total number of candidate SNP predictors
`sample.id`	the sample IDs
`snp.id`	the SNP IDs
`snp.position`	SNP position in basepair
`snp.allele`	a vector of characters with the format of “A allele/B allele”
`snp.allele.freq`	the allele frequencies
`hla.locus`	the name of HLA locus
`hla.allele`	the HLA alleles used in the model
`hla.freq`	the HLA allele frequencies
`assembly`	the human genome reference, such like "hg19"
`model`	internal use
`matching`	matching proportion in the training set

Author(s)

Xiuwen Zheng

References

Zheng X, Shen J, Cox C, Wakefield J, Ehm M, Nelson M, Weir BS; HIBAG – HLA Genotype Imputation with Attribute Bagging. Pharmacogenomics Journal. doi: 10.1038/tpj.2013.18. https://www.nature.com/articles/tpj201318

Examples

# make a "hlaAlleleClass" object
hla.id <- "A"
hla <- hlaAllele(HLA_Type_Table$sample.id,
    H1 = HLA_Type_Table[, paste(hla.id, ".1", sep="")],
    H2 = HLA_Type_Table[, paste(hla.id, ".2", sep="")],
    locus=hla.id, assembly="hg19")

# divide HLA types randomly
set.seed(100)
hlatab <- hlaSplitAllele(hla, train.prop=0.5)
names(hlatab)
# "training"   "validation"
summary(hlatab$training)
summary(hlatab$validation)

# SNP predictors within the flanking region on each side
region <- 500   # kb
snpid <- hlaFlankingSNP(HapMap_CEU_Geno$snp.id, HapMap_CEU_Geno$snp.position,
    hla.id, region*1000, assembly="hg19")
length(snpid)  # 275

# training and validation genotypes
train.geno <- hlaGenoSubset(HapMap_CEU_Geno,
    snp.sel=match(snpid, HapMap_CEU_Geno$snp.id),
    samp.sel=match(hlatab$training$value$sample.id,
    HapMap_CEU_Geno$sample.id))
test.geno <- hlaGenoSubset(HapMap_CEU_Geno,
    samp.sel=match(hlatab$validation$value$sample.id,
    HapMap_CEU_Geno$sample.id))

# train a HIBAG model
set.seed(100)
# please use "nclassifier=100" when you use HIBAG for real data
model <- hlaAttrBagging(hlatab$training, train.geno, nclassifier=4,
    verbose.detail=TRUE)
summary(model)

# validation
pred <- hlaPredict(model, test.geno)
summary(pred)

# compare
(comp <- hlaCompareAllele(hlatab$validation, pred, allele.limit=model,
    call.threshold=0))
(comp <- hlaCompareAllele(hlatab$validation, pred, allele.limit=model,
    call.threshold=0.5))


# save the parameter file
mobj <- hlaModelToObj(model)
save(mobj, file="HIBAG_model.RData")
save(test.geno, file="testgeno.RData")
save(hlatab, file="HLASplit.RData")

# Clear Workspace
hlaClose(model)  # release all resources of model
rm(list = ls())


######################################################################

# NOW, load a HIBAG model from the parameter file
mobj <- get(load("HIBAG_model.RData"))
model <- hlaModelFromObj(mobj)

# validation
test.geno <- get(load("testgeno.RData"))
hlatab <- get(load("HLASplit.RData"))

pred <- hlaPredict(model, test.geno, type="response")
summary(pred)

# compare
(comp <- hlaCompareAllele(hlatab$validation, pred, allele.limit=model,
    call.threshold=0))
(comp <- hlaCompareAllele(hlatab$validation, pred, allele.limit=model,
    call.threshold=0.5))


# delete the temporary files
unlink(c("HIBAG_model.RData", "testgeno.RData", "HLASplit.RData"), force=TRUE)

zhengxwen/HIBAG documentation built on Nov. 24, 2024, 5:24 a.m.