hlaAttrBagging: Build a HIBAG model
In HIBAG: HIBAG -- HLA Genotype Imputation with Attribute Bagging

Description Usage Arguments Details Value Author(s) References See Also Examples

To build a HIBAG model for predicting HLA types.

1 2	hlaAttrBagging(hla, snp, nclassifier=100, mtry=c("sqrt", "all", "one"), prune=TRUE, rm.na=TRUE, verbose=TRUE, verbose.detail=FALSE)

`hla`	the training HLA types, an object of `hlaAlleleClass`
`snp`	the training SNP genotypes, an object of `hlaSNPGenoClass`
`nclassifier`	the total number of individual classifiers
`mtry`	a character or a numeric value, the number of variables randomly sampled as candidates for each selection. See details
`prune`	if TRUE, to perform a parsimonious forward variable selection, otherwise, exhaustive forward variable selection. See details
`rm.na`	if TRUE, remove the samples with missing HLA types
`verbose`	if TRUE, show information
`verbose.detail`	if TRUE, show more information

mtry (the number of variables randomly sampled as candidates for each selection): "sqrt", using the square root of the total number of candidate SNPs; "all", using all candidate SNPs; "one", using one SNP; an integer, specifying the number of candidate SNPs; 0 < r < 1, the number of candidate SNPs is "r * the total number of SNPs".

prune: there is no significant difference on accuracy between parsimonious and exhaustive forward variable selections. If prune=TRUE, the searching algorithm performs a parsimonious forward variable selection: if a new SNP predictor reduces the current out-of-bag accuracy, then it is removed from the candidate SNP set for future searching. Parsimonious selection helps to improve the computational efficiency by reducing the searching times on non-informative SNP markers.

A parallel version of hlaAttrBagging is hlaParallelAttrBagging.

Return an object of hlaAttrBagClass:

`n.samp`	the total number of training samples
`n.snp`	the total number of candidate SNP predictors
`sample.id`	the sample IDs
`snp.id`	the SNP IDs
`snp.position`	SNP position in basepair
`snp.allele`	a vector of characters with the format of “A allele/B allele”
`snp.allele.freq`	the allele frequencies
`hla.locus`	the name of HLA locus
`hla.allele`	the HLA alleles used in the model
`hla.freq`	the HLA allele frequencies
`assembly`	the human genome reference, such like "hg19"
`model`	internal use

Xiuwen Zheng

Zheng X, Shen J, Cox C, Wakefield J, Ehm M, Nelson M, Weir BS; HIBAG – HLA Genotype Imputation with Attribute Bagging; (Abstract 294, Platform/Oral Talk); Present at the 62nd Annual Meeting of the American Society of Human Genetics, November 9, 2012 in San Francisco, California.

Zheng X, Shen J, Cox C, Wakefield J, Ehm M, Nelson M, Weir BS; HIBAG – HLA Genotype Imputation with Attribute Bagging. Pharmacogenomics Journal. doi: 10.1038/tpj.2013.18. http://dx.doi.org/10.1038/tpj.2013.18

hlaClose, hlaParallelAttrBagging, summary.hlaAttrBagClass, predict.hlaAttrBagClass

# load HLA types and SNP genotypes
data(HLA_Type_Table, package="HIBAG")
data(HapMap_CEU_Geno, package="HIBAG")

# make a "hlaAlleleClass" object
hla.id <- "A"
hla <- hlaAllele(HLA_Type_Table$sample.id,
	H1 = HLA_Type_Table[, paste(hla.id, ".1", sep="")],
	H2 = HLA_Type_Table[, paste(hla.id, ".2", sep="")],
	locus=hla.id, assembly="hg19")

# divide HLA types randomly
set.seed(100)
hlatab <- hlaSplitAllele(hla, train.prop=0.5)
names(hlatab)
# "training"   "validation"
summary(hlatab$training)
summary(hlatab$validation)

# SNP predictors within the flanking region on each side
region <- 500   # kb
snpid <- hlaFlankingSNP(HapMap_CEU_Geno$snp.id, HapMap_CEU_Geno$snp.position,
	hla.id, region*1000, assembly="hg19")
length(snpid)  # 275

# training and validation genotypes
train.geno <- hlaGenoSubset(HapMap_CEU_Geno,
	snp.sel=match(snpid, HapMap_CEU_Geno$snp.id),
	samp.sel=match(hlatab$training$value$sample.id, HapMap_CEU_Geno$sample.id))
test.geno <- hlaGenoSubset(HapMap_CEU_Geno,
	samp.sel=match(hlatab$validation$value$sample.id, HapMap_CEU_Geno$sample.id))

# train a HIBAG model
set.seed(100)
# please use "nclassifier=100" when you use HIBAG for real data
model <- hlaAttrBagging(hlatab$training, train.geno, nclassifier=4,
	verbose.detail=TRUE)
summary(model)

# validation
pred <- predict(model, test.geno)
# compare
(comp <- hlaCompareAllele(hlatab$validation, pred, allele.limit=model,
	call.threshold=0))
(comp <- hlaCompareAllele(hlatab$validation, pred, allele.limit=model,
	call.threshold=0.5))


# save the parameter file
mobj <- hlaModelToObj(model)
save(mobj, file="HIBAG_model.RData")
save(test.geno, file="testgeno.RData")
save(hlatab, file="HLASplit.RData")

# Clear Workspace
hlaClose(model)  # release all resources of model
rm(list = ls())


######################################################################

# NOW, load a HIBAG model from the parameter file
mobj <- get(load("HIBAG_model.RData"))
model <- hlaModelFromObj(mobj)

# validation
test.geno <- get(load("testgeno.RData"))
hlatab <- get(load("HLASplit.RData"))

pred <- predict(model, test.geno, type="response")
summary(pred)

# compare
(comp <- hlaCompareAllele(hlatab$validation, pred, allele.limit=model,
	call.threshold=0))
(comp <- hlaCompareAllele(hlatab$validation, pred, allele.limit=model,
	call.threshold=0.5))