GraB: Prediction using the Optimal Gradient Boosted Regression...

Description Usage Arguments Details Value References Examples

Description

The function returns the prediction of the polygenic gene score weights based on the optimal gradient boosted regression trees model.

Usage

1
2
3
4
GraB(betas, annotations, pos = 2, pos_sign = 3, abs_effect = 2:5,
  trait_name = NULL, which.var = 2:5, steps = 1, validation = 5,
  verbose = FALSE, interval = 200, sig = 1e-05, interact_depth = 5,
  shrink = 0.001, bag_frac = 0.5, max_tree = 2000, WRITE = FALSE)

Arguments

betas

a matrix of regression coefficients from association analysis in the target population. The first column is the chromosome for each SNP, and the column with the regression coefficient should be specified by setting pos. The default value for pos is 2. The SNP IDs or other information could be present as additional columns. Users need to prepare univariate association beta file without headers. The betas were generated from the model:
coef(summary(lm(pheno_data ~ geno[,j])))[2,1]

Both genotype data and phenotype data over individuals need to be standardized to have mean = 0 and variance = 1.

annotations

a matrix of annotation variables used to update the betas values through gradient boosted regression tree models. Usually, this can be taken from the summary-level test statistics of matching traits from genome-wide consortia available online. The first column of the matrix must be the SNP IDs and the remaining columns could be additional annotation information. The SNP IDs must be in the same order as those in betas.

pos

an integer indicating which columns of the data matrix annotations is the corresponding consortium value and additionally which columns should also be included.

pos_sign

an integer indicating which column of the data matrix annotations should be used to update the sign of the univariate regression coefficient. Usually, it is set to be the consortium univariate regression coefficient of the same trait.

abs_effect

a vector of integers indicating which columns of the data matrix annotations should be used as absolute effect by taking the absolute sign. For example, when only the strength of the effect rather than the direction of the effect is informative for improving the polygenic score weights.

trait_name

a character for the name of the quantitative trait, assuming the file is named as trait_name_univariate_beta.txt.

which.var

a vector of integers indicating which columns of annotations should be included in the formula to update the weights. Should be have at least length of two and can not take value 1 (the column for SNP ID). If not provided, it will be set to abs_effect.

steps

an integer indicating the current cross-fold

validation

an integer indicating the total number of cross folds. The default and recommended number of cross-fold is 5.

verbose

a logic indicating whether the adjusted prediction r-squared for each tested model with different number of trees should be returned.

interval

an integer indicating the number of iterated cycles to calculate the best trees using gbm. The default value is 200.

sig

the significance level for including a predictor to build the regression trees model.

interact_depth

an integer for the maximum depth of variable interactions used in gbm. See ?gbm. The default here is 5.

shrink

a shrinkage parameter or the learning rate of the tree models in gbm. See ?gbm. The default is 0.001.

bag_frac

a numeric between 0 and 1, controls the fraction of the training set observations randomly selected to propose the next tree. See ?gbm. The default value is 0.5.

max_tree

an integer indicating the total number of trees to fit in gbm. See ?gbm. The default used here is 2000.

WRITE

a logic indicating whether the results of the GraBLD weights should be written to a file with file name trait_name_gbm_beta.txt.

Details

For large datasets, it is recommended to run from the command line with

1
2
3
4
5
6
7
8
9
validation=5
for (( i = 1; i <= $validation; i++))
do
Rscript calculate_gbm.R geno_data
   trait_name annotations_file pos ${i}
   $validation interaction_depth
   shrinkage_parameter bag_fraction
   maximum_tree &
done

where the R script calculate_gbm.R might look something like this, while additional options can be added to the argument list:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/sh
rm(list = ls())
library('GraBLD')
args = (commandArgs(TRUE))
geno_data = args[1]
 trait_name = args[2]
 annotations_file = args[3]
 pos = eval(parse(text=args[4]))
 steps = eval(parse(text=args[5]))
 validation = eval(parse(text=args[6]))
 p1 = eval(parse(text=args[7]))
 p2 = eval(parse(text=args[8]))
 p3 = eval(parse(text=args[9]))
 p4 = eval(parse(text=args[10]))
 betas = load_beta(trait_name)
 annotation = load_database(annotations_file, pos = 2:3)
 geno <- load_geno(geno_data)
 GraB(betas = betas, annotations = annotation,
   trait_name = trait_name, steps = steps, validation = validation,
   interval = 200, sig = 1e-05, interact_depth = p1, shrink = p2,
   bag_frac = p3, max_tree = p4, WRITE = TRUE)

Value

a numeric vector of updated weights with length matching the number of SNPs.

References

Greg Ridgeway with contributions from others (2015). gbm: Generalized Boosted Regression Models. R package version 2.1.1. https://CRAN.R-project.org/package=gbm

Guillaume Pare, Shihong Mao, Wei Q Deng (2017) A machine-learning heuristic to improve gene score prediction of polygenic traits Short title: Machine-learning boosted gene scores, bioRxiv 107409; doi: https://doi.org/10.1101/107409; http://www.biorxiv.org/content/early/2017/02/09/107409

Examples

1
2
3
4
data(univariate_beta)
data(annotation_data)
GraB(betas = univariate_beta, annotations = annotation_data,
   trait_name = 'BMI', steps = 2, validation = 5)

GMELab/GraBLD documentation built on May 4, 2019, 3:20 p.m.