GraB: Prediction using the Optimal Gradient Boosted Regression...
In GMELab/GraBLD: Gradient Boosted and LD adjusted

Description Usage Arguments Details Value References Examples

The function returns the prediction of the polygenic gene score weights based on the optimal gradient boosted regression trees model.

GraB(betas, annotations, pos = 2, pos_sign = 3, abs_effect = 2:5,
  trait_name = NULL, which.var = 2:5, steps = 1, validation = 5,
  verbose = FALSE, interval = 200, sig = 1e-05, interact_depth = 5,
  shrink = 0.001, bag_frac = 0.5, max_tree = 2000, WRITE = FALSE)

`betas`	a matrix of regression coefficients from association analysis in the target population. The first column is the chromosome for each SNP, and the column with the regression coefficient should be specified by setting `pos`. The default value for `pos` is 2. The SNP IDs or other information could be present as additional columns. Users need to prepare univariate association beta file without headers. The betas were generated from the model: `coef(summary(lm(pheno_data ~ geno[,j])))[2,1]` Both genotype data and phenotype data over individuals need to be standardized to have `mean` = 0 and `variance` = 1.
`annotations`	a matrix of annotation variables used to update the `betas` values through gradient boosted regression tree models. Usually, this can be taken from the summary-level test statistics of matching traits from genome-wide consortia available online. The first column of the matrix must be the SNP IDs and the remaining columns could be additional annotation information. The SNP IDs must be in the same order as those in `betas`.
`pos`	an integer indicating which columns of the data matrix `annotations` is the corresponding consortium value and additionally which columns should also be included.
`pos_sign`	an integer indicating which column of the data matrix `annotations` should be used to update the sign of the univariate regression coefficient. Usually, it is set to be the consortium univariate regression coefficient of the same trait.
`abs_effect`	a vector of integers indicating which columns of the data matrix `annotations` should be used as absolute effect by taking the absolute sign. For example, when only the strength of the effect rather than the direction of the effect is informative for improving the polygenic score weights.
`trait_name`	a character for the name of the quantitative trait, assuming the file is named as `trait_name_univariate_beta.txt`.
`which.var`	a vector of integers indicating which columns of `annotations` should be included in the formula to update the weights. Should be have at least length of two and can not take value 1 (the column for SNP ID). If not provided, it will be set to `abs_effect`.
`steps`	an integer indicating the current cross-fold
`validation`	an integer indicating the total number of cross folds. The default and recommended number of cross-fold is 5.
`verbose`	a logic indicating whether the adjusted prediction r-squared for each tested model with different number of trees should be returned.
`interval`	an integer indicating the number of iterated cycles to calculate the best trees using `gbm`. The default value is 200.
`sig`	the significance level for including a predictor to build the regression trees model.
`interact_depth`	an integer for the maximum depth of variable interactions used in `gbm`. See ?`gbm`. The default here is 5.
`shrink`	a shrinkage parameter or the learning rate of the tree models in `gbm`. See ?`gbm`. The default is 0.001.
`bag_frac`	a numeric between 0 and 1, controls the fraction of the training set observations randomly selected to propose the next tree. See ?`gbm`. The default value is 0.5.
`max_tree`	an integer indicating the total number of trees to fit in `gbm`. See ?`gbm`. The default used here is 2000.
`WRITE`	a logic indicating whether the results of the GraBLD weights should be written to a file with file name `trait_name_gbm_beta.txt`.

For large datasets, it is recommended to run from the command line with

validation=5
for (( i = 1; i <= $validation; i++))
do
Rscript calculate_gbm.R geno_data
   trait_name annotations_file pos ${i}
   $validation interaction_depth
   shrinkage_parameter bag_fraction
   maximum_tree &
done

where the R script calculate_gbm.R might look something like this, while additional options can be added to the argument list:

#!/bin/sh
rm(list = ls())
library('GraBLD')
args = (commandArgs(TRUE))
geno_data = args[1]
 trait_name = args[2]
 annotations_file = args[3]
 pos = eval(parse(text=args[4]))
 steps = eval(parse(text=args[5]))
 validation = eval(parse(text=args[6]))
 p1 = eval(parse(text=args[7]))
 p2 = eval(parse(text=args[8]))
 p3 = eval(parse(text=args[9]))
 p4 = eval(parse(text=args[10]))
 betas = load_beta(trait_name)
 annotation = load_database(annotations_file, pos = 2:3)
 geno <- load_geno(geno_data)
 GraB(betas = betas, annotations = annotation,
   trait_name = trait_name, steps = steps, validation = validation,
   interval = 200, sig = 1e-05, interact_depth = p1, shrink = p2,
   bag_frac = p3, max_tree = p4, WRITE = TRUE)

a numeric vector of updated weights with length matching the number of SNPs.

Greg Ridgeway with contributions from others (2015). gbm: Generalized Boosted Regression Models. R package version 2.1.1. https://CRAN.R-project.org/package=gbm

Guillaume Pare, Shihong Mao, Wei Q Deng (2017) A machine-learning heuristic to improve gene score prediction of polygenic traits Short title: Machine-learning boosted gene scores, bioRxiv 107409; doi: https://doi.org/10.1101/107409; http://www.biorxiv.org/content/early/2017/02/09/107409

data(univariate_beta)
data(annotation_data)
GraB(betas = univariate_beta, annotations = annotation_data,
   trait_name = 'BMI', steps = 2, validation = 5)