Description Usage Arguments Details Value References Examples
The function returns the prediction of the polygenic gene score weights based on the optimal gradient boosted regression trees model.
1 2 3 4 |
betas |
a matrix of regression coefficients from association analysis in the target population.
The first column is the chromosome for each SNP, and the column with the regression coefficient should be
specified by setting Both genotype data and phenotype data over individuals need to be standardized
to have |
annotations |
a matrix of annotation variables used to update the |
pos |
an integer indicating which columns of the data matrix |
pos_sign |
an integer indicating which column of the data matrix |
abs_effect |
a vector of integers indicating which columns of the data matrix |
trait_name |
a character for the name of the quantitative trait, assuming the file is named as |
which.var |
a vector of integers indicating which columns of |
steps |
an integer indicating the current cross-fold |
validation |
an integer indicating the total number of cross folds. The default and recommended number of cross-fold is 5. |
verbose |
a logic indicating whether the adjusted prediction r-squared for each tested model with different number of trees should be returned. |
interval |
an integer indicating the number of iterated cycles to calculate the best trees using |
sig |
the significance level for including a predictor to build the regression trees model. |
interact_depth |
an integer for the maximum depth of variable interactions used in |
shrink |
a shrinkage parameter or the learning rate of the tree models in |
bag_frac |
a numeric between 0 and 1, controls the fraction of the training set
observations randomly selected to propose the next tree. See ? |
max_tree |
an integer indicating the total number of trees to fit in |
WRITE |
a logic indicating whether the results of the GraBLD weights should be written to a file with
file name |
For large datasets, it is recommended to run from the command line with
1 2 3 4 5 6 7 8 9 | validation=5
for (( i = 1; i <= $validation; i++))
do
Rscript calculate_gbm.R geno_data
trait_name annotations_file pos ${i}
$validation interaction_depth
shrinkage_parameter bag_fraction
maximum_tree &
done
|
where the R script calculate_gbm.R
might look something like this, while additional options can be
added to the argument list:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | #!/bin/sh
rm(list = ls())
library('GraBLD')
args = (commandArgs(TRUE))
geno_data = args[1]
trait_name = args[2]
annotations_file = args[3]
pos = eval(parse(text=args[4]))
steps = eval(parse(text=args[5]))
validation = eval(parse(text=args[6]))
p1 = eval(parse(text=args[7]))
p2 = eval(parse(text=args[8]))
p3 = eval(parse(text=args[9]))
p4 = eval(parse(text=args[10]))
betas = load_beta(trait_name)
annotation = load_database(annotations_file, pos = 2:3)
geno <- load_geno(geno_data)
GraB(betas = betas, annotations = annotation,
trait_name = trait_name, steps = steps, validation = validation,
interval = 200, sig = 1e-05, interact_depth = p1, shrink = p2,
bag_frac = p3, max_tree = p4, WRITE = TRUE)
|
a numeric vector of updated weights with length matching the number of SNPs.
Greg Ridgeway with contributions from others (2015). gbm: Generalized Boosted Regression Models. R package version 2.1.1. https://CRAN.R-project.org/package=gbm
Guillaume Pare, Shihong Mao, Wei Q Deng (2017) A machine-learning heuristic to improve gene score prediction of polygenic traits Short title: Machine-learning boosted gene scores, bioRxiv 107409; doi: https://doi.org/10.1101/107409; http://www.biorxiv.org/content/early/2017/02/09/107409
1 2 3 4 | data(univariate_beta)
data(annotation_data)
GraB(betas = univariate_beta, annotations = annotation_data,
trait_name = 'BMI', steps = 2, validation = 5)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.