Description Usage Arguments Details Value Author(s) References Examples
Ensemble predictor comprised of individual generalized linear model predictors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39  randomGLM(
# Input data
x, y, xtest = NULL,
# Include interactions?
maxInteractionOrder = 1,
# Prediction type
classify = is.factor(y)  length(unique(y)) < 4,
# Multilevel classification options  only apply to classification with multilevel response
multiClass.global = TRUE,
multiClass.pairwise = FALSE,
multiClass.minObs = 1,
multiClass.ignoreLevels = NULL,
# Sampling options
nBags = 100,
replace = TRUE,
sampleWeight=NULL,
nObsInBag = if (replace) nrow(x) else as.integer(0.632 * nrow(x)),
nFeaturesInBag = ceiling(ifelse(ncol(x)<=10, ncol(x),
ifelse(ncol(x)<=300, (1.02760.00276*ncol(x))*ncol(x), ncol(x)/5))),
minInBagObs = min( max( nrow(x)/2, 5), 2*nrow(x)/3),
# Individual ensemble member predictor options
nCandidateCovariates=50,
corFncForCandidateCovariates= cor,
corOptionsForCandidateCovariates = list(method = "pearson", use="p"),
mandatoryCovariates = NULL,
interactionsMandatory = FALSE,
keepModels = is.null(xtest),
# Miscellaneous options
thresholdClassProb = 0.5,
interactionSeparatorForCoefNames = ".times.",
randomSeed = 12345,
nThreads = NULL,
verbose =0 )

x 
a matrix whose rows correspond to observations (samples) and whose columns correspond to features (also known as covariates or variables). Thus, 
y 
outcome variable corresponding to the rows of 
xtest 
an optional matrix of a second data set (referred to as test data set while the data in

maxInteractionOrder 
integer specifying the maximum interaction level. The default is to have no interactions; numbers higher than 1 specify interactions up to that order. For example, 3 means quadratic and cubic interactions will be included. Warning: higher order interactions greatly increase the computation time. We see no benefit of using maxInteractionOrder>2. 
classify 
logical: If 
multiClass.global 
for multilevel classification, this logical argument controls whether binary variables of the type "level vs. all others" are included in the series of binary variables to which classification is applied. 
multiClass.pairwise 
for multilevel classification, this logical argument controls whether binary variables of the type "level A vs. level B" are included in the series of binary variables to which classification is applied. 
multiClass.minObs 
an integer specifying the minimum number of observations for each level for the level to be considered when creating "level vs. all" and "level vs. level" binary variables. 
multiClass.ignoreLevels 
optional specifications of the values (levels) of the input response

nBags 
number of bags (bootstrap samples) for defining the ensemble predictor, i.e. this also corresponds to the number of individual GLMs. 
replace 
logical. If 
sampleWeight 
weight assigned to each observations (sample) during bootstrap sampling. Default 
nObsInBag 
number of observations selected for each bag. Typically, a bootstrap sample (bag) has the
same number of observations as in the original data set (i.e. the rows of 
nFeaturesInBag 
number of features randomly selected for each bag. Features are randomly selected
without replacement. If there are no interaction terms, then this number should be smaller than or equal to
the number of rows of 
minInBagObs 
minimum number of unique observations that constitute a valid bag. If the sampling
produces a bag with fewer than this number of unique observations, the bag is discarded and resampled again
until the number of unique observations is at least 
nCandidateCovariates 
Positive integer. The number of features that are being considered for forward selection in each GLM (and in each bag). For each bag, the covariates are being chosen according their highest absolute correlation with the outcome. In case of a binary outcome, it is first turned into a binary numeric variable. 
corFncForCandidateCovariates 
the correlation function used to select candidate covariates. Choices
include 
corOptionsForCandidateCovariates 
list of arguments to the correlation function. Note that robust correlations are sometimes problematic for binary class outcomes. When using the robust
correlation 
mandatoryCovariates 
indices of features that are included as mandatory covariates in each GLM model. The default is no mandatory features. This allows the user to "force" variables into each GLM. 
interactionsMandatory 
logical: should interactions of mandatory covariates be mandatory as well? Interactions are only included up to the level specified in 
keepModels 
logical: should the regression models for each bag be kept? The models are necessary for future predictions using the 
thresholdClassProb 
number in the interval [0,1]. Recommended value 0.5. This parameter is only relevant in case of a binary outcome, i.e. for a logistic regression model. Then this threshold will be applied to the predictive class probabilities to arrive at binary outcome (class outcome). 
interactionSeparatorForCoefNames 
a character string that will be used to separate feature names when
forming names of interaction terms. This is only used when interactions are actually taken into account (see

randomSeed 
NULL or integer. The seed for the random number generator. If NULL, the seed will not be set. If nonNULL and the random generator has been initialized prior to the function call, the latter's state is saved and restored upon exit. 
nThreads 
number of threads (worker processes) to perform the calculation. If not given, will be determined automatically as the number of available cores if the latter is 3 or less, and number of cores minus 1 if the number of available cores is 4 or more. Invalid entries (missing value, zero or negative values etc.) are changed to 1, with a warning. 
verbose 
value 0 or 1 which determines the level of verbosity. Zero means silent, 1 reports the bag number the function is working on. At this point verbose output only works if 
At this point, the function randomGLM
can be used to predict a binary outcome or a
quantitative numeric outcome. This ensemble predictor proceeds along the following steps.
Step 1 (bagging): nBags
bootstrapped data sets are being generated based on random sampling from the
original training data set (x
,y
). If a bag contains less than minInBagObs
unique
observations or it contains all observations, it is discarded and resampled again.
Step 2 (random subspace): For each bag, nFeaturesInBag
features are randomly selected (without
replacement) from the columns of x
. Optionally, interaction terms between the selected features can
be formed (see the argument maxInteractionOrder
).
Step 3 (feature ranking): In each bag, features are ranked according to their correlation with the outcome
measure. Next the top nCandidateCovariates
are being considered for forward selection in each GLM
(and in each bag).
Step 4 (forward selection): Forward variable selection is employed to define a multivariate GLM model of the outcome in each bag.
Step 5 (aggregating the predictions): Prediction from each bag are aggregated. In case, of a quantitative outcome, the predictions are simply averaged across the bags.
Generally, nCandidateCovariates
>100 is not recommended, because the forward
selection process is
timeconsuming. If arguments "nBags=1, replace=FALSE, nObsInBag=nrow(x)"
are used,
the function becomes a forward selection GLM predictor without bagging.
Classification of multilevel categorical responses is performed indirectly by turning the single
multiclass response into a set of binary variables. The set can include two types of binary variables:
Level vs. all others (this binary variable is 1 when the original response equals the level and zero
otherwise), and level A vs. level B (this binary variable is 0 when the response equals level A, 1 when the
response equals level B, and NA otherwise).
For example, if the input response y
contains observations with values (levels) "A", "B",
"C", the binary variables
will have names "all.vs.A" (1 means "A", 0 means all others), "all.vs.B",
"all.vs.C", and optionally also "A.vs.B" (0 means "A", 1 means "B", NA means neither "A" nor "B"), "A.vs.C",
and "B.vs.C".
Note that using pairwise level vs. level binary variables be
very timeconsuming since the number of such binary variables grows quadratically with the number of levels
in the response. The user has the option to limit which levels of the original response will have their
"own" binary variables, by setting the minimum observations a level must have to qualify for its own binary
variable, and by explicitly enumerating levels that should not have their own binary variables. Note that
such "ignored" levels are still included on the "all" side of "level vs. all" binary variables.
At this time the predictor does not attempt to summarize the binary variable classifications into a single multilevel classification.
Training this predictor on data with fewer than 8 observations is not recommended (and the function will warn about it). Due to the bagging step, the number of unique observations in each bag is less than the number of observations in the input data; the low number of unique observations can (and often will) lead to an essentially perfect fit which makes it impossible to perfrom meaningful stepwise model selection.
Feature names: In general, the column names of input x
are assumed to be the feature names. If
x
has no column names (i.e., colnames(x)
is NULL
), stadard column names of the form
"F01", "F02", ...
are used. If x
has nonNULL column names, they are turned into valid and
unique names using the function make.names
. If the function make.names
returns
names that are not the same as the column names of x
, the component featureNamesChanged
will
be TRUE
and the component nameTranslationTable
contains the information about input and actual
used feature names. The feature names are used as predictor names in the individual models in each bag.
The function returns an object of class randomGLM
. For continuous prediction or twolevel
classification, this is a list with the following components:
predictedOOB 
the continuous prediction (if 
predictedOOB.response 
In case of a binary outcome, this is the predicted probability of each outcome
specified by 
predictedTest.cont 
if test set is given, the predicted probability of each outcome specified by

predictedTest 
if test set is given, the predicted classification for test data. Only for binary outcomes. 
candidateFeatures 
candidate features in each bag. A list with one component per bag. Each component
is a matrix with 
featuresInForwardRegression 
features selected by forward selection in each bag. A list with one
component per bag. Each component
is a matrix with 
coefOfForwardRegression 
coefficients of forward regression. A list with one
component per bag. Each component is a vector giving the coefficients of the model determined by forward
selection in the corresponding bag. The order of the coefficients is the same as the order of the terms in
the corresponding component of 
interceptOfForwardRegression 
a vector with one component per bag giving the intercept of the regression model in each bag. 
bagObsIndx 
a matrix with 
timesSelectedByForwardRegression 
a matrix of 
models 
the regression models for each bag. Predictor features in each bag model are named using their 
featureNamesChanged 
logical indicating whether feature names were copied verbatim from column names
of 
nameTranslationTable 
only present if above 
In addition, the output value contains a copy of several input arguments. These are included to facilitate
prediction using the predict
method. These returned values should be considered undocumented and may
change in the future.
In the multilevel classification classification case, the returned list (still considered a valid
randomGLM
object) contains the following components:
binaryPredictors 
a list with one component per binary variable, containing the 
predictedOOB 
a matrix in which columns correspond to the binary variables and rows to samples, containing the predicted binary classification for each binary variable. Columns names and meaning of 0 and 1 are described above. 
predictedOOB.response 
a matrix with two columns per binary variable, giving the class probabilities for each of the two classes in each binary variables. Column names contain the variable and class names. 
levelMatrix 
a character matrix with two rows and one column per binary variable, giving the level
corresponding to value 0 (row 1) and level corresponding to value 1 (row 2). This encodes the same
information as the names of the 
If input xTest
is nonNULL, the components predictedTest
and predictedTest.response
contain test set predictions analogous to predictedOOB
and predictedOOB.response
.
Lin Song, Steve Horvath, Peter Langfelder.
The function makes use of the glm
function and other standard R functions.
Lin Song, Peter Langfelder, Steve Horvath: Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics (2013)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36  ## binary outcome prediction
# data generation
data(iris)
# Restrict data to first 100 observations
iris=iris[1:100,]
# Turn Species into a factor
iris$Species = as.factor(as.character(iris$Species))
# Select a training and a test subset of the 100 observations
set.seed(1)
indx = sample(100, 67, replace=FALSE)
xyTrain = iris[indx,]
xyTest = iris[indx,]
xTrain = xyTrain[, 5]
yTrain = xyTrain[, 5]
xTest = xyTest[, 5]
yTest = xyTest[, 5]
# predict with a small number of bags  normally nBags should be at least 100.
RGLM = randomGLM(xTrain, yTrain, xTest, nCandidateCovariates=ncol(xTrain), nBags=30, nThreads = 1)
yPredicted = RGLM$predictedTest
table(yPredicted, yTest)
## continuous outcome prediction
x=matrix(rnorm(100*20),100,20)
y=rnorm(100)
xTrain = x[1:50,]
yTrain = y[1:50]
xTest = x[51:100,]
yTest = y[51:100]
RGLM = randomGLM(xTrain, yTrain, xTest, classify=FALSE, nCandidateCovariates=ncol(xTrain), nBags=10,
keepModels = TRUE, nThreads = 1)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.