madlib.randomForest: MADlib wrapper function for Random Forest

Description Usage Arguments Value Author(s) References See Also Examples

View source: R/madlib-randomForest.R

Description

This function is a wrapper of MADlib's random forest model training function. The resulting forest is stored in a table in the database, and one can also view the result from R using print.rf.madlib.

Usage

1
2
3
madlib.randomForest(formula, data, id = NULL, ntree = 100, mtry = NULL,
importance = FALSE, nPerm = 1, na.action = NULL, control,
na.as.level = FALSE, verbose = FALSE, ...) 

Arguments

formula

A formula object, intercept term will automatically be removed. Factors will not be expanded to their dummy variables. Grouping syntax is also supported, see madlib.lm and madlib.glm for more details.

data

A db.obj object, which wraps the data in the database.

id

A string, the index for each row. If key has been specified for data, the key will be used as the ID unless this argument is also specified. We have to have this specified so that predict.rf.madlib's result can be compared with the original data.

ntree

An integer, maximum number of trees to grow in the random forest model, default is 100.

mtry

An integer, number of features randomly selected for each split.

importance

A boolean, whether or not to calculate variable importance, default is FALSE.

nPerm

An integer, number of times to permute each feature value while calculating variable importance, default is 1.

na.action

A function, which filters the NULL values from the data. Not implemented yet.

control

A list, which includes parameters for the fit. Supported parameters include: 'minsplit' - minimum number of observations that must be present in a node for a split to be attempted. default is minsplit=20

'minbucket' - Minimum number of observations in any terminal node, default is min_split/3

'maxdepth' - Maximum depth of any node, default is maxdepth=10

'nbins' - Number of bins to find possible node split threshold values for continuous variables, default is 100 (Must be greater than 1)

'max_surrogates' - Number of surrogate splits at each node in the trees constructed.

na.as.level

A boolean, indicating if NULL value for a categorical variable is treated as a distinct level, default is na.as.level=false

verbose

A boolean, indicating whether or not to print more info, default is verbose=false

...

Arguments to be passed to or from other methods.

Value

An S3 object of type rf.madlib in the case of non-grouping, and of type rf.madlib.grp in the case of grouping.

Author(s)

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. [email protected]

References

[1] Documentation of random forest in MADlib 1.7, http://doc.madlib.net/latest/

See Also

print.rf.madlib function to print summary of a model fitted through madlib.randomForest

predict.rf.madlib is a wrapper for MADlib's predict function for random forests.

madlib.lm, madlib.glm, madlib.summary, madlib.arima, madlib.elnet, madlib.rpart are all MADlib wrapper functions.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## Not run: 


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

x <- as.db.data.frame(abalone, conn.id = cid, verbose = FALSE)
lk(x, 10)

## decision tree using abalone data, using default values of minsplit,
## maxdepth etc.
key(x) <- "id"
fit <- madlib.randomForest(rings < 10 ~ length + diameter + height + whole + shell,
       data=x)
fit

## Another example, using grouping
fit <- madlib.randomForest(rings < 10 ~ length + diameter + height + whole + shell | sex,
       data=x)
fit

db.disconnect(cid)

## End(Not run)

PivotalR documentation built on May 30, 2017, 8:18 a.m.