madlib.rpart: MADlib wrapper function for Decision Tree

Description Usage Arguments Value Author(s) References See Also Examples

View source: R/madlib-rpart.R

Description

This function is a wrapper of MADlib's decision tree model training function. The resulting tree is stored in a table in the database, and one can also view the result from R using plot.dt.madlib, text.dt.madlib and print.dt.madlib.

Usage

1
2
madlib.rpart(formula, data, weights = NULL, id = NULL, na.action = NULL, parms,
control, na.as.level = FALSE, verbose = FALSE, ...) 

Arguments

formula

A formula object, intercept term will automatically be removed. Factors will not be expanded to their dummy variables. Grouping syntax is also supported, see madlib.lm and madlib.glm for more details.

data

A db.obj object, which wraps the data in the database.

weights

A string, the column name for the weights.

id

A string, the index for each row. If key has been specified for data, teh key will be used as the ID unless this argument is also specified. We have to have this specified so that predict.dt.madlib's result can be compared with the original data.

na.action

A function, which filters the NULL values from the data. Not implemented yet.

parms

A list, which includes parameters for the splitting function. Supported parameters include: 'split' specifying which split function to use. Options are 'gini', 'misclssification' and 'entropy' for classification, and 'mse' for regression. Default is 'gini' for classification and 'mse' for regression.

control

A list, which includes parameters for the fit. Supported parameters include: 'minsplit' - minimum number of observations that must be present in a node for a split to be attempted. default is minsplit=20

'minbucket' - Minimum number of observations in any terminal node, default is min_split/3

'maxdepth' - Maximum depth of any node, default is maxdepth=10

'nbins' - Number of bins to find possible node split threshold values for continuous variables, default is 100 (Must be greater than 1)

'cp' - Cost complexity parameter, default is cp=0.01

'n_folds' - Number of cross-validation folds

'max_surrogates' - The number of surrogates number

na.as.level

A boolean, indicating if NULL value for a categorical variable is treated as a distinct level, default is na.as.level=false

verbose

A boolean, indicating whether or not to print more info, default is verbose=false

...

Arguments to be passed to or from other methods.

Value

An S3 object of type dt.madlib in the case of non-grouping, and of type dt.madlib.grp in the case of grouping.

Author(s)

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io

References

[1] Documentation of decision tree in MADlib 1.6, https://madlib.apache.org/docs/latest/

See Also

plot.dt.madlib, text.dt.madlib, print.dt.madlib are visualization functions for a model fitted through madlib.rpart

predict.dt.madlib is a wrapper for MADlib's predict function for decision trees.

madlib.lm, madlib.glm, madlib.summary, madlib.arima, madlib.elnet are all MADlib wrapper functions.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## Not run: 


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

x <- as.db.data.frame(abalone, conn.id = cid, verbose = FALSE)
lk(x, 10)

## decision tree using abalone data, using default values of minsplit,
## maxdepth etc.
key(x) <- "id"
fit <- madlib.rpart(rings < 10 ~ length + diameter + height + whole + shell,
       data=x, parms = list(split='gini'), control = list(cp=0.005))
fit

## Another example, using grouping
fit <- madlib.rpart(rings < 10 ~ length + diameter + height + whole + shell | sex,
       data=x, parms = list(split='gini'), control = list(cp=0.005))
fit

db.disconnect(cid)

## End(Not run)

PivotalR documentation built on March 13, 2021, 1:06 a.m.