Description Usage Arguments Details Value Note Author(s) References See Also Examples
The wrapper function for MADlib linear regression. Heteroskedasticity can be detected using the Breusch-Pagan test. One or multiple columns of data can be used to separated the data set into multiple groups according to the values of the grouping columns. Linear regression is applied onto each group, which has fixed values of the grouping columns. Categorial variables are supported, see details below. The computation is parallelized by MADlib if the connected database is Greenplum database. The regression computation can also be done on a column that is an array in the data table.
1 |
formula |
an object of class " |
data |
An object of |
na.action |
A string which indicates what should happen when the data
contain |
hetero |
A logical value with default value |
na.as.level |
A logical value, default is |
... |
More parameters can be passed into this function. Currently, it is just a place holder and any parameter here is not used. |
For details about how to write a formula, see formula
for details. "|" can be used at the end of the formula to denote that
the fitting is done conditioned on the values of one or more
variables. For example, y ~ x + sin(z) | v + w
will do the
fitting each distinct combination of the values of v
and
w
.
Both the linear regression (this function) and the logistic regression
(madlib.glm
) support categorical variables. Use
as.factor,db.obj-method
to denote that a variable is categorical, and
the corresponding dummy variables are created and fitted. See
as.factor,db.obj-method
for more.
If there is no grouping (i.e. no |
in the formula), the result
is a lm.madlib
object. Otherwise, it is a lm.madlib.grps
object, which is just a list of lm.madlib
objects.
A lm.madlib
object is a list which contains the following items:
grouping column(s) |
When there are grouping columns in the formula, the resulting list has multiple items, each of which has the same name as one of the grouping columns. All of these items are vectors, and they have the same length, which is equal to the number of distinct combinations of all the grouping column values. Each row of these items together is one distinct combination of the grouping values. When there is no grouping column in the formula, none of such items will appear in the resulting list. |
coef |
A numeric matrix, the fitting coefficients. Each row contains the coefficients for the linear regression of each group of data. So the number of rows is equal to the number of distinct combinations of all the grouping column values. The number of columns is equal to the number features (including intercept if it presents in the formula). |
r2 |
A numeric array. R2 values for all combinations of the grouping column values. |
std_err |
A numeric matrix, the standard error for each coefficients. |
t_stats |
A numeric matrix, the t-statistics for each coefficient, which is
the absolute value of the ratio of |
p_values |
A numeric matrix, the p-values of |
condition_no |
A numeric array, the condition number for all combinations of the grouping column values. |
bp_stats |
A numeric array when |
bp_p_value |
A numeric array when |
grps |
An integer, the number of groups that the data is divided into according to the grouping columns in the formula. |
grp.cols |
An array of strings. The column names of the grouping columns. |
has.intercept |
A logical, whether the intercept is included in the fitting. |
ind.vars |
An array of strings, all the different terms used as independent variables in the fitting. |
ind.str |
A string. The independent variables in an array format string. |
call |
A language object. The function call that generates this result. |
col.name |
An array of strings. The column names used in the fitting. |
appear |
An array of strings, the same length as the number of independent
variables. The strings are used to print a clean result, especially when
we are dealing with the factor variables, where the dummy variable
names can be very long due to the inserting of a random string to
avoid naming conflicts, see |
model |
A |
terms |
A |
nobs |
The number of observations used to fit the model. |
data |
A |
origin.data |
The original |
Note that if there is grouping done, and there are multiple
lm.madlib
objects in the final result, each one of them
contains the same copy model
.
|
is not part of standard R formula object, but many R packages
use |
to add their own functionalities into formula
object. However, |
has different meanings and usages
in different packages. The user must be careful that usage of |
in
PivotalR-package
may not be the same as the others.
Author: Predictive Analytics Team at Pivotal Inc.
Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io
[1] Wikipedia: Breusch-Pagan test, https://en.wikipedia.org/wiki/Breusch-Pagan_test [2] Documentation of linear regression in MADlib v0.6, https://madlib.apache.org/docs/latest/group__grp__linreg.html.
madlib.glm
,
madlib.summary
, madlib.arima
are MADlib
wrapper functions.
as.factor
creates categorical variables for fitiing.
delete
safely deletes the result of this function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | ## Not run:
## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)
x <- as.db.data.frame(abalone, conn.id = cid, verbose = FALSE)
lk(x, 10)
## linear regression conditioned on nation value
## i.e. grouping
fit <- madlib.lm(rings ~ . - id | sex, data = x, heteroskedasticity = T)
fit
## use I(.) for expressions
fit <- madlib.lm(rings ~ length + diameter + shell + I(diameter^2),
data = x, heteroskedasticity = T)
fit # display the result
## Another example
fit <- madlib.lm(rings ~ . - id | sex + (id < 2000), data = x)
## 3rd example
## The table has two columns: x is an array, y is double precision
dat <- x
dat$arr <- db.array(x[,-c(1,2)])
array.data <- as.db.data.frame(dat)
## Fit to y using every element of x
## This does not work in R's lm, but works in madlib.lm
fit <- madlib.lm(rings ~ arr, data = array.data)
fit <- madlib.lm(rings ~ arr - arr[1], data = array.data)
fit <- madlib.lm(rings ~ . - arr[1:2], data = array.data)
fit <- madlib.lm(as.integer(rings < 10) ~ . - arr[1:2], data = array.data)
## 4th example
## Step-wise feature selection
start <- madlib.lm(rings ~ . - id - sex, data = x)
## step(start)
db.disconnect(cid)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.