speedlm | R Documentation |
The functions of class 'speedlm' may speed up the fitting of LMs to large data sets. High performances can be obtained especially if R is linked against an optimized BLAS, such as ATLAS.
# S3 method of class 'data.frame'
speedlm(formula, data, weights = NULL, offset = NULL, sparse = NULL,
set.default = list(), method=c('eigen','Cholesky','qr'),
model = FALSE, y = FALSE, fitted = FALSE, subset=NULL, ...)
# S3 method of class 'matrix'
speedlm.fit(y, X, intercept = FALSE, offset = NULL, row.chunk = NULL,
sparselim = 0.9, camp = 0.01, eigendec = TRUE,
tol.solve = .Machine$double.eps, sparse = NULL, tol.values = 1e-07,
tol.vectors = 1e-07, method=c('eigen','Cholesky','qr'), ...)
speedlm.wfit(y, X, w, intercept = FALSE, offset = NULL, row.chunk = NULL,
sparselim = 0.9, camp = 0.01, eigendec = TRUE,
tol.solve = .Machine$double.eps, sparse = NULL, tol.values = 1e-07,
tol.vectors = 1e-07, method=c('eigen','Cholesky','qr'), ...)
# S3 method of class 'speedlm' (object) and 'data.frame' (data)
## S3 method for class 'speedlm'
update(object, formula, data, add=TRUE, evaluate=TRUE,
subset=NULL, offset=NULL, weights=NULL,...)
# S3 method of class 'speedlm' (object) and 'data.frame' (data)
updateWithMoreData(object, data, weights = NULL, offset = NULL, sparse = NULL,
all.levels = FALSE, set.default = list(), subset=NULL,...)
Most of arguments are the same of functions lm but with some difference.
formula |
the same of function |
data |
the same of function |
weights |
the same of function |
w |
model weights. |
intercept |
a logical value which indicates if an intercept is used. |
offset |
the same of function |
X |
the same of |
y |
the same of |
sparse |
logical. Is the model matrix sparse? By default is NULL, so a quickly sample survey will be made. |
set.default |
a list in which to specify the parameters to pass to the functions cp, control and is.sparse. |
sparselim |
a value in the interval [0, 1]. It indicates the minimal proportion of zeroes, in the model matrix X, in order to consider X as sparse. |
camp |
see function |
eigendec |
logical. Do you want to investigate on rank of X? You may set it to false if you are sure that X is full rank. |
row.chunk |
an integer, see the function |
tol.solve |
see function solve. |
tol.values |
see function control. |
tol.vectors |
see function control. |
method |
the method used to check for singularity. The default is 'eigen'. See details |
object |
an object of class 'speedlm'. |
all.levels |
are all levels of eventual factors present in each data
chunk? If so, set |
model |
logical. Should the model frame be returned? |
fitted |
logical. Should the fitted values be returned? |
subset |
the same of function |
add |
logical. Are additional data coming from a new chunk provided? |
evaluate |
logical. If true evaluate the new call else return the call. |
... |
further optional arguments. |
Unlikely from lm or biglm, the functions of
class 'speedlm' do not use the QR decomposition of the X-matrix, but directly
solve the normal equations. In some extreme case, this might have some problem
of numerical stability, but it may take advantage from the use of
an optimized BLAS. To check for singularity, three options are available.
Method "eigen" performs an eigendecomposition of X'X. The 'Cholesky' method is
faster than "eigen" and does not use tolerance, but the former
seems to be more stable for opportune tolerance values. Method 'qr' does
not perform a QR decomposition directly on matrix X, but on X'WX.
The memory size of an object of class 'speedlm' is O(p^2)
, where p
is the number of covariates. If an optimized BLAS library is not installed, an
attempt to speed up calculations may be done by setting row.chunk
to some value, usually less than 1000, in set.default
. See the function
cp for details. Factors are permitted without limitations.
In the most recent versions, function update.speedlm
is now a wrapper to
call either updateWithMoreData
(the new name of the old update.speedlm
,
for additional data chunks), or update from package stats
.
coefficients |
the estimated coefficients. |
df.residual |
the residual degrees of freedom. |
XTX |
the product X'X (weighted, if the case). |
A |
the product X'X (weighted, if the case) not checked for singularity. |
Xy |
the product X'y (weighted, if the case). |
ok |
the set of column indeces of the model matrix where the model has been fitted. |
rank |
the numeric rank of the fitted linear model. |
pivot |
see the function control. |
RSS |
the estimated residual sums of squares of the fitted model. |
sparse |
a logical value indicating if the model matrix is sparse. |
deviance |
the estimated deviance of the fitted model. |
weigths |
the weights used in the last updating. |
zero.w |
the number of non-zero weighted observations. |
nobs |
the number of observations. |
nvar |
the number of independent variables. |
terms |
the |
intercept |
a logical value which indicates if an intercept has been used. |
call |
the matched call. |
model |
Either NULL or the model frame, if |
y |
Either NULL or the response variable, if |
fitted.values |
Either NULL or the fitted values, if |
offset |
the model offset. |
... |
others values necessary to update the estimation. |
All the above functions make an object of class 'speedlm'.
Marco Enea, with contribution from Ronen Meiri.
Enea, M. (2009) Fitting Linear Models and Generalized Linear Models With Large Data Sets in R.
In book of short papers, conference on “Statistical Methods for the analysis of large data-sets”,
Italian Statistical Society, Chieti-Pescara, 23-25 September 2009, 411-414. ISBN:978-88-6129-425-7
Klotz, J.H. (1995) Updating Simple Linear Regression. Statistica Sinica, 5, 399-403.
Bates, D. (2009) Comparing Least Square Calculations. Technical report.
Lumley, T. (2009) biglm: bounded memory linear and generalized linear models. R package version 0.7 https://CRAN.R-project.org/package=biglm.
summary.speedlm,speedglm, lm, and biglm
data(data1)
da <- data1
do1 <- da[1:30,]
do2 <- da[31:70,]
do3 <- da[71:100,]
m1 <- speedlm(y ~ factor(fat1) + x1 + x2, data = do1)
m1 <- update(m1, data = do2)
m1 <- update(m1, data = do3)
m2 <- lm(y ~ factor(fat1) + x1 + x2, data = data1)
summary(m1)
summary(m2)
# as before but recursively
make.data <- function(filename, chunksize,...){
conn <- NULL
function(reset=FALSE, header=TRUE){
if(reset){
if(!is.null(conn)) close(conn)
conn<<-file(filename,open="r")
} else{
rval <- read.table(conn, nrows=chunksize,header=header,...)
if (nrow(rval)==0) {
close(conn)
conn<<-NULL
rval<-NULL
}
return(rval)
}
}
}
tmp_da<-tempfile("da",fileext=".txt")
write.table(da,tmp_da,col.names=TRUE,row.names=FALSE,quote=FALSE)
dat <- make.data(tmp_da,chunksize=30,col.names=c("y","fat1","x1", "x2"))
dat(reset=TRUE)
da2 <- dat(reset=FALSE)
# the first model runs on the first 30 rows.
m3 <- speedlm(y ~ factor(fat1) + x1 + x2, data=da2)
# the last three models run on the subsequent 30, 30 and 10 rows, respectively
for (i in 1:3){
da2 <- dat(reset=FALSE, header=FALSE)
m3 <- update(m3, data=da2, add=TRUE)
}
all.equal(coef(m1),coef(m3))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.