Fitting Generalized Linear Models for Large Data Sets

Description

speedglm and speedglm.wfit fit GLMs to medium-large data sets, that is those storable into the R memory. The highest performances, in terms of computation time, are obtained when R is linked against an optimized BLAS, such as ATLAS. The function shglm is for a data set stored into a file of size greater than the available memory, and takes as argument a function to manipulate connections.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
## S3 method for class 'data.frame':
speedglm(formula,data,family=gaussian(),weights=NULL,start=NULL,
         etastart=NULL,mustart=NULL,offset=NULL,maxit=25, k=2, 
         sparse=NULL,set.default=list(), trace=FALSE,
         method=c('eigen','Cholesky','qr'), model=FALSE, y=FALSE, 
         fitted=FALSE,...)

## S3 method for class 'matrix':       
speedglm.wfit(y, X, intercept=TRUE, weights=NULL,row.chunk=NULL,
              family=gaussian(), start=NULL, etastart=NULL,
              mustart=NULL, offset=NULL, acc=1e-08, maxit=25, k=2,
              sparselim=.9,camp=.01, eigendec=TRUE, tol.values=1e-7,
              tol.vectors=1e-7, tol.solve=.Machine$double.eps,
              sparse=NULL,method = c('eigen','Cholesky','qr'), 
              trace=FALSE,...)
              
## S3 method for class 'function':              
shglm(formula, datafun, family = gaussian(), weights.fo = NULL, start = NULL, 
      etastart = NULL, mustart = NULL, offset = NULL, maxit = 25, k = 2, 
      chunksize = 5000, sparse = NULL, trace = FALSE, all.levels = FALSE,
      set.default = list(),...)
     

Arguments

Most of arguments are the same of glm or bigglm but with some difference.

formula

the same of glm.

data

a data frame.

datafun

a function which uses connections. See the example below.

family

the same of glm, but it must be specified with brackets.

start

the same of glm.

weights

the same of glm, but it must be specified as data$weights.

weights.fo

weights for the response. It must be specified as a formula (see the example below).

etastart

the same of glm.

mustart

the same of glm.

offset

the same of glm.

intercept

the same of glm.

X

the same of x in glm.fit.

y

the same of glm and glm.fit.

maxit

the same of glm.

k

numeric, the penalty per parameter to be used; the default k = 2 is the classical AIC.

trace

logical. Do you want to be informed about the model estimation progress?

sparse

logical. Is the model matrix sparse? By default is NULL, so a quickly sample survey will be made.

chunksize

an integer indicates the number of rows of the data file to read at time.

all.levels

logical, are all factor's levels present in each data chunk?

set.default

a list in which to specify the below parameters.

sparselim

a real in the interval [0, 1]. It indicates the minimal proportion of zeroes in the data matrix X in order to consider X as sparse.

camp

see the function is.sparse.

eigendec

logical. Do you want to check the rank of X? You may set it to false if you are sure that X is full rank.

row.chunk

an integer, see the function cp for details.

acc

tolerance to be used for the estimation.

tol.solve

see the function solve.

tol.values

see the function control.

tol.vectors

see the function control.

method

the chosen method to detect for singulatity.

model

logical. If TRUE the model frame will be returned.

fitted

logical. If TRUE the fitted values will be returned.

...

further optional arguments.

Details

The function shglm works like biglm, but it checks for singularity and does not impose restrictions on factors. Since during the IWLS estimation shglm uses repeated accesses to data file stored, for example, into the hard disk, the estimation time could be very long. Unlike from glm or biglm, the functions of class 'speedglm' do not use the QR decomposition, but directly solve the equations in the form of Iterative(-ly) (Re-)Weighted Least Squares (IWLS). The memory size of an object of class 'speedglm' is O(p^2), where p is the number of covariates, unless one or more of argument model, y and fitted are set to TRUE. If an optimized BLAS is not installed, an attempt to speed up calculations might be done by setting row.chunk to some value, usually less than 1000, in set.default. See the function cp for details.
If the model matrix is (very) sparse, the package Matrix could be used. Note that if method 'qr' is chosen, then the qr decomposition will not be applied on matrix X, as in lm, but on X'WX.

Value

coefficients

the estimated coefficients.

logLik

the log likelihood of the fitted model.

iter

the number of iterations of IWLS used.

tol

the maximal value of tolerance reached.

convergence

a logical value which indicates if convergence was reached.

family

the family object used.

link

the link function used.

df

the degrees of freedom of the model.

XTX

the product X'X (weighted, if the case).

dispersion

the estimated dispersion parameter of the model.

ok

the set of column indeces of the model matrix where the model has been fitted.

rank

the rank of the model matrix.

RSS

the estimated residual sum of squares of the fitted model.

aic

the estimated Akaike Information Criterion.

sparse

a logical value which indicates if the model matrix is sparse.

deviance

the estimated deviance of the fitted model.

nulldf

the degrees of freedom of the null model.

nulldev

the estimated deviance of the null model.

ngoodobs

the number of non-zero weighted observations.

n

the number of observations.

intercept

a logical value which indicates if an intercept has been used.

terms

the terms object used.

call

the matched call.

model

Either NULL or, if model was previously set to TRUE, the model frame.

y

Either NULL or, if y was previously set to TRUE, the response variable.

linear.predictors

Either NULL or, if fitted was previously set to TRUE, the fitted values.

Note

All the above functions make an object of class 'speedglm'.
In the current package version, arguments start, mustart and etastart of function shglm have been disabled. These will be restored in future.

Author(s)

Marco Enea. Ronen Meiri contributed with method 'qr'

References

Enea, M. (2009) Fitting Linear Models and Generalized Linear Models with large data sets in R. In book of short papers, conference on “Statistical Methods for the analysis of large data-sets”, Italian Statistical Society, Chieti-Pescara, 23-25 September 2009, 411-414.

Bates, D. (2009) Comparing Least Square Calculations. Technical report. Available at http://cran.rakanu.com/web/packages/Matrix/vignettes/Comparisons.pdf

Lumley, T. (2009) biglm: bounded memory linear and generalized linear models. R package version 0.7. http://CRAN.R-project.org/package=biglm.

See Also

speedlm, bigglm, glm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
## Not run: 
# The following comparison among glm(), bigglm() and speedglm() cannot be considered rigorous 
# and exhaustive, but it is only to give an idea of the computation time. 
# It may take a long time.
require(biglm)
n<-50000
k<-80
y <- rgamma(n,1.5,1)
x <-round( matrix(rnorm(n*k),n,k),digits=3)
colnames(x) <-paste("s",1:k,sep = "")
da<- data.frame(y,x)
fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+")))   

system.time(m1 <- glm(fo,data=da,family=Gamma(log)))
system.time(m2 <- bigglm(fo,data=da,family=Gamma(log)))
system.time(m3 <- speedglm(fo,data=da,family=Gamma(log)))

# You may also try speedglm when R is linked against an optimized BLAS,
# otherwise try to run the following function. In some computers, it is
# faster for large data sets.
system.time(m4 <- speedglm(fo,data=da,family=Gamma(log),set.default=list(row.chunk=1000)))

## End(Not run)
##################
## Not run: 
## An example of function using a connection to an out-memory file  
## This is a slightly modified version of the function from the bigglm's help page 
 make.data<-function(filename, chunksize,...){       
     conn<-NULL
     function(reset=FALSE){
     if(reset){
       if(!is.null(conn)) close(conn)
       conn<<-file(filename,open="r")
     } else{
       rval<-read.table(conn, nrows=chunksize,...)
       if ((nrow(rval)==0)) {
            close(conn)
            conn<<-NULL
            rval<-NULL
       }
       return(rval)
     }
  }
}


# data1 is a small toy dataset
data(data1)
write.table(data1,"data1.txt",row.names=FALSE,col.names=FALSE)
rm(data1) 

da<-make.data("data1.txt",chunksize=50,col.names=c("y","fat1","x1","x2"))

# Caution! make sure to close the connection once you have run command #1
da(reset=T) #1: opens the connection to "data1.txt"
da(reset=F) #2: reads the first 50 rows (out of 100) of the dataset
da(reset=F) #3: reads the second 50 rows (out of 100) of the dataset
da(reset=F) #4: is NULL: this latter command closes the connection


require(biglm)        
# fat1 is a factor with four levels                                    
b1<-shglm(y~factor(fat1)+x1,weights=~I(x2^2),datafun=da,family=Gamma(log))
b2<-bigglm(y~factor(fat1)+x1,weights=~I(x2^2),data=da,family=Gamma(log))
summary(b1) 
summary(b2) 

file.remove("data1.txt")

## End(Not run)