# Fitting Generalized Linear Models for Large Data Sets

### Description

`speedglm`

and `speedglm.wfit`

fit GLMs to medium-large
data sets, that is those storable into the R memory. The highest performances, in terms of computation time,
are obtained when R is linked against an optimized BLAS, such as ATLAS. The function `shglm`

is for a data set stored into a file of size greater than the available memory, and takes as
argument a function to manipulate connections.

### Usage

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | ```
## S3 method for class 'data.frame':
speedglm(formula,data,family=gaussian(),weights=NULL,start=NULL,
etastart=NULL,mustart=NULL,offset=NULL,maxit=25, k=2,
sparse=NULL,set.default=list(), trace=FALSE,
method=c('eigen','Cholesky','qr'), model=FALSE, y=FALSE,
fitted=FALSE,...)
## S3 method for class 'matrix':
speedglm.wfit(y, X, intercept=TRUE, weights=NULL,row.chunk=NULL,
family=gaussian(), start=NULL, etastart=NULL,
mustart=NULL, offset=NULL, acc=1e-08, maxit=25, k=2,
sparselim=.9,camp=.01, eigendec=TRUE, tol.values=1e-7,
tol.vectors=1e-7, tol.solve=.Machine$double.eps,
sparse=NULL,method = c('eigen','Cholesky','qr'),
trace=FALSE,...)
## S3 method for class 'function':
shglm(formula, datafun, family = gaussian(), weights.fo = NULL, start = NULL,
etastart = NULL, mustart = NULL, offset = NULL, maxit = 25, k = 2,
chunksize = 5000, sparse = NULL, trace = FALSE, all.levels = FALSE,
set.default = list(),...)
``` |

### Arguments

Most of arguments are the same of glm or bigglm but with some difference.

`formula` |
the same of |

`data` |
a data frame. |

`datafun` |
a function which uses connections. See the example below. |

`family` |
the same of |

`start` |
the same of |

`weights` |
the same of |

`weights.fo` |
weights for the response. It must be specified as a formula (see the example below). |

`etastart` |
the same of |

`mustart` |
the same of |

`offset` |
the same of |

`intercept` |
the same of |

`X` |
the same of |

`y` |
the same of |

`maxit` |
the same of |

`k` |
numeric, the penalty per parameter to be used; the default k = 2 is the classical AIC. |

`trace` |
logical. Do you want to be informed about the model estimation progress? |

`sparse` |
logical. Is the model matrix sparse? By default is NULL, so a quickly sample survey will be made. |

`chunksize` |
an integer indicates the number of rows of the data file to read at time. |

`all.levels` |
logical, are all factor's levels present in each data chunk? |

`set.default` |
a list in which to specify the below parameters. |

`sparselim` |
a real in the interval [0, 1]. It indicates the minimal proportion of zeroes in the data matrix X in order to consider X as sparse. |

`camp` |
see the function is.sparse. |

`eigendec` |
logical. Do you want to check the rank of X? You may set it to false if you are sure that X is full rank. |

`row.chunk` |
an integer, see the function cp for details. |

`acc` |
tolerance to be used for the estimation. |

`tol.solve` |
see the function solve. |

`tol.values` |
see the function control. |

`tol.vectors` |
see the function control. |

`method` |
the chosen method to detect for singulatity. |

`model` |
logical. If TRUE the model frame will be returned. |

`fitted` |
logical. If TRUE the fitted values will be returned. |

`...` |
further optional arguments. |

### Details

The function `shglm`

works like
`biglm`

, but it checks for singularity and does not impose restrictions on
factors. Since during the IWLS estimation `shglm`

uses repeated accesses to data file stored, for example,
into the hard disk, the estimation time could be very long.
Unlike from `glm`

or `biglm`

, the functions of class 'speedglm' do not use
the QR decomposition, but directly solve the equations in the form of Iterative(-ly) (Re-)Weighted Least Squares (IWLS).
The memory size of an object of class 'speedglm' is *O(p^2)*, where *p* is the number of covariates, unless one or more of argument `model`

, `y`

and `fitted`

are set to TRUE. If an optimized BLAS
is not installed, an attempt to speed up calculations might be done by setting `row.chunk`

to some value, usually less than 1000, in `set.default`

. See the function cp for details.

If the model matrix is (very) sparse, the package `Matrix`

could be used.
Note that if method 'qr' is chosen, then the qr decomposition will not be applied on matrix X, as in `lm`

,
but on X'WX.

### Value

`coefficients` |
the estimated coefficients. |

`logLik` |
the log likelihood of the fitted model. |

`iter` |
the number of iterations of IWLS used. |

`tol` |
the maximal value of tolerance reached. |

`convergence` |
a logical value which indicates if convergence was reached. |

`family` |
the family object used. |

`link` |
the link function used. |

`df` |
the degrees of freedom of the model. |

`XTX` |
the product X'X (weighted, if the case). |

`dispersion` |
the estimated dispersion parameter of the model. |

`ok` |
the set of column indeces of the model matrix where the model has been fitted. |

`rank` |
the rank of the model matrix. |

`RSS` |
the estimated residual sum of squares of the fitted model. |

`aic` |
the estimated Akaike Information Criterion. |

`sparse` |
a logical value which indicates if the model matrix is sparse. |

`deviance` |
the estimated deviance of the fitted model. |

`nulldf` |
the degrees of freedom of the null model. |

`nulldev` |
the estimated deviance of the null model. |

`ngoodobs` |
the number of non-zero weighted observations. |

`n` |
the number of observations. |

`intercept` |
a logical value which indicates if an intercept has been used. |

`terms` |
the terms object used. |

`call` |
the matched call. |

`model` |
Either NULL or, if |

`y` |
Either NULL or, if |

`linear.predictors` |
Either NULL or, if |

### Note

All the above functions make an object of class 'speedglm'.

In the current package version, arguments `start`

, `mustart`

and `etastart`

of function `shglm`

have been disabled.
These will be restored in future.

### Author(s)

Marco Enea. Ronen Meiri contributed with method 'qr'

### References

Enea, M. (2009) Fitting Linear Models and Generalized Linear Models with large data sets in R.
In *book of short papers, conference on “Statistical Methods for the analysis of large data-sets”,
Italian Statistical Society*, Chieti-Pescara, 23-25 September 2009, 411-414.

Bates, D. (2009) Comparing Least Square Calculations. Technical report. Available at
http://cran.rakanu.com/web/packages/Matrix/vignettes/Comparisons.pdf

Lumley, T. (2009) biglm: bounded memory linear and generalized linear models. *R package version 0.7*.
http://CRAN.R-project.org/package=biglm.

### See Also

speedlm, bigglm, glm

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | ```
## Not run:
# The following comparison among glm(), bigglm() and speedglm() cannot be considered rigorous
# and exhaustive, but it is only to give an idea of the computation time.
# It may take a long time.
require(biglm)
n<-50000
k<-80
y <- rgamma(n,1.5,1)
x <-round( matrix(rnorm(n*k),n,k),digits=3)
colnames(x) <-paste("s",1:k,sep = "")
da<- data.frame(y,x)
fo <- as.formula(paste("y~",paste(paste("s",1:k,sep=""),collapse="+")))
system.time(m1 <- glm(fo,data=da,family=Gamma(log)))
system.time(m2 <- bigglm(fo,data=da,family=Gamma(log)))
system.time(m3 <- speedglm(fo,data=da,family=Gamma(log)))
# You may also try speedglm when R is linked against an optimized BLAS,
# otherwise try to run the following function. In some computers, it is
# faster for large data sets.
system.time(m4 <- speedglm(fo,data=da,family=Gamma(log),set.default=list(row.chunk=1000)))
## End(Not run)
##################
## Not run:
## An example of function using a connection to an out-memory file
## This is a slightly modified version of the function from the bigglm's help page
make.data<-function(filename, chunksize,...){
conn<-NULL
function(reset=FALSE){
if(reset){
if(!is.null(conn)) close(conn)
conn<<-file(filename,open="r")
} else{
rval<-read.table(conn, nrows=chunksize,...)
if ((nrow(rval)==0)) {
close(conn)
conn<<-NULL
rval<-NULL
}
return(rval)
}
}
}
# data1 is a small toy dataset
data(data1)
write.table(data1,"data1.txt",row.names=FALSE,col.names=FALSE)
rm(data1)
da<-make.data("data1.txt",chunksize=50,col.names=c("y","fat1","x1","x2"))
# Caution! make sure to close the connection once you have run command #1
da(reset=T) #1: opens the connection to "data1.txt"
da(reset=F) #2: reads the first 50 rows (out of 100) of the dataset
da(reset=F) #3: reads the second 50 rows (out of 100) of the dataset
da(reset=F) #4: is NULL: this latter command closes the connection
require(biglm)
# fat1 is a factor with four levels
b1<-shglm(y~factor(fat1)+x1,weights=~I(x2^2),datafun=da,family=Gamma(log))
b2<-bigglm(y~factor(fat1)+x1,weights=~I(x2^2),data=da,family=Gamma(log))
summary(b1)
summary(b2)
file.remove("data1.txt")
## End(Not run)
``` |