glm: Fitting Large Scale Generalized Linear Models

Description Usage Arguments Details Algorithm Author(s)

Description

gtGlm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution. The syntax is modeled after that of glm.

For more information regarding the helper functions, see the description of Make and Argument functions in the general documentation on the Grokit system.

Usage

1
2
3
4
5
6
7
8
glm.data(data, ..., outputs = result, force.frame = FALSE)

GLM(data, ..., model = NULL, outputs = result)

GLMMake(formula, family = gaussian, weights = NULL, start = NULL,
        eta.start = NULL, mu.start = NULL, offset = NULL,
        maxit = 25, epsilon = 1e-8, trace = FALSE, debug = FALSE,
        convergence = "relative", model = TRUE,  ...)

Arguments

data

an object of class "data".

formula

an object of class formula. The specification of formula differs slightly from the stats library glm (see ‘Details’ for specifications).

family

the expected error distribution of the data and the link function. This argument can be a character string naming a family (e.g. ‘gaussian’), a family function (e.g. gaussian), or a list that is the result of a call to a family funtion (e.g. gaussian(link = ‘inverse’)). See family for further information.

weights

an optional description of ‘prior weights’ to be used. Unlike glm, this should be NULL or an expression.

start

an optional vector that specifies starting values for the parameters in the linear predictor.

etastart

an optional specification of starting values for the linear predictor. The required format is equivalent to that of weights. If start is non-NULL, this argument is unused.

mustart

an optional specification of starting values for the predicted means. The required format is equivalent to that of weights. If either start or etastart is non-NULL, this argument is unused.

offset

an optional specification of a known component to be added to the linear predictor during fitting. The required format is equivalent to that of weights.

maxit

the maximal number of IWLS iterations to be used. In glm, this argument is specified within the control argument.

epsilon

the maximum absolute change of parameters allowed for convergence. See section ‘Algorithm’ for more information.

trace

logical indicating whether every iteration of the coefficient vector should be returned.

debug

logical indicating whether sample calculations should be outputed to the terminal.

convergence

the type of convergence to be tested for, either "relative" or "absolute".

Details

A typical formula has the form response ~ terms where both response and terms are expressions with the additional function I allowed. Additionally, terms is allowed the additional binary operator :. See the ‘Details’ of glm for more information about I, :, +, and *.

Unlike glm, a binomial model is specified exactly the same as other models. To specify number of trials, simply include them in the weights, which is easily accomplished because weights is allowed to be a mathematical expression. Furthermore, the response must be the proportion of successes.

For example, let S denote a vector of number of successes; F, a vector of number of failures; W, a vector of weights. In the implementation of glm, this could be called as glm(cbind(S, F) ~ [formula], family = binomial, weights = W). To form an equivalent model using gtGlm, use gtGlm(S/(S + F) ~ [formula], family = binomial, weights = (F + S) * W). Here, [formula] represents an arbitrary formula of covariates.

Algorithm

As the algorithm for maximimizing the log likelihood is different from that of glm, a brief outline is given:

1. η_i = \textbf{x}_i \cdot \textbf{β}_j - The linear predictor is the dot product of the input vector (which is formed from the formula and the data) and the current iteration of the coefficient vector.

2. \hat{μ}_i = g^{-1}≤ft(η_i\right) - the predict mean is the inverse link function of the linear predictor.

3. z_i = η_i + ≤ft(y_i - \hat{μ}_i \right) \cdot \frac{dη_i}{d μ_i} - The working dependent variable is computed.

4. w_i = \frac{p_i}{\text{var}≤ft(μ_i\right) \cdot ≤ft(\frac{dη_i}{d μ_i}\right)^2} - the iterative weight calculated from the prior weights (the weights argument) and the variance function specified by the family.

5. \textbf{X}^\textbf{T} \textbf{WX} \stackrel{+}{=} w_i \cdot \textbf{x}_i^\textbf{T} \textbf{x} - The weights matrix is updated for that item.

6. \textbf{X}^\textbf{T} \textbf{Wz} \stackrel{+}{=} w_i \cdot z_i \cdot \textbf{x}_i - The response matrix is updated for that item.

7. \boldsymbol β_{j+1} = ≤ft( \textbf{X}^\textbf{T} \textbf{WX} \right)^{-1} \cdot \textbf{X}^\textbf{T} \textbf{Wz} - the next iteration of the coefficient vector is computed. Note: ( \cdot )^{-1} denotes the Moore-Penrose psuedoinverse, not the standard matrix inverse.

8. If \max\limits_i ≤ft|β_j[i] - β_{j+1}[i]\right| < ε, then the IWLS has converged and deviance is calculated. Otherwise, another iteration is performed.

Author(s)

Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC


tera-insights/gtStats documentation built on May 31, 2019, 8:36 a.m.