glm: Fitting Large Scale Generalized Linear Models
In tera-insights/gtStats: Higher Level Statistical Functions for Grokit

Description Usage Arguments Details Algorithm Author(s)

gtGlm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution. The syntax is modeled after that of glm.

For more information regarding the helper functions, see the description of Make and Argument functions in the general documentation on the Grokit system.

glm.data(data, ..., outputs = result, force.frame = FALSE)

GLM(data, ..., model = NULL, outputs = result)

GLMMake(formula, family = gaussian, weights = NULL, start = NULL,
        eta.start = NULL, mu.start = NULL, offset = NULL,
        maxit = 25, epsilon = 1e-8, trace = FALSE, debug = FALSE,
        convergence = "relative", model = TRUE,  ...)

`data`	an object of class `"data"`.
`formula`	an object of class `formula`. The specification of formula differs slightly from the stats library `glm` (see ‘Details’ for specifications).
`family`	the expected error distribution of the data and the link function. This argument can be a character string naming a family (e.g. `‘gaussian’`), a family function (e.g. `gaussian`), or a list that is the result of a call to a family funtion (e.g. `gaussian(link = ‘inverse’)`). See `family` for further information.
`weights`	an optional description of ‘prior weights’ to be used. Unlike `glm`, this should be NULL or an `expression`.
`start`	an optional vector that specifies starting values for the parameters in the linear predictor.
`etastart`	an optional specification of starting values for the linear predictor. The required format is equivalent to that of `weights`. If `start` is non-`NULL`, this argument is unused.
`mustart`	an optional specification of starting values for the predicted means. The required format is equivalent to that of `weights`. If either `start` or `etastart` is non-NULL, this argument is unused.
`offset`	an optional specification of a known component to be added to the linear predictor during fitting. The required format is equivalent to that of `weights`.
`maxit`	the maximal number of IWLS iterations to be used. In `glm`, this argument is specified within the control argument.
`epsilon`	the maximum absolute change of parameters allowed for convergence. See section ‘Algorithm’ for more information.
`trace`	logical indicating whether every iteration of the coefficient vector should be returned.
`debug`	logical indicating whether sample calculations should be outputed to the terminal.
`convergence`	the type of convergence to be tested for, either `"relative"` or `"absolute"`.

A typical formula has the form response ~ terms where both response and terms are expressions with the additional function I allowed. Additionally, terms is allowed the additional binary operator :. See the ‘Details’ of glm for more information about I, :, +, and *.

Unlike glm, a binomial model is specified exactly the same as other models. To specify number of trials, simply include them in the weights, which is easily accomplished because weights is allowed to be a mathematical expression. Furthermore, the response must be the proportion of successes.

For example, let S denote a vector of number of successes; F, a vector of number of failures; W, a vector of weights. In the implementation of glm, this could be called as glm(cbind(S, F) ~ [formula], family = binomial, weights = W). To form an equivalent model using gtGlm, use gtGlm(S/(S + F) ~ [formula], family = binomial, weights = (F + S) * W). Here, [formula] represents an arbitrary formula of covariates.

As the algorithm for maximimizing the log likelihood is different from that of glm, a brief outline is given:

1. η_i = \textbf{x}_i \cdot \textbf{β}_j - The linear predictor is the dot product of the input vector (which is formed from the formula and the data) and the current iteration of the coefficient vector.

2. \hat{μ}_i = g^{-1}≤ft(η_i\right) - the predict mean is the inverse link function of the linear predictor.

3. z_i = η_i + ≤ft(y_i - \hat{μ}_i \right) \cdot \frac{dη_i}{d μ_i} - The working dependent variable is computed.

4. w_i = \frac{p_i}{\text{var}≤ft(μ_i\right) \cdot ≤ft(\frac{dη_i}{d μ_i}\right)^2} - the iterative weight calculated from the prior weights (the weights argument) and the variance function specified by the family.

5. \textbf{X}^\textbf{T} \textbf{WX} \stackrel{+}{=} w_i \cdot \textbf{x}_i^\textbf{T} \textbf{x} - The weights matrix is updated for that item.

6. \textbf{X}^\textbf{T} \textbf{Wz} \stackrel{+}{=} w_i \cdot z_i \cdot \textbf{x}_i - The response matrix is updated for that item.

7. \boldsymbol β_{j+1} = ≤ft( \textbf{X}^\textbf{T} \textbf{WX} \right)^{-1} \cdot \textbf{X}^\textbf{T} \textbf{Wz} - the next iteration of the coefficient vector is computed. Note: ( \cdot )^{-1} denotes the Moore-Penrose psuedoinverse, not the standard matrix inverse.

8. If \max\limits_i ≤ft|β_j[i] - β_{j+1}[i]\right| < ε, then the IWLS has converged and deviance is calculated. Otherwise, another iteration is performed.

Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC

tera-insights/gtStats documentation built on May 31, 2019, 8:36 a.m.