Logistic regression is most commonly used with a Bernoulli response. If $Y \sim Bernoulli(p)$, then $P(Y = 1) = p = 1 - P(Y = 0)$. Logistic regression can be used to estimate $p|\textbf{x}$ where $\textbf{x}$ is a collection of predictor variables. This is accomplished by assuming the relationship
$$log(\frac{p}{1-p}) = \textbf{x}\beta$$
and then choosing $\beta$ to maximize the joint likelihood:
$$\Pi_i p_i^{y_i}(1-p_i)^{(1-y_i)}$$
Note that the function linking the predictor variables to $p$ is
called the logit function, which is what gives logistic regression
its name. In $\textbf{R}$, this model can be fit using glm
by
specifying the options family = "binomial"
.
The binomial distribution is a generalization of the Bernoulli, typically conceptualized as describing the number of "successes" (here denoted $y$) out of some number of attempts or trials (here denoted $n$). The probability mass function for a binomial random variable, $Y$, is
$$P(Y = y|n, p) = {x \choose n} p^x(1 - p)^{n - x}$$
From this, we can see that by setting $n = 1$, we recover the Bernoulli PMF.
Since the binomial distribution is a generalization of the Bernoulli distribution, it stands to reason that it may be possible to generalize logistic regression for Bernoulli variables to binomial variables. Indeed, this is the case, and it turns out it is relatively easy. Using the logit link function again, maximize the joint likelihood:
$$\Pi_i {y_i \choose n_i} p_i^{y_i}(1-p_i)^{(1-y_i)}$$
R is also capable of fitting binomial regression models using glm
with family = "binomial"
. However, there are a few syntactic
quirks. First, in the model statement, the response variable must
be given as a proportion of successes for a given run. This can be
done either by providing the proportions as a column in the data
matrix or by specifying a ratio of the number of successes to the
number of attempts. An example here might be instructive.
set.seed(20171011) library(dplyr) library(ciTools)
df <- data.frame( x = runif(30, -1, 1), n = rbinom(30, 6, .8)) %>% mutate(y = rbinom(30, n, (exp(x)/(1 + exp(x)))))
Consider the data set, df
, given below.
head(df)
Here, $x$ is some predictor variable related to the number of successes, $y$, for the given number of attempts, $n$. We can fit a logistic regression to this data to estimate this relationship:
glm(y/n ~ x, data = df, family = "binomial", weights = n)
We get the same results using different syntax:
dfProb <- mutate(df, prob = y/n) head(dfProb) glm(prob ~ x, data = dfProb, family = "binomial", weights = n)
Notably, these fitted values are equivalent to what we would get if
we were to transform each of our binomial responses into equivalent
sets of Bernoulli trials. For example, the first observation in
df
is a single success out of 5 attempts with a covariate of
r round(df$x[1], 3)
. Another way to think of this is five Bernoulli
trials, one successful, four unsuccessful, all with a covariate
value of r round(df$x[1], 3)
.
get_box <- function(myRow){ data.frame(x = rep(myRow$x, myRow$n), y = c(rep(1, myRow$y), rep(0, myRow$n - myRow$y))) } out <- NULL for(i in 1:nrow(df)){ out <- bind_rows(out, get_box(df[i,])) } dfTall <- out
For example:
head(dfTall) glm(y ~ x, data = dfTall, family = "binomial")
Note that while the degrees of freedom, information criteria, etc. are different, the coefficient estimates are the same, as are the standard errors for these coefficients.
ciTools
for binomial regressionciTools
supports logistic regression with both Bernoulli and
binomial response variables. For both types, add_ci
has intuitive
and expected functionality. It produces a point estimate and
confidence intervals around the estimated probability of
success. However, the other functions of ciTools
(add_pi
,
add_quantile
, and add_probs
) produce different behaviors,
because prediction intervals, quantiles, and probability
values are not very useful for Bernoulli variables.
For example, consider prediction intervals. These are used to
quantify the observation-to-observation variability and estimate its
range. For a Bernoulli variable, only two values are possible, so
this is not particularly meaningful -- the prediction interval will
always be $[0,1]$. Quantiles are similarly problematic. And
probability estimates (that is, $P(Y>y|x)$) are
equivalent to estimates of $p$ or $1-p$ . Thus, asking for any of
these will produce either an error (in the case of
add_pi.glm(family =
"binomial)
or add_quantiles(family = "binomial)
) or a warning
(in the case of add_probs.glm(family = "binomial")
).
For a binomial response, on the other hand, each of these functions produce meaningful output. Binomial response variables can take more than two values, making it sensible to consider prediction intervals, quantiles, or probabilities. For example, it may be interesting to consider the 90 percent quantile for a binomial regression:
fit <- glm(y/n ~ x, data = df, family = "binomial", weights = n)
head(add_quantile(df, fit, p = 0.9))
There is a lot to unpack here, including three warning
messages. Let's consider the output first. Unlike when we use
add_ci.glm
, the pred
column includes the predicted values
$E(Y|\boldsymbol{x})$ rather than the estimated probability of success
$\hat{p}|\boldsymbol{x}$.
Next, there are three warnings. None of these indicate that
anything has gone wrong. Rather, they are included to ensure that
users understand the output they're given. The first warning makes
it clear to the user that the weights
argument is assumed to
indicate that they're doing binomial rather than weighted Bernoulli
regression. The second warning informs users that the estimated
interval is approximate. This is because the current method used by
ciTools
for GLMs is based on simulation, and because ciTools
forces quantile estimates to lie in the support set of the response
distribution. The final warning points out that the pred
column
refers to the fitted values rather than estimated probability of
success, which was mentioned in the previous paragraph.
Logistic regression can be used for both Bernoulli and Binomial
response variables, and glm
supports both. ciTools
differentiates between the two based on whether the user has
included a column in the weights
argument. Aside from add_ci
,
none of the functions in ciTools
produce useful information for
Bernoulli response variables. For Binomial response variables, all
of these functions produce useful information, though error
messages are included to ensure that users understand the output
presented.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.