The usual method of finding maximum likelihood estimates involves deriving the log likelihood function with respect to each parameter we want to estimate, then setting this equation to 0. But there are times when this is not possible. For example, in the logistic regression scenario, closed form solutions cannot be obtained. Thus we need to resort to some iterative method in order to compute these estimates. In this way, we usually say the MLE is found when the improvement after each iteration becomes negligible.
This package currently implements the Newton-Raphson method for simple logistic regression, and the EM algorithm for estimating the frequency of blood alleles.
First load the package via:
library(naim)
The NR_logit
function finds estimated parameters under simple logistic
regression, where closed form solutions are not obtainable. Using the
Newton-Raphson method, we supply an initial feasible estimate, and
iteratively improve the estimate.
There are 3 required arguments: covariates, number of trials, and
successes. More details are in the documentation via ?NR_logit
. Let's
see an example:
set.seed(547)
# Covariates
x <- rnorm(100, mean = 3, sd = 0.2)
# Number of trials
n <- sample(1:100, replace = TRUE)
# Successes
y <- rbinom(100, size = n, prob = 0.6)
# Data set
head(data.frame(x, n, y))
## x n y
## 1 3.067176 61 41
## 2 3.088659 23 18
## 3 2.958531 61 38
## 4 2.952617 83 50
## 5 3.324626 49 24
## 6 3.019444 90 49
The default call to NR_logit
returns a data.frame with the intercept
and slope MLE.
NR_logit(x, y, n)
## intercept slope
## 1 0.5229187 -0.03921141
Suppose we wanted to see the iterative process, i.e. the incremental
improvements in the estimated parameters. This can be accomplished by
setting the argument verbose = TRUE
(which is FALSE
by default).
NR_logit(x, y, n, verbose = TRUE)
## [1] "Initial Value: intercept = 0 slope = 0"
## [1] "Iteration 1 : intercept = 0.5127 , slope = -0.037627"
## [1] "Iteration 2 : intercept = 0.52291 , slope = -0.039209"
## [1] "Iteration 3 : intercept = 0.522919 , slope = -0.039211"
## [1] "Iteration 4 : intercept = 0.522919 , slope = -0.039211"
## [1] "MLE are intercept = 0.522919 and slope = -0.039211"
## intercept slope
## 1 0.5229187 -0.03921141
It shows the initial value used (which is always (0, 0)), the updated estimates at each iteration, and the final MLE.
The stopping criterion occurs when the absolute error of either
parameter with its previous value falls below a certain tolerance level.
By default the tolerance is set to 1e-6. If we wanted to increase
the accuracy of our estimate, we can manually set a lower tolerance by
adding the argument tolerance = 1e-10
, for example.
NR_logit(x, y, n, tol = 1e-15, verbose = TRUE)
## [1] "Initial Value: intercept = 0 slope = 0"
## [1] "Iteration 1 : intercept = 0.5127 , slope = -0.037627"
## [1] "Iteration 2 : intercept = 0.52291 , slope = -0.039209"
## [1] "Iteration 3 : intercept = 0.522919 , slope = -0.039211"
## [1] "Iteration 4 : intercept = 0.522919 , slope = -0.039211"
## [1] "Iteration 5 : intercept = 0.522919 , slope = -0.039211"
## [1] "MLE are intercept = 0.522919 and slope = -0.039211"
## intercept slope
## 1 0.5229187 -0.03921141
Notice that the Newton-Raphson takes an additional iteration to converge because the tolerance has been set to a smaller value. In the context of this example, there is no significant difference from our previous estimate.
The second function of this package uses the EM algorithm to estimate
the frequency of blood alleles (A, B, or O) in a population given their
phenotypic frequencies. An iterative method is needed because closed
form solutions cannot be obtained. The allele frequencies are
intertwined within the six possible blood genotypes, forming the latent
variables in our problem. More details are in the documentation via
?EM_blood
.
Using EM_blood
, you can quickly obtain the three MLE by providing the
total number of people who have A, B, AB, and O blood types. The
arguments tol
and verbose
are also implemented and function the
exact same way as in NR_logit
.
Let's work through a simple example. Suppose the phenotypic frequencies are as given below:
# Number of people who have each blood type in population
A <- 80; B <- 45; AB <- 13; O <- 100
Then the allele frequencies are:
EM_blood(A, B, AB, O)
## pA_hat pB_hat pO_hat
## 1 0.219678 0.130473 0.649849
The prevalence of the B antigen is the lowest in this population whereas having no antigen has the highest probability. Note that the sum of the probabilities is always 1 (with some rounding error).
Again, we can inspect the iterative process using verbose = TRUE
:
EM_blood(A, B, AB, O, verbose = TRUE)
## [1] "Initial Probs: pA = 0.333333 pB = 0.333333 pO = 0.333333"
## [1] "Iteration 1 : pA = 0.251401 pB = 0.153361 pO = 0.595238"
## [1] "Iteration 2 : pA = 0.224682 pB = 0.132638 pO = 0.642681"
## [1] "Iteration 3 : pA = 0.220385 pB = 0.130692 pO = 0.648923"
## [1] "Iteration 4 : pA = 0.219775 pB = 0.130498 pO = 0.649728"
## [1] "Iteration 5 : pA = 0.219691 pB = 0.130476 pO = 0.649833"
## [1] "Iteration 6 : pA = 0.21968 pB = 0.130474 pO = 0.649846"
## [1] "Iteration 7 : pA = 0.219678 pB = 0.130473 pO = 0.649848"
## [1] "Iteration 8 : pA = 0.219678 pB = 0.130473 pO = 0.649849"
## [1] "MLE are pA = 0.219678 pB = 0.130473 pO = 0.649849"
## pA_hat pB_hat pO_hat
## 1 0.219678 0.130473 0.649849
Note that the initial probabilities correspond to the situation of equally likely allele frequency.
Note: this section would not normally appear in a vignette.
Trying to figure out how to get an overview.md
to be produced from
the vignette YAML. Ended up manually calling
rmarkdown::render("~/GitHub/naim/vignettes/overview.Rmd", "md_document")
.
No conflicts with R CMD Check
and Build & Reload
. Many thanks to
the course instructor and TAs for help on this!
Sometimes my Build & Reload
doesn't work unless I delete a
lock00
file in my home directory. Not sure why this is, but the
package builds successfully thereafter.
Uncertainty about LaTeX
style markup in R documentation. Seems
like Greek letters like \epsilon
work, but accents (\hat
),
operators (\frac
), etc. do not. Wondering how to make the help
files look as professional as possible.
My R CMD Check
passed everything except for one NOTE:
* checking top-level files ... NOTE Non-standard file/directory found at top level: 'README.Rmd' 'README.html'
.
To "fix" this I added README.Rmd
and README.html
to the
.Rbuildignore
file. Now the R CMD Check
succeeds without any
notes or warnings. My question is whether this workflow would
coincide with CRAN guidelines after my modification to the
.Rbuildignore
file.
There seems to be a lot of repeated information in my documentation.
I guess it's best to be very clear about what your package does, but
I found myself writing similar material in my vignette
, README
,
?naim
documentation, and DESCRIPTION
file. It would be useful to
have some more concrete guidance on any subtle differences,
otherwise I would simply just copy over explanations, examples, etc.
Testing is very useful and forces me to look for loop holes in my function definition. May seem tedious at first but worth the effort if you want to write a professional package.
The two functions in this package were written by me very recently for another course called STAT 560 Statistical Inference I. I had the opportunity to streamline the functions, polish them up, and write explanations. No better way to learn the material than to incorporate them as functions in my own R package!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.