The usual method of finding maximum likelihood estimates involves deriving the log likelihood function with respect to each parameter we want to estimate, then setting this equation to 0. But there are times when this is not possible. For example, in the logistic regression scenario, closed form solutions cannot be obtained. Thus we need to resort to some iterative method in order to compute these estimates. In this way, we usually say the MLE is found when the improvement after each iteration becomes negligible.
This package currently implements the Newton-Raphson method for simple logistic regression, and the EM algorithm for estimating the frequency of blood alleles.
First load the package via:
library(naim)
The NR_logit
function finds estimated parameters under simple logistic regression, where closed form solutions are not obtainable. Using the Newton-Raphson method, we supply an initial feasible estimate, and iteratively improve the estimate.
There are 3 required arguments: covariates, number of trials, and successes. More details are in the documentation via ?NR_logit
. Let's see an example:
set.seed(547) # Covariates x <- rnorm(100, mean = 3, sd = 0.2) # Number of trials n <- sample(1:100, replace = TRUE) # Successes y <- rbinom(100, size = n, prob = 0.6) # Data set head(data.frame(x, n, y))
The default call to NR_logit
returns a data.frame with the intercept and slope MLE.
NR_logit(x, y, n)
Suppose we wanted to see the iterative process, i.e. the incremental improvements in the estimated parameters. This can be accomplished by setting the argument verbose = TRUE
(which is FALSE
by default).
NR_logit(x, y, n, verbose = TRUE)
It shows the initial value used (which is always (0, 0)), the updated estimates at each iteration, and the final MLE.
The stopping criterion occurs when the absolute error of either parameter with its previous value falls below a certain tolerance level. By default the tolerance is set to 1e-6. If we wanted to increase the accuracy of our estimate, we can manually set a lower tolerance by adding the argument tolerance = 1e-10
, for example.
NR_logit(x, y, n, tol = 1e-15, verbose = TRUE)
Notice that the Newton-Raphson takes an additional iteration to converge because the tolerance has been set to a smaller value. In the context of this example, there is no significant difference from our previous estimate.
The second function of this package uses the EM algorithm to estimate the frequency of blood alleles (A, B, or O) in a population given their phenotypic frequencies.
An iterative method is needed because closed form solutions cannot be obtained. The allele frequencies are intertwined within the six possible blood genotypes, forming the latent variables in our problem. More details are in the documentation via ?EM_blood
.
Using EM_blood
, you can quickly obtain the three MLE by providing the total number of people who have A, B, AB, and O blood types. The arguments tol
and verbose
are also implemented and function the exact same way as in NR_logit
.
Let's work through a simple example. Suppose the phenotypic frequencies are as given below:
# Number of people who have each blood type in population A <- 80; B <- 45; AB <- 13; O <- 100
Then the allele frequencies are:
EM_blood(A, B, AB, O)
The prevalence of the B antigen is the lowest in this population whereas having no antigen has the highest probability. Note that the sum of the probabilities is always 1 (with some rounding error).
Again, we can inspect the iterative process using verbose = TRUE
:
EM_blood(A, B, AB, O, verbose = TRUE)
Note that the initial probabilities correspond to the situation of equally likely allele frequency.
Note: this section would not normally appear in a vignette.
Trying to figure out how to get an overview.md
to be produced from the vignette YAML. Ended up manually calling rmarkdown::render("~/GitHub/naim/vignettes/overview.Rmd", "md_document")
. No conflicts with R CMD Check
and Build & Reload
. Many thanks to the course instructor and TAs for help on this!
Sometimes my Build & Reload
doesn't work unless I delete a lock00
file in my home directory. Not sure why this is, but the package builds successfully thereafter.
Uncertainty about LaTeX
style markup in R documentation. Seems like Greek letters like \epsilon
work, but accents (\hat
), operators (\frac
), etc. do not. Wondering how to make the help files look as professional as possible.
My R CMD Check
passed everything except for one NOTE: * checking top-level files ... NOTE Non-standard file/directory found at top level: 'README.Rmd' 'README.html'
. To "fix" this I added README.Rmd
and README.html
to the .Rbuildignore
file. Now the R CMD Check
succeeds without any notes or warnings. My question is whether this workflow would coincide with CRAN guidelines after my modification to the .Rbuildignore
file.
There seems to be a lot of repeated information in my documentation. I guess it's best to be very clear about what your package does, but I found myself writing similar material in my vignette
, README
, ?naim
documentation, and DESCRIPTION
file. It would be useful to have some more concrete guidance on any subtle differences, otherwise I would simply just copy over explanations, examples, etc.
Testing is very useful and forces me to look for loop holes in my function definition. May seem tedious at first but worth the effort if you want to write a professional package.
The two functions in this package were written by me very recently for another course called STAT 560 Statistical Inference I. I had the opportunity to streamline the functions, polish them up, and write explanations. No better way to learn the material than to incorporate them as functions in my own R package!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.