knitr::opts_chunk$set(echo = TRUE)

This package is part of a project to determine the best way to evaluate data with rounded values (binned data). Our task was to use an EM algorithm with penalty for mixtures of t-distributions.

This vignette has the goal to explain all important functions contained in this package and their usage as well as the visualization with examples.

library(t.mix)
data("ZD", package = "EUCASTData")

First we establish three example data sets, which will be used throughout this vignette. These datasets are subsets of our original data. All three datasets are focused on the bacteria species Escherichia. They are distinguishable by their antibiotic, which are "Ampicillin", "Piperacillin" and "Mecillinam" respectively.

#extract subsets
ZDs1 <- subset(ZD, Antimicrobial == "Ampicillin" & Bacterium == "Escherichia coli")
ZDs2 <- subset(ZD, Antimicrobial == "Piperacillin" & Bacterium == "Escherichia coli")
ZDs3 <- subset(ZD, Antimicrobial == "Mecillinam" & Bacterium == "Escherichia coli")

Main functions for analysis

To get an idea of how our functions work, we provided an overview of our algorithm, which should help to understand the functions and their purpose.

\begin{algorithm}[H] \caption{EM Algorithm for mixtures of t's with binned data (grouped)}\label{euclid} \begin{algorithmic}[1] \Procedure{}{} \State Calculate starting values $\pi_k^{(0)}, \mu_k^{(0)}, \sigma_k^{(0)}, \nu_k^{(0)}$, \; $k=1, ..., K$ \State Set $\alpha_0$, $\beta_0$

    \Repeat
    \State $\tau_{k, j1}^{(i)} = \frac{\pi_k^{(i)} f_{\tilde Y}(j|\mu_k^{(i)}, \sigma_k^{(i)},\nu_k^{(i)})}{\sum \limits_{c=1}^K \pi_c^{(i)} f_{\tilde Y}(j|\mu_c^{(i)}, \sigma_c^{(i)}, \nu_c^{(i)})}$ for $k=1, ..., K$, $j = 1, ..., J$
    \State $\pi_k^{(i+1)} \gets\frac{1}{\tilde N} \sum\limits_{j=1}^{J}  n_j \tau_{k, j1}^{(i)} \qquad \text{for } k=1, ..., K$

    \State $(\mu_k^{(i+1)}, \sigma_k^{(i+1)}, \nu_k^{(i+1)}) \gets \arg \max Q_2(\mu,\; \sigma, \; \nu;\;\Theta^{(i)}) + \newline Q_3(\sigma; \; \alpha_0, \beta_0) $ for $k=1, ..., K$
    \Until{convergence criterion is fulfilled}
    \State \Return $\Theta^{(i+1)}$

    \EndProcedure
\end{algorithmic}

\end{algorithm}

The two main functions to analyse the data are \textbf{total.function} and \textbf{model.select.t.mixture}. Most of the other functions included in this package do the preliminary work or are used in either one or the other or both of the main functions.

The following diagrams include a histogram of the data (grey bars), a kernel density estimation (red line) and the fitted t-mixture density (blue line). Please note, that the fitted t-mixture density only captures non-resistant observations. Furthermore, the estimated ECOFF is drawn as a vertical black dashed line.

We will now proceed with evaluating the first example dataset.

####### First data set #########

#extract data necessary for analysis

j = 6:50 #these values have been observed
n1 = as.integer(ZDs1[4:48]) #observed frequencies

#one component
(result1 = total.function(n1, j, K=1, atoms=6, memb.exp=2, draw=TRUE))

#two components
(result1 = total.function(n1, j, K=2, atoms=6, memb.exp=2, draw=TRUE))

#three components
(result1 = total.function(n1, j, K=3, atoms=6, memb.exp=2, draw=TRUE))

The last step is the model selection process:

#model selection (up to Kmax=5)
(selected1 <- model.select.t.mixture(n1, j, Kmax=5, memb.exp=2, atoms=6))

With an ECOFF of 13.305 our analysis shows, that a K = 2 is the best fit for the model in the first example.

This process is repeated for the second example dataset : We remember, that example dataset number 2 contains the observations with the antibiotic "Piperacillin".

####### Second data set ##########

n2 = as.integer(ZDs2[4:48])

#one component
(result2 = total.function(n2, j, K=1, atoms=6, memb.exp=2, draw=TRUE))

#two components
(result2 = total.function(n2, j, K=2, atoms=6, memb.exp=2, draw=TRUE))

#three components
(result2 = total.function(n2, j, K=3, atoms=6, memb.exp=2, draw=TRUE))
#model selection (up to Kmax=5)
(selected2 <- model.select.t.mixture(n2, j, Kmax=5, memb.exp=2, atoms=6))

The histograms show a bimodal distribution, apart from the spike at 6 caused by the resistant type. It also shows that using only one component is not suitable, which already becomes obvious by checking the value for the ECOFF estimate, which is about -6.

After analysing the different versions we can conclude that K = 2 (number of non-resistant components) gives the best results. The estimated ECOFF of 19.8 almost matches the official one of 20, whereas with K = 3 the ECOFF-value is about 21.

On to the third example dataset: This dataset contains the observations with the antibiotic "Mecillinam".

On to the third example dataset:

###### Third data set #############


n3 = as.integer(ZDs3[4:48])

#one component
(result3 = total.function(n3, j, K=1, atoms=6, memb.exp=2, draw=TRUE))

#two components
(result3 = total.function(n3, j, K=2, atoms=6, memb.exp=2, draw=TRUE))

#three components
(result3 = total.function(n3, j, K=3, atoms=6, memb.exp=2, draw=TRUE))

#four components
(result3 = total.function(n3, j, K=4, atoms=6, memb.exp=2, draw=TRUE))

To be on the safe, this time we ran 4 models ranging from K = 1 to K = 4. All that is left is to find the model with the best fit. The histogram does not adequately show that there is a slight elevation right before the main one. This makes it more difficult to find the best fit.

#model selection (up to Kmax=5)
(selected3 <- model.select.t.mixture(n3, j, Kmax=5, memb.exp=2, atoms=6, optim.method="Nelder-Mead"))

This time the \textbf{model.select.t.mixture}-function shows that the model with K = 3 has the best results, indicated by the AIC and BIC.



sp2019-antibiotics/Team-Student documentation built on Nov. 5, 2019, 9:13 a.m.