The UCI data set \texttt{dermatitis} [@altay] consist of
and a class variable with six different skin diseases
Many of the classical machine learning algorithms have been applied to the dataset \texttt{dermatitis} [@liu2015fast]. They all achieve a prediction accuracy above $95\%$ and some even above $99\%$. But...:
Given a new patient $y$, we want to test the hypotheses
\begin{align} H_1: & y \text{ has psoriasis} \ H_2: & y \text{ has seborrheic dermatitis} \ H_3: & y \text{ has lichen planus} \ H_4: & y \text{ has pityriasis rosea} \ H_5: & y \text{ has chronic dermatitis} \ H_6: & y \text{ has pityriasis rubra pilaris} \end{align}
Since all hypotheses are exclusive we do not correct for multiple hypothesis testing (but the user can do this by setting the significance level accordingly).
We first show how to test $H_1$. First extract the psoriasis data:
library(dplyr) library(molic) y <- unlist(derma[80, -35]) # a patient with seboreic dermatitis psor <- derma %>% filter(ES == "psoriasis") %>% dplyr::select(-ES)
Next, we fit the interaction graph for the psoriasis patients:
library(ess) g <- fit_graph(psor, q = 0, trace = FALSE)
We can color the nodes corresponding to clinical attributes (red), histopathological attributes (green) and the age variable (gray):
vs <- names(adj_lst(g)) vcol <- structure(vector("character", length(vs)), names = vs) vcol[grepl("c", vs)] <- "tomato" # clinical attributes vcol[grepl("h", vs)] <- "#98FB98" # histopathological attributes vcol["age"] <- "gray" # age variable
plot(g, vcol, vertex.size = 10, vertex.label = NA)
The take home message here is, that we cannot assume independence between the attributes for the psoriasis patient as seen in the interaction graph - there are many associations.
set.seed(300718) m <- fit_outlier(psor, g, y) print(m)
Notice that that the number of observations is $112$ even though we have only observed $111$ psoriasis patients. This is because, under the hypothesis, $H_1$, the new observation $y$ has psoriasis. The other summary statistics is self explanatory.
plot(m)
The red area is the critical region (here 5%) and the dotted line is the observed test statistic (the deviance) of $y$. Since the dotted line is outside the critical region, we cannot reject that $y$ has psoriasis.
We can use the fit_multiple_models
function to test all six hypothesis as follows.
set.seed(300718) mm <- fit_multiple_models(derma, y, "ES", q = 0,trace = FALSE) plot(mm)
knitr::include_graphics("multiple_models.png")
Thus, we cannot reject that $y$ has either psoriasis, seboreic dermatitis or pityriasis rosea. This is conservative compared to classification methods and hence a little safer. The medical expert should proceed the investigation from here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.