knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
ROC based analyses aim to evaluate binary classification performance of a classifier. In other words, this type of analyses evaluate a classifier performance on differentiating two different outcomes, classes or categories.
In real world scenarios, classification processes usually present more than two possible outcomes. Thus, these scenarios can be dichotomized by selecting one outcome as the condition of interest, or the one to be predicted, and others as not being it.
Following vignette aims to show:
.condition
argument.We'll start by loading ROCnGO and some other libraries which will help in the analysis.
library(ROCnGO) library(dplyr)
As mentioned before, the outcomes of these analyses can be dichotomized in being a condition of interest $(D=1)$ or not $(D=0)$. In this way, ROCnGO internally transform the variable with each case outcome (response
) to a factor of values 1 and 0, representing presence or absence of the condition.
Taking the following example with three different outcomes, if we considered setosa as the condition of interest, the following factor would be generated.
| Case | Response | Factor | |------ |:-----------:|--------:| | 1 | Setosa | 1 | | 2 | Versicolor | 0 | | 3 | Virginica | 0 |
response
may be of different types, so in order to select by default which class will correspond to the condition of interest among its values, library functions follow some criteria based on the variable type:
sort()
over all posible options.levels()
.All other classes not identified as the class to predict will be combined into a common category, labelled as 0.
.condition
argumentSometimes, default criteria used by functions may not be desirable. Thus, if we want to change the category identified as the condition of interest we can use .condition
argument.
This argument takes as an input one of the values of response
, setting it as the condition of interest of the classifier.
These behaviours can be tested with the following examples. In the first place we will create an small dataset by using a small subset of iris
dataset.
# Create a small subset of iris with 5 random flowers of each species iris_subset <- as_tibble(iris) %>% group_by(Species) %>% slice_sample(n = 5) %>% ungroup() iris_subset
Once we have created our dataset, we can check the performance of the different variables as predictors for the species, for this task we may use summarize_dataset()
function.
# Check levels in Species levels(iris_subset$Species) # Summarize dataset classifiers iris_results <- summarize_dataset( iris_subset, response = Species, ratio = "tpr", threshold = 0.9 ) iris_results$data
As we may see Sepal.Width scores the best performance in the dataset, at least for setosa species. As we have mentioned before, this class has been selected as the condition of interest since it is the first element in species levels. Furthermore, the performance of Sepal.Width as a setosa classifier may be addressed since it presents slightly higher scores.
Now, if we want to repeat the analysis but considering virginica as the species of interest, we can consider .condition
argument.
# Summarize dataset classifiers with virginica species as D=1 virginica_results <- summarize_dataset( iris_subset, response = Species, ratio = "tpr", threshold = 0.9, .condition = "virginica" ) virginica_results$data
As we may see, new results highly differ from previous ones. Now Sepal.Length, Petal.Length and Petal.Width behave as better classifiers instead of Sepal.Width. In the same way, these results can be qualitatively matched with values in dataset, where variables score higher for this species.
Sometimes, it may be more useful to select manually the condition of interest. This may be the case, e.g. when working with a variable type than cannot be easily treated.
In order to manually select this condition, we could simply transform response
to another type that can be recognized by the library, even .condition
may be used to specify which class to use.
Alternatively, we can transform response
to a factor of 0 and 1 values, where its first item in levels()
will be 0. Library recognizes this variable as not needing any treatment, so it can be used to easily define this new responses.
We can check this manual selection with the following example. In this scenario, we will be supposing that we cannot make directly calculations over Species and we will need to define new variables to do it.
# Create new variables to evaluate "virginica" species classifiers iris_subset <- iris_subset %>% mutate( Species_int = ifelse(Species == "virginica", 2L, 1L), Species_fct = factor( ifelse(Species == "virginica", 1, 0), levels = c(0, 1) ) ) # Check new variables iris_subset[, c("Species", "Species_int", "Species_fct")]
Now we can evaluate the classifier performance.
# Select predictors predictors <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") # Check performance of virginica classifiers with .condition = 2 int_results <- summarize_dataset( iris_subset, predictors = predictors, response = Species_int, ratio = "tpr", threshold = 0.9, .condition = 2 ) int_results$data # Check performance of virginica classifiers with factor fct_results <- summarize_dataset( iris_subset, predictors = predictors, response = Species_fct, ratio = "tpr", threshold = 0.9 ) fct_results$data
As we may see results for each scenario correspond to ones obtained in the previous section, where we evaluated Species variable using .condition = "virginica"
directly.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.