This is a collection of tutorials that show how to use the R.ROSETTA package.
Installation from github requires devtools package:
install.packages("devtools")
Installation and loading R.ROSETTA package from github:
library(devtools) install_github("komorowskilab/R.ROSETTA") library(R.ROSETTA)
The input is a decision table in a form of data.frame
, where the columns represent features and rows represent objects. In the table last column shall contain the decision classes.
library(R.ROSETTA) library(knitr) library("kableExtra") r1=c("...", "...", "...", "...","case") r2=c("...", "...", "...", "...","control") r3=c("...", "...", "...", "...","case") r4=c("...", "...", "...", "...","...") r5=c("...", "...", "...", "...","case") tab1=rbind(r1, r2, r3, r4, r5) colnames(tab1)<-c("feature_1", "feature_2", "...","feature_n","outcome") rownames(tab1)<-c("object_1", "object_2", "object_3","...","object_n") kable(tab1, format="html", caption = "Decision table format") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
To deal with complex decision tables we suggest to use feature selection methods.
You can create a synthetic decision table with predefined correlation. Let's assume that we want to create a decision table with 100 objects and 20 features that level of feature-feature correlation is 0.4 and the level of feature-decision correlation is 0.6. In this example the outcome is balanced and has two decision classes. To generate such data, type:
dt <- synData(nFeatures=20, rf=0.4, rd=0.6, nObjects=100, nOutcome=2, unbalanced=F, seed=1)
Now we add double correlated features at the level of 0.1 and 0.7, and triple correlated features at the level of 0.8 and 0.2. We will set up feature-feature correlation as a constant number 0.5. We set numeric vectors as the parameters, the order of nFeatures
and R
vectors must correspond to each other:
dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.5,0.5,0.5,0.5,0.5), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1)
The package contains gene expression dataset from case-control studies for autism prediction. Decision table exists as a data.frame
named autcon
.
dt <- autcon
The main function to run a classifier is rosetta()
. The default parameters of the function are set for the processing of the sample datasets. To display the parameters type the help function ?rosetta
.
To create rough set-based model from continuous data, type:
# synthetic data dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.1,0.8,0.8), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) out_s <- rosetta(dt) # gene expression data out_ge <- rosetta(autcon)
As you may notice, the default parametrs are set towards processing of continuous data.
If your decision table contains all discrete features please use option discrete=TRUE
.
Here is an example of processing synthetic data with discrete values:
dt <- synData(nFeatures=c(5,5,3,2,2), rf=c(0.2,0.3,0.2,0.4,0.4), rd=c(0.2,0.3,0.4,0.5,0.6), discrete = T, levels = 3, labels = c("low", "medium", "high")) outd <- rosetta(dt, discrete = T)
For data containing a mixture of continuous and discrete features, please use option discrete=FALSE
and assign following object type to the features:
numeric
character
, factor
, logical
dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.1,0.8,0.8), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) # change two of the feature from the group 5 to discrete dt$f5.2_rf0.8_rd0.2 <- as.factor(cut(dt$f5.2_rf0.8_rd0.2,3, labels = c("low", "medium", "high"))) # or as.character dt$f5.3_rf0.8_rd0.2 <- as.factor(cut(dt$f5.3_rf0.8_rd0.2,3, labels = c("low", "medium", "high"))) # or as.character out <- rosetta(dt, discrete = F)
The rosetta()
function generates two main outputs. The rule table is stored as a data.frame
structure under $main
. The model quality is stored as a table structure under $quality
.
The main output of the function contains a collection of rules in a table. The rule components and all statistics values are collected in separate columns. The individual values are comma separated.
dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.2,0.2,0.3,0.5,0.5), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) out <- rosetta(dt) out$main
library(knitr) library("kableExtra") r1=c("F1,F2", "1,2", "case", "43","0.97","0.174","...","...") r2=c("F2,F4,F5", "2,1,3", "control", "40","0.95","0.142","...","...") r3=c("F2", "3", "case", "36","0.89","0.097","...","...") r4=c("...", "...", "...", "...","...","...","...") r5=c("F7,F1", "3,1", "control", "10","0.64","0.014","...","...") tab1=rbind(r1, r2, r3, r4, r5) colnames(tab1)<-c("features", "levels", "decision","support","accuracy","coverage","cuts","statistics") rownames(tab1)<-c("rule_1", "rule_2", "rule_3","...","rule_n") kable(tab1, format="html", caption = "Main output") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
discrete=T
option.Estimated accuracy and AUC values are collected into table. The values are taken from n-fold Cross-Validation process.
dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) out <- rosetta(dt) out$quality
library(knitr) library("kableExtra") dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) out <- rosetta(dt) # out$quality kable(out$quality, format="html", caption = "Model quality") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
If you need an information about AUC, please take look at the code below:
dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) out <- rosetta(dt, roc = TRUE, clroc = "D1") out$quality
library(knitr) library("kableExtra") dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) out <- rosetta(dt, roc = TRUE, clroc = "D1") # out$quality kable(out$quality[1:7], format="html", caption = "Model quality") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left") kable(out$quality[8:13], format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left") kable(out$quality[14:17], format="html") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
To print rules in a classic IF-THEN form, use viewRules()
. The table display an information about RHS(Righ Hand Support) values.
out <- rosetta(autcon) rules <- out$main vRules <- viewRules(rules) head(vRules)
library(knitr) library("kableExtra") out <- rosetta(autcon) rules <- out$main vRules <- viewRules(rules) kable(head(vRules), format="html", caption = "Print rules") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
R.ROSETTA allows to recalculate a model according to the input decision table. This step may be used to retrieve a statistics in case of performing undersampling.
Let's consider that one of our rules has support 24 and accuracy 0.92. These values come from a model that was divided into smaller training sets in the process of balancing the data. Thanks to the model recalculation we obtain support 32 and accuracy 0.95, which now are the values corresponding to the input decision table. To recalculate a model, run recalculateRules()
:
dt <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) out <- rosetta(dt) rules <- out$main newRules <- recalculateRules(dt, rules)
Additionally model recalculation calculates support sets, which are added as the last columns in the data.frame
object.
We can visualize a specific rule in a form of heatmap or boxplot. The plot presents a distribution of values for three support groups. This visualization is done only after recalculating the model.
out <- rosetta(autcon) rules <- out$main newRules <- recalculateRules(autcon, rules) #rule heatmap plotRule(autcon, newRules, type = "heatmap", ind = 1) #rule boxplot plotRule(autcon, newRules, type = "boxplot", ind = 1)
To test your model on external data you may use the predictClass()
function. The algorithm validates if the levels of discretization correspond to the external dataset. Make sure that feature names correspond to the names used in the model.
### continuous data ## 1. to create a model dt1 <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, seed=1) ## 2. to validate the model (less objects, different seed, the same outcome) dt2 <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=50, nOutcome=2, unbalanced=F, seed=2) #store decision dt2_decision <- dt2$decision #remove decision from table dt2 <- dt2[,-length(dt2)] out <- rosetta(dt1) rules <- out$main # we can predict new classes if we don't have the outcome predictClass(dt2, rules) # if the outcome is known, we can obtain the accuracy of external prediction predictClass(dt2, rules, validate = TRUE, defClass = dt2_decision) ### discrete data dt1 <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=100, nOutcome=2, unbalanced=F, discrete=T, levels=3, labels=c("LOW","MEDIUM","HIGH"), seed=1) ## 2. to validate the model (less objects, different seed, the same outcome) dt2 <- synData(nFeatures=c(20,2,2,3,3), rf=c(0.1,0.1,0.7,0.7,0.7), rd=c(0.5,0.1,0.7,0.8,0.2), nObjects=50, nOutcome=2, unbalanced=F, discrete=T, levels=3, labels=c("LOW","MEDIUM","HIGH"), seed=2) #store decision dt2_decision <- dt2$decision #remove decision from table dt2 <- dt2[,-length(dt2)] out <- rosetta(dt1, discrete = T) rules <- out$main # we can predict new classes if we don't have the outcome predictClass(dt2, rules, discrete = T) # if the outcome is known, we can obtain the accuracy of external prediction predictClass(dt2, rules, discrete = T, validate = TRUE, defClass = dt2_decision)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.