README.md

Build Status CRAN_Status_Badge Downloads Downloads DOI License: GPL v2

The ScottKnott Effect Size Difference (ESD) test (Version 3.0, the development branch)

The Scott-Knott Effect Size Difference (ESD) test is a multiple comparison approach that leverages a hierarchical clustering to partition the set of treatment averages (e.g., means) into statistically distinct groups with non-negligible difference [Tantithamthavorn et al., (2018) https://doi.org/10.1109/TSE.2018.2794977]. It is an alternative approach of the Scott-Knott test that considers the magnitude of the difference (i.e., effect size) of treatment means with-in a group and between groups. Therefore, the Scott-Knott ESD test (v2+) produces the ranking of treatment means while ensuring that (1) the magnitude of the difference for all of the treatments in each group is negligible; and (2) the magnitude of the difference of treatments between groups is non-negligible. The Scott-Knott ESD test is recommended than other multiple comparison tests, since it does not produce overlapping groups like other post-hoc tests (e.g., Nemenyi’s test).

The Parametric ScottKnott ESD test (available in Version 2.0+)

The Parametric Scott-Knott ESD test is a mean comparison approach that leverages a hierarchical clustering to partition the set of treatment means. This Parametric ScottKnott ESD test is based on the ANOVA assumptions of the original ScottKnott test (e.g., the assumptions of normal distributions, homogeneous distributions, and the minimum sample size). The mechanism of the Scott-Knott ESD test is made up of 2 steps:

Unlike the earlier version of the Scott-Knott ESD test (v1.x) that post-processes the groups that are produced by the Scott-Knott test, the Scott-Knott ESD test (v2.x) pre-processes the groups by merging pairs of statistically distinct groups that have a negligible difference.

The Non-Parametric ScottKnott ESD test (available in Version 3.0, the development branch)

The Non-Parametric ScottKnott ESD (NPSK) test is a multiple comparison approach that leverages a hierarchical clustering to partition the set of median values of techniques (e.g., medians of variable importance scores, medians of model performance) into statistically distinct groups with non-negligible difference. The Non-Parametric ScottKnott ESD (NPSK) does not require the assumptions of normal distributions, homogeneous distributions, and the minimum sample size. The mechanism of the Non-Parametric Scott-Knott ESD test is made up of 2 steps:

Release Notes

V3.0.0 - Supporting the Non-Parametric ScottKnott ESD test. (Pending approval to be available in CRAN)

install.packages("devtools")
devtools::install_github("klainfo/ScottKnottESD", ref="development")
# Using Non-Parametric ScottKnott ESD test
sk <- sk_esd(example, version="np")

V2.0.3 - The Parametric ScottKnott ESD test (v2.x) produces the ranking of treatment means while ensuring that (1) the magnitude of the difference for all of the treatments in each group is negligible; and (2) the magnitude of the difference of treatments between groups is non-negligible. [Tantithamthavorn et al., (2018) https://doi.org/10.1109/TSE.2018.2794977]

install.packages("devtools")
devtools::install_github("klainfo/ScottKnottESD", ref="development")
# Using the Parametric ScottKnott ESD test
sk <- sk_esd(example, version="p")

Example usage scenarios in software engineering domain.

(1) Ranking and identifying the most influential variables that are produced by random forests models or regression models.

(2) Ranking and identifying the top-performing feature selection, classification, and model validation techniques for defect prediction models.

Installation

Install the current release from CRAN::
install.packages("ScottKnottESD")
Install the development version from GitHub:
install.packages("devtools")
devtools::install_github("klainfo/ScottKnottESD", ref="development")
Install from python (by calling R package via rpy2)
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector  # R vector of strings

utils = rpackages.importr("utils")
utils.chooseCRANmirror(ind=1)  # select the first mirror in the list

# R package names
packnames = ["ScottKnottESD"]

# Selectively install what needs to be installed.
names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
print(f"packages to install: {names_to_install}")

if len(names_to_install) > 0:
    utils.install_packages(StrVector(names_to_install))

Warning: this solution may not work properly in some processors like the M1 chip. This was tested in Google Colab running an Ubuntu machine.

Install development version in GitHub from python (by calling R package via rpy2)
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri


pandas2ri.activate()
devtools = importr("devtools")
devtools.install_github("klainfo/ScottKnottESD", ref="development")
sk = importr("ScottKnottESD")

Warning: this solution may not work properly in some processors like the M1 chip. This was tested in Google Colab running an Ubuntu machine.

Example R Usage

library(ScottKnottESD)

# An example dataset: The 1,000 variable importance scores of 9 software metrics. 
# The scores are generated by the Random Forests technique using 1,000 out-of-sample bootstrap.
example

# Using Non-Parametric ScottKnott ESD test
sk <- sk_esd(example, version="np")
plot(sk)

sk <- sk_esd(maven)
plot(sk)

Example Python Usage (by calling R package via rpy2)

# For Linux
pip install rpy2

# For Mac OS X
env ARCHFLAGS="-arch i386 -arch x86_64" pip install rpy2
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
import pandas as pd

sk = importr('ScottKnottESD')
data = pd.DataFrame(
    {
        "TechniqueA": [5, 1, 4],
        "TechniqueB": [6, 8, 3],
        "TechniqueC": [7, 10, 15],
        "TechniqueD": [7, 10.1, 15],
    }
)
display(data)
r_sk = sk.sk_esd(data)
column_order = list(r_sk[3] - 1)
ranking = pd.DataFrame(
    {
        "technique": [data.columns[i] for i in column_order],
        "rank": r_sk[1].astype("int"),
    }
) # long format
ranking = pd.DataFrame(
    [r_sk[1].astype("int")], columns=[data.columns[i] for i in column_order]
) # wide format

Referencing ScottKnottESD

ScottKnottESD can be referenced as:

@article{tantithamthavorn2017mvt,
    Author={Tantithamthavorn, Chakkrit and McIntosh, Shane and Hassan, Ahmed E. and Matsumoto, Kenichi},
    Title = {An Empirical Comparison of Model Validation Techniques for Defect Prediction Models},
    Booktitle = {IEEE Transactions on Software Engineering (TSE)},
    Volumn = {43},
    Number = {1},
    page = {1-18},
    Year = {2017}
}
@article{tantithamthavorn2018optimization,
    Author={Tantithamthavorn, Chakkrit and McIntosh, Shane and Hassan, Ahmed E. and Matsumoto, Kenichi},
    Title = {The Impact of Automated Parameter Optimization for Defect Prediction Models},
    Booktitle = {IEEE Transactions on Software Engineering (TSE)},
    page = {Early Access},
    Year = {2018}
}


klainfo/ScottKnottESD documentation built on Feb. 15, 2023, 5:46 a.m.