README.md

ASTRID

The astrid R-package, short for Automatic STRucture IDentification, provides an implementation of the method described in

Henelius, Andreas, Puolamäki, Kai and Ukkonen, Antti. Finding Statistically Significant Attribute Interactions. 2016, available from arXiv.

The basic idea is to use classifiers to investigate class-dependent attribute interactions in datasets.

Citing

To get a BibTex entry in R type citation("astrid") when the package is installed.

Installation from GitHub

The development version of the astrid package can be installed from GitHub as follows.

First install the devtools-package and load it:

install.packages("devtools")
library(devtools)

You can now install the astrid package:

install_github("bwrc/astrid-r")

Examples

This is a short example demonstrating use of the library. We here analyse the following synthetic dataset: synthetic

The dataset has two classees, each with 500 samples. The data is generated so that attributes a1 and a2 must be used jointly to predict the class (leftmost panel), while attribute a3 carries some (weak) class information (middle panel). Attriubte a4 (rightmost panel) is just noise. The known class-dependent attribute interaction structure is hence given by ((a1, a2), (a3), (a4)).

## Load the library
library(astrid)
library(e1071)
library(randomForest)

## Create a synthetic dataset with the known
## attribute interaction structure
## ((a1, a2), (a_3), (a_4)), where attribute a_4 is just noise.
dataset <- make_synthetic_dataset(N = 500, seed = 42, mg2 = 0.6)

## Perform the analysis using the ASTRID algorithm
res <- analyze_dataset(dataset, classname = "class",  classifier = "svm", parallel = TRUE, R = 250)

## Print the results as an HTML table
print_result_table_html(res, full_tree = TRUE)

This gives the following results for the analysis of the synthetic dataset using the SVM classifier:

k acc p a3 a4 a2 a1 2 0.89 0.71 (A) (B B B) 3 0.88 0.78 (A) (B) (C C) 4 0.73 0.00 (A) (B) (C) (D)

In this table k is the size (cardinality) of the grouping, acc is the average accuracy of the classifier when trained using a dataset randomised using this grouping, and p is the statistical significance of the grouping. The following columns each denote one attribute (here a4, a3, a1 and a2.). At each row, attributes marked with the same letter belong to the same group.

This shows that the maximum-cardinality grouping with a p-value of at least 0.05 is for k = 3, where the grouping is ((a1, a2), (a3), (a4)). The structure found by the ASTRID algorithm matches the model used to create the data.

License

The astrid R-package is licensed under the MIT-license.



bwrc/astrid-r documentation built on June 24, 2017, 8:05 p.m.