One particular type of data that is well suited for use by multiDA is microarray data. In this example, we demostrate the power of multiDA in predicting cancer type using the provided SRBCT data.

Small Round Blue Cell Tumors (SRBCT) data

The SRBCT dataset (Khan, 2001) looks at classifying 4 classes of different childhood tumours sharing similar visual features during routine histology. These classes include Ewing's family of tumours (EWS), neuroblastoma (NB), Burkitt's lymphoma (BL), and rhabdomyosarcoma (RMS). Data was collected from 63 cDNA microarrays, with 1586 features present after filtering for genes with zero median absolute deviation. The data can be accessed by typing SRBCT. We assign our feature data to X, and our response data to y.

library(multiDA)
summary(SRBCT)

X <- SRBCT$X
y <- SRBCT$y

Fitting the multiDA model to the SRBCT data

We will fit a multiDA model to describe the group structure for each feature.

res  <- multiDA(X=X, y=y, penalty="EBIC", equal.var=TRUE, set.options="exhaustive")

We can explore the results of the multiDA model using the print() function:

print(res)

The key information returned to us is that 215 of the 1586 are deemed significant, with high gamma.hat values inndicating a higher probablity of membership to a particular partitioning. For example, for the most significant feature V1172, there is a 0.999 chance that it is described by partition 8, which, looking at the Partition Matrix output, consists of classes 1 and 3 described by a Gaussian curve, and classes 2 and 4 explained by a Gaussian curve (2 cueves altogether).

A low top value of gamma.hat indicates high uncertainty in model fit even for the most significant features - and in such a case another ML algorithm should be considered.

Exploring the plot functionality

We can visualise the paritioning of the classes within each feature using the plot() method for multiDA. By default, the plot function plots the top 10 ranked features. If `ranked=FALSE, then the user can specify which features to be plotted (specified by column names).

plot(res, ranks = 1)

An example using specified features:

plot(res, ranked=FALSE, features = c("V22"))

Prediction using the multiDA algorithm

If we want to predict class labels, we can use the predict function in order to do so. In this case, we will find the resubstitution error rate for this dataset using the multiDA algorithm. We can either extract the probability of class membership for each point ($probabilities) or we can directly extract class membership prediction (which class had the max probability) using $y.pred.

prediction <- predict(res, newdata=X)

head(prediction$probabilities)
head(prediction$y.pred)

vals <- prediction$y.pred 
rser <- sum(vals!=y)/length(y)
rser

Using the multiDA "tidy" functions

Tidy functions have been created for ease of use - namely glimpse_multiDA() and tidy_multiDA(), both of which take a trained multiDA object (in this case, res) as function input.

multiDA “glimpse” function

This function returns a one row data frame, with quick summaries from the algorithm. In the spirit of the “glance” function from the broom package.

glimpse_multiDA(res)

multiDA “tidy” function

This returns a tidy data frame, with key results from the trained multiDA object, namely, a data.frame of significant features and their ranks. In the spirit of “tidy” from the broom package.

tidy_res <- tidy_multiDA(res)
head(tidy_res)


sarahromanes/multiDA documentation built on July 14, 2019, 9:41 p.m.