docs/vignettes/07_SPEARordinal.md

'Ordinal/Multinomial Example'

This vignette will walkthrough using SPEAR on data to predict a ordinal (ordered categorical) response. Multinomial and binomial responses are very similar to ordinal, and thus will not have their own respective vignette walkthroughs.

This vignette assumes you have already installed SPEAR (return to main vignette with any questions)

Load required packages

library(SPEAR)

Load/Prepare Data

In this tutorial, we assume the user knows how to make/load data into a SPEARobject (return to main vignette with any questions)

We will be using the simulate_data function:

sim.data <- SPEAR::simulate_data(
  N = 200,        # how many training samples?
  N.test = 1000,  # how many testing samples?
  D = 4,          # how many datasets?
  P = 200,        # how many features per dataset?
  num.factors = 10,# how many true factors in the data?
  factors.influencing.Y = 3, # how many factors influence Y?
  c1 = 3,         # how much signal-to-noise? Lower = more noise
  family = "ordinal",
  ordinal.centers = c(-1.6, -1, 0, 1, 1.6) # means for ordinal classes, for more classes add more means
  #multi.centers.x = ..., # for multinomial, use two centers, one x and one y
  #multi.centers.y = ... # each is a vector of length num_classes
)

The data returned from the simulate_data function are normalized (/mu = 0, /sigma2 = 1), but if using your own data, make sure to preprocess!

We will now estimate the number of factors:

num.factors <- SPEAR::estimate_num_factors(X = sim.data$train$Xlist)

> --------------------
> Running initial SVD for min:
> min: 9 factors [SVD]
> max: 20 factors [floor(200/10)]
> --------------------
> Iter 1:10 Best = 9 factors
> Iter 2:10 Best = 10 factors
> Iter 3:10 Best = 10 factors
> Iter 4:10 Best = 10 factors
> Iter 5:10 Best = 10 factors
> Iter 6:10 Best = 10 factors
> Iter 7:10 Best = 10 factors
> Iter 8:10 Best = 9 factors
> Iter 9:10 Best = 8 factors
> Iter 10:10    Best = 10 factors
> ----------------
> min: 9 factors
> avg: 10 factors
> max: 20 factors
> ----------------
> Suggested: 10 factors (avg)

As a reminder for ordinal responses, Y must start with 0 and go to num.classes - 1

table(sim.data$train$Y)

>  0  1  2  3  4 
> 37 44 43 36 40

For multinomial responses, Y must be one-hot encoded. Use the one_hot_encode function to do this:

Y <- sim.data$train$Y
Y[1:5,,drop=FALSE]
>      [,1]
> [1,]    0
> [2,]    1
> [3,]    0
> [4,]    4
> [5,]    0

Y.onehot <- SPEAR::one_hot_encode(Y)
Y.onehot[1:5,]

>      0 1 2 3 4
> [1,] 1 0 0 0 0
> [2,] 0 1 0 0 0
> [3,] 1 0 0 0 0
> [4,] 0 0 0 0 1
> [5,] 1 0 0 0 0

Make SPEARobject

SPEARobj <- SPEAR::make.SPEARobject(X = sim.data$train$Xlist,
                                    Y = sim.data$train$Y,
                                    num_factors = num.factors,
                                    family = "ordinal"
                                    )


> ----------------------------------------------------------------
> SPEAR version 1.2.0   Please direct all questions to Jeremy Gygi
> (jeremy.gygi@yale.edu) or Leying Guan (leying.guan@yale.edu)
> ----------------------------------------------------------------
> Generating SPEARobject...
> $data...  Done!
> $params...    Done!
> $inits... Done!
> SPEARobject generated!
> 
> Detected 4 datasets:
> dataset1  Subjects:  200  Features:  200 
> dataset2  Subjects:  200  Features:  200 
> dataset3  Subjects:  200  Features:  200 
> dataset4  Subjects:  200  Features:  200 
> Detected 1 response variable:
> Y1    Subjects:  200  Type:  ordinal 
> 
> For assistance, browse the SPEAR vignettes with any of these commands...
> > SPEARobject$SPEAR.help()
> > SPEAR::SPEAR.help()
> > browseVignettes("SPEAR")

Run CV SPEAR and evaluate:

# Run SPEAR:
SPEARobj$run.cv.spear()
# optional: save model:
SPEARobj$save.model(file = "_path_to_save_model_.rds")

Analyze SPEAR results:

First, we need to select a wx to use:

SPEARobj$plot.cv.loss()

We will opt for the "sd" model, which is the default:

SPEARobj$set.weights(method = "sd")
## Setting current weight index to 1
## w.x: 2
## w.y: 1

Predictions:

Next, we will look at the predictions:

SPEARobj$plot.predictions()
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Since the data we simulated has a test dataset, we can add that to our trained SPEARobject:

SPEARobj$add.data(X = sim.data$test$Xlist, Y = sim.data$test$Y, name = "test")
## Warning in SPEARobj$add.data(X = sim.data$test$Xlist, Y = sim.data$test$Y, : WARNING: name provided already exists in names(SPEARobject$data). Overwriting...

## Saved dataset to $data$test

We can plot the predictions using data parameter:

SPEARobj$plot.predictions(data = "test")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Misclassification

SPEAR has a custom get.misclassification function to return:

SPEARobj$get.misclassification(data = "test")
## $confusion_mat
##     Pred
## True   0   1   2   3   4
##    0 147  48   1   0   0
##    1  37 142  21   0   0
##    2   1  22 157  18   0
##    3   0   0  16 148  56
##    4   0   0   0  56 130
## 
## $misclass_error
## [1] 0.28
## 
## $misclass_error_ind
##    0    1    2    3    4 
## 0.25 0.29 0.21 0.33 0.30 
## 
## $balanced_misclass_error
## [1] 0.28

Assessing performance

If you are interested in more than the results of the get.misclassification function you can use the assess function:

SPEAR utilizes the yardstick packages to provide even more information about predictions and probabilities…

Use assess(...) to pull this information:

SPEARobj$assess(data = "test")
## # A tibble: 5 × 4
##   Response .metric     .estimator .estimate
##   <chr>    <chr>       <chr>          <dbl>
## 1 Y1       accuracy    multiclass     0.724
## 2 Y1       precision   macro          0.727
## 3 Y1       recall      macro          0.725
## 4 Y1       sensitivity macro          0.725
## 5 Y1       roc_auc     macro          0.939

SPEAR returns the above metrics by default, but you can use every metric available in the yardstick package (see here for the list).

SPEAR also uses macro averaging by default, but you can specify this as well (see here for more information)

SPEARobj$assess(
  data = "test",
  estimator = "macro_weighted",
  metrics = c("specificity")
)
## # A tibble: 1 × 4
##   Response .metric     .estimator     .estimate
##   <chr>    <chr>       <chr>              <dbl>
## 1 Y1       specificity macro_weighted     0.930

Prediction Probabilities

Because our response is ordinal, we can look at the actual probabilities of each prediction assignment…

probabilities <- SPEARobj$get.predictions(data = "test")
# SPEAR returns a list of matrices instead of just predictions for
# ordinal, multinomial, and binomial responses:
# NOTE: ordinal responses will first be indexed by the response name (Y1 in this case)
names(probabilities$Y1)
## [1] "class.predictions"   "class.separations"   "class.probabilities" "signal"
par(mfrow=c(2,3))
for(i in 1:5){
  boxplot(probabilities$Y1$class.probabilities[,i] ~ SPEARobj$data$test$Y, xlab = "True Class", ylab = paste0("Prob. Class ", i), col = viridis::inferno(n = 5, begin = .2)[i])
}

Notice that the highest probabilities being assigned for each predicted class correspond well to the true class.

SPEAR can plot these probabilities in multiple ways:

plot.class.probabilities(... , plot.type = "heatmap")

SPEARobj$plot.class.probabilities(data = "test", plot.type = "heatmap")

plot.class.probabilities(... , plot.type = "line")

SPEARobj$plot.class.probabilities(data = "test", plot.type = "line")

Factor Scores

We can analyze the factor scores to see which ones correspond to our response:

SPEARobj$plot.factor.scores(factors = 1:5)

SPEARobj$plot.factor.scores(data = "test", factors = 1:5)

We can further explore the factors via the plot.contributions function.

This will tell us how the factors are composed (which datasets are being used, and for which response?)

SPEARobj$plot.contributions()

We are interested in factors 1, 2, and 3 primarily, since they have the most influence on the prediction from their factor score correlations.

Feature Interpretation

To get the features that make up Factors 1, 2, and 3, use the get.features function:

relevant.features <- SPEARobj$get.features(
                                           factors = c(1,2,3),
                                           datasets = NULL, # can specify which datasets here, otherwise all will be returned
                                           coefficient.cutoff = 0, # specify a magnitude cutoff
                                           probability.cutoff = .95 # specify a probability cutoff, .95 is default (recommended)
                                          )

relevant.features[1:7,c(1, 2, 5, 6)]
##    Factor          Feature projection.coefficient joint.probability
## 1 Factor1  dataset3_feat10             0.05352758                 1
## 2 Factor1  dataset3_feat69             0.04992221                 1
## 3 Factor1  dataset2_feat27             0.05890834                 1
## 4 Factor1  dataset2_feat84             0.05123886                 1
## 5 Factor1 dataset2_feat120            -0.05984954                 1
## 6 Factor1 dataset2_feat170             0.06033837                 1
## 7 Factor1  dataset2_feat50            -0.06451814                 1

See here (Getting Top Features) for more information on how to understand this feature table

We can see that simulated datasets 2 + 3 (and one from dataset 4) have the pattern for Factor 1:

table(dplyr::filter(relevant.features, Factor == "Factor1")$Dataset)
## 
## dataset2 dataset3 dataset4 
##       29       18        1

… datasets 1 + 4 have the pattern for Factor 2:

table(dplyr::filter(relevant.features, Factor == "Factor2")$Dataset)
## 
## dataset1 dataset4 
##       22       22

… and simulated datasets 1 + 2 have the pattern for Factor 3:

table(dplyr::filter(relevant.features, Factor == "Factor3")$Dataset)
## 
## dataset1 dataset2 
##       25       25

From this point, you would try to determine what Factor 1, Factor 2, and Factor 3 represent biologically.

Looking at the wx = 0 model

Let’s see what happens if we fix the model selection to the lowest wx possible (wx = 0):

SPEARobj$set.weights(w.x = 0)
## Setting current weight index to 6
## w.x: 0
## w.y: 1

Let’s look at the factor scores:

SPEARobj$plot.factor.scores(data = "test", factors = 1:5)

With wx = 0, SPEAR is not concerned about reconstructing the factors, rather it is interested in solely predicting Y:

SPEARobj$plot.predictions(data = "test")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

While this model does a good job, let’s see where the features for Factor 1 are coming from:

relevant.features <- SPEARobj$get.features(factors = 1)
table(relevant.features$Dataset)
## 
## dataset1 dataset2 dataset3 dataset4 
##       57       60       30       28

Here we see SPEAR with wx = 0 is combining the information from the "sd" model’s Factors 1-3.

Notice how this model achieves good prediction of Y, but potentially discards important features (we get more relevant features from the “sd” model with a higher wx). Keep this in mind as you explore the different wx models of SPEAR.



jgygi/SPEAR documentation built on July 5, 2023, 5:35 p.m.