This vignette will walkthrough using SPEAR on data to predict a ordinal (ordered categorical) response. Multinomial and binomial responses are very similar to ordinal, and thus will not have their own respective vignette walkthroughs.
This vignette assumes you have already installed SPEAR (return to main vignette with any questions)
library(SPEAR)
In this tutorial, we assume the user knows how to make/load data into a SPEARobject (return to main vignette with any questions)
We will be using the simulate_data
function:
sim.data <- SPEAR::simulate_data(
N = 200, # how many training samples?
N.test = 1000, # how many testing samples?
D = 4, # how many datasets?
P = 200, # how many features per dataset?
num.factors = 10,# how many true factors in the data?
factors.influencing.Y = 3, # how many factors influence Y?
c1 = 3, # how much signal-to-noise? Lower = more noise
family = "ordinal",
ordinal.centers = c(-1.6, -1, 0, 1, 1.6) # means for ordinal classes, for more classes add more means
#multi.centers.x = ..., # for multinomial, use two centers, one x and one y
#multi.centers.y = ... # each is a vector of length num_classes
)
The data returned from the simulate_data
function are normalized
(/mu = 0, /sigma2 = 1), but if using your own
data, make sure to preprocess!
We will now estimate the number of factors:
num.factors <- SPEAR::estimate_num_factors(X = sim.data$train$Xlist)
> --------------------
> Running initial SVD for min:
> min: 9 factors [SVD]
> max: 20 factors [floor(200/10)]
> --------------------
> Iter 1:10 Best = 9 factors
> Iter 2:10 Best = 10 factors
> Iter 3:10 Best = 10 factors
> Iter 4:10 Best = 10 factors
> Iter 5:10 Best = 10 factors
> Iter 6:10 Best = 10 factors
> Iter 7:10 Best = 10 factors
> Iter 8:10 Best = 9 factors
> Iter 9:10 Best = 8 factors
> Iter 10:10 Best = 10 factors
> ----------------
> min: 9 factors
> avg: 10 factors
> max: 20 factors
> ----------------
> Suggested: 10 factors (avg)
As a reminder for ordinal responses, Y must start with 0 and go to num.classes - 1
table(sim.data$train$Y)
> 0 1 2 3 4
> 37 44 43 36 40
For multinomial responses, Y must be one-hot encoded. Use the
one_hot_encode
function to do this:
Y <- sim.data$train$Y
Y[1:5,,drop=FALSE]
> [,1]
> [1,] 0
> [2,] 1
> [3,] 0
> [4,] 4
> [5,] 0
Y.onehot <- SPEAR::one_hot_encode(Y)
Y.onehot[1:5,]
> 0 1 2 3 4
> [1,] 1 0 0 0 0
> [2,] 0 1 0 0 0
> [3,] 1 0 0 0 0
> [4,] 0 0 0 0 1
> [5,] 1 0 0 0 0
SPEARobj <- SPEAR::make.SPEARobject(X = sim.data$train$Xlist,
Y = sim.data$train$Y,
num_factors = num.factors,
family = "ordinal"
)
> ----------------------------------------------------------------
> SPEAR version 1.2.0 Please direct all questions to Jeremy Gygi
> (jeremy.gygi@yale.edu) or Leying Guan (leying.guan@yale.edu)
> ----------------------------------------------------------------
> Generating SPEARobject...
> $data... Done!
> $params... Done!
> $inits... Done!
> SPEARobject generated!
>
> Detected 4 datasets:
> dataset1 Subjects: 200 Features: 200
> dataset2 Subjects: 200 Features: 200
> dataset3 Subjects: 200 Features: 200
> dataset4 Subjects: 200 Features: 200
> Detected 1 response variable:
> Y1 Subjects: 200 Type: ordinal
>
> For assistance, browse the SPEAR vignettes with any of these commands...
> > SPEARobject$SPEAR.help()
> > SPEAR::SPEAR.help()
> > browseVignettes("SPEAR")
# Run SPEAR:
SPEARobj$run.cv.spear()
# optional: save model:
SPEARobj$save.model(file = "_path_to_save_model_.rds")
First, we need to select a wx to use:
SPEARobj$plot.cv.loss()
We will opt for the "sd"
model, which is the default:
SPEARobj$set.weights(method = "sd")
## Setting current weight index to 1
## w.x: 2
## w.y: 1
Next, we will look at the predictions:
SPEARobj$plot.predictions()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Since the data we simulated has a test dataset, we can add that to our trained SPEARobject:
SPEARobj$add.data(X = sim.data$test$Xlist, Y = sim.data$test$Y, name = "test")
## Warning in SPEARobj$add.data(X = sim.data$test$Xlist, Y = sim.data$test$Y, : WARNING: name provided already exists in names(SPEARobject$data). Overwriting...
## Saved dataset to $data$test
We can plot the predictions using data
parameter:
SPEARobj$plot.predictions(data = "test")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
SPEAR has a custom get.misclassification
function to return:
confusion_mat
- the confusion matrix
misclass_error
- the unbalanced misclassification error
misclass_error_ind
- the misclassification error per class
balanced_misclass_error
- the balanced misclassification error
SPEARobj$get.misclassification(data = "test")
## $confusion_mat
## Pred
## True 0 1 2 3 4
## 0 147 48 1 0 0
## 1 37 142 21 0 0
## 2 1 22 157 18 0
## 3 0 0 16 148 56
## 4 0 0 0 56 130
##
## $misclass_error
## [1] 0.28
##
## $misclass_error_ind
## 0 1 2 3 4
## 0.25 0.29 0.21 0.33 0.30
##
## $balanced_misclass_error
## [1] 0.28
If you are interested in more than the results of the
get.misclassification
function you can use the assess
function:
SPEAR utilizes the yardstick packages to provide even more information about predictions and probabilities…
Use assess(...)
to pull this information:
SPEARobj$assess(data = "test")
## # A tibble: 5 × 4
## Response .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 Y1 accuracy multiclass 0.724
## 2 Y1 precision macro 0.727
## 3 Y1 recall macro 0.725
## 4 Y1 sensitivity macro 0.725
## 5 Y1 roc_auc macro 0.939
SPEAR returns the above metrics by default, but you can use every metric available in the yardstick package (see here for the list).
SPEAR also uses macro averaging by default, but you can specify this as well (see here for more information)
SPEARobj$assess(
data = "test",
estimator = "macro_weighted",
metrics = c("specificity")
)
## # A tibble: 1 × 4
## Response .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 Y1 specificity macro_weighted 0.930
Because our response is ordinal, we can look at the actual probabilities of each prediction assignment…
probabilities <- SPEARobj$get.predictions(data = "test")
# SPEAR returns a list of matrices instead of just predictions for
# ordinal, multinomial, and binomial responses:
# NOTE: ordinal responses will first be indexed by the response name (Y1 in this case)
names(probabilities$Y1)
## [1] "class.predictions" "class.separations" "class.probabilities" "signal"
par(mfrow=c(2,3))
for(i in 1:5){
boxplot(probabilities$Y1$class.probabilities[,i] ~ SPEARobj$data$test$Y, xlab = "True Class", ylab = paste0("Prob. Class ", i), col = viridis::inferno(n = 5, begin = .2)[i])
}
Notice that the highest probabilities being assigned for each predicted class correspond well to the true class.
SPEAR can plot these probabilities in multiple ways:
plot.class.probabilities(... , plot.type = "heatmap")
SPEARobj$plot.class.probabilities(data = "test", plot.type = "heatmap")
plot.class.probabilities(... , plot.type = "line")
SPEARobj$plot.class.probabilities(data = "test", plot.type = "line")
We can analyze the factor scores to see which ones correspond to our response:
SPEARobj$plot.factor.scores(factors = 1:5)
SPEARobj$plot.factor.scores(data = "test", factors = 1:5)
We can further explore the factors via the plot.contributions
function.
This will tell us how the factors are composed (which datasets are being used, and for which response?)
SPEARobj$plot.contributions()
We are interested in factors 1, 2, and 3 primarily, since they have the most influence on the prediction from their factor score correlations.
To get the features that make up Factors 1, 2, and 3, use the
get.features
function:
relevant.features <- SPEARobj$get.features(
factors = c(1,2,3),
datasets = NULL, # can specify which datasets here, otherwise all will be returned
coefficient.cutoff = 0, # specify a magnitude cutoff
probability.cutoff = .95 # specify a probability cutoff, .95 is default (recommended)
)
relevant.features[1:7,c(1, 2, 5, 6)]
## Factor Feature projection.coefficient joint.probability
## 1 Factor1 dataset3_feat10 0.05352758 1
## 2 Factor1 dataset3_feat69 0.04992221 1
## 3 Factor1 dataset2_feat27 0.05890834 1
## 4 Factor1 dataset2_feat84 0.05123886 1
## 5 Factor1 dataset2_feat120 -0.05984954 1
## 6 Factor1 dataset2_feat170 0.06033837 1
## 7 Factor1 dataset2_feat50 -0.06451814 1
See here (Getting Top Features) for more information on how to understand this feature table
We can see that simulated datasets 2 + 3 (and one from dataset 4) have the pattern for Factor 1:
table(dplyr::filter(relevant.features, Factor == "Factor1")$Dataset)
##
## dataset2 dataset3 dataset4
## 29 18 1
… datasets 1 + 4 have the pattern for Factor 2:
table(dplyr::filter(relevant.features, Factor == "Factor2")$Dataset)
##
## dataset1 dataset4
## 22 22
… and simulated datasets 1 + 2 have the pattern for Factor 3:
table(dplyr::filter(relevant.features, Factor == "Factor3")$Dataset)
##
## dataset1 dataset2
## 25 25
From this point, you would try to determine what Factor 1, Factor 2, and Factor 3 represent biologically.
Let’s see what happens if we fix the model selection to the lowest wx possible (wx = 0):
SPEARobj$set.weights(w.x = 0)
## Setting current weight index to 6
## w.x: 0
## w.y: 1
Let’s look at the factor scores:
SPEARobj$plot.factor.scores(data = "test", factors = 1:5)
With wx = 0, SPEAR is not concerned about reconstructing the factors, rather it is interested in solely predicting Y:
SPEARobj$plot.predictions(data = "test")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
While this model does a good job, let’s see where the features for Factor 1 are coming from:
relevant.features <- SPEARobj$get.features(factors = 1)
table(relevant.features$Dataset)
##
## dataset1 dataset2 dataset3 dataset4
## 57 60 30 28
Here we see SPEAR with wx = 0 is combining the
information from the "sd"
model’s Factors 1-3.
Notice how this model achieves good prediction of Y, but potentially
discards important features (we get more relevant features from the
“sd
” model with a higher wx). Keep this in mind as you
explore the different wx models of SPEAR.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.