This vignette will walkthrough using SPEAR on data to predict a gaussian (continuous) response.
This vignette assumes you have already installed SPEAR (return to main vignette with any questions)
library(SPEAR)
In this tutorial, we assume the user knows how to make/load data into a SPEARobject (return to main vignette with any questions)
We will be using the simulate_data
function:
sim.data <- SPEAR::simulate_data(
N = 200, # how many training samples?
N.test = 1000, # how many testing samples?
D = 4, # how many datasets?
P = 200, # how many features per dataset?
num.factors = 10,# how many true factors in the data?
factors.influencing.Y = 3, # how many factors influence Y?
c1 = 3, # how much signal-to-noise? Lower = more noise
family = "gaussian"
)
The data returned from the simulate_data
function are normalized
(/mu = 0, /sigma2 = 1), but if using your own
data, make sure to preprocess!
We will now estimate the number of factors:
num.factors <- SPEAR::estimate_num_factors(X = sim.data$train$Xlist)
> --------------------
> Running initial SVD for min:
> min: 10 factors [SVD]
> max: 20 factors [floor(200/10)]
> --------------------
> Iter 1:10 Best = 10 factors
> Iter 2:10 Best = 10 factors
> Iter 3:10 Best = 10 factors
> Iter 4:10 Best = 10 factors
> Iter 5:10 Best = 10 factors
> Iter 6:10 Best = 10 factors
> Iter 7:10 Best = 10 factors
> Iter 8:10 Best = 10 factors
> Iter 9:10 Best = 10 factors
> Iter 10:10 Best = 10 factors
> ----------------
> min: 10 factors
> avg: 10 factors
> max: 20 factors
> ----------------
> Suggested: 10 factors (avg)
SPEARobj <- SPEAR::make.SPEARobject(X = sim.data$train$Xlist,
Y = sim.data$train$Y,
num_factors = num.factors,
family = "gaussian"
)
> ----------------------------------------------------------------
> SPEAR version 1.2.0 Please direct all questions to Jeremy Gygi
> (jeremy.gygi@yale.edu) or Leying Guan (leying.guan@yale.edu)
> ----------------------------------------------------------------
> Generating SPEARobject...
> $data... Done!
> $params... Done!
> $inits... Done!
> SPEARobject generated!
>
> Detected 4 datasets:
> dataset1 Subjects: 200 Features: 200
> dataset2 Subjects: 200 Features: 200
> dataset3 Subjects: 200 Features: 200
> dataset4 Subjects: 200 Features: 200
> Detected 1 response variable:
> Y1 Subjects: 200 Type: gaussian
>
> For assistance, browse the SPEAR vignettes with any of these commands...
> > SPEARobject$SPEAR.help()
> > SPEAR::SPEAR.help()
> > browseVignettes("SPEAR")
# Run SPEAR:
SPEARobj$run.cv.spear()
# optional: save model:
SPEARobj$save.model(file = "_path_to_save_model_.rds")
First, we need to select a wx to use:
SPEARobj$plot.cv.loss()
We will opt for the "sd"
model, which is the default:
SPEARobj$set.weights(method = "sd")
## Setting current weight index to 1
## w.x: 2
## w.y: 1
Next, we will look at the predictions:
SPEARobj$plot.predictions()
Since the data we simulated has a test dataset, we can add that to our trained SPEARobject:
SPEARobj$add.data(X = sim.data$test$Xlist, Y = sim.data$test$Y, name = "test")
## Saved dataset to $data$test
We can plot the predictions using data
parameter:
SPEARobj$plot.predictions(data = "test")
SPEARobj$plot.factor.scores()
SPEARobj$plot.factor.scores(data = "test")
We can further explore the factors via the plot.contributions
function.
This will tell us how the factors are composed (which datasets are being used, and for which response?)
SPEARobj$plot.contributions()
We are interested in factors 1 and 2 primarily, since they have the most influence on the prediction from their factor score correlations.
To get the features that make up Factors 1 and 2, use the get.features
function:
relevant.features <- SPEARobj$get.features(
factors = c(1,2),
datasets = NULL, # can specify which datasets here, otherwise all will be returned
coefficient.cutoff = 0, # specify a magnitude cutoff
probability.cutoff = .95 # specify a probability cutoff, .95 is default (recommended)
)
relevant.features[1:7,c(1, 2, 5, 6)]
## Factor Feature projection.coefficient joint.probability
## 1 Factor1 dataset4_feat92 0.06605894 1
## 2 Factor1 dataset4_feat182 0.06043808 1
## 3 Factor1 dataset4_feat90 -0.06911251 1
## 4 Factor1 dataset1_feat46 0.06453957 1
## 5 Factor1 dataset4_feat85 -0.06482454 1
## 6 Factor1 dataset4_feat199 -0.05732077 1
## 7 Factor1 dataset1_feat78 0.06898450 1
See here (Getting Top Features) for more information on how to understand this feature table
We can see that simulated datasets 1 + 4 have the pattern for Factor 1:
table(dplyr::filter(relevant.features, Factor == "Factor1")$Dataset)
##
## dataset1 dataset4
## 21 20
… and simulated datasets 1 + 2 have the pattern for Factor 2:
table(dplyr::filter(relevant.features, Factor == "Factor2")$Dataset)
##
## dataset1 dataset2
## 23 25
From this point, you would try to determine what Factor 1 and Factor 2 represent biologically.
Let’s see what happens if we fix the model selection to the lowest wx possible (wx = 0):
SPEARobj$set.weights(w.x = 0)
## Setting current weight index to 6
## w.x: 0
## w.y: 1
Let’s look at the factor scores:
SPEARobj$plot.factor.scores(data = "test")
With wx = 0, SPEAR is not concerned about reconstructing the factors, rather it is interested in solely predicting Y:
SPEARobj$plot.predictions(data = "test")
While this model does a good job, let’s see where the features for Factor 1 are coming from:
relevant.features <- SPEARobj$get.features(factors = 1)
table(relevant.features$Dataset)
##
## dataset1 dataset2 dataset4
## 17 8 3
Notice how this model achieves good prediction of Y, but potentially
discards important features (we get more relevant features from the
“sd
” model with a higher wx). Keep this in mind as you
explore the different wx models of SPEAR.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.