docs/vignettes/06_SPEARgaussian.md

'Gaussian Response Example'

This vignette will walkthrough using SPEAR on data to predict a gaussian (continuous) response.

This vignette assumes you have already installed SPEAR (return to main vignette with any questions)

Load required packages

library(SPEAR)

Load/Prepare Data

In this tutorial, we assume the user knows how to make/load data into a SPEARobject (return to main vignette with any questions)

We will be using the simulate_data function:

sim.data <- SPEAR::simulate_data(
  N = 200,        # how many training samples?
  N.test = 1000,  # how many testing samples?
  D = 4,          # how many datasets?
  P = 200,        # how many features per dataset?
  num.factors = 10,# how many true factors in the data?
  factors.influencing.Y = 3, # how many factors influence Y?
  c1 = 3,         # how much signal-to-noise? Lower = more noise
  family = "gaussian"
)

The data returned from the simulate_data function are normalized (/mu = 0, /sigma2 = 1), but if using your own data, make sure to preprocess!

We will now estimate the number of factors:

num.factors <- SPEAR::estimate_num_factors(X = sim.data$train$Xlist)

> --------------------
> Running initial SVD for min:
> min: 10 factors [SVD]
> max: 20 factors [floor(200/10)]
> --------------------
> Iter 1:10 Best = 10 factors
> Iter 2:10 Best = 10 factors
> Iter 3:10 Best = 10 factors
> Iter 4:10 Best = 10 factors
> Iter 5:10 Best = 10 factors
> Iter 6:10 Best = 10 factors
> Iter 7:10 Best = 10 factors
> Iter 8:10 Best = 10 factors
> Iter 9:10 Best = 10 factors
> Iter 10:10    Best = 10 factors
> ----------------
> min: 10 factors
> avg: 10 factors
> max: 20 factors
> ----------------
> Suggested: 10 factors (avg)

Make SPEARobject

SPEARobj <- SPEAR::make.SPEARobject(X = sim.data$train$Xlist,
                                    Y = sim.data$train$Y,
                                    num_factors = num.factors,
                                    family = "gaussian"
                                    )


> ----------------------------------------------------------------
> SPEAR version 1.2.0   Please direct all questions to Jeremy Gygi
> (jeremy.gygi@yale.edu) or Leying Guan (leying.guan@yale.edu)
> ----------------------------------------------------------------
> Generating SPEARobject...
> $data...  Done!
> $params...    Done!
> $inits... Done!
> SPEARobject generated!
> 
> Detected 4 datasets:
> dataset1  Subjects:  200  Features:  200 
> dataset2  Subjects:  200  Features:  200 
> dataset3  Subjects:  200  Features:  200 
> dataset4  Subjects:  200  Features:  200 
> Detected 1 response variable:
> Y1    Subjects:  200  Type:  gaussian 
> 
> For assistance, browse the SPEAR vignettes with any of these commands...
> > SPEARobject$SPEAR.help()
> > SPEAR::SPEAR.help()
> > browseVignettes("SPEAR")

Run CV SPEAR and evaluate:

# Run SPEAR:
SPEARobj$run.cv.spear()
# optional: save model:
SPEARobj$save.model(file = "_path_to_save_model_.rds")

Analyze SPEAR results:

First, we need to select a wx to use:

SPEARobj$plot.cv.loss()

We will opt for the "sd" model, which is the default:

SPEARobj$set.weights(method = "sd")
## Setting current weight index to 1
## w.x: 2
## w.y: 1

Predictions:

Next, we will look at the predictions:

SPEARobj$plot.predictions()

Since the data we simulated has a test dataset, we can add that to our trained SPEARobject:

SPEARobj$add.data(X = sim.data$test$Xlist, Y = sim.data$test$Y, name = "test")
## Saved dataset to $data$test

We can plot the predictions using data parameter:

SPEARobj$plot.predictions(data = "test")

Factor Scores

SPEARobj$plot.factor.scores()

SPEARobj$plot.factor.scores(data = "test")

We can further explore the factors via the plot.contributions function.

This will tell us how the factors are composed (which datasets are being used, and for which response?)

SPEARobj$plot.contributions()

We are interested in factors 1 and 2 primarily, since they have the most influence on the prediction from their factor score correlations.

Feature Interpretation

To get the features that make up Factors 1 and 2, use the get.features function:

relevant.features <- SPEARobj$get.features(
                                           factors = c(1,2),
                                           datasets = NULL, # can specify which datasets here, otherwise all will be returned
                                           coefficient.cutoff = 0, # specify a magnitude cutoff
                                           probability.cutoff = .95 # specify a probability cutoff, .95 is default (recommended)
                                          )

relevant.features[1:7,c(1, 2, 5, 6)]
##    Factor          Feature projection.coefficient joint.probability
## 1 Factor1  dataset4_feat92             0.06605894                 1
## 2 Factor1 dataset4_feat182             0.06043808                 1
## 3 Factor1  dataset4_feat90            -0.06911251                 1
## 4 Factor1  dataset1_feat46             0.06453957                 1
## 5 Factor1  dataset4_feat85            -0.06482454                 1
## 6 Factor1 dataset4_feat199            -0.05732077                 1
## 7 Factor1  dataset1_feat78             0.06898450                 1

See here (Getting Top Features) for more information on how to understand this feature table

We can see that simulated datasets 1 + 4 have the pattern for Factor 1:

table(dplyr::filter(relevant.features, Factor == "Factor1")$Dataset)
## 
## dataset1 dataset4 
##       21       20

… and simulated datasets 1 + 2 have the pattern for Factor 2:

table(dplyr::filter(relevant.features, Factor == "Factor2")$Dataset)
## 
## dataset1 dataset2 
##       23       25

From this point, you would try to determine what Factor 1 and Factor 2 represent biologically.

Looking at the wx = 0 model

Let’s see what happens if we fix the model selection to the lowest wx possible (wx = 0):

SPEARobj$set.weights(w.x = 0)
## Setting current weight index to 6
## w.x: 0
## w.y: 1

Let’s look at the factor scores:

SPEARobj$plot.factor.scores(data = "test")

With wx = 0, SPEAR is not concerned about reconstructing the factors, rather it is interested in solely predicting Y:

SPEARobj$plot.predictions(data = "test")

While this model does a good job, let’s see where the features for Factor 1 are coming from:

relevant.features <- SPEARobj$get.features(factors = 1)
table(relevant.features$Dataset)
## 
## dataset1 dataset2 dataset4 
##       17        8        3

Notice how this model achieves good prediction of Y, but potentially discards important features (we get more relevant features from the “sd” model with a higher wx). Keep this in mind as you explore the different wx models of SPEAR.



jgygi/SPEAR documentation built on July 5, 2023, 5:35 p.m.