SPEAR is a supervised latent factor model for multi-omic data. This vignette will walk through the steps necessary to properly preprocess multi-omic data to be used in SPEAR.
Read in the data you wish to use with SPEAR. Data must follow the following structure:
SPEAR requires the same N samples (rows), with unique features in each dataset in X and Y
X needs to be a list of matrices, where the rows of each matrix correspond to the same N samples…
Here is an example of how the explanatory data (X) should be loaded. For this vignette, we will generate two datasets, X1 and X2.
X1 will be normally distributed with mean = 0 and variance of 1.
X2 will be normally distributed, but with a different mean and variance…
Below is a basic example of preparing X for SPEAR…
# This code chunk is not evaluated...
X1 <- read.csv(...)
X2 <- read.table(...)
# Store as a list where 'dataset1' is the name given to dataset X1
X <- list(dataset1 = X1,
dataset2 = X2)
# If you want to change the names, use the names function...
names(X) <- c("dataset1", "dataset2")
Each dataset in X must be a matrix (not a data.frame). You can do format this with the code below
X <- lapply(X, as.matrix)
Each dataset in X must have the same row names (sample names). If none are provided, SPEAR will automatically assume row 1 of X1 corresponds to the same sample as row 1 of X2.
You can manually add row (sample) names like this…
# replace ... with vector of row names
rownames <- ...
X <- lapply(X, function(dataset){
rownames(dataset) <- rownames
return(datset)
})
Each dataset in X must have unique column names. If none are provided, SPEAR will automatically define features as datasetname__featfeaturenumber (i.e. dataset1_feat1).
You can manually add column (feature) names like this…
# replace ... with vector of column names (feature names) for X1
colnames(X[[1]]) <- ...
# repeat for all datasets...
colnames(X[[2]]) <- ...
# use this to confirm that all column names are unique
# should return FALSE if all are unique
any(duplicated(colnames(do.call("cbind", X))))
SPEAR performs dimensionality reduction on the datasets X in order to predict response Y.
Y also needs to be a matrix, where each row corresponds to the same rows in all datasets of X.
Y <- read.csv(...)
# Make sure to coerce Y as a matrix...
Y <- as.matrix(Y)
# Set the rownames to be the same as X1 and X2
# NOTE: Make sure the ordering is correct before copying the rownames!
rownames(Y) <- rownames(X[[1]])
# Set the column names to be the names of the response:
# See the section on 'family' below for instructions
# on how many columns are accepted for each 'family'
colnames(Y) <- c("Response")
SPEAR works best when each dataset is approximately normal with mean centered at μ = 0 with variance of 1 (σ2 = 1).
Every dataset is different and requires its own preprocessing.
Good preprocessing begins with visualizing each dataset and applying the
appropriate transformations until the data are approximately normal.
This may include log
or sqrt
transformations, outlier removal or
quantile normalization, etc.
# Look at the distributions for random features...
par(mfrow = c(2, 2))
hist(X1[,1])
hist(X1[,2])
hist(X1[,3])
hist(X1[,4])
These histograms look good (the data are approximately Normal, with mean
= 0 and var = 1)
# Look at the distributions for X2's random features...
par(mfrow = c(2, 2))
hist(X2[,1])
hist(X2[,2])
hist(X2[,3])
hist(X2[,4])
These histograms do not look good (the data are approximately
Normal, but centered at mean = 10 and var = 25 (sd = 5))
# Scale X2 (set mean = 0, var = 1)
X2 <- scale(X2)
# Look at the distributions for X2's random features...
par(mfrow = c(2, 2))
hist(X2[,1])
hist(X2[,2])
hist(X2[,3])
hist(X2[,4])
Now, all the datasets in X (X1 and X2) are approximately
Normal, centered at mean = 0 with var = 1.
A less good approach is to always default use the scale
function in R,
as it assumes the data are approximately normally distributed already.
However, for many datasets, this will be sufficient…
# Make sure each column (response) in Y is scaled (mean = 0, var = 1)
D <- length(X)
X[[1]] <- scale(X[[1]])
...
X[[D]] <- scale(X[[D]])
# To scale all datasets, use...
X <- lapply(X, scale)
Only scale Y when appropriate (when family = "gaussian"
). If Y
is binomial
, ordinal
, or multinomial
, ignore this step.
If Y is gaussian
…
# Make sure each column (response) in Y is scaled (mean = 0, var = 1)
Y <- scale(Y)
If Y is ordinal
…
# Make sure the lowest class is marked as a 0, and that each class increments by 1
table(Y)
# Should look something like:
# 0 1 2 3 4 <- class label
# 17 25 17 16 15 <- num. samples with each class
family
):SPEAR supports the following options for the family
parameter:
| family | Values in Y (Response) | Should Y be scaled? | Can Y have >1 cols? | |:-------------------------------------------------|:--------------------------------------|:--------------------|:----------------------| | gaussian | Continuous values (scaled) | Yes | Yes | | binomial | 0’s or 1’s | No | Yes (use multinomial) | | ordinal | Ranked scale of integers (start at 0) | No | No | | multinomial | Multiple binomial (0/1) columns | No | Yes |
Below are examples of what Y should look like to use a different family:
# Gaussian, A continuous response
family = "gaussian"
Y_gaussian <- data.frame(GausResponse = rnorm(10))
## GausResponse
## Sample1 1.8068085
## Sample2 0.1311561
## Sample3 0.3877388
## Sample4 1.5455637
## Sample5 2.1654431
## Sample6 0.0589606
## Sample7 0.2489577
## Sample8 -0.7025783
## Sample9 0.9805171
## Sample10 -2.5733975
# Gaussian can have more than one column
# One column in Y per response
family = "gaussian"
Y_gaussian.multi <- data.frame(GausResponse1 = rnorm(10), GausResponse2 = rnorm(10))
## GausResponse1 GausResponse2
## Sample1 1.32643797 -0.16297404
## Sample2 -0.77987761 0.42814169
## Sample3 1.35314275 -0.07528643
## Sample4 0.30570052 0.29166226
## Sample5 -0.03825471 0.61914996
## Sample6 -0.29487117 0.66904934
## Sample7 -0.36933323 1.16125567
## Sample8 0.19070604 -1.45955074
## Sample9 -0.09989580 0.32223038
## Sample10 -0.31118379 0.68025118
# Binomial, A response that is either 1 (TRUE) or 0 (FALSE)
# One column in Y per response
family = "binomial"
Y_binomial <- data.frame(BinomResponse = c(0,1,0,1,1,0,1,0,0,1))
## BinomResponse
## Sample1 0
## Sample2 1
## Sample3 0
## Sample4 1
## Sample5 1
## Sample6 0
## Sample7 1
## Sample8 0
## Sample9 0
## Sample10 1
# Multinomial (Categorical), multilple groups without an ordinal scale
# Same as binomial, but with multiple columns
# One column in Y per response
# Value = 1 if belongs to class, 0 otherwise
family = "multinomial"
Y_multinomial <- data.frame(Category1 = c(1,0,0,0,0,0,1,0,0,1),
Category2 = c(0,0,1,0,0,0,0,0,1,0),
Category3 = c(0,1,0,1,0,1,0,0,0,0),
Category4 = c(0,0,0,0,1,0,0,1,0,0))
## Category1 Category2 Category3 Category4
## Sample1 1 0 0 0
## Sample2 0 0 1 0
## Sample3 0 1 0 0
## Sample4 0 0 1 0
## Sample5 0 0 0 1
## Sample6 0 0 1 0
## Sample7 1 0 0 0
## Sample8 0 0 0 1
## Sample9 0 1 0 0
## Sample10 1 0 0 0
# Ordinal, multiple groups that can be ranked
# Only supports one column
# Value is an integer between 0 and the the number of classes minus 1 (-1)
# Lowest class starts at 0
family = "ordinal"
Y_ordinal <- data.frame(OrdinalResponse = sample(c(0:4), 10, replace = TRUE))
## OrdinalResponse
## Sample1 4
## Sample2 1
## Sample3 1
## Sample4 2
## Sample5 2
## Sample6 2
## Sample7 4
## Sample8 1
## Sample9 1
## Sample10 2
To return to the main SPEAR vignette, click here
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.