knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = FALSE )
The goal of inteRmodel is to provide the tools required to search for models in the generalized canonical correlations as provided by the RGCCA package.
You can read more about canonical correlation at [@tenenhaus2011].
library(inteRmodel)
In this use case we want to find the best model to relate the gene expression on a glioma with some repetitions at the DNA level, taking into account that these tumors are at three different locations. Our idea is that the location might influence the tumor.
We use the data used on the vignette of the RGCCA package. We need the samples in the rows and the variables in the columns:
data("ge_cgh_locIGR") A <- ge_cgh_locIGR$multiblocks Loc <- factor(ge_cgh_locIGR$y) levels(Loc) <- colnames(ge_cgh_locIGR$multiblocks$y) A[[3]] = A[[3]][, -3] names(A)[3] <- "Loc"
With these steps we read and prepare the data for the sgcca
and rgcca
function.
The workflow is as follows:
These steps are explained in detail in the next sections.
Then we need to encode our hypothesis in a model, which takes the shape of a matrix.
To ease the design of this model there are several functions available.
The most important one is subSymm
, which makes substitutions on symmetric matrices.
C <- matrix(0, nrow = 3, ncol = 3, dimnames = list(names(A), names(A))) model0 <- subSymm(C, "GE", "CGH", 1) model0 <- subSymm(model0, "GE", "Loc", 1) model0
Here we hypothesize that the Agriculture and the industry are linked and that the agriculture is linked with the political block.
Now we look for the model that best relates these data.
out_model <- search_model(A = A, c1 = c(.071, .2, 1), scheme = "horst", scale = FALSE, verbose = FALSE, ncomp = rep(1, length(A)), bias = TRUE)
This quickly explores over the 20 different models that could be the best ones. We can see that our hypothesis is in the upper middle:
hist(out_model$AVE_inner, xlab = "inner AVE", main = "Histogram of AVE_inner in the models", breaks = 20) abline(v = out_model$AVE_inner[out_model$var12 == 0 & out_model$var13 == 1 & out_model$var23 == 1])
According to this, a better model would then be:
columns <- grep("var", colnames(out_model)) model <- symm(C, out_model[which.max(out_model$AVE_inner), columns]) model
A model that instead of the expected relation between Agriculture and Industry block, they are both related to the Political block.
We further explore the relationships in this model:
out_best <- iterate_model(C = model, A = A, c1 = c(.071,.2, 1), scheme = "horst", scale = FALSE, verbose = FALSE, ncomp = rep(1, length(A)), bias = TRUE)
We can see that usually the model is around 0.4 inner AVE.
hist(out_best$AVE_inner, xlab = "inner AVE", main = "Histogram of AVE_inner in the models")
The best model would then be:
model2 <- symm(C, out_best[which.max(out_best$AVE_inner), columns]) model2
A next step would be to see if some models are better than the others, something along these lines but with more iterations.
index <- boot_index(nrow(A[[1]]), 10) bs0 <- boot_index_sgcca(index, A = A, C = model0, c1 = c(.071,.2, 1)) bs1 <- boot_index_sgcca(index, A = A, C = model, c1 = c(.071,.2, 1)) bs2 <- boot_index_sgcca(index, A = A, C = model2, c1 = c(.071,.2, 1))
We can see that they have different distributions:
plot(c(0.6, 0.9), c(0.05, 0.15), type = "n", xlab = "inner AVE", ylab = "outer AVE", main = "Bootstrap of models") points(bs0$AVE, pch = 16, col = "red") points(bs1$AVE, pch = 16, col = "green") points(bs2$AVE, pch = 16) legend("topright", legend = c("model2", "model", "model0"), fill = c("black", "green", "red")) df <- rbind( cbind.data.frame(bs0$AVE, Model = "model 0"), cbind.data.frame(bs1$AVE, Model = "model 1"), cbind.data.frame(bs2$AVE, Model = "model 2")) agg <- aggregate(df[, 1:2], list(Model = df$Model), mean) points(agg[, 2:3], pch = 17, col = c("red", "green", "black"), cex = 1.5)
This suggests that the model2 has a higher inner AVE than the model0. With more iterations we could see if there is a tendency of a model to be more centered than the others. So, indicating that the model can better find the relationship between the blocks regardless of the input data.
We can now analyze the canonical correlation for the weight of each variable.
m2 <- RGCCA::sgcca(A, model2, c1 = c(.071,.2, 1), scheme = "horst", scale = TRUE, verbose = FALSE, ncomp = rep(1, length(A)), bias = TRUE) m0 <- RGCCA::sgcca(A, model0, c1 = c(.071,.2, 1), scheme = "horst", scale = TRUE, verbose = FALSE, ncomp = rep(1, length(A)), bias = TRUE)
We can now explore the position of each sample:
pars <- par(mfrow = c(1, 2)) plot(m0$Y[[1]][, 1], m0$Y[[2]][, 1], type = "n", xlab = "Block1 Comp1", ylab = "Block2 Comp1", main = "Initial model") abline(h = 0, v = 0) text(m0$Y[[1]][, 1], m0$Y[[2]][, 1], labels = Loc, col = as.numeric(Loc) + 1) plot(m2$Y[[1]][, 1], m2$Y[[2]][, 1], type = "n", xlab = "Block1 Comp1", ylab = "Block2 Comp1", main = "Final model") abline(h = 0, v = 0) text(m2$Y[[1]][, 1], m2$Y[[2]][, 1], labels = Loc, col = as.numeric(Loc) + 1)
Although there is a relative difference in the general outcome, what has changed more is the importance of each variable. As we can see here:
par(pars) plot(m0$a$CGH[, 1], m2$a$CGH[, 1], main = "CGH variables importance", xlab = "model0", ylab = "model2", pch = 16)
Where some variables that had a weight of 0 in model 0 become important and some variables that were important that no longer have any weight on the model 2.
Finding a different model that explains your data better challenges the assumptions about the data. By comparing models, we can learn which blocks of variables are connected thus showing new relationships. These new relationships consider the same variables but with different importance, which, with the right model, should be more accurate.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.