knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi=300, out.width="50%" ) library(ggplot2)
We first generate a mixture of bivariate normal distributions. The distributions differ only by their x- and y-displacement, that is, by their mean values. There are two predictors
grp2 which predict the differences in means.
grp1 predicts differences in the first dimension and
grp2 predicts differences in the second dimension. Without focus parameter, both predictors are needed to distinguish all four groups. If one of the two means is chosen as a focus parameter, only one of the two predictors is important.
library(semtree) set.seed(123) N <- 1000 grp1 <- factor(sample(x = c(0,1), size=N, replace=TRUE)) grp2 <- factor(sample(x = c(0,1), size=N, replace=TRUE)) noise <- factor(sample(x = c(0,1),size=N, replace=TRUE)) Sigma <- matrix(byrow=TRUE, nrow=2,c(2,0.2, 0.2,1)) obs <- MASS::mvrnorm(N,mu=c(0,0), Sigma=Sigma) obs[,1] <- obs[,1] + ifelse(grp1==1,3,0) obs[,2] <- obs[,2] + ifelse(grp2==1,3,0) df.biv <- data.frame(obs, grp1, grp2, noise) names(df.biv)[1:2] <- paste0("x",1:2) manifests<-c("x1","x2") model.biv <- mxModel("Bivariate_Model", type="RAM", manifestVars = manifests, latentVars = c(), mxPath(from="x1",to=c("x1","x2"), free=c(TRUE,TRUE), value=c(1.0,.2) , arrows=2, label=c("VAR_x1","COV_x1_x2") ), mxPath(from="x2",to=c("x2"), free=c(TRUE), value=c(1.0) , arrows=2, label=c("VAR_x2") ), mxPath(from="one",to=c("x1","x2"), label=c("mu1","mu2"), free=TRUE, value=0, arrows=1), mxData(df.biv, type = "raw") ); result <- mxRun(model.biv) summary(result)
This is how the data look in a 2D space:
df.biv.pred <- data.frame(df.biv, leaf=factor(as.numeric(df.biv$grp2)*2+as.numeric(df.biv$grp1))) ggplot(data = df.biv.pred, aes(x=x1, y=x2, group=leaf))+ geom_density_2d(aes(colour=leaf))+ viridis::scale_color_viridis(discrete=TRUE)+ theme_classic()
Now, we choose the mean of the second dimension
mu2 as focus parameter. We expect that only predictor
grp2. This is what we see in a single tree.
fp <- "mu2" # predicted by grp2 #fp <- "mu1" # predicted by grp1 tree.biv <- semtree(model.biv, data=df.biv, constraints = list(focus.parameters=fp)) plot(tree.biv)
fp <- "mu2" # predicted by grp2 forest <- semforest(model.biv, data=df.biv, constraints = list(focus.parameters=fp), control=semforest.control(num.trees=10, control=semtree.control(method="score",alpha=1)))
Now, we are repeating the same analysis in a forest.
forest <- semforest(model.biv, data=df.biv, constraints = list(focus.parameters=fp), control=semforest.control(num.trees=5, control=semtree.control(method="score")))
By default, we see that individual trees are fully grown (without a p-value threshold). The first split is according to
grp2 because it best explains the group differences. Subsequent splits are according to
grp1 even though the chi2 values are close to zero. They only appear because there is no p-value-based stopping criterion.
Now, let us investigate the permutation-based variable importance:
vim <- varimp(forest, method="permutationFocus") plot(vim, main="Variable Importance")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.