library(bninfo, quietly = T, warn.conflicts = F) library(tidyverse, quietly = T, warn.conflicts = F)
My goal is to create a Bayes net and an equivalent structural causal model in Pyro and compare their performance on intervention/counterfactual prediction using the Sachs network
My goal with this vignette was to quickly fit some parameters on the network, as I want to focus on the prediction task, not parameter inference.
I want to fit these parameters on the Sachs network using all the intervention data. I did this first by converting the processed discrete data into 2 levels ("on" and "off") instead of 3, since that is more interpretable and has less parameters to deal with. I also tried working with the continuous unprocessed data, applying an inverse hyperbolic transform, and fitting each node with a linear model that includes its parents, and a bunch of interaction effects that should account both for interventions and batch effects. My conclusion is that this is less preferable than working with the discrete processed data because (1) explaining the transform to a non-bio audience would be challenging, (2) it seems to call for latent variables on top of noise variables in a way that makes me want to use Stan, (3) it discrete model seems to capture strange interactions between parents like AND-gates that the linear model can't, and (4) the discrete model lets me use if-then control flow, which I think is nice in the probabilistic programming setting. The estimates from the discrete model were used to build a Python notebook that I plan to submit to the Pyro repo.
Visualizing the network.
val_str <- paste0( "[int_Mek]", "[int_PKA]", "[int_Akt]", "[int_PKC]", "[int_PIP2]", "[Plcg]", "[PIP3|Plcg]", "[PKC|int_PKC]", "[PKA|PKC:int_PKA]", "[Raf|PKC:PKA]", "[Mek|PKC:PKA:Raf:int_Mek]", "[Erk|Mek:PKA]", "[Akt|Erk:PKA:int_Akt]", "[P38|PKC:PKA]", "[Jnk|PKC:PKA]", "[PIP2|Plcg:PIP3:int_PIP2]" ) int_net <- model2network(val_str) graphviz.plot(int_net)
Let's first fit a binary network, and get an idea of what the prediction task will be.
data(tcells) .data <- tcells$processed$.data interventions <- tcells$processed$interventions
.data <- map_df(.data, ~ as.factor(ifelse(.x == 1, 'off', 'on'))) training_selection <- sample(c(T, F), prob = c(.9, .1), size = nrow(.data), replace=T) training <- .data[training_selection, ] test <- .data[!training_selection, ] INT <- interventions[training_selection] INT_test <- interventions[!training_selection] training <- training %>% mutate( int_PKA = as.factor(as.integer(INT == 'PKA')), int_Akt = as.factor(as.integer(INT == 'Akt')), int_PKC = as.factor(as.integer(INT == 'PKC')), int_PIP2 = as.factor(as.integer(INT == 'PIP2')), int_Mek = as.factor(as.integer(INT == 'Mek')) ) %>% as.data.frame
Create a network without interventions, and fit it using simulated data from the intervention network.
val_str <- paste0( "[Plcg]", "[PIP3|Plcg]", "[PKC]", "[PKA|PKC]", "[Raf|PKC:PKA]", "[Mek|PKC:PKA:Raf]", "[Erk|Mek:PKA]", "[Akt|Erk:PKA]", "[P38|PKC:PKA]", "[Jnk|PKC:PKA]", "[PIP2|Plcg:PIP3]" ) net <- model2network(val_str) fit1 <- int_net %>% bn.fit(training, method = "bayes") %>% rbn(100000) %>% select(-int_PKA, -int_Akt, -int_PKC, -int_PIP2, -int_Mek) %>% as.data.frame %>% {bn.fit(net, ., method = "bayes")}
Some observations on the distributions.
fit1$Akt
When PKA is off, if Erk is off, Akt is likely off, and when Erk is on, Akt is likely on. When PKA is on, Erk seems to have little effect on Akt. In Python could code it up as follows:
if PKA: Akt = Flip(.32) else: if Erk: Akt = Flip(.8077) else: Akt = Flip(.2793)
The other proteins are similar. Mek is a bit more complicated.
fit1$Mek
PIP3's impact on PIP2 seems negligible, as is Plcg's affect on PIP3
print(fit1$Plcg) print(fit1$PIP2) print(fit1$PIP3)
I am going to try recoding, so that levels 1 and 2 become 'off' and 3 becomes 'on'.
.data <- training %>% mutate( Raf = as.factor(ifelse(Raf == 1, 'off', 'on')), Mek = as.factor(ifelse(Mek == 1, 'off', 'on')), Plcg = as.factor(ifelse(Plcg == 3, 'on', 'off')), PIP2 = as.factor(ifelse(PIP2 == 3, 'on', 'off')), PIP3 = as.factor(ifelse(PIP3 == 3, 'on', 'off')), Erk = as.factor(ifelse(Erk == 1, 'off', 'on')), Akt = as.factor(ifelse(Akt == 1, 'off', 'on')), PKA = as.factor(ifelse(PKA == 1, 'off', 'on')), PKC = as.factor(ifelse(PKC == 1, 'off', 'on')), P38 = as.factor(ifelse(P38 == 1, 'off', 'on')), Jnk = as.factor(ifelse(Jnk == 1, 'off', 'on')) ) %>% mutate( int_PKA = as.factor(as.integer(INT == 'PKA')), int_Akt = as.factor(as.integer(INT == 'Akt')), int_PKC = as.factor(as.integer(INT == 'PKC')), int_PIP2 = as.factor(as.integer(INT == 'PIP2')), int_Mek = as.factor(as.integer(INT == 'Mek')) ) %>% as.data.frame val_str <- paste0( "[Plcg]", "[PIP3|Plcg]", "[PKC]", "[PKA|PKC]", "[Raf|PKC:PKA]", "[Mek|PKC:PKA:Raf]", "[Erk|Mek:PKA]", "[Akt|Erk:PKA]", "[P38|PKC:PKA]", "[Jnk|PKC:PKA]", "[PIP2|Plcg:PIP3]" ) net <- model2network(val_str) fit2 <- int_net %>% bn.fit(training, method = "bayes") %>% rbn(100000) %>% select(-int_PKA, -int_Akt, -int_PKC, -int_PIP2, -int_Mek) %>% as.data.frame %>% {bn.fit(net, ., method = "bayes")}
Not much difference between Plcg:
print(fit1$Plcg) print(fit2$Plcg)
print(fit1$PIP2) print(fit2$PIP2)
Not much difference, except on the direction of the relationship between PIP3 and PIP2.
This is an attempt to do evaluation by calculating log-likelihood of test data. I wrote one using cpquery
, and one that directly calculates the logmass using the CPDs.
results <- list() INT_test <- interventions[!training_selection] for(i in 1:nrow(test)){ if(INT_test[i] != "observational"){ int_target <- INT_test[i] evidence <- structure(list(test[[prot]][i]), names=prot) int_fit <- mutilated(fit1, evidence) ordered_nodes <- node.ordering(fit1) factors <- NULL for(prot in ordered_nodes){ parents <- bnlearn::parents(int_fit, prot) prot_cpd <- as_tibble(int_fit[[prot]]$prob) if(nrow(prot_cpd) == 2){ names(prot_cpd)[1] <- prot } obs <- test[i, c(prot, parents)] str <- paste0(names(obs), " == '", map(obs, as.character), "'", sep = "", collapse = " , ") cmd <- paste0('filter(prot_cpd, ', str, ")$n") factor <- eval(parse(text = cmd)) factors <- c(factors, factor) } names(factors) <- ordered_nodes results[[int_target]] <- c(results[[int_target]], factors) } } for(i in 1:nrow(test)){ if(INT_test[i] != "observational"){ prot <- INT_test[i] evidence <- structure(list(test[[prot]][i]), names=prot) int_fit <- mutilated(fit1, evidence) not_prot <- setdiff(names(test), prot) obs <- paste("(", not_prot, " == '", sapply(test[1, not_prot], as.character), "')", sep = "", collapse = " & ") ev <- paste0("(", names(evidence), " == '", evidence[[prot]], "')") cmd <- paste0("log(cpquery(int_fit, ", obs, ", ", ev, "))", sep = "") results[[prot]] <- c(results[[prot]], eval(parse(text = cmd))) } }
Thinking further, I think it is preferable to focus on a specific prediction task. I think Erk is a good candidate because it is downstream of many of the interventions.
Let's see if there is an intervention dataset for Erk where we have a great difference from the others.
plot_comparison <- function(int){ int_erk <- filter(test, INT_test == int)$Erk not_int_erk <- filter(test, INT_test != int)$Erk title <- paste0('Erk in ', int, ' and non-', int, ' sets.') barplot(prop.table(table(not_int_erk)), ylim=c(0,1), col=alpha("darkgrey", .5), main=title) barplot(prop.table(table(int_erk)), ylim=c(0,1), add=T, col=alpha("darkred", .5)) } plot_comparison('PKA') plot_comparison('Akt') plot_comparison('PKC') plot_comparison('PIP2') plot_comparison('Mek')
So I am going to go with this intervention on Mek as the evaluation set.
We will fit a linear model to the protein values transformed by the inverse hyperbolic transform, which is like the log transform but allowing for 0's.
.data <- tcells$raw_data asinh <- function(x) { log(x + sqrt(x*x + 1)) } prot_names <- c("Raf", "Mek", "Plcg", "PIP2", "PIP3", "Erk", "Akt", "PKA", "PKC", "P38", "Jnk") for (prot in prot_names){ .data[prot] <- asinh(.data[prot]) }
Create a variable for the batch of the data. Each protein variable will have this as a predictor in order to estimate batch effects.
for(i in 1:length(.raw_data)){ .data[[i]]$batch <- names(.data)[i] }
Next, we create the model for each protein. We want a linear model that is additive in the parents, so we exclude multiplicative effects between parents. However, to account for pertubations and batch effects, we include a multiplicative effect between batch and each parent.
formulas <- list( PKC = PKC ~ batch, PKA = PKA ~ PKC*batch, Jnk = Jnk ~ PKC*batch + PKA*batch, Raf = Raf ~ PKC*batch + PKA*batch, Mek = Mek ~ PKC*batch + PKA*batch + Raf*batch, Erk = Erk ~ Mek*batch + PKA*batch, Akt = Akt ~ Erk*batch + PKA*batch, P38 = P38 ~ PKA*batch + PKC*batch, Plcg = Plcg ~ batch, PIP3 = PIP3 ~ Plcg*batch, PIP2 = PIP2 ~ PIP3*batch + Plcg*batch )
Seperate the data into training and test sets.
training_selection <- sample(c(T, F), prob = c(.9, .1), size = nrow(.fit_data), replace=T) training <- .fit_data[training_selection, ] test <- .fit_data[!training_selection, ]
Fit a model for each protein.
fit_protein <- function(prot){ fit <- lm(formulas[[prot]], training) res <- predict.glm(fit, newdata = test, type = "response") - test[[prot]] mse <- mean(res**2) return(list(fit=tidy(fit), sqrd_errs = summary(res))) } model <- map(prot_names, fit_protein) names(model) <- prot_names
There might be something that can be done here to improve residuals. IRLS?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.