knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The single-atom catalysis data is stored in data/single_atom_catalysis.RData
, and the raw data is available at this Github repo. Here we model the metal/oxide binding energy (the response variable $y$) using $p=59$ physical properties of the transition metals and the oxide supports (the primary features $X$). The response variable $y$ and the primary features $X$ are treated as continuous variables, and we aim to use iBART to find an interpretable model with high predictive performance for the metal/oxide binding energy.
A total of 13 transition metals ($\text{Cu, Ag, Au, Ni, Pd, Pt,}$ $\text{Co, Rh, Ir, Fe, Ru, Mn, V}$) and 7 oxide supports ($\text{CeO}2(111)$, $\text{MgO}(100)$, $\text{CeO}_2(110)$, $\text{TbO}_2(111)$, $\text{ZnO}(100)$, $\text{TiO}_2(011)$, $\alpha\text{-Al}_2\text{O}_3(0001)$) were studied in the dataset, making a total of $n = 13 \times 7 = 91$ metal/oxide pairs. The primary feature matrix $X$ contains various physical properties of the transition metals and the oxide supports including Pauling Electronegativity ($\chi_P$), $(n-1)^{\text{th}}$ and $n^{\text{th}}$ Ionization Energies ($\text{IE}{n-1}$, $\text{IE}n$), Electron Affinity ($\text{EA}$), $\text{HOMO}$ Energy, $\text{LUMO}$ Energy, Heat of Sublimation ($\Delta H{\text{sub}}$), Oxidation Energy of oxide support ($\Delta H_{\text{f,ox,bulk}}$), Oxide Formation Enthalpy ($\Delta H_{\text{f,ox}}$), Zunger Orbital Radius ($r$), Atomic Number ($Z$), Meidema Parameters of metal atoms $(\eta^{1/3}, \varphi)$, Valance Electron ($\text{N}{\text{val}}$), Oxygen Vacancy Energy of oxide support ($\Delta\text{E}{\text{vac}}$), Workfunction of oxide support $(\text{WF})$, Surface Energy ($\gamma$), Coordination Number ($\text{CN}$), and Bond Valence of surface metal atom ($\text{BV}$). Most of these physical properties are defined for both the transition metals and the oxide supports while a few of them are only defined for either the transition metals or the oxide supports. A detailed description of the 59 primary features $X$ can be find in pages 11--14 of the data supplementary materials published by O'Connor et al.
Before loading the iBART package, we must allocate enough memory for Java to avoid out of memory errors.
# Allocate 10GB of memory for Java. Must be called before library(iBART) options(java.parameters = "-Xmx10g") library(iBART)
Next, we load the real data set and examine what data are needed to run iBART.
load("../data/catalysis.rda") # load data summary(catalysis)
The data set consists of 4 objects:
X
: a matrix
of physical properties of the transition metals and the oxide supports described in Data Description. These are our primary features (predictors).y
: a numeric
vector of metal/oxide binding energy described in Data Description. This is our response variable.head
: a character
vector storing the column names of X
.unit
: a (optional) list
of named numeric vectors. This stores the unit information of the primary features X
. This can be generated using the helper function generate_unit(unit, dimension)
. See ?iBART::generate_unit
for more detail.Now let's apply iBART to this data set. Besides the usual regression data (X,y)
, we need to specify the descriptor generating strategy through opt
. Here we specify opt = c("binary", "unary", "binary")
, meaning there will be 3 iterations and we want to alternate between binary and unary operators, starting with binary operators $\mathcal{O}_b$. We can also use all operators $O$ in an iteration. For example, opt = c("all", "all")
will apply all operators $O$ for 2 iterations.
iBART_real_data <- iBART(X = catalysis$X, y = catalysis$y, head = catalysis$head, # colnames of X unit = catalysis$unit, # units of X opt = c("binary", "unary", "binary"), # binary operator first out_sample = FALSE, Lzero = TRUE, K = 5, # maximum descriptors in l-zero model standardize = FALSE, seed = 888)
load("../data/iBART_real_data.rda") # load full result
To save time, we cached the result of a complete run in data("iBART_real_data", package = "iBART")
, which can be replicated by using the above code. iBART()
returns many interesting outputs. For example, iBART_real_data$iBART_gen_size
and iBART_real_data$iBART_sel_size
store dimension of the newly generated / selected descriptor space for each iteration. Let's plot them and see how iBART use nonparametric variable selection for dimension reduction.
library(ggplot2) df_dim <- data.frame(dim = c(iBART_real_data$iBART_sel_size, iBART_real_data$iBART_gen_size), iter = rep(0:3, 2), type = rep(c("Selected", "Generated"), each = 4)) ggplot(df_dim, aes(x = iter, y = dim, colour = type, group = type)) + theme(text = element_text(size = 15), legend.title = element_blank()) + geom_line(size = 1) + geom_point(size = 3, shape = 21, fill = "white") + geom_text(data = df_dim, aes(label = dim, y = dim + 40, group = type), position = position_dodge(0), size = 5, colour = "blue") + labs(x = "Iteration", y = "Number of descriptors") + scale_x_continuous(breaks = c(0, 1, 2, 3))
We can access the selected $k$-descriptor via iBART_real_data$Lzero_names
and the corresponding regression model in iBART_real_data$Lzero_models
. For instance, the selected 3-descriptor model is
iBART_real_data$Lzero_names[[3]] summary(iBART_real_data$Lzero_models[[3]])
The OIS model differs from the non-OIS model in that the former builds on nonlinear descriptors (composition of $\mathcal{O}$ on $X$) while the latter builds on the primary features $X$. The OIS model has many advantages. In particular, it reveals interpretable nonlinear relationship between $y$ and $X$, and improves prediction accuracy over a simple linear regression model (or non-OIS model). We showcase the improved accuracy over non-OIS model using Figure 7 in the paper.
# Train a non-OIS model with 3 predictors set.seed(123) model_no_OIS <- k_var_model(X_train = catalysis$X, y_train = catalysis$y, k = 3, parallel = FALSE) #### Figure 7 #### library(ggpubr) model_OIS <- iBART_real_data$Lzero_model[[3]] # Prepare data for plotting data_OIS <- data.frame(y = catalysis$y, y_hat = model_OIS$fitted.values) data_no_OIS <- data.frame(y = catalysis$y, y_hat = model_no_OIS$models$fitted.values) p1 <- ggplot(data_OIS, aes(x = y_hat, y = catalysis$y)) + geom_point() + geom_abline() + xlim(c(min(data_OIS$y_hat, data_OIS$y) - 0.2, max(data_OIS$y_hat, data_OIS$y) + 0.2)) + ylim(c(min(data_OIS$y_hat, data_OIS$y) - 0.2, max(data_OIS$y_hat, data_OIS$y) + 0.2)) + xlab("") + ylab("") + annotate("text", x = -12, y = -3, parse = TRUE, label = paste("R^{2} ==", round(summary(model_OIS)$r.squared, 4))) p2 <- ggplot(data_no_OIS, aes(x = y_hat, y = catalysis$y)) + geom_point() + geom_abline() + xlim(c(min(data_no_OIS$y_hat, data_no_OIS$y) - 0.2, max(data_no_OIS$y_hat, data_no_OIS$y) + 0.2)) + ylim(c(min(data_no_OIS$y_hat, data_no_OIS$y) - 0.2, max(data_no_OIS$y_hat, data_no_OIS$y) + 0.2)) + xlab("") + ylab("") + annotate("text", x = -12, y = -3, parse = TRUE, label = paste("R^{2} ==", round(summary(model_no_OIS$models)$r.squared, 4))) fig <- ggarrange(p1, p2, labels = c("OIS", "non-OIS"), ncol = 2, nrow = 1) annotate_figure(fig, bottom = text_grob("Predicted binding energy from descriptors (eV)"), left = text_grob("DFT binding energy (eV)", rot = 90))
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.