Deploy XGBoost Model as SQL Query"
In xgb2sql: Convert Trained 'XGBoost' Model to SQL Query

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

You fit a boosting tree model in R with your favourite package xgboost, the validation results looks great, so next question comes up as how to deploy this model into production so that others could utilize it to help with the business. Incorporating the model into a shiny app would certainly be a good idea, but sometimes the model needs to be integrated into some other systems that the company is heavily relied on. Plus moving large amount of data between database and R could be time and memory consuming. So we propose R package xgb2sql enabling in-database scoring of XGBoost models built in R by translating trained model objects into SQL query.

CRAN Task View: Model Deployment with R categorizes the process of deploying models to various environments for scoring or inferencing on new data into two categories. The first category is Deployment through Different Types of Artifacts, which basically means exporting the model as an object, then using supported software/platform to consume this object scoring out the model predictions. The other category is Deployment through Cloud/Server, which includes a). providing an R interface to third-party managed services such as Google Cloud Machine Learning Engine; b). turning R code into web API and opening service on the server. Our approach provides SQL query producing model predictions, which can be taken as a combination of the model itself plus the scoring process. The output SQL query can be treated as an artifact, but we can easily set up service for it on the database server.

The SQL query generated by this tool is basic enough to be compatible with all SQL-based database products and services. Other than this tool, there are two R packages providing modeling and predicting capability inside database:

Package ibmdbR offers modeling and predicting capability for naive-Bayes, linear regression, decision tree, association rules, and K-means clustering. But the only supported database product and service is IBM DB2.
Package tidypredict leverages dplyr and dbplyr for SQL translation, supporting linear regression, generalized linear model, and random forest.

Here is the outline for the rest of this vignette:

Prepare Data in Both R and Database
Transform XGBoost Model into SQL Query

Prepare Data in Both R and Database {#data}

As we know, xgboost only consumes numeric input for its model fitting function [^1]. So after transferring raw table in database to R as a data.frame/data.table, same one-hot encoding needs to be performed on both the table and the data.frame/data.table. Here we have function onehot2sql() to perform one-hot encoding on the training data in R, producing at the same the SQL query performing the exact transformation for the raw table in database. Let's start with loading the sample dataset from ggplot2:

library(data.table)
library(xgboost)
library(xgb2sql)
df <- data.frame(ggplot2::diamonds)
head(df)

Funtion onehot2sql() is built upon base R functions model.frame() and model.matrix(). Other than consuming a data.frame as input, this function has been optimized to work with a data.table with greater efficiency. It outputs a matrix ready for model fitting following rules listed below:

The function treats any non-numeric columns, i.e., columns with class not being numeric and integer, as categorical and performs one-hot encoding for them.
The one-hot encoding doesn't remove one feature-level combination for each categorical feature, as model.matrix does in order to avoid issues caused by multicollinearity. Although the output matrix conveys same amount of information with and without one feature-level combination removed, thus producing similar model performance, system knowledge gained along this modeling practice is very different. If let's say the "cut" of diamonds being "Ideal" has a huge impact on its price as the target/response, removing binary column "cut.Ideal" in the output matrix would result in the predictive power of "cut.Ideal" being scattered among other "cut" columns [^2]. The model performance would be comparative, but we would miss the information that "cut" being "Ideal" is the dominate factor of price, by studying the variable importance. So as multicollinearity wouldn't be a problem for tree-based model [^3], we believe performing full one-hot encoding is more appropriate for XGBoost modeling.
The function keeps NAs inside both categorical and numeric features preserved. As pointed by the author of xgboost, the algorithm will automatically learn what is the best direction to go when a value is missing, which can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss [^4]. This is one of the reasons of XGBoost being so powerful, so we are keeping all NAs in the output matrix.
The function outputs meta information tracking all the levels for each categorical feature. If it is given to the function as an input, the exact feature-level combinations will be populated, even if the new data is missing one level for a particular categorical feature, or having a new level never seen before.

Available arguments of this function are, which will be explained with examples:

onehot2sql(data, meta=NULL, sep="_", ws_replace=TRUE, ws_replace_with="",
           unique_id=NULL, output_file_name=NULL, input_table_name=NULL)

Output of this function is a list containing:

meta data tracking the transformation.
matrix model.matrix being the data after processing which is ready for XGBoost fitting.
SQL query sql performing the exact one-hot encoding in the database.

So let' take a look of its basic usage:

out <- onehot2sql(df)
print(out$meta)
head(out$model.matrix)

It should be noted that level "Very Good" for feature "cut" has been replaced with "VeryGood", with the white-space removed. This behaviour is controlled by function arguments ws_replace=TRUE and ws_replace_with="", where other symbol can be specified to replace the white-space inside levels of categorical features. Such processing is very necessary as SQL database usually doesn't allow white-space inside its table column names. And symbol separating the feature and its levels is controlled by sep="_". The output model.matrix would have all its columns reordered alphabetically.

The SQL query performing one-hot encoding for the raw table is:

cat(out$sql)

We want to emphasise here that an unique row identifier inside the raw table is crucial for in-database scoring of XGBoost model. Column name of the identifier can be specified by the function argument unique_id, which will be passed along to the table after one-hot encoding. If it is not given, SQL query will be populated with column name "ROW_KEY" for the identifier. Similarly, "INPUT_TABLE" is used in the query if name of the raw table input_table_name is NULL. Given a valid value, the SQL query will be written to the file specified by output_file_name.

Let's have another example with NAs and a date column:

d2 <- data.table(ggplot2::diamonds)
# change column class
d2[, cut:=factor(cut, ordered=FALSE)]
d2[, clarity:=as.character(clarity)]
# create IDate column
d2[, tsdt:=as.IDate('2017-01-05')]
d2[1:3, tsdt:=tsdt-1]
# add NAs
d2[1, clarity:=NA]
d2[2, depth:=NA]
head(d2)
out2 <- onehot2sql(d2)
head(out2$model.matrix)

Then let's look at when meta is given to data with new elements, whether onehot2sql() will output model.matrix with identical columns as the training data, in order to apply predict() to the trained model on the new data:

newdata <- d2[1:5,]
# newdata has columns with new elements
newdata[5, clarity:='NEW']; newdata[1,tsdt:=as.IDate('2018-05-01')]
# newdata has a new column
newdata[, new_col:=1]
# newdata is lacking a column
newdata[, cut:=NULL]
head(newdata)
onehot2sql(newdata, meta=out2$meta)$model.matrix

We can see from this example that

any new levels will have value of 0s on all the columns related to that feature.
any new features will not be in the output model.matrix.
the entire feature will be imputed with NAs if it is missing in the new data, and warnings will be given.

We recommend any feature engineering and/or missing imputation work to be done before applying function onehot2sql() to the training data in R. It should be the last step before kicking off the model fitting. And SQL query for feature engineering and/or missing imputation can be placed as a sub-query inside the one-hot query. For example, replacing "INPUT_TABLE" inside out$sql with following sub-query will do one-hot encoding together with missing imputation for feature "clarity":

(SELECT ROW_KEY, [cut], [color], 
  (case when [clarity] IS NULL then 'MISS' else [clarity] end) as [clarity],
  [carat], [depth], [table], [price], [x], [y], [z]
FROM INPUT_TABLE) AS IMPUTED_TABLE

Transform XGBoost Model into SQL Query {#xgb}

Before taking a close look at function booster2sql() translating XGBoost model into SQL query, we want to illustrate the suggested work-flow for the whole process of model fitting and scoring with package xgb2sql:

We start with transferring raw table from database to R as a data.frame/data.table. There are many packages supporting database connection, we recommend dplyr and DBI here.
After all feature engineering and missing imputation is done, apply function onehot2sql() to the data.frame/data.table, obtaining the model.matrix and storing the one-hot query.
Conduct all modeling practices until reaching a final model, then apply function booster2sql() to the final model, producing the XGBoost query for its in-database scoring.
Modeling in R is done, let's move to in-database scoring:
- Execute the one-hot query on the raw table, creating the model-ready table.
- Execute the XGBoost query on the model-ready table, obtaining the model predictions.
- Compare the model prediction in R with the values given by the XGBoost query is always recommended.

Now let's move back to function booster2sql(). Available arguments are:

booster2sql(xgbModel, print_progress=FALSE, unique_id=NULL,
            output_file_name=NULL, input_table_name=NULL, input_onehot_query=NULL)

The model input xgbModel to this function should have a class of xgb.Booster. And print_progress=FALSE controls whether the translating progress should be printed to console. Similarly, unique_id and input_table_name should be given to generate the SQL query. It should be noted that there must be a valid file path for output_file_name to write the query, otherwise the function will not run.

Let's try to predict the "price" of diamonds using the other features. In order to demonstrate the generated XGBoost query, we will train the model with max.depth=2 and nround=2.

x <- out$model.matrix[,colnames(out$model.matrix)!='price']
y <- out$model.matrix[,colnames(out$model.matrix)=='price']
bst <- xgboost(data = x,
               label = y,
               max.depth = 2,
               eta = .3,
               nround = 2,
               objective = 'reg:linear')
booster2sql(bst, output_file_name='xgb.txt')

cat(readChar('xgb.txt', file.info('xgb.txt')$size))

We can see that each SELECT ... AS ONETREE section inside the XGBoost query is composed of nested case when statement, providing scores along a tree structure. And each of these sections represents one round/iteration of the XGBoost model. Values for the splits and scores within the query are from the xgb.dump() of the model without any rounding:

xgb.dump(bst)

It should be noted that model prediction calculated by adding up the scores provided by xgb.dump(), is different from that by applying predict() to the model directly. It is a rounding difference thus extremely insignificant. But since the XGBoost query is generated with scores from xgb.dump, this difference will still be there between the in-database scoring and the R's predict() of the model.

There is one last argument of booster2sql() we haven't talked about, i.e., input_onehot_query. Here we can input the one-hot query generated by onehot2sql(), which will be used as sub-query replacing "MODREADY_TABLE" within the XGBoost query. In this way, the XGBoost query can be executed on the raw table, producing the model predictions directly.

booster2sql(bst, output_file_name='onehot-xgb.txt', input_onehot_query=out$sql)

cat(readChar('onehot-xgb.txt', file.info('onehot-xgb.txt')$size))

As processing time and query size grow exponentially with max.depth, linearly with nround, this approach of combining the one-hot query and the XGBoost query together should be used for only simple models.

[^1]: Understand your dataset with XGBoost.\

[^2]: Ensembles of tree-based models: why correlated features do not trip them and why NA matters.\

[^3]: StackExchange: Does XGBoost handle multicollinearity by itself?\

[^4]: GitHub Issue: What are the ways of treatng missing values in XGboost?\