isotree.to.sql | R Documentation |
Generate SQL statements - either separately per tree (the default), for a single tree if needed (if passing 'tree'), or for all trees concatenated together (if passing 'table_from'). Can also be made to output terminal node numbers (numeration starting at one).
Some important considerations:
Making predictions through SQL is much less efficient than from the model itself, as each terminal node will have to check all of the conditions that lead to it instead of passing observations down a tree.
If constructed with the default arguments, the model will not perform any sub-sampling, which can lead to very big trees. If it was fit to a large dataset, the generated SQL might consist of gigabytes of text, and might lay well beyond the character limit of commands accepted by SQL vendors.
The generated SQL statements will not include range penalizations, thus predictions might differ from calls to 'predict' when using 'penalize_range=TRUE'.
The generated SQL statements will only include handling of missing values when using 'missing_action="impute"'. When using the single-variable model with categorical variables + subset splits, the rule buckets might be incomplete due to not including categories that were not present in a given node - this last point can be avoided by using 'new_categ_action="smallest"', 'new_categ_action="random"', or 'missing_action="impute"' (in the latter case will treat them as missing, but the 'predict' function might treat them differently).
The resulting statements will include all the tree conditions as-is, with no simplification. Thus, there might be lots of redundant conditions in a given terminal node (e.g. "X > 2" and "X > 1", the second of which is redundant).
If using 'scoring_metric="density"' or 'scoring_metric="boxed_ratio"' plus 'output_tree_num=FALSE', the outputs will correspond to the logarithm of the density rather than the density.
isotree.to.sql(
model,
enclose = "doublequotes",
output_tree_num = FALSE,
tree = NULL,
table_from = NULL,
select_as = "outlier_score",
column_names = NULL,
column_names_categ = NULL,
nthreads = model$nthreads
)
model |
An Isolation Forest object as returned by isolation.forest. |
enclose |
With which symbols to enclose the column names in the select statement so as to make them SQL compatible in case they include characters like dots. Options are:
|
output_tree_num |
Whether to make the statements / outputs return the terminal node number instead of the isolation depth. The numeration will start at one. |
tree |
Tree for which to generate SQL statements or other outputs. If passed, will generate the statements only for that single tree. If passing 'NULL', will generate statements for all trees in the model. |
table_from |
If passing this, will generate a single select statement for the outlier score from all trees, selecting the data from the table name passed here. In this case, will always output the outlier score, regardless of what is passed under 'output_tree_num'. |
select_as |
Alias to give to the generated outlier score in the select statement. Ignored when not passing 'table_from'. |
column_names |
Column names to use for the numeric columns. If not passed and the model was fit to a 'data.frame', will use the column names from that 'data.frame', which can be found under 'model$metadata$cols_num'. If not passing it and the model was fit to data in a format other than 'data.frame', the columns will be named 'column_N' in the resulting SQL statement. Note that the names will be taken verbatim - this function will not do any checks for e.g. whether they constitute valid SQL or not when exporting to SQL, and will not escape characters such as double quotation marks when exporting to SQL. |
column_names_categ |
Column names to use for the categorical columns. If not passed, will use the column names from the 'data.frame' to which the model was fit. These can be found under 'model$metadata$cols_cat'. |
nthreads |
Number of parallel threads to use. |
If passing neither 'tree' nor 'table_from', will return a list of 'character' objects, containing at each entry the SQL statement for the corresponding tree.
If passing 'tree', will return a single 'character' object with the SQL statement representing that tree.
If passing 'table_from', will return a single 'character' object with the full SQL select statement for the outlier score, selecting the columns from the table name passed under 'table_from'.
library(isotree)
data(iris)
set.seed(1)
iso <- isolation.forest(iris, ntrees=2, sample_size=16, ndim=1, nthreads=1)
sql_forest <- isotree.to.sql(iso, table_from="my_iris_table")
cat(sql_forest)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.