Introduction to xform_function

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

Introduction

This vignette provides examples of how to use the xform_function transformation to create new data features for PMML models.

Given a xform_wrap object and a transformation expression, xform_function calculates data for a new feature and creates a new xform_wrap object. When PMML is produced with pmml::pmml(), the transformation is inserted into the LocalTransformations node as a DerivedField.

Multiple data fields and functions can be combined to produce a new feature.

The code below uses knitr::kable() to make tables more readable.

library(pmml)
library(knitr)

Single numeric input

Using the iris dataset as an example, let's construct a new feature by transforming one variable. Load the dataset and show the first few lines:

data(iris)
kable(head(iris,3))

Create the iris_box object with xform_wrap:

iris_box <- xform_wrap(iris)

iris_box contains the data and transform information that will be used to produce PMML later. The original data is in iris_box$data. Any new features created with a transformation are added as columns to this data frame.

kable(head(iris_box$data,3))

Transform and field information is in iris_box$field_data. The field_data data frame contains information on every field in the dataset, as well as every transform used. The xform_function column contains expressions used in the xform_function transform.

kable(iris_box$field_data)

Now add a new feature, Sepal.Length.Sqrt, using xform_function:

iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length",
                         new_field_name="Sepal.Length.Sqrt",
                         expression="sqrt(Sepal.Length)")

The new feature is calculated and added as a column to the iris_box$data data frame:

kable(head(iris_box$data,3))

iris_box$field_data now contains a new row with the transformation expression:

kable(iris_box$field_data[6,c(1:3,14)])

Construct a linear model for Petal.Width using this new feature, and convert it to PMML:

fit <- lm(Petal.Width ~ Sepal.Length.Sqrt, data=iris_box$data)
fit_pmml <- pmml(fit, transform=iris_box)

Since the model predicts Petal.Width using a variable based on Sepal.Length, the PMML will contain these two fields in the DataDictionary and MiningSchema:

fit_pmml[[2]] #Data Dictionary node
fit_pmml[[3]][[1]] #Mining Schema node

The LocalTransformations node contains Sepal.Length.Sqrt as a derived field:

fit_pmml[[3]][[3]]

Single categorical input

xform_function can also operate on categorical data. In this example, let's create a numeric feature that equals 1 when Species is setosa, and 0 otherwise:

iris_box <- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Species",
                         new_field_name="Species.Setosa",
                         expression="if (Species == 'setosa') {1} else {0}")
kable(head(iris_box$data,3))

Create a linear model and check the LocalTransformations node:

fit <- lm(Petal.Width ~ Species.Setosa, data=iris_box$data)
fit_pmml <- pmml(fit, transform=iris_box)
fit_pmml[[3]][[3]]

Multiple input fields

Several fields can be combined to create new features. Let's make a new field from the ratio of sepal and petal lengths:

iris_box <- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length",
                         new_field_name="Length.Ratio",
                         expression="Sepal.Length / Petal.Length")

As before, the new field is added as a column to the iris_box$data data frame:

kable(head(iris_box$data,3))

Fit a linear model using this new feature, and convert it to pmml:

fit <- lm(Petal.Width ~ Length.Ratio, data=iris_box$data)
fit_pmml <- pmml(fit, transform=iris_box)

The pmml will contain Sepal.Length and Petal.Length in the DataDictionary and MiningSchema:

fit_pmml[[2]] #Data Dictionary node
fit_pmml[[3]][[1]] #Mining Schema node

The Local.Transformations node contains Length.Ratio as a derived field:

fit_pmml[[3]][[3]]

Using a previously derived feature

It is possible to pass a feature derived with xform_function to another xform_function call. To do this, the second call to xform_function must use the original data field names (instead of the derived field) in the orig_field_name argument.

iris_box <- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length",
                         new_field_name="Length.Ratio",
                         expression="Sepal.Length / Petal.Length")

iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length,Sepal.Width",
                         new_field_name="Length.R.Times.S.Width",
                         expression="Length.Ratio * Sepal.Width")
kable(iris_box$field_data[6:7,c(1:3,14)])
fit <- lm(Petal.Width ~ Length.R.Times.S.Width, data=iris_box$data)
fit_pmml <- pmml(fit, transform=iris_box)

The pmml will contain Sepal.Length, Petal.Length, and Sepal.Width in the DataDictionary and MiningSchema:

fit_pmml[[2]] #Data Dictionary node
fit_pmml[[3]][[1]] #Mining Schema node

The Local.Transformations node contains Length.Ratio and Length.R.Times.S.Width as derived fields:

fit_pmml[[3]][[3]]

Factor output

The resulting field can be numeric or factor. Note that factors are exported with dataType = "string" and optype = "categorical" in PMML. The following code creates a factor with 3 levels from Sepal.Length:

iris_box <- xform_wrap(iris)

iris_box <- xform_function(wrap_object = iris_box,
                              orig_field_name = "Sepal.Length",
                              new_field_name = "SL_factor",
                              new_field_data_type = "factor",
                              expression = "if(Sepal.Length<5.1) {'level_A'} else if (Sepal.Length>6.6) {'level_B'} else {'level_C'}")

kable(head(iris_box$data, 3))

The feature can then be used to create a model as usual:

fit <- lm(Petal.Width ~ SL_factor, data=iris_box$data)
fit_pmml <- pmml(fit, transform=iris_box)

PMML functions supported by xform_function

The following R functions and operators are directly supported by xform_function. Their PMML equivalents are listed in the second column:

R <- c("+","-","/","*","^","<","<=",">",">=","&&","&","|","||","==","!=","!","ceiling","prod","log")
PMML <- c("+","-","/","*","pow","lessThan","lessOrEqual","greaterThan","greaterOrEqual","and","and","or","or","equal","notEqual","not","ceil","product","ln")

funcs_df <- data.frame(R, PMML)
knitr::kable(funcs_df)

For these functions, no extra code is required for translation.

The R function prod can be used as long as only numeric arguments are specified. That is, prod can take an na.rm argument, but specifying this in xform_function directly will not produce PMML equivalent to the R expression.

Similarly, the R function log can be used directly as long as the second argument (the base) is not specified.

PMML functions not supported by xform_function

There are built-in functions defined in PMML that cannot be directly translated to PMML using xform_function as described above.

In this case, an error will be thrown when R tries to calculate a new feature using the function passed to xform_function, but does not see that function in the environment.

It is still possible to make xform_function work, but the PMML function must be defined in the R environment first.

Let's use isIn, a PMML function, as an example. The function returns a boolean indicating whether the first argument is contained in a list of values. Detailed specification for this function is available on this DMG page.

One way to implement this in R is by using %in%, with the list of values being represented by ...:

isIn <- function(x, ...) {
  dots <- c(...)
  if (x %in% dots) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

isIn(1,2,1,4)

This function can now be passed to xform_function. The following code creates a feature that indicates whether Species is either setosa or versicolor:

iris_box <- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Species",
                         new_field_name="Species.Setosa.or.Versicolor",
                         expression="isIn(Species,'setosa','versicolor')")

The data data frame now contains the new feature:

kable(head(iris_box$data,3))

Create a linear model and view the corresponding PMML for the function:

fit <- lm(Petal.Width ~ Species.Setosa.or.Versicolor, data=iris_box$data)
fit_pmml <- pmml(fit, transform=iris_box)
fit_pmml[[3]][[3]]

PMML function not supported by xform_function - another example

As another example, let's use R's mean function to create a new feature. PMML has a built-in avg, so we will define an R function with this name.

avg <- function(...) {
  dots <- c(...)
  return(mean(dots))
}

Now use this function to take an average of several other features and combine with another field:

iris_box <- xform_wrap(iris)
iris_box <- xform_function(iris_box,orig_field_name="Sepal.Length,Petal.Length,Sepal.Width",
                         new_field_name="Length.Average.Ratio",
                         expression="avg(Sepal.Length,Petal.Length)/Sepal.Width")

The data data frame now contains the new feature:

kable(head(iris_box$data,3))

Create a simple linear model and view the corresponding PMML for the function:

fit <- lm(Petal.Width ~ Length.Average.Ratio, data=iris_box$data)
fit_pmml <- pmml(fit, transform=iris_box)
fit_pmml[[3]][[3]]

In the PMML, avg will be recognized as a valid function.

PMML for arbitrary functions

The function function_to_pmml (part of the pmml package) makes it possible to convert an R expression into PMML directly, without creating a model or calculating values.

As long as the expression passed to the function is a valid R expression (e.g., no unbalanced parentheses), it can contain arbitrary function names not defined in R. Variables in the expression passed to xform_function are always assumed to be field names, and not substituted. That is, even if x has a value in the R environment, the resulting expression will still use x.

function_to_pmml("1 + 2")

x <- 3
function_to_pmml("foo(bar(x * y))")

More notes on functions

There are several limitations to parsing expressions in xform_function.

Each transformation operates on one data row at a time. For example, it is not possible to compute the mean of an entire feature column in xform_function.

An expression such as foo(x) is treated as a function foo with argument x. Consequently, passing in an R vector c(1,2,3) will produce PMML where c is a function and 1,2,3 are the arguments:

function_to_pmml("c(1,2,3)")

We can also see what happens when passing an na.rm argument to prod, as mentioned in an above example:

function_to_pmml("prod(1,2,na.rm=FALSE)") #produces incorrect PMML
function_to_pmml("prod(1,2)") #produces correct PMML

Additionally, passing in a vector to prod produces incorrect PMML:

prod(c(1,2,3))
function_to_pmml("prod(c(1,2,3))")

More examples of functions

The following are additional examples of pmml produced from R expressions.

Extra parentheses:

function_to_pmml("pmmlT(((1+2))*(x))")

If-else expressions:

function_to_pmml("if(a<2) {x+3} else if (a>4) {4} else {5}")

References



Try the pmml package in your browser

Any scripts or data that you put into this service are public.

pmml documentation built on March 18, 2022, 5:49 p.m.