Executing business rules at scale using RDrools - an interface to Drools
In Rdrools: A Rules Engine Based on 'Drools'

#Package installation if required for handbook

if (!requireNamespace("DT", quietly = TRUE)) {
     install.packages("DT", repos = "http://cloud.r-project.org/")
}

if (!requireNamespace("lubridate", quietly = TRUE)) {
     install.packages("lubridate", repos = "http://cloud.r-project.org/")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
     install.packages("ggplot2", repos = "http://cloud.r-project.org/")
}
library("magrittr")
library("Rdrools")
library("dplyr")
library("purrr")
library("tibble")

options(stringsAsFactors = F)

Introduction

Objectives of Rdrools

The Rdrools package aims to accomplish two main objectives:

Allow data scientists an intuitive interface to execute business rules on datasets for the purpose of analysis or designing intelligent systems, while leveraging the Drools rule engine
Provide a direct interface to Drools for executing all types of rules defined in the Drools .drl format

The advantages of a rule engine

Rule engines allow for optimal checking of rules against data for large rule sets [of the order of hundreds or even thousands of rules]. Drools [and other rule engines] implement an enhanced version of the Rete algorithm, which efficiently match facts [data tuples] against conditions [rules]. This allows for codifying intuition/ business context which can be used to power intelligent systems.

Why Rdrools

RDrools brings the efficiencies of large-scale production rule systems to data science users. Rule sets can be used alone, or in conjunction with machine learning models, to develop and operationalize intelligent systems. RDrools allows for deployment of rules defined through an R interface into a production system. As data comes in [periodic or real-time], a pre-defined set of rules can be checked on the data, and actions can be triggered based on the result

Running rules on Rdrools

Executing rules on a dataset

In order to achieve the objective of providing data scientists an intuitive interface to execute rules on datasets, the Rdrools package exposes the executeRulesOnDataset function, which is explicitly designed for data scientists. As input to this function rules are defined using the typical language of data science with verbs such as

filter
group by
aggregate

For ease of use, the rules can be defined in a csv format and imported into the R session through the usual read functions. The require format follows a familiar structure using the verbs discussed earlier. We take the example of the iris dataset and define rules on it. The sample rules for the iris dataset are defined in the irisRules data object [for the purpose of the example]

data("iris")
data("irisRules")
sampleRules <- irisRules
rownames(sampleRules) <- seq(1:nrow(sampleRules))
sampleRules[is.na(sampleRules)]    <-""
sampleRules

Through this function, various typical types of rules can be executed with a combination of the verbs described above.

Note - In order to plot graphs to show counts of number of facts passing/ failing rules, we have defined a function internal to the vignette to plot graphs called 'plotgraphs'

#' Vignette helper functions
#' @description: Function plot graphs in the vignette
#' -----------------------------------------------------------------------------
#' @param result result of rule check
#' @param plotName Plot to be generated
#' @param rules the rules defined in csv format
#' -----------------------------------------------------------------------------
#' @return a plotly plot
#' @keywords internal

plotgraphs <- function(result,plotName){

  if(plotName == "Plot of points distribution"){
    anomaliesCountPlot <-list()
    purrr::map (1:length(result),function(i){
      outputDataframe <- result[[i]][["output"]]
      noOfTrueFalse <-  outputDataframe %>% dplyr::group_by(IsTrue) %>%
        dplyr::summarise(Frequency = n())
      if(nrow(noOfTrueFalse)==2){

        noOfTrueFalse <- noOfTrueFalse %>% as.data.frame %>% `rownames<-`(c("Anomalies","Non-Anomalies"))  
        anomaliesCountPlot[[i]] <- ggplot2::ggplot(noOfTrueFalse, ggplot2::aes(x=IsTrue, y=Frequency)) +
          ggplot2::geom_bar(stat = "identity", fill="steelblue")+
          ggplot2::labs(title="Distribution of points \n for the rule", 
              y = "Count") +
          ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))

      }else{
        anomaliesCountPlot[[i]] <- NULL
      }

      return(anomaliesCountPlot)     
    })
  }else if(plotName == "Plot of groups"){
    plotAnomalies <-list()
    purrr::map (1:length(result),function(ruleNum){
      ruleName <- paste0("Rule",ruleNum)
      ruleValue <- paste0("Rule",ruleNum,"Value")
      intermediateOutput<- result[[ruleNum]][["intermediateOutput"]]

      if(class(intermediateOutput)=="list"){
        plotAnomalies[[ruleNum]] <- NULL

      }else {
        intermediateOutput<- dplyr::filter_(intermediateOutput,paste(ruleName,"==","'true'"))

        GroupedCols <- paste(colnames(intermediateOutput[,
                                                         !names(intermediateOutput) %in% c(ruleName,ruleValue)]),collapse = ":")
        intermediateOutput$Group <-  apply( intermediateOutput[ , !names(intermediateOutput) %in% c(ruleName,ruleValue) ] , 1 , paste , collapse = ":" )
        colnames(intermediateOutput)[ncol(intermediateOutput)-1] <- "values"

        plotAnomalies[[ruleNum]] <- ggplot2::ggplot(intermediateOutput, ggplot2::aes(x=Group, y=values))+
          ggplot2::geom_bar(stat = "identity",fill="steelblue")+
          ggplot2::labs(title="Groups satisfying the rule", 
               x=list(title = paste0("Grouped By - ",GroupedCols), tickangle = -45), y = "Aggregated Value") +
          ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))

        return(plotAnomalies)

      }
    })
  }
}

Applying a simple filter

The first type of rule is applying a simple filter based on the condition on a particular column. This is done by specifying the full condition under the filter column.

In the case of the iris dataset, we filter out a specific type of Species. To illustrate this case, we apply only rule 1.

filterRule <- sampleRules[1,]
filterRule
filterRuleOutput <- executeRulesOnDataset(iris, filterRule)
str(filterRuleOutput)

The output has three objects:

input: has the rule defined by the user in a data frame
intermediateOutput: is an empty list as there is no grouped aggregation
output: has the output data frame with 3 columns:
- Group: the above rule has no group by and hence the rule is applied row-wise. Group, in this case, represents the row number
- Indices: the row numbers of the data frame
- IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the Species is setosa and false if not

Plotting graphs of the result obtained

The output obtained can be visualized by plotting the graphs of the distribution of true and false in the output. true here represents the points which satisfy the rule i.e Species = setosa and false represents the points which do not.

anomaliesCountGraph <- plotgraphs(result=filterRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying a condition on aggregated grouped data

The second type of rule is to apply a condition to the aggregated value of metrics for different groups. In the case of the iris dataset, we aggregate the Sepal.Length variable across different Species, and identify the Species which have an average Sepal.Length greater than a threshold value.

To illustrate this case, we apply only rule 2 from the set of sample rules.

groupedAggregationRule <- sampleRules[2,]
groupedAggregationRule
groupedAggregationRuleOutput <- executeRulesOnDataset(iris, groupedAggregationRule)
str(groupedAggregationRuleOutput)

The output has three objects:

input: has the rule defined by the user in a data frame
intermediateOutput: has group (group by column) and it's corresponding flag (true/false)
output: has the output data frame with 3 columns:
- Group: the above rule has a group by condition and hence the rule is applied to each group. Group, in this case, represents the values of the column on which the group by condition was applied i.e. Species
- Indices: the row numbers form the dataset present in each group
- IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the aggregated value of the group is greater than or equal to the threshold value and false if it's not

Plotting graphs of the result obtained

anomalousSetGraph<-plotgraphs(result=groupedAggregationRuleOutput, plotName="Plot of groups")
anomalousSetGraph[[1]][[1]]

The above graph shows the groups i.e, the Species for which the average of Sepal.Length is greater than or equal to 5.9. The Y-axis shows the average Sepal.Length for each Species.

The plot below shows the number of groups which satisfied the rule. As we can see from above, 2 of the 3 groups satisfy the rule, and hence true has a count of 2.

anomaliesCountGraph<-plotgraphs(result=groupedAggregationRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying an aggregation on a column

This type of rule allows the data scientist to aggregate an entire column and compare that with a threshold value. In the case of the iris dataset, we aggregate the Sepal.Length variable across all cases, and check if it is less than a threshold value

To illustrate this case, we apply only rule 3 from the set of sample rules.

columnAggregationRule <- sampleRules[3,]
columnAggregationRule
columnAggregationRuleOutput <- executeRulesOnDataset(iris, columnAggregationRule)
str(columnAggregationRuleOutput)

The output has three objects:

input: has the rule defined by the user in a data frame
intermediateOutput: is an empty list as there is no grouped aggregation
output: has the *output* data frame with 3 columns:
- Group: the above rule has no group by and no filter. The rule is applied to the whole column. Group, in this case, represents the whole column
- Indices: the row numbers of the whole data frame
- IsTrue: flag to say if the data point is satisfying the rule or not. IN this case, Flag is true if the aggregated value is greater than the threshold value and false if not

Applying a filter with aggregation

In this case, we apply a filter, and then on the filtered data, aggregate a column and compare it to a threshold value. In the case of the iris dataset, we check if for cases with Sepal.Width > 3, if the average Sepal.Length is greater than 5

To illustrate this case, we apply only rule 4 from the set of sample rules.

filterColAggregationRule <- sampleRules[4,]
filterColAggregationRule
filterColAggregationRuleOutput <- executeRulesOnDataset(iris, filterColAggregationRule)
str(filterColAggregationRuleOutput)

The output has three objects:

input: has the rule defined by the user in a data frame
intermediateOutput: is an empty list as there is no grouped aggregation
output: has the output data frame with 3 columns:
- Group: the above rule has no group by and hence the rule is applied to the whole column after filtering the data. Group, in this case, represents the whole column
- Indices: the row numbers of the filtered data frame on applying the condition
- IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the aggregated value of the filtered data is greater than or equal to 5 and false if it's not

Applying a filter with grouped aggregation

We now combine all types if verbs into one rule. In the iris dataset, we check if for all cases with Petal.Width greater than a threshold value, if each type of Species [which is a group] has an average Petal.Length greater than another threshold.

To illustrate this case, we apply only rule 5 from the set of sample rules.

filterGroupByAggrRule <- sampleRules[5,]
filterGroupByAggrRule
filterGroupByAggrRuleOutput <- executeRulesOnDataset(iris, filterGroupByAggrRule)
str(filterGroupByAggrRuleOutput)

The output has three objects:

input: has the rule defined by the user in a data frame
intermediateOutput: has the groups (group by column) and their corresponding flag (true/false)
output: has the output data frame with 3 columns:
- Group: the above rule has a group by and filter. Hence the rule is applied to each group after filtering the data. Group, in this case, represents the grouped by column i.e. Species
- Indices: the row numbers of the data frame present in each group after filtering
- IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the aggregated value for the group after filtering is less than or equal to the threshold and false if not

anomalousSetGraph<-plotgraphs(result=filterGroupByAggrRuleOutput, plotName="Plot of groups")
anomalousSetGraph[[1]][[1]]

The above graph shows the groups i.e, the Species for which the average of Petal.Length is less than 5. The Y-axis shows the average Petal.Length for each Species.

Applying a condition to compare columns

Here we compare values of two columns. In the case of the iris dataset, we compare the Petal.Length with Sepal.Width, and identify the rows which have a Petal.Length greater than Sepal.Width.

To illustrate this case, we apply only rule 6 from the set of sample rules.

compareColumnsRule <- sampleRules[6,]
compareColumnsRule
compareColumnsRuleOutput <- executeRulesOnDataset(iris, compareColumnsRule)
str(compareColumnsRuleOutput)

The output has three objects:

input: has the rule defined by the user in a data frame
intermediateOutput: is an empty list as there is no aggregation
output: has the output data frame with 3 columns:
- Group: the above rule has no group by and filter. Hence the rule is applied to each row. Group, in this case, represents the row
- Indices: each row number
- IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the Petal.Length greater than Sepal.Width and false if not

anomaliesCountGraph<-plotgraphs(result=compareColumnsRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying a filter and comparing columns

Here we compare values of two columns after filtering the dataset. In the case of the iris dataset, we compare the Petal.Length with Sepal.Width, and identify the rows which have a Petal.Length greater than Sepal.Width.

To illustrate this case, we apply only rule 7 from the set of sample rules.

compareFilterRule <- sampleRules[7,]
compareFilterRule
compareFilterRuleOutput <- executeRulesOnDataset(iris, compareFilterRule)
str(compareFilterRuleOutput)

The output has three objects:

input: has the rule defined by the user in a data frame
intermediateOutput: is an empty list as there is no aggregation
output: has the output data frame with 3 columns:
- Group: the above rule has no group by condition. Hence the rule is applied to each row. Group, in this case, represents the row
- Indices: each row number
- IsTrue: flag to say if the data point is satisfying the rule or not. In this case, Flag is true if the Petal.Length of Versicolor species is greater than Sepal.Width and false if not

anomaliesCountGraph<-plotgraphs(result=compareColumnsRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Use case

We now consider a more business-specific problem, where such a rule system might be deployed.

Problem statement

Consider the customers of a retail bank, who make transactions against their bank account for different purposes such as shopping, money transfers, etc. In the banking system, there is a huge potential for fraud. Typically, abnormal transaction behavior is a strong indicator of fraud.

We explore how such transactions can be monitored intelligently to detect fraud using Rdrools by applying business rules.

Details of the dataset

The following dataset provides transaction data for multiple customers of the retail bank (identified by their Account IDs) is used. Every transaction that a user (account) does is recorded with the following details:

Transaction ID: unique code for each transaction
Transaction Amount: the amount debited or credited in each transaction
trans_tender_type: tells if the transaction is a deposit or loan repayment or if it is an overseas transaction
Credit card details: the monthly credit card expenditure
Transaction_Channel: the mode of transaction like ATM or card swipe etc
Balance: the amount in the account after each transaction
Total transactions: number of transactions done by the customer in each month and the cumulative number of transactions

data("transactionData")
transactionData$Date <- lubridate::ymd(transactionData$Date)
transactionData <- transactionData[1:500,]

Displaying a sample (top 10 rows) of the uploaded dataset

DT::datatable(
  head(transactionData, 20), extensions = 'FixedColumns',
  options = list(
  dom = 't',
  scrollX = TRUE,
  scrollCollapse = TRUE
))

str(transactionData)

Defining the rules file

There might be certain cases where we simply want to check the behavior of customers based on a constant benchmark value. These might be cases such as compliance and policy violations, etc.

In our case we check rules like:

If the number of transactions done by the customer in a month exceeds more than 20 then the customer can be fraudulent
If the transaction amount exceeds the more than a threshold then it might be a fraudulent transaction

data("transactionRules")
rownames(transactionRules) <- seq(1:nrow(transactionRules))
transactionRules[is.na(transactionRules)]    <-""
transactionRules

One example of the rules to mark anomalous transactions from the above list is

$$\textsf{For an account, the total Transaction_Amount } \ \textsf{should be greater than or equal to USD 40,000}$$

Executing rules on the dataset

We now take the entire set of rules and execute it on the transaction data as follows:

transactionDataOutput  <- executeRulesOnDataset(transactionData, transactionRules)

Viewing results

length(transactionDataOutput)
str(transactionDataOutput[[5]]) #Rule 5 output

Let us take the results obtained for Rule5 to understand the applications of Rdrools. Rule 5 was

$$\textsf{For a fraudulent/ anomalous account, the maximum of Transaction_Amount } \ \textsf{should be greater than or equal to USD 40,000 for all the debit transactions done after 2017-05-01}$$

The output has three objects:

input: has the rule defined by the user in a data frame
intermediateOutput: has the groups (group by column) and their corresponding flag (true/false)
output: has the output data frame with 3 columns:
- Group: In this rule, we want to aggregate transaction behaviour at an account level. Hence group here represents the Account_ID.
- Indices: the row numbers of the data frame under each Account_ID i.e. the row numbers of the transactions against each account
- IsTrue: flag to say if the Account_ID is fraudulent or not.

Plotting graphs of the result obtained

The distribution of points i.e, the Account_ID that are true or false is shown in the graph below. In this case, the true values can be called as Anomalous Account_IDs and the points that are false are Non-Anomalous Account_IDs.

anomaliesCountGraph<-plotgraphs(result=transactionDataOutput, plotName="Plot of points distribution")

anomaliesCountGraph[[5]][[5]]

The above graph shows that there are 4 anomalous Account_IDs which satisfy the rule given and 7 Account_IDs that are non-anomalous.

anomalousSetGraph<-plotgraphs(result=transactionDataOutput, plotName="Plot of groups")
anomalousSetGraph[[5]][[5]]

The above graph gives more information about the anomalous Account_IDs. The graph shows the sum of Transaction_Amount for each anomalous Account_ID

References

Drools Documentation

Rdrools Documentation

Any scripts or data that you put into this service are public.

Rdrools documentation built on May 2, 2019, 8:23 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Rdrools
A Rules Engine Based on 'Drools'

Executing business rules at scale using RDrools - an interface to Drools
In Rdrools: A Rules Engine Based on 'Drools'

Introduction

Objectives of Rdrools

The advantages of a rule engine

Why Rdrools

Running rules on Rdrools

Executing rules on a dataset

Applying a simple filter

Applying a condition on aggregated grouped data

Applying an aggregation on a column

Applying a filter with aggregation

Applying a filter with grouped aggregation

Applying a condition to compare columns

Applying a filter and comparing columns

Use case

Problem statement

Details of the dataset

Displaying a sample (top 10 rows) of the uploaded dataset

Defining the rules file

Executing rules on the dataset

Viewing results

Plotting graphs of the result obtained

References

Try the Rdrools package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

Rdrools A Rules Engine Based on 'Drools'

Executing business rules at scale using RDrools - an interface to Drools In Rdrools: A Rules Engine Based on 'Drools'

Introduction

Objectives of Rdrools

The advantages of a rule engine

Why Rdrools

Running rules on Rdrools

Executing rules on a dataset

Applying a simple filter

Applying a condition on aggregated grouped data

Applying an aggregation on a column

Applying a filter with aggregation

Applying a filter with grouped aggregation

Applying a condition to compare columns

Applying a filter and comparing columns

Use case

Problem statement

Details of the dataset

Displaying a sample (top 10 rows) of the uploaded dataset

Defining the rules file

Executing rules on the dataset

Viewing results

Plotting graphs of the result obtained

References

Try the Rdrools package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

Rdrools
A Rules Engine Based on 'Drools'

Executing business rules at scale using RDrools - an interface to Drools
In Rdrools: A Rules Engine Based on 'Drools'