Executing business rules at scale using RDrools - an interface to Drools

#Package installation if required for handbook

if (!requireNamespace("DT", quietly = TRUE)) {
     install.packages("DT", repos = "http://cloud.r-project.org/")
}

if (!requireNamespace("lubridate", quietly = TRUE)) {
     install.packages("lubridate", repos = "http://cloud.r-project.org/")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
     install.packages("ggplot2", repos = "http://cloud.r-project.org/")
}
library("magrittr")
library("Rdrools")
library("dplyr")
library("purrr")
library("tibble")
options(stringsAsFactors = F)

Introduction

Objectives of Rdrools

The Rdrools package aims to accomplish two main objectives:

The advantages of a rule engine

Rule engines allow for optimal checking of rules against data for large rule sets [of the order of hundreds or even thousands of rules]. Drools [and other rule engines] implement an enhanced version of the Rete algorithm, which efficiently match facts [data tuples] against conditions [rules]. This allows for codifying intuition/ business context which can be used to power intelligent systems.

Why Rdrools

RDrools brings the efficiencies of large-scale production rule systems to data science users. Rule sets can be used alone, or in conjunction with machine learning models, to develop and operationalize intelligent systems. RDrools allows for deployment of rules defined through an R interface into a production system. As data comes in [periodic or real-time], a pre-defined set of rules can be checked on the data, and actions can be triggered based on the result

Running rules on Rdrools

Executing rules on a dataset

In order to achieve the objective of providing data scientists an intuitive interface to execute rules on datasets, the Rdrools package exposes the executeRulesOnDataset function, which is explicitly designed for data scientists. As input to this function rules are defined using the typical language of data science with verbs such as

For ease of use, the rules can be defined in a csv format and imported into the R session through the usual read functions. The require format follows a familiar structure using the verbs discussed earlier. We take the example of the iris dataset and define rules on it. The sample rules for the iris dataset are defined in the irisRules data object [for the purpose of the example]

data("iris")
data("irisRules")
sampleRules <- irisRules
rownames(sampleRules) <- seq(1:nrow(sampleRules))
sampleRules[is.na(sampleRules)]    <-""
sampleRules

Through this function, various typical types of rules can be executed with a combination of the verbs described above.

Note - In order to plot graphs to show counts of number of facts passing/ failing rules, we have defined a function internal to the vignette to plot graphs called 'plotgraphs'

#' Vignette helper functions
#' @description: Function plot graphs in the vignette
#' -----------------------------------------------------------------------------
#' @param result result of rule check
#' @param plotName Plot to be generated
#' @param rules the rules defined in csv format
#' -----------------------------------------------------------------------------
#' @return a plotly plot
#' @keywords internal

plotgraphs <- function(result,plotName){

  if(plotName == "Plot of points distribution"){
    anomaliesCountPlot <-list()
    purrr::map (1:length(result),function(i){
      outputDataframe <- result[[i]][["output"]]
      noOfTrueFalse <-  outputDataframe %>% dplyr::group_by(IsTrue) %>%
        dplyr::summarise(Frequency = n())
      if(nrow(noOfTrueFalse)==2){

        noOfTrueFalse <- noOfTrueFalse %>% as.data.frame %>% `rownames<-`(c("Anomalies","Non-Anomalies"))  
        anomaliesCountPlot[[i]] <- ggplot2::ggplot(noOfTrueFalse, ggplot2::aes(x=IsTrue, y=Frequency)) +
          ggplot2::geom_bar(stat = "identity", fill="steelblue")+
          ggplot2::labs(title="Distribution of points \n for the rule", 
              y = "Count") +
          ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))

      }else{
        anomaliesCountPlot[[i]] <- NULL
      }

      return(anomaliesCountPlot)     
    })
  }else if(plotName == "Plot of groups"){
    plotAnomalies <-list()
    purrr::map (1:length(result),function(ruleNum){
      ruleName <- paste0("Rule",ruleNum)
      ruleValue <- paste0("Rule",ruleNum,"Value")
      intermediateOutput<- result[[ruleNum]][["intermediateOutput"]]

      if(class(intermediateOutput)=="list"){
        plotAnomalies[[ruleNum]] <- NULL

      }else {
        intermediateOutput<- dplyr::filter_(intermediateOutput,paste(ruleName,"==","'true'"))

        GroupedCols <- paste(colnames(intermediateOutput[,
                                                         !names(intermediateOutput) %in% c(ruleName,ruleValue)]),collapse = ":")
        intermediateOutput$Group <-  apply( intermediateOutput[ , !names(intermediateOutput) %in% c(ruleName,ruleValue) ] , 1 , paste , collapse = ":" )
        colnames(intermediateOutput)[ncol(intermediateOutput)-1] <- "values"

        plotAnomalies[[ruleNum]] <- ggplot2::ggplot(intermediateOutput, ggplot2::aes(x=Group, y=values))+
          ggplot2::geom_bar(stat = "identity",fill="steelblue")+
          ggplot2::labs(title="Groups satisfying the rule", 
               x=list(title = paste0("Grouped By - ",GroupedCols), tickangle = -45), y = "Aggregated Value") +
          ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1))

        return(plotAnomalies)

      }
    })
  }
}

Applying a simple filter

The first type of rule is applying a simple filter based on the condition on a particular column. This is done by specifying the full condition under the filter column.

In the case of the iris dataset, we filter out a specific type of Species. To illustrate this case, we apply only rule 1.

filterRule <- sampleRules[1,]
filterRule
filterRuleOutput <- executeRulesOnDataset(iris, filterRule)
str(filterRuleOutput)

The output has three objects:

Plotting graphs of the result obtained

The output obtained can be visualized by plotting the graphs of the distribution of true and false in the output. true here represents the points which satisfy the rule i.e Species = setosa and false represents the points which do not.

anomaliesCountGraph <- plotgraphs(result=filterRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying a condition on aggregated grouped data

The second type of rule is to apply a condition to the aggregated value of metrics for different groups. In the case of the iris dataset, we aggregate the Sepal.Length variable across different Species, and identify the Species which have an average Sepal.Length greater than a threshold value.

To illustrate this case, we apply only rule 2 from the set of sample rules.

groupedAggregationRule <- sampleRules[2,]
groupedAggregationRule
groupedAggregationRuleOutput <- executeRulesOnDataset(iris, groupedAggregationRule)
str(groupedAggregationRuleOutput)

The output has three objects:

Plotting graphs of the result obtained

anomalousSetGraph<-plotgraphs(result=groupedAggregationRuleOutput, plotName="Plot of groups")
anomalousSetGraph[[1]][[1]]

The above graph shows the groups i.e, the Species for which the average of Sepal.Length is greater than or equal to 5.9. The Y-axis shows the average Sepal.Length for each Species.

The plot below shows the number of groups which satisfied the rule. As we can see from above, 2 of the 3 groups satisfy the rule, and hence true has a count of 2.

anomaliesCountGraph<-plotgraphs(result=groupedAggregationRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying an aggregation on a column

This type of rule allows the data scientist to aggregate an entire column and compare that with a threshold value. In the case of the iris dataset, we aggregate the Sepal.Length variable across all cases, and check if it is less than a threshold value

To illustrate this case, we apply only rule 3 from the set of sample rules.

columnAggregationRule <- sampleRules[3,]
columnAggregationRule
columnAggregationRuleOutput <- executeRulesOnDataset(iris, columnAggregationRule)
str(columnAggregationRuleOutput)

The output has three objects:

Applying a filter with aggregation

In this case, we apply a filter, and then on the filtered data, aggregate a column and compare it to a threshold value. In the case of the iris dataset, we check if for cases with Sepal.Width > 3, if the average Sepal.Length is greater than 5

To illustrate this case, we apply only rule 4 from the set of sample rules.

filterColAggregationRule <- sampleRules[4,]
filterColAggregationRule
filterColAggregationRuleOutput <- executeRulesOnDataset(iris, filterColAggregationRule)
str(filterColAggregationRuleOutput)

The output has three objects:

Applying a filter with grouped aggregation

We now combine all types if verbs into one rule. In the iris dataset, we check if for all cases with Petal.Width greater than a threshold value, if each type of Species [which is a group] has an average Petal.Length greater than another threshold.

To illustrate this case, we apply only rule 5 from the set of sample rules.

filterGroupByAggrRule <- sampleRules[5,]
filterGroupByAggrRule
filterGroupByAggrRuleOutput <- executeRulesOnDataset(iris, filterGroupByAggrRule)
str(filterGroupByAggrRuleOutput)

The output has three objects:

anomalousSetGraph<-plotgraphs(result=filterGroupByAggrRuleOutput, plotName="Plot of groups")
anomalousSetGraph[[1]][[1]]

The above graph shows the groups i.e, the Species for which the average of Petal.Length is less than 5. The Y-axis shows the average Petal.Length for each Species.

Applying a condition to compare columns

Here we compare values of two columns. In the case of the iris dataset, we compare the Petal.Length with Sepal.Width, and identify the rows which have a Petal.Length greater than Sepal.Width.

To illustrate this case, we apply only rule 6 from the set of sample rules.

compareColumnsRule <- sampleRules[6,]
compareColumnsRule
compareColumnsRuleOutput <- executeRulesOnDataset(iris, compareColumnsRule)
str(compareColumnsRuleOutput)

The output has three objects:

anomaliesCountGraph<-plotgraphs(result=compareColumnsRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Applying a filter and comparing columns

Here we compare values of two columns after filtering the dataset. In the case of the iris dataset, we compare the Petal.Length with Sepal.Width, and identify the rows which have a Petal.Length greater than Sepal.Width.

To illustrate this case, we apply only rule 7 from the set of sample rules.

compareFilterRule <- sampleRules[7,]
compareFilterRule
compareFilterRuleOutput <- executeRulesOnDataset(iris, compareFilterRule)
str(compareFilterRuleOutput)

The output has three objects:

anomaliesCountGraph<-plotgraphs(result=compareColumnsRuleOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[1]][[1]]

Use case

We now consider a more business-specific problem, where such a rule system might be deployed.

Problem statement

Consider the customers of a retail bank, who make transactions against their bank account for different purposes such as shopping, money transfers, etc. In the banking system, there is a huge potential for fraud. Typically, abnormal transaction behavior is a strong indicator of fraud.

We explore how such transactions can be monitored intelligently to detect fraud using Rdrools by applying business rules.

Details of the dataset

The following dataset provides transaction data for multiple customers of the retail bank (identified by their Account IDs) is used. Every transaction that a user (account) does is recorded with the following details:

data("transactionData")
transactionData$Date <- lubridate::ymd(transactionData$Date)
transactionData <- transactionData[1:500,]

Displaying a sample (top 10 rows) of the uploaded dataset

DT::datatable(
  head(transactionData, 20), extensions = 'FixedColumns',
  options = list(
  dom = 't',
  scrollX = TRUE,
  scrollCollapse = TRUE
))
str(transactionData)

Defining the rules file

There might be certain cases where we simply want to check the behavior of customers based on a constant benchmark value. These might be cases such as compliance and policy violations, etc.

In our case we check rules like:

data("transactionRules")
rownames(transactionRules) <- seq(1:nrow(transactionRules))
transactionRules[is.na(transactionRules)]    <-""
transactionRules

One example of the rules to mark anomalous transactions from the above list is

$$\textsf{For an account, the total Transaction_Amount } \ \textsf{should be greater than or equal to USD 40,000}$$

Executing rules on the dataset

We now take the entire set of rules and execute it on the transaction data as follows:

transactionDataOutput  <- executeRulesOnDataset(transactionData, transactionRules)

Viewing results

length(transactionDataOutput)
str(transactionDataOutput[[5]]) #Rule 5 output

Let us take the results obtained for Rule5 to understand the applications of Rdrools. Rule 5 was

$$\textsf{For a fraudulent/ anomalous account, the maximum of Transaction_Amount } \ \textsf{should be greater than or equal to USD 40,000 for all the debit transactions done after 2017-05-01}$$

The output has three objects:

Plotting graphs of the result obtained

The distribution of points i.e, the Account_ID that are true or false is shown in the graph below. In this case, the true values can be called as Anomalous Account_IDs and the points that are false are Non-Anomalous Account_IDs.

anomaliesCountGraph<-plotgraphs(result=transactionDataOutput, plotName="Plot of points distribution")
anomaliesCountGraph[[5]][[5]]

The above graph shows that there are 4 anomalous Account_IDs which satisfy the rule given and 7 Account_IDs that are non-anomalous.

anomalousSetGraph<-plotgraphs(result=transactionDataOutput, plotName="Plot of groups")
anomalousSetGraph[[5]][[5]]

The above graph gives more information about the anomalous Account_IDs. The graph shows the sum of Transaction_Amount for each anomalous Account_ID

References

Drools Documentation

Rdrools Documentation



Try the Rdrools package in your browser

Any scripts or data that you put into this service are public.

Rdrools documentation built on May 2, 2019, 8:23 a.m.