Use Case

A lot of times data scientists need to experiment with data sets to find an optimal model for generating best results. Usually this involves many rounds of repetitive tasks at each stage of the overall model searching process, i.e., feature engineering, algorithm selection, and parameter tuning. Without loss of generality, each of the analytics is formalized to follow a typical data science work flow:

  1. Feature Engineering. Note as there is no prior domain knowledge, data used for analytics are assumed to have been aggregated and cleansened properly. General feature engineering techniques are applied on the data. To name a few,
  2. Algorithm Selection. Selection of an appropriate algorithm to create model for data science task.
  3. Parameter Tuning. Tune machine learning algorithms to meet the optimization goal of trained model.

While in a real-world use case scenario, there might be necessity for iterating on all of the three steps (sometimes the problem is even more complicated as the work flow is not unidirectional). For simplicity reason, the demonstration solely shows an experimental analytical job on model selection, while those variables related to other two parts are kept fixed.

To automate the whole process of model creation, it is beneficial to have multiple available servers to run the analytics in parallel so as to boost efficiency. The demo in this tutorial shows how to perform such kind of model creation on remote DSVM cluster. For the sake of simplicity, the problem to be solved in this tutorial is binary classification, and feature engineering is not taken into account as it will depend on domain knowledge which is not the focus here.

Set up

Let's repeat the same thing in the previous tutorials, to deploy DSVMs for the case studies. For comparison, two scenarios, single DSVM and cluster of DSVMs, are deployed under two resource groups.

library(AzureDSVM)
library(AzureSMR)
library(dplyr)
library(stringr)
library(stringi)
library(magrittr)
library(readr)
library(rattle)
# Load the required subscription resources: TID, CID, and KEY.
# Also includes the ssh PUBKEY for the user.

USER <- Sys.info()[['user']]

source(file.path("..", paste0(USER, "_credentials.R")))
BASE <- 
  runif(4, 1, 26) %>%
  round() %>%
  letters[.] %>%
  paste(collapse="") %T>%
  {sprintf("Base name:\t\t%s", .) %>% cat("\n")}

RG <-
  paste0("my_dsvm_", BASE,"_rg_sea") %T>%
  {sprintf("Resource group:\t\t%s", .) %>% cat("\n")}

# Choose a data centre location.

LOC <-
  "southeastasia"  %T>%
  {sprintf("Data centre location:\t%s", .) %>% cat("\n")}

# Include the random BASE in the hostname to reducely likelihood of
# conflict.

HOST <-
  paste0("my", BASE) %T>%
  {sprintf("Hostname:\t\t%s", .) %>% cat("\n")}

cat("\n")

Check existence of resource group and create one if there is no.

context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY)

rg_pre_exists <- existsRG(context, RG, LOC)

Create a Resource Group

Create the resource group within which all resources we create will be grouped.

if (! rg_pre_exists) azureCreateResourceGroup(context, RG, LOC)

existsRG(context, RG, LOC)

Deploy a Ubuntu Data Science Virtual Machine

Create the actual Ubuntu DSVM with public-key based authentication method. Name, username, and size can also be configured.

# Create the required Ubuntu DSVM - generally 4 minutes.

deployDSVM(context, 
           resource.group=RG,
           location=LOC,
           hostname=HOST,
           username=USER,
           size="Standard_D12_v2",
           os="Ubuntu",
           authen="Key",
           pubkey=PUBKEY)

operateDSVM(context, RG, HOST, operation="Check")

azureListVM(context, RG)

Deploy a cluster of Ubuntu Data Science Virtual Machines.

# Create a set of Ubuntu DSVMs and they will be formed as a cluster.

deployDSVMCluster(context, 
                  resource.group=RG, 
                  location=LOC, 
                  count=COUNT, 
                  hostname=HOST, 
                  username=USER, 
                  authen="Key",
                  pubkey=rep(PUBKEY, COUNT)) 

Analytics

To start with, candidature model types as well as their parameters are pre-configured as follows. In this case, three different algorithms available in Microsoft RevoScaleR package are used.

# make a model config to temporarily preserve model parameters. Parameters are kept fixed.

model_config <- list(name=c("rxLogit", "rxBTrees", "rxDForest"), 
                     para=list(list(list(maxIterations=10,
                                         coeffTolerance=1e-6),
                                    list(maxIterations=15,
                                         coeffTolerance=2e-6),
                                    list(maxIterations=20,
                                         coeffTolerance=3e-6)),
                               list(list(nTree=10, 
                                         learningRate=0.05),
                                    list(nTree=15,
                                         learningRate=0.1),
                                    list(nTree=20,
                                         learningRate=0.15)),
                               list(list(cp=0.01,
                                         nTree=10,
                                         mTry=3),
                                    list(cp=0.01,
                                         nTree=15,
                                         mTry=3),
                                    list(cp=0.01,
                                         nTree=20,
                                         mTry=3))))

Sample analysis

The data used in this demonstration records a number of credit card transactions, some of which are fradulent. The original data is available on kaggle website or directly from [togaware]{https://access.togaware.com/creditcard.xdf} in XDF format. The data consists both normal and fraudulent transactions, which are indicated by the label "Class", and the problem is to detect a potential fraudulent transaction based on patterns "learnt" by the trained model.

Codes of solving such a machine learning problem can be obtained from [workerClassification.R]{...test/workerClassification.R}. The function mlProcess takes data, formula, and model specs as inputs. Considering scalability and performance efficiency, data of xdf format is used, which allows parallel computation outside memory. Area-under-curve is used as performance metric to evaluate quality of model. The function returns a model object (based on the training results) and evaluation result of the model.

Following shows snippets of the machine learning process.

# functions used for model building and evaluating.

mlProcess <- function(formula, data, modelName, modelPara) {

  xdf <- RxXdfData(file=data)

  # split data into training set (70%) and testing set (30%).

  data_part <- c(train=0.7, test=0.3)

  data_split <-
    rxSplit(xdf, 
            outFilesBase=tempfile(),
            splitByFactor="splitVar",
            transforms=list(splitVar=
                              sample(data_factor,
                                     size=.rxNumRows,
                                     replace=TRUE,
                                     prob=data_part)),
            transformObjects=
              list(data_part=data_part,
                   data_factor=factor(names(data_part), levels=names(data_part)))) 

  data_train <- data_split[[1]]
  data_test  <- data_split[[2]]

  # train model.

  if(missing(modelPara) ||
     is.null(modelPara) || 
     length(modelPara) == 0) {
    model <- do.call(modelName, list(data=data_train, formula=formula))
  } else {
    model <- do.call(modelName, c(list(data=data_train,
                                       formula=formula),
                                  modelPara))
  }

  # validate model

  scores <- rxPredict(model, 
                      data_test,
                      extraVarsToWrite=names(data_test),
                      predVarNames="Pred",
                      outData=tempfile(fileext=".xdf"), 
                      overwrite=TRUE)

  label <- as.character(formula[[2]])

  roc <- rxRoc(actualVarName=label, 
               predVarNames=c("Pred"), 
               data=scores) 

  auc <- rxAuc(roc)

  # clean up.

  file.remove(c(data_train@file, data_test@file))

  return(list(model=model, metric=auc))
}

The worker script can be executed on a remote Ubuntu DSVM or DSVM cluster with AzureDSVM function executeScript like what has been done in the previous tutorials.

The worker script for binary classification is located in "vignettes/test" directory, with name "worker_classficiation.R".

VM_URL <- paste(HOST, LOC, "cloudapp.azure.com", sep=".")
# remote execution on a single DSVM.

VMS <- azureListVM(context, RG, LOC)

HOST <- VMS$name
FQDN <- paste(HOST, LOC, "cloudapp.azure.com", sep=".")

time1 <- Sys.time()

executeScript(context,
              resource.group=RG,
              hostname=HOST[1],
              remote=FQDN[1],
              username=USER,
              script="test/workerClassification.R",
              compute.context="localParallel")

# remote execution on a cluster of DSVMs.

time2 <- Sys.time()

executeScript(context,
              resource.group=RG,
              hostname=HOST,
              remote=FQDN[1],
              username=USER,
              script="test/workerClassification.R",
              master=FQDN[1],
              slaves=FQDN[-1],
              compute.context="clusterParallel")

time3 <- Sys.time()

Save time variables into a data file for later references.

save(list(time_1, time_2, time_3), "./elapsed.RData")

Calculating expense

After execution of the analytic job is done, expense on running the executions on Azure resources cannot be obtained.

# calculate expense on computations. 

load("./elapsed.RData")

cost <- 0

if (length(vm$name) == 1) {
  cost <- costDSVM(context=context,
                   hostname=as.character(vm$name[1]), 
                   time.start=time_1,
                   time.end=time_2,
                   granularity="Hourly",
                   currency="currency",
                   locale="your_locale",
                   offerId="your_offer_id",
                   region="your_location")
} else {
  for (name in as.character(vm$name)) {
    cost <- cost + costDSVM(context=context,
                            hostname=name, 
                            time.start=time_1,
                            time.end=time_2,
                            granularity="Hourly",
                            currency="currency",
                            locale="your_locale",
                            offerId="your_offer_id",
                            region="your_location")
  }
}

Clean-up

Stop or delete computing resources if they are no longer needed to avoid unnecessary cost.

if (! rg_pre_exists)
  azureDeleteResourceGroup(context, RG)


Azure/AzureDSVM documentation built on May 20, 2019, 2:03 p.m.