DRAFT UNDER DEVELOPMENT

Use Case

This tutorial will demonstrate the effectiveness of parallelizing a Hot Spots analysis across Azure DSVM(s). The Hot Spots method was proposed by Graham Williams for discovering knowledge of interets from large data sets. General steps of a Hot Spots analysis are as follows:

  1. Cluster a data set into several complete and disjoint clusters. This is often achieved with a clustering algorithm such as k-means.
  2. Rule-based inductions are then built to discriminatorily describe each of the clusters.
  3. Within each of the clusters specified by associative rules, create models to evaluate data attributes to find patterns.

The greatest benefit of using Hot Spots method for data mining are that it visually describes the knowledge by a set of rules which are of particular convenience to a data miner to understand mining results. This is helpful in various scenarios such as insurance premium setting, fraud detection in health, etc.

In this demonstration, Hot Spots analysis is used for supervised binary classification. The workflow is as follows

  1. Given a labelled data set. Split the data into training and testing sets.
  2. For the training set, cluster it into different segments. This is done by k-means algorithm.
    1. Number of clusters is swept from 2 to a large enough number (say 20 in this demo) in order to select the optimal one. This selection is based on the elbow method.
    2. Considering that k-means clustering performance relies on initial conditions, each clustering is run 10 times for averging out instability.
  3. Assume the number of clusters is determined to be N. Build N descriptive models that differentiate each cluster from the rest. Within each cluster, build a predictive model to do binary classification on the data label.
  4. In the testing phase, examine the each observation of the data which cluster it belongs to using the descriptive models. Then do binary classification using the predictive models to produce the predicted label.

Setup

Similar to the previous sections, credentials for authentication are required to fire up the DSVMs.

library(AzureDSVM)
library(AzureSMR)
library(magrittr)
# Load the required subscription resources: TID, CID, and KEY.
# Also includes the ssh PUBKEY for the user.

USER <- Sys.info()[['user']]

source(file.path(paste0(USER, "_credentials.R")))

Specifications of computing resources.

runif(4, 1, 26) %>%
  round() %>%
  letters[.] %>%
  paste(collapse="") %T>%
  {sprintf("Base name:\t\t%s", .) %>% cat("\n")} ->
BASE

BASE %>%
  paste0("my_dsvm_", .,"_rg_sea") %T>%
  {sprintf("Resource group:\t\t%s", .) %>% cat("\n")} ->
RG

SIZE <- "Standard_D12_v2" 
sprintf("DSVM size:\t\t%s", SIZE) %>% cat("\n")

# Choose a data centre location.

"southeastasia"  %T>%
  {sprintf("Data centre location:\t%s", .) %>% cat("\n")} ->
LOC

# Include the random BASE in the hostname to reducely likelihood of
# conflict.

BASE %>%
  paste0("my", .) %T>%
  {sprintf("Hostname:\t\t%s", .) %>% cat("\n")} ->
HOST

cat("\n")

Deploy a cluster of DSVMs if there is no existing, otherwise start the machines.

# Connect to the Azure subscription and use this as the context for
# all of our activities.

context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY)

# Check if the resource group already exists. Take note this script
# will not remove the resource group if it pre-existed.

rg_pre_exists <- existsRG(context, RG, LOC) %T>% print()

# Create Resource Group

if (! rg_pre_exists)
{
  # Create a new resource group into which we create the VMs and
  # related resources. Resource group name is RG. 

  # Note that to create a new resource group one needs to add access
  # control of Active Directory application at subscription level.

  azureCreateResourceGroup(context, RG, LOC)

}

Create one Linux DSVM for running the Hot Spots analytics.

vm <- AzureSMR::azureListVM(context, RG)

if (!is.null(vm))
{

  AzureDSVM::operateDSVM(context, RG, vm$name, operation="Check")

  # start machines if they exist in the resource group.

  AzureDSVM::operateDSVM(context, RG, vm$name, operation="Start")

} else
{
  ldsvm <- deployDSVM(context, 
                      resource.group=RG,
                      location=LOC,
                      hostname=HOST,
                      username=USER,
                      size=SIZE,
                      pubkey=PUBKEY)
  ldsvm

  operateDSVM(context, RG, HOST, operation="Check")

}

This will now take 4 minutes. Whilst that is deploying let's discuss a little about the Hot Spots methodology. Fang's slide set is a great overview of the approach.

Setting up a DSVM is not a trivial task as it is a complete solution for the data scientist and includes thew whole stack of open source data science technology including: R, Python, Hadoop, Spark, MRS, (SQL Server, RTVS, ...)

azureListVM(context, RG)

Remote execution of analysis

The following demo shows how to achieve a Hot Spots analysis in R and parallelize the analysis on Azure cloud.

The data used in this demonstration is still the credit card data, which is
available on kaggle website or directly from togaware in XDF format.

The top-level worker script for Hot Spot analysis is available as workerHotSpots.R. Besides the main worker script, there are several other scripts where functions used for the analysis reside.

Specify the master node that will run the analytic script. In this case, it is the DSVM created just now.

# specify machine names, master, and slaves.

vm <- AzureSMR::azureListVM(context, RG) %T>% print()

machines <- unlist(vm$name)
dns_list <- paste0(machines, ".", LOC, ".cloudapp.azure.com")
master <- dns_list[1] %T>% print()
slaves <- dns_list[-1]

The whole end-to-end Hot Spots analysis is run on the remote machine in a parallel manner. To accelerate the analysis process, parameter sweeping inside model training and testing is executed with the help of rxExec function from Microsoft R Server. The local parallel backend will make use of available cores of the machine to run those functions in parallel.

Functions used for the analysis are defined in separated scripts, and uploaded onto remote DSVM with AzureDSVM::fileTransfer.

worker_scripts <- c("workerHotspotsFuncs.R", 
                    "workerHotspotsSetup.R", 
                    "workerHotspotsTrain.R", 
                    "workerHotspotsTest.R",
                    "workerHotspotsProcess.R")

sapply(worker_scripts, 
       fileTransfer,
       from="test",
       to=paste0(master, ":~"), 
       user=USER)

Remote execution of worker script (the analysis takes approximately 10 - 15 minutes on a D12 v2 DSVM which has 4 cores and 28 GB RAM.

Whilst this is running we continue talking about the algorithm for Hot Spots.

# parallel the analytics with local parallel computing context.

time_1 <- Sys.time()

AzureDSVM::executeScript(context=context, 
                         resource.group=RG, 
                         hostname=machines, 
                         remote=master, 
                         username=USER, 
                         script="test/workerHotspots.R", 
                         compute.context="localParallel")

time_2 <- Sys.time()

Get the results from remote DSVM.

# get results from remote

AzureDSVM::fileTransfer(from=paste0(master, ":~"), 
                        to=".", 
                        user=USER, 
                        file="results.RData")

load("./results.RData") 
results_local <- 
  eval %T>%
  print()

Save time points for later reference.

elapsed <- list(instances=machines,
                time_start=time_1, 
                time_end=time_2)

save(elapsed, file="./elapsed.RData")

The cost of running the above analytics can be obtained with AzureDSVM::expenseCalculation function. Note there is usually delay of recording data consumption into system and this delay depends on the location of data center.

# calculate expense on computations.

load("./elapsed.RData")

cost <- 0

if (length(vm$name) == 1) {
  cost <- costDSVM(context=context,
                   hostname=as.character(vm$name[1]), 
                   time.start=time_1,
                   time.end=time_2,
                   granularity="Hourly",
                   currency="currency",
                   locale="your_locale",
                   offerId="your_offer_id",
                   region="your_location")
} else {
  for (name in as.character(vm$name)) {
    cost <- cost + costDSVM(context=context,
                            hostname=name, 
                            time.start=time_1,
                            time.end=time_2,
                            granularity="Hourly",
                            currency="currency",
                            locale="your_locale",
                            offerId="your_offer_id",
                            region="your_location")
  }
}

So that has cost me something less than $1! The data is securely sent, fully encrypted, to and from the server on Azure, and the server itself is newly created (hence, no one will have previously discovered it and implanted spy ware or visrus) and only accessible via the secure shell so all communications are fully encrypted from your desktop to the server, and then we remove the server and ensure no remenants of our anonynmous data remain on the cloud!

Clean-up

Once finishing the analysis, switch off DSVMs.

# stop machines after the analysis.

for (vm in machines) {
  AzureDSVM::operateDSVM(context, RG, vm, operation="Stop")
}

Or delete the resource group to avoid unnecessary cost.

if (! rg_pre_exi, eval=FALSEsts)
  azureDeleteResourceGroup(context, RG)


Azure/AzureDSVM documentation built on May 20, 2019, 2:03 p.m.