A lot of times data scientists need to experiment with data sets to find an optimal model for generating best results. Usually this involves many rounds of repetitive tasks at each stage of the overall model searching process, i.e., feature engineering, algorithm selection, and parameter tuning. Without loss of generality, each of the analytics is formalized to follow a typical data science work flow:
While in a real-world use case scenario, there might be necessity for iterating on all of the three steps (sometimes the problem is even more complicated as the work flow is not unidirectional). For simplicity reason, the demonstration solely shows an experimental analytical job on model selection, while those variables related to other two parts are kept fixed.
Feature engineering - Feature engineering is always considered to be an art rather than science as it depends on not merely machine learning techniques but also domain knowledge. To a great extent, domain knowledge plays an even more vital role in it. As for illustration purpose, the data sets used in the experiments are pre-processed - there is no need for tasks such as aggregation, normalization, etc. performed on the data sets. The data sets are assumed to be ready for training a binary classsification problem.
Algorithm selection - Choosing the right algorithm for a machine learning problem is significant to success. There are many models existing off-the-shelf but they are not globally suitable to all sorts of problems. Some may fit to resolve linear problem with desirably good performance but fail in tackling data sets with non-linear correlations across variables.
Parameter tuning - After algorithm selection, the next step is to fine tune model hyper parameters so as to finalize the model for the problem. This is the same as the previous step except that the algorithm is fixed and parameters are swept within an acceptable range. The output in this step is the optimal model.
To automate the whole process of model creation, it is beneficial to have multiple available servers to run the analytics in parallel so as to boost efficiency. The demo in this tutorial shows how to perform such kind of model creation on remote DSVM cluster. For the sake of simplicity, the problem to be solved in this tutorial is binary classification, and feature engineering is not taken into account as it will depend on domain knowledge which is not the focus here.
Let's repeat the same thing in the previous tutorials, to deploy DSVMs for the case studies. For comparison, two scenarios, single DSVM and cluster of DSVMs, are deployed under two resource groups.
library(AzureDSVM) library(AzureSMR) library(dplyr) library(stringr) library(stringi) library(magrittr) library(readr) library(rattle)
# Load the required subscription resources: TID, CID, and KEY. # Also includes the ssh PUBKEY for the user. USER <- Sys.info()[['user']] source(file.path("..", paste0(USER, "_credentials.R")))
BASE <- runif(4, 1, 26) %>% round() %>% letters[.] %>% paste(collapse="") %T>% {sprintf("Base name:\t\t%s", .) %>% cat("\n")} RG <- paste0("my_dsvm_", BASE,"_rg_sea") %T>% {sprintf("Resource group:\t\t%s", .) %>% cat("\n")} # Choose a data centre location. LOC <- "southeastasia" %T>% {sprintf("Data centre location:\t%s", .) %>% cat("\n")} # Include the random BASE in the hostname to reducely likelihood of # conflict. HOST <- paste0("my", BASE) %T>% {sprintf("Hostname:\t\t%s", .) %>% cat("\n")} cat("\n")
Check existence of resource group and create one if there is no.
context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY) rg_pre_exists <- existsRG(context, RG, LOC)
Create the resource group within which all resources we create will be grouped.
if (! rg_pre_exists) azureCreateResourceGroup(context, RG, LOC) existsRG(context, RG, LOC)
Create the actual Ubuntu DSVM with public-key based authentication method. Name, username, and size can also be configured.
# Create the required Ubuntu DSVM - generally 4 minutes. deployDSVM(context, resource.group=RG, location=LOC, hostname=HOST, username=USER, size="Standard_D12_v2", os="Ubuntu", authen="Key", pubkey=PUBKEY) operateDSVM(context, RG, HOST, operation="Check") azureListVM(context, RG)
# Create a set of Ubuntu DSVMs and they will be formed as a cluster. deployDSVMCluster(context, resource.group=RG, location=LOC, count=COUNT, hostname=HOST, username=USER, authen="Key", pubkey=rep(PUBKEY, COUNT))
To start with, candidature model types as well as their parameters are pre-configured as follows. In this case, three different algorithms available in Microsoft RevoScaleR package are used.
# make a model config to temporarily preserve model parameters. Parameters are kept fixed. model_config <- list(name=c("rxLogit", "rxBTrees", "rxDForest"), para=list(list(list(maxIterations=10, coeffTolerance=1e-6), list(maxIterations=15, coeffTolerance=2e-6), list(maxIterations=20, coeffTolerance=3e-6)), list(list(nTree=10, learningRate=0.05), list(nTree=15, learningRate=0.1), list(nTree=20, learningRate=0.15)), list(list(cp=0.01, nTree=10, mTry=3), list(cp=0.01, nTree=15, mTry=3), list(cp=0.01, nTree=20, mTry=3))))
The data used in this demonstration records a number of credit card transactions, some of which are fradulent. The original data is available on kaggle website or directly from [togaware]{https://access.togaware.com/creditcard.xdf} in XDF format. The data consists both normal and fraudulent transactions, which are indicated by the label "Class", and the problem is to detect a potential fraudulent transaction based on patterns "learnt" by the trained model.
Codes of solving such a machine learning problem can be obtained from
[workerClassification.R]{...test/workerClassification.R}. The function mlProcess
takes data, formula, and model specs as inputs. Considering scalability and performance efficiency, data of xdf format is used, which allows parallel computation outside memory. Area-under-curve is used as performance metric to evaluate quality of model. The function returns a model object (based on the training results) and evaluation result of the model.
Following shows snippets of the machine learning process.
# functions used for model building and evaluating. mlProcess <- function(formula, data, modelName, modelPara) { xdf <- RxXdfData(file=data) # split data into training set (70%) and testing set (30%). data_part <- c(train=0.7, test=0.3) data_split <- rxSplit(xdf, outFilesBase=tempfile(), splitByFactor="splitVar", transforms=list(splitVar= sample(data_factor, size=.rxNumRows, replace=TRUE, prob=data_part)), transformObjects= list(data_part=data_part, data_factor=factor(names(data_part), levels=names(data_part)))) data_train <- data_split[[1]] data_test <- data_split[[2]] # train model. if(missing(modelPara) || is.null(modelPara) || length(modelPara) == 0) { model <- do.call(modelName, list(data=data_train, formula=formula)) } else { model <- do.call(modelName, c(list(data=data_train, formula=formula), modelPara)) } # validate model scores <- rxPredict(model, data_test, extraVarsToWrite=names(data_test), predVarNames="Pred", outData=tempfile(fileext=".xdf"), overwrite=TRUE) label <- as.character(formula[[2]]) roc <- rxRoc(actualVarName=label, predVarNames=c("Pred"), data=scores) auc <- rxAuc(roc) # clean up. file.remove(c(data_train@file, data_test@file)) return(list(model=model, metric=auc)) }
The worker script can be executed on a remote Ubuntu DSVM or DSVM cluster with AzureDSVM function executeScript
like what has been done in the previous tutorials.
The worker script for binary classification is located in "vignettes/test" directory, with name "worker_classficiation.R".
VM_URL <- paste(HOST, LOC, "cloudapp.azure.com", sep=".")
# remote execution on a single DSVM. VMS <- azureListVM(context, RG, LOC) HOST <- VMS$name FQDN <- paste(HOST, LOC, "cloudapp.azure.com", sep=".") time1 <- Sys.time() executeScript(context, resource.group=RG, hostname=HOST[1], remote=FQDN[1], username=USER, script="test/workerClassification.R", compute.context="localParallel") # remote execution on a cluster of DSVMs. time2 <- Sys.time() executeScript(context, resource.group=RG, hostname=HOST, remote=FQDN[1], username=USER, script="test/workerClassification.R", master=FQDN[1], slaves=FQDN[-1], compute.context="clusterParallel") time3 <- Sys.time()
Save time variables into a data file for later references.
save(list(time_1, time_2, time_3), "./elapsed.RData")
After execution of the analytic job is done, expense on running the executions on Azure resources cannot be obtained.
# calculate expense on computations. load("./elapsed.RData") cost <- 0 if (length(vm$name) == 1) { cost <- costDSVM(context=context, hostname=as.character(vm$name[1]), time.start=time_1, time.end=time_2, granularity="Hourly", currency="currency", locale="your_locale", offerId="your_offer_id", region="your_location") } else { for (name in as.character(vm$name)) { cost <- cost + costDSVM(context=context, hostname=name, time.start=time_1, time.end=time_2, granularity="Hourly", currency="currency", locale="your_locale", offerId="your_offer_id", region="your_location") } }
Stop or delete computing resources if they are no longer needed to avoid unnecessary cost.
if (! rg_pre_exists) azureDeleteResourceGroup(context, RG)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.