Use Case

Sometimes more than one DSVM is needed.

This tutorial deploys a collection/cluster of Linux Data Science Virtual Machines (DSVMs) for the above two scenarios. In the latter scenarios the user distributes a compute task over those servers, collects the results and generates a report. Code is included but not run to then delete the resource group if the resources are no longer required. Once deleted consumption will cease.

This script is best run interactively to review its operation and to ensure that the interaction with Azure completes.

Setup

Refer to Deploy for an explanation of the set up of the virtual machines.

# Load the required subscription resources: TID, CID, and KEY.
# Also includes the ssh PUBKEY for the user.

USER <- Sys.info()[['user']]

source(paste0(USER, "_credentials.R"))

# Load the required packages.

library(AzureSMR)    # Support for managing Azure resources.
library(AzureDSVM)   # Further support for the Data Scientist.
library(magrittr)    # Pipeline computation.
library(dplyr)       # Data wrangling.

# Parameters for this script: the name for the new resource group and
# its location across the Azure cloud. The resource name is used to
# name the resource group that we will create transiently for the
# purposes of this script.

# Create a random resource group to reduce likelihood of conflict with
# other users.

BASE <- 
  runif(4, 1, 26) %>%
  round() %>%
  letters[.] %>%
  paste(collapse="") %T>%
  {sprintf("Base name:\t\t%s", .) %>% cat("\n")}

RG <-
  paste0("my_dsvm_", BASE,"_rg_sea") %T>%
  {sprintf("Resource group:\t\t%s", .) %>% cat("\n")}

# Choose a data centre location.

LOC <-
  "southeastasia"  %T>%
  {sprintf("Data centre location:\t%s", .) %>% cat("\n")}

cat("\n")

# Number of VMs to deploy.

COUNT <- 4

# Connect to the Azure subscription and use this as the context for
# all of our activities.

context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY)

# Check if the resource group already exists. Take note this script
# will not remove the resource group if it pre-existed.

rg_pre_exists <- existsRG(context, RG, LOC)

if (! rg_pre_exists)
{
  # Create a new resource group into which we create the VMs and
  # related resources. Resource group name is RG. 

  # Note that to create a new resource group one needs to add access
  # control of Active Directory application at subscription level.

  azureCreateResourceGroup(context, RG, LOC) %>% cat("\n\n")

}

# Check that it now exists.

cat("Resource group", RG, "at", LOC,
    ifelse(!existsRG(context, RG, LOC), "does not exist.\n", "exists.\n"), "\n")

Create a Cluster

Multi-deployment of DSVMs can be achieved by calling deployDSVMCluster(). This function is designed to implicitly switch between cluster and collection of DSVMs, according to the given inputs. If the hostname (i.e., names of the DSVMs) consists of only one character string then a cluster of homogeneous DSVMs (i.e., same size) will be deployed. Using the unique machine name as the base a sequential numbering is used to form a full hostname. On the other hand if the hostname is a vector of character strings the function will create a collection of machines with names so specified.

It is worth mentioning that a cluster of DSVMs is useful when a batch-based analytical job needs to be performed in a desired computing context, especially in a distributed manner across nodes of the cluster. The distributed computing functionality is supported by the Microsoft RevoScaleR parallel computing backend. The distributed and parallel computing backend is socket-based and relies on SSH for secure communication. To allow this communication across nodes, deployDSVMCluster adds inbound security rules into the security group of each DSVM in the cluster, and establishes public key pairs for the machines.

We can now deploy a cluster of homogeneous DSVMs. Each DSVM will be named based on the hostname provided and sequentially numbered.

# Deploy a cluster of DSVMs.

deployDSVMCluster(context, 
                  resource.group=RG, 
                  location=LOC, 
                  hostname=BASE,
                  username=USER, 
                  authen="Key",
                  pubkey=PUBKEY,
                  count=COUNT)

cluster <- azureListVM(context, RG, LOC)

# To validate the existence of deployed DSVMs.

for (i in 1:COUNT)
{
  vm <- cluster[i, "name"]
  fqdn <- paste(cluster[i, "name"],
                cluster[i, "location"],
                "cloudapp.azure.com",
                sep=".")

  cat(vm, "\n")

  operateDSVM(context, RG, vm, operation="Check")

  # Send a simple system() command across to the new server to test
  # its existence. Expect a single line with an indication of how long
  # the server has been up and running.

  cmd <- paste("ssh -q",
               "-o StrictHostKeyChecking=no",
               "-o UserKnownHostsFile=/dev/null\\\n   ",
               fqdn,
               "uptime") %T>%
    {cat(., "\n")}
  cmd
  system(cmd)
  cat("\n")
}

Create a Collection

We can also create a collection of Linux DSVMs each with a different user and with public-key based authentication method. Name, username, and size can also be configured.

DSVM_NAMES <- paste0(BASE, c(1, 2, 3)) %T>% print()
DSVM_USERS <- paste0("user", c("a", "b", "c")) %T>% print()

# Deploy multiple DSVMs using deployDSVMCluster.

deployDSVMCluster(context, 
                  resource.group=RG, 
                  location=LOC, 
                  hostname=DSVM_NAMES, 
                  username=DSVM_USERS, 
                  authen="Key",
                  pubkey=rep(PUBKEY, COUNT)) 

# Check the deployed machines?

for (vm in DSVM_NAMES)
{
  cat(vm, "\n")

  operateDSVM(context, RG, vm, operation="Check")

  # Send a simple system() command across to the new server to test
  # its existence. Expect a single line with an indication of how long
  # the server has been up and running.

  cmd <- paste("ssh -q",
               "-o StrictHostKeyChecking=no",
               "-o UserKnownHostsFile=/dev/null\\\n   ",
               paste0(vm, ".", LOC, ".cloudapp.azure.com"),
               "uptime") %T>%
    {cat(., "\n")}
  cmd
  system(cmd)
  cat("\n")
}

Delete the Resource Group

# Delete the resource group now that we have proved existence. There
# is probably no need to wait. Only delete if it did not pre-exist
# this script. Deletion seems to take 10 minutes or more.

if (! rg_pre_exists)
  azureDeleteResourceGroup(context, RG)


Azure/AzureDSVM documentation built on May 20, 2019, 2:03 p.m.