In Azure/AzureDSVM: Create and Manage Azure Data Science Virtual Machine(s) for Convenient Execution of Scalable and Elastic Analytics.

Use Case

Sometimes more than one DSVM is needed.

Multi-deployment of heterogeneous DSVMs may be required for a collaborative project where each of group member works on a machine with a specific configuration targetting their requirements. For instance, a powerful (expensive) machine is assigned to perform computationally intensive tasks while a smaller (inexpensive) machine can be used for explorative or interactive tasks.
Another common use case is for a Data Scientist to create their R programs to analyse a dataset on their local compute platform (e.g., a laptop with 6GB RAM running Ubuntu with R installed). Development is performed with a subset of the full dataset (a random sample) that will not exceed the available memory and will return results quickly. When the experimental setup is complete the script can be sent across to a considerably more capable compute engine on Azure, possibly a cluster of servers to build models in parallel in a deploy/compute/destroy cycle to be a significantly more cost effective alternateive to an on-premise purchase of hardware.

This tutorial deploys a collection/cluster of Linux Data Science Virtual Machines (DSVMs) for the above two scenarios. In the latter scenarios the user distributes a compute task over those servers, collects the results and generates a report. Code is included but not run to then delete the resource group if the resources are no longer required. Once deleted consumption will cease.

This script is best run interactively to review its operation and to ensure that the interaction with Azure completes.

Setup

Refer to Deploy for an explanation of the set up of the virtual machines.

# Load the required subscription resources: TID, CID, and KEY.
# Also includes the ssh PUBKEY for the user.

USER <- Sys.info()[['user']]

source(paste0(USER, "_credentials.R"))

# Load the required packages.

library(AzureSMR)    # Support for managing Azure resources.
library(AzureDSVM)   # Further support for the Data Scientist.
library(magrittr)    # Pipeline computation.
library(dplyr)       # Data wrangling.

# Parameters for this script: the name for the new resource group and
# its location across the Azure cloud. The resource name is used to
# name the resource group that we will create transiently for the
# purposes of this script.

# Create a random resource group to reduce likelihood of conflict with
# other users.

BASE <- 
  runif(4, 1, 26) %>%
  round() %>%
  letters[.] %>%
  paste(collapse="") %T>%
  {sprintf("Base name:\t\t%s", .) %>% cat("\n")}

RG <-
  paste0("my_dsvm_", BASE,"_rg_sea") %T>%
  {sprintf("Resource group:\t\t%s", .) %>% cat("\n")}

# Choose a data centre location.

LOC <-
  "southeastasia"  %T>%
  {sprintf("Data centre location:\t%s", .) %>% cat("\n")}

cat("\n")

# Number of VMs to deploy.

COUNT <- 4

# Connect to the Azure subscription and use this as the context for
# all of our activities.

context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY)

# Check if the resource group already exists. Take note this script
# will not remove the resource group if it pre-existed.

rg_pre_exists <- existsRG(context, RG, LOC)

if (! rg_pre_exists)
{
  # Create a new resource group into which we create the VMs and
  # related resources. Resource group name is RG. 

  # Note that to create a new resource group one needs to add access
  # control of Active Directory application at subscription level.

  azureCreateResourceGroup(context, RG, LOC) %>% cat("\n\n")

}

# Check that it now exists.

cat("Resource group", RG, "at", LOC,
    ifelse(!existsRG(context, RG, LOC), "does not exist.\n", "exists.\n"), "\n")

Create a Cluster

Multi-deployment of DSVMs can be achieved by calling deployDSVMCluster(). This function is designed to implicitly switch between cluster and collection of DSVMs, according to the given inputs. If the hostname (i.e., names of the DSVMs) consists of only one character string then a cluster of homogeneous DSVMs (i.e., same size) will be deployed. Using the unique machine name as the base a sequential numbering is used to form a full hostname. On the other hand if the hostname is a vector of character strings the function will create a collection of machines with names so specified.

It is worth mentioning that a cluster of DSVMs is useful when a batch-based analytical job needs to be performed in a desired computing context, especially in a distributed manner across nodes of the cluster. The distributed computing functionality is supported by the Microsoft RevoScaleR parallel computing backend. The distributed and parallel computing backend is socket-based and relies on SSH for secure communication. To allow this communication across nodes, deployDSVMCluster adds inbound security rules into the security group of each DSVM in the cluster, and establishes public key pairs for the machines.

We can now deploy a cluster of homogeneous DSVMs. Each DSVM will be named based on the hostname provided and sequentially numbered.

# Deploy a cluster of DSVMs.

deployDSVMCluster(context, 
                  resource.group=RG, 
                  location=LOC, 
                  hostname=BASE,
                  username=USER, 
                  authen="Key",
                  pubkey=PUBKEY,
                  count=COUNT)

cluster <- azureListVM(context, RG, LOC)

# To validate the existence of deployed DSVMs.

for (i in 1:COUNT)
{
  vm <- cluster[i, "name"]
  fqdn <- paste(cluster[i, "name"],
                cluster[i, "location"],
                "cloudapp.azure.com",
                sep=".")

  cat(vm, "\n")

  operateDSVM(context, RG, vm, operation="Check")

  # Send a simple system() command across to the new server to test
  # its existence. Expect a single line with an indication of how long
  # the server has been up and running.

  cmd <- paste("ssh -q",
               "-o StrictHostKeyChecking=no",
               "-o UserKnownHostsFile=/dev/null\\\n   ",
               fqdn,
               "uptime") %T>%
    {cat(., "\n")}
  cmd
  system(cmd)
  cat("\n")
}

Create a Collection

We can also create a collection of Linux DSVMs each with a different user and with public-key based authentication method. Name, username, and size can also be configured.

DSVM_NAMES <- paste0(BASE, c(1, 2, 3)) %T>% print()
DSVM_USERS <- paste0("user", c("a", "b", "c")) %T>% print()

# Deploy multiple DSVMs using deployDSVMCluster.

deployDSVMCluster(context, 
                  resource.group=RG, 
                  location=LOC, 
                  hostname=DSVM_NAMES, 
                  username=DSVM_USERS, 
                  authen="Key",
                  pubkey=rep(PUBKEY, COUNT)) 

# Check the deployed machines?

for (vm in DSVM_NAMES)
{
  cat(vm, "\n")

  operateDSVM(context, RG, vm, operation="Check")

  # Send a simple system() command across to the new server to test
  # its existence. Expect a single line with an indication of how long
  # the server has been up and running.

  cmd <- paste("ssh -q",
               "-o StrictHostKeyChecking=no",
               "-o UserKnownHostsFile=/dev/null\\\n   ",
               paste0(vm, ".", LOC, ".cloudapp.azure.com"),
               "uptime") %T>%
    {cat(., "\n")}
  cmd
  system(cmd)
  cat("\n")
}

Delete the Resource Group

# Delete the resource group now that we have proved existence. There
# is probably no need to wait. Only delete if it did not pre-exist
# this script. Deletion seems to take 10 minutes or more.

if (! rg_pre_exists)
  azureDeleteResourceGroup(context, RG)

Azure/AzureDSVM documentation built on May 20, 2019, 2:03 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Azure/AzureDSVM
Create and Manage Azure Data Science Virtual Machine(s) for Convenient Execution of Scalable and Elastic Analytics.

In Azure/AzureDSVM: Create and Manage Azure Data Science Virtual Machine(s) for Convenient Execution of Scalable and Elastic Analytics.

Use Case

Setup

Create a Cluster

Create a Collection

Delete the Resource Group

R Package Documentation

Browse R Packages

We want your feedback!

Azure/AzureDSVM Create and Manage Azure Data Science Virtual Machine(s) for Convenient Execution of Scalable and Elastic Analytics.

In Azure/AzureDSVM: Create and Manage Azure Data Science Virtual Machine(s) for Convenient Execution of Scalable and Elastic Analytics.

Use Case

Setup

Create a Cluster

Create a Collection

Delete the Resource Group

R Package Documentation

Browse R Packages

We want your feedback!

Azure/AzureDSVM
Create and Manage Azure Data Science Virtual Machine(s) for Convenient Execution of Scalable and Elastic Analytics.