Sometimes more than one DSVM is needed.
Multi-deployment of heterogeneous DSVMs may be required for a collaborative project where each of group member works on a machine with a specific configuration targetting their requirements. For instance, a powerful (expensive) machine is assigned to perform computationally intensive tasks while a smaller (inexpensive) machine can be used for explorative or interactive tasks.
Another common use case is for a Data Scientist to create their R programs to analyse a dataset on their local compute platform (e.g., a laptop with 6GB RAM running Ubuntu with R installed). Development is performed with a subset of the full dataset (a random sample) that will not exceed the available memory and will return results quickly. When the experimental setup is complete the script can be sent across to a considerably more capable compute engine on Azure, possibly a cluster of servers to build models in parallel in a deploy/compute/destroy cycle to be a significantly more cost effective alternateive to an on-premise purchase of hardware.
This tutorial deploys a collection/cluster of Linux Data Science Virtual Machines (DSVMs) for the above two scenarios. In the latter scenarios the user distributes a compute task over those servers, collects the results and generates a report. Code is included but not run to then delete the resource group if the resources are no longer required. Once deleted consumption will cease.
This script is best run interactively to review its operation and to ensure that the interaction with Azure completes.
Refer to Deploy for an explanation of the set up of the virtual machines.
# Load the required subscription resources: TID, CID, and KEY. # Also includes the ssh PUBKEY for the user. USER <- Sys.info()[['user']] source(paste0(USER, "_credentials.R")) # Load the required packages. library(AzureSMR) # Support for managing Azure resources. library(AzureDSVM) # Further support for the Data Scientist. library(magrittr) # Pipeline computation. library(dplyr) # Data wrangling. # Parameters for this script: the name for the new resource group and # its location across the Azure cloud. The resource name is used to # name the resource group that we will create transiently for the # purposes of this script. # Create a random resource group to reduce likelihood of conflict with # other users. BASE <- runif(4, 1, 26) %>% round() %>% letters[.] %>% paste(collapse="") %T>% {sprintf("Base name:\t\t%s", .) %>% cat("\n")} RG <- paste0("my_dsvm_", BASE,"_rg_sea") %T>% {sprintf("Resource group:\t\t%s", .) %>% cat("\n")} # Choose a data centre location. LOC <- "southeastasia" %T>% {sprintf("Data centre location:\t%s", .) %>% cat("\n")} cat("\n") # Number of VMs to deploy. COUNT <- 4 # Connect to the Azure subscription and use this as the context for # all of our activities. context <- createAzureContext(tenantID=TID, clientID=CID, authKey=KEY) # Check if the resource group already exists. Take note this script # will not remove the resource group if it pre-existed. rg_pre_exists <- existsRG(context, RG, LOC) if (! rg_pre_exists) { # Create a new resource group into which we create the VMs and # related resources. Resource group name is RG. # Note that to create a new resource group one needs to add access # control of Active Directory application at subscription level. azureCreateResourceGroup(context, RG, LOC) %>% cat("\n\n") } # Check that it now exists. cat("Resource group", RG, "at", LOC, ifelse(!existsRG(context, RG, LOC), "does not exist.\n", "exists.\n"), "\n")
Multi-deployment of DSVMs can be achieved by calling
deployDSVMCluster()
. This function is designed to implicitly switch
between cluster and collection of DSVMs, according to the given
inputs. If the hostname
(i.e., names of the DSVMs) consists of only
one character string then a cluster of homogeneous DSVMs (i.e., same
size) will be deployed. Using the unique machine name as the base a
sequential numbering is used to form a full hostname. On the other
hand if the hostname
is a vector of character strings the function
will create a collection of machines with names so specified.
It is worth mentioning that a cluster of DSVMs is useful when a
batch-based analytical job needs to be performed in a desired
computing context, especially in a distributed manner across nodes of
the cluster. The distributed computing functionality is supported by
the Microsoft RevoScaleR parallel computing backend. The distributed
and parallel computing backend is socket-based and relies on SSH for
secure communication. To allow this communication across nodes,
deployDSVMCluster
adds inbound security rules into the security
group of each DSVM in the cluster, and establishes public key pairs for
the machines.
We can now deploy a cluster of homogeneous DSVMs. Each DSVM will be named based on the hostname provided and sequentially numbered.
# Deploy a cluster of DSVMs. deployDSVMCluster(context, resource.group=RG, location=LOC, hostname=BASE, username=USER, authen="Key", pubkey=PUBKEY, count=COUNT) cluster <- azureListVM(context, RG, LOC) # To validate the existence of deployed DSVMs. for (i in 1:COUNT) { vm <- cluster[i, "name"] fqdn <- paste(cluster[i, "name"], cluster[i, "location"], "cloudapp.azure.com", sep=".") cat(vm, "\n") operateDSVM(context, RG, vm, operation="Check") # Send a simple system() command across to the new server to test # its existence. Expect a single line with an indication of how long # the server has been up and running. cmd <- paste("ssh -q", "-o StrictHostKeyChecking=no", "-o UserKnownHostsFile=/dev/null\\\n ", fqdn, "uptime") %T>% {cat(., "\n")} cmd system(cmd) cat("\n") }
We can also create a collection of Linux DSVMs each with a different user and with public-key based authentication method. Name, username, and size can also be configured.
DSVM_NAMES <- paste0(BASE, c(1, 2, 3)) %T>% print() DSVM_USERS <- paste0("user", c("a", "b", "c")) %T>% print() # Deploy multiple DSVMs using deployDSVMCluster. deployDSVMCluster(context, resource.group=RG, location=LOC, hostname=DSVM_NAMES, username=DSVM_USERS, authen="Key", pubkey=rep(PUBKEY, COUNT)) # Check the deployed machines? for (vm in DSVM_NAMES) { cat(vm, "\n") operateDSVM(context, RG, vm, operation="Check") # Send a simple system() command across to the new server to test # its existence. Expect a single line with an indication of how long # the server has been up and running. cmd <- paste("ssh -q", "-o StrictHostKeyChecking=no", "-o UserKnownHostsFile=/dev/null\\\n ", paste0(vm, ".", LOC, ".cloudapp.azure.com"), "uptime") %T>% {cat(., "\n")} cmd system(cmd) cat("\n") }
# Delete the resource group now that we have proved existence. There # is probably no need to wait. Only delete if it did not pre-exist # this script. Deletion seems to take 10 minutes or more. if (! rg_pre_exists) azureDeleteResourceGroup(context, RG)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.