Automate Provisioning of HDInsight Clusters with PowerShell and Azure Automation

HDInsight is a Platform-as-a-Service Hadoop distribution that is part of the Cortana Intelligence Suite hosted in Azure.  With HDInsight, you can stand up a Hadoop cluster in about half an hour with Hive, Pig, Sqoop, and the rest of the Apache Hadoop  arsenal at your fingertips.

As with many PaaS offerings in Azure, HDInsight has 2 major value props:

  1.  Powered by the cloud, you can spin up or scale clusters with any number of nodes quickly and easily.
  2. You only pay for the compute and storage that you use.

HDInsight uses either Blob storage or Azure Data Lake Store for storing files.  These services, and the files stored on them, are managed separately from the HDInsight service.  This separation of compute and storage means that you can scale or even shut down the compute service without impacting storage.  Said in another way, you can delete an HDInsight cluster, but the files that the cluster either consumed or created will remain after the cluster is deleted.  When you stand a cluster back up on top of the same storage, the files will be available as if the cluster had never been deleted.  This means you can stand up a cluster when you need it for processing or querying data, and tear it down when not in use.  

* 4 Node A3 Cluster

Being you only pay for HDInsight when a cluster is up and running, this idea of deleting a cluster when not in use provides a big opportunity for cost savings.  To quickly illustrate the potential impact, consider a simple 4 node cluster on standard A3 virtual machines (cheapest VMs at time of post).  At the time of this post, running this cluster 24 hours/day, 7 days/week will run just over $1,400/month or just over $17,000/year.  If you only run the cluster 5 days/week, that cost will drop to $1,000/month or $12,000/year.  If you run the cluster only 12 hours/day, 5 days/week, that cost will drop to about $500/month or only $6,000/year.  If you only run the cluster on-demand, as needed, then the costs could potentially drop even lower. 

PowerShell Scripts for Provisioning HDInsight

Besides using the Azure Portal, HDInsight clusters can also be created and deleted using PowerShell scripts. 

The relatively straightforward PowerShell script below can be used to provision an HDInsight cluster on top of an existing Blob storage account.  

###########################################
# Azure Sign-In
###########################################
# Sign in
$azureUser = "" #Provide Microsoft account username
$azurePSWD = $password = ConvertTo-SecureString "" -AsPlainText -Force #Provide Microsoft Account Password
$azureCredential = New-Object System.Management.Automation.PSCredential ($azureUser, $azurePSWD)
Login-AzureRmAccount -Credential $azureCredential

# Select the subscription to use
$subscriptionID = "" # Provide Azure SubscriptionID
Select-AzureRmSubscription -SubscriptionId $subscriptionID

###########################################
# Create an HDInsight Cluster
###########################################
# Cluster Variables
$resourceGroupName = "" # Provide Resource Group Name
$storageAccountName = "" # Provide Storage Account Name
$containerName = "" # Provide Blob Container 
$storageAccountKey = Get-AzureRmStorageAccountKey -Name $storageAccountName -ResourceGroupName $resourceGroupName | %{ $_.Key1 }
$clusterName = $containerName 
$clusterNodes = 1
$clusterUser = "" # Provide Cluster Username
$clusterSSHUser = "" # Provide SSH Username
$clusterPSWD = ConvertTo-SecureString "" -AsPlainText -Force # Provide password to use for cluster and ssh if the same
$clusterCredential = New-Object System.Management.Automation.PSCredential ($clusterUser, $clusterPSWD)
$sshCredential = New-Object System.Management.Automation.PSCredential ($clusterSSHUser, $clusterPSWD)
$clusterType = "Hadoop"
$clusterOS = "Linux" 
$clusterNodeSize = "Standard_A3"
$location = Get-AzureRmStorageAccount -ResourceGroupName $resourceGroupName -StorageAccountName $storageAccountName | %{$_.Location}

# Create a new HDInsight cluster
New-AzureRmHDInsightCluster -ClusterName $clusterName -ResourceGroupName $resourceGroupName -HttpCredential $clusterCredential -Location $location -DefaultStorageAccountName "$storageAccountName.blob.core.windows.net" -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainer $containerName-ClusterSizeInNodes $clusterNodes -ClusterType $clusterType -OSType $clusterOS -Version "3.2" -SshCredential $sshCredential -HeadNodeSize $clusterNodeSize -WorkerNodeSize $clusterNodeSize

The script is a derivative of a script found here.  It assumes a storage account already exists (separation of compute and storage), and it automates the Azure login and credentialing parts of the scrip

The corresponding script for tearing down a cluster (but leaving the data in place) is below.

###########################################
# Azure Sign-In
###########################################
# Sign in
$azureUser = "" # Provide Microsoft account username
$azurePSWD = $password = ConvertTo-SecureString "" -AsPlainText -Force # Provide Microsoft Account Password
$azureCredential = New-Object System.Management.Automation.PSCredential ($azureUser, $azurePSWD)
Login-AzureRmAccount -Credential $azureCredential

# Select the subscription to use
$subscriptionID = "" # Provide Azure SubscriptionID
Select-AzureRmSubscription -SubscriptionId $subscriptionID

###########################################
# Delete HDInsight Cluster
###########################################
# Delete Cluster
$clusterName = "" # Provide Cluster Name
Remove-AzureRmHDInsightCluster -ClusterName $clusterName

Together, these 2 PowerShell scripts can be used to automate the provisioning of a cluster when needed for processing and querying, and to automate deleting the cluster when querying or processing is complete.

Automate PowerShell with Azure Automation

The scripts above can be executed as-is from a local computer or Azure VM.  This may be acceptable in many scenarios, but Azure also offers a PaaS service called Azure Automation that makes it easy to manage and schedule such scripts.  The service provides other useful capabilities like:

  • securely storing parameters such as credentials, connections, and other variables that can be passed to your PowerShell scripts
  • stringing together PowerShell scripts to create a workflow
  • visual logging and metrics for scheduled and executed scripts

To get started with Azure Automation, you will need to create an Automation account.  This can be done in the Azure Portal.  The Automation service can be found by searching the Azure Marketplace for "Automation".

.Before you can automate the PowerShell scripts from above with the Automation account, you will need to add the "AzureRM.HDInsight" module to the account.  To do this, you will need to open the "ASSETS" tile under resources and then open the "Modules" tile.  

This displays a list of the modules currently installed in the Automation account.  You will need to click the Browse Gallery button to find and import the AzureRM.HDInsight module.

While not required, to avoid needing to upgrade/install additional dependent modules, I installed version 1.03 of the AzureRM.HDInsight module.  This can be done by clicking "View in PowerShell Gallery", scrolling down to and selecting the appropriate version of the module, and clicking the "Deploy to Azure Automation" button.

After importing the HDInsight module, the next thing you might want to do is create and store some Credentials and Variables in the Automation service.  Storing credentials and variables provides an easy and secure way to manage them without hard-coding them into your scripts.  Once created, these objects can be referenced directly in 1 or many PowerShell scripts.  Creating and managing variables and credentials is very straight forward and can easily be done from the "Assets" blade.

After you have created your Variables and Credentials, you are ready to create your script.  The script below is a modified flavor of the provisioning script from earlier in this post.  The main difference in the 2 scripts is that the script below references stored Automation credentials for the Azure, cluster, and SSH credentials and also a stored Automation variable for the Azure subscription Id.  This is done using the "Get-AutomationPSCCredential" and "Get-AutomationVariable" commands.

###########################################
# Azure Sign-In
###########################################
# Sign in
$azureCredential = Get-AutomationPSCredential -Name 'cred-admin' #Automation credential for Azure login
Login-AzureRmAccount -Credential $azureCredential

# Select the subscription to use
$subscriptionID = Get-AutomationVariable -Name 'var-subscriptionid' #Automation variable for subscription id
Select-AzureRmSubscription -SubscriptionId $subscriptionID

###########################################
# Create HDInsight Cluster
###########################################
# Cluster Variables
$resourceGroupName = "" #Provide Resource Group Name
$storageAccountName = "" #Provide Storage Account Name
$containerName = "" #Provide Blob Container
$storageAccountKey = Get-AzureRmStorageAccountKey -Name $storageAccountName -ResourceGroupName $resourceGroupName | %{ $_.Key1 }
$clusterName = $containerName 
$clusterCredential = Get-AutomationPSCredential -Name 'cred-clusteruser' #Automation credential for cluster user
$sshCredential = Get-AutomationPSCredential -Name 'cred-sshuser' #Automation credential for ssh user
$clusterType = "Hadoop"
$clusterOS = "Linux"
$clusterNodes = 1  
$clusterNodeSize = "Standard_A3"
$location = Get-AzureRmStorageAccount -ResourceGroupName $resourceGroupName -StorageAccountName $storageAccountName | %{$_.Location}

# Create a new HDInsight cluster
New-AzureRmHDInsightCluster -ClusterName $clusterName -ResourceGroupName $resourceGroupName -HttpCredential $clusterCredential -Location $location -DefaultStorageAccountName "$storageAccountName.blob.core.windows.net" -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainer $containerName  -ClusterSizeInNodes $clusterNodes -ClusterType $clusterType -OSType $clusterOS -Version "3.2" -SshCredential $sshCredential -HeadNodeSize $clusterNodeSize -WorkerNodeSize $clusterNodeSize

Here is the corresponding script for tearing down a cluster (but leaving the data in place).

###########################################
# Azure Sign-In
###########################################
# Sign in
$azureCredential = Get-AutomationPSCredential -Name 'cred-admin' #Automation credential for Azure login
Login-AzureRmAccount -Credential $azureCredential

# Select the subscription to use
$subscriptionID = Get-AutomationVariable -Name 'var-subscriptionid' #Automation variable for subscription id
Select-AzureRmSubscription -SubscriptionId $subscriptionID

###########################################
# Delete HDInsight Cluster
###########################################
# Delete Cluster
Remove-AzureRmHDInsightCluster -ClusterName "" # Provide Cluster Name

To use either of these scripts in the Automation service, you will need to create a runbook.  The "Runbooks" tile can be found in the "Resources" section of the "Automation account" blade.

When you create a runbook for a script like those above, make sure and choose "PowerShell" for the "Runbook type".  Once the runbook is created, you can simply cut-and-paste your PowerShell script into the editor in the portal and choose to either test it (which will execute the script in your Azure subscription) or publish it. 

Finally, once you have a published runbook, you can either run it on-demand or schedule it to run (once or recurring) by clicking the appropriate Start and Schedule buttons on the Runbook blade.

Drops Mic

Using the concepts and tools presented here to help control costs and make things work more efficiently for your big data cloud solution can be just as important as the insights gained from the solution itself.  Remembering to architect these details into your solution can save you and your company a lot of time and money, and maybe even make you a hero along the way!