Automating HDInsight cluster creation with PowerShell

If I have to do some work more than twice I like automating the process (or try to as much as I can). Recently, I became interested in spinning up my own HDInsight cluster in Windows Azure with some customization. Since it’s work and I had to spin up clusters more than once, I tried automating the process with PowerShell. In the following code examples I will try to explain what I did and why.

Prerequisite:
1. Windows Azure subscription
2. Windows PowerShell V2
3. Windows Azure PowerShell and Microsoft .net SDK for Hadoop
4. Follow the link from #3 to download and import publishsettings

Now let’s start with the parameters needed. In this first section, I am gathering the parameters from the user of the script. 

If you are creating your cluster for the first time and you don’t have any data yet here are the options to use:
1. $clusterName = Name that you would like to give your cluster
2. $clusterNodes = Number of data nodes you want
3. $location = Location of the Cluster provided in double quote. ie: “West US”
4. $storageAccountName = New storage account
5. $newContainerName = New container

If you have existing data in WASB (Windows Azure Storage Blob) the parameters are as following: 
1. $clusterName = Name that you would like to give your cluster
2. $clusterNodes = Number of data nodes you want
3. $existingStorage = An existing storage account. No need to worry, you will be prompted to choose it
4. $existingContainer = If there is an existing container that houses your data
5. $existingMetastoreDatabase = Existing Hive/Oozie database on SQL Azure that can be metadata repository

param
    (
     [Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$false, HelpMessage='Provide a new cluster name for your HDInsight services')]
     [string] $clusterName,
     [Parameter(Position=1, Mandatory=$true, ValueFromPipeline=$false, HelpMessage='Provide the # of nodes you would like in the cluster' )]
     [int] $clusterNodes,
     [Parameter(Position=2, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='-existingStorage' )]
     [switch] $existingStorage,
     [Parameter(Position=3, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='-existingStorage' )]
     [switch] $existingContainer,
     [Parameter(Position=4, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='SQL Azure Server and Database must exist')]
     [switch] $existingMetastoreDatabase,
     [Parameter(Position=5, Mandatory=$false, ParameterSetName='new', ValueFromPipeline=$false, HelpMessage='Provide location if creating new storage account' )]
     [string] $location,
     [Parameter(Position=6, Mandatory=$false, ParameterSetName='new', ValueFromPipeline=$false, HelpMessage='Provide a name if creating a new storage account')]
     [string] $storageAccountName,
     [Parameter(Position=7, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='Provide a new container name')]
     [string] $newContainerName,
     [Parameter(Position=8, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='Provide a subscription name if more than one subscription exists')]
     [string] $subscriptionName
     )

Now the fun begins. this part of the code will evaluate if your Windows Azure PowerShell (or Windows PowerShell) is connected to your Azure account. It does this by the try-catch block and trying to create new or access an existing storage accounts. If it cannot do that, it will fail and ask you to run Add-AzureAccount  for your Windows Azure subscription.

This part of the code will also prompt you for choosing existing storage account if you have used the second set of parameters from above or create a new storage account if you have used the first set.

try{
    If($storageAccountName)
        {
            if($location)
            {
                $storageAccountName = $storageAccountName.ToLower()
                Write-host "Using storage account:"$storageAccountName "at" $location "data center" -foreground Green
                New-AzureStorageAccount -StorageAccountName $storageAccountName -Location $location
            }
            Else
            {
                Write-Host""
                Write-Host "Please provide a location when creating a new storage account. See example." -foreground RED
                Exit
            }   
        }
    Elseif($existingStorage)
        {
            $allAzureStorageAccounts = Get-AzureStorageAccount
            $azureStorageAccounts = $allAzureStorageAccounts.label
  
            if($azureStorageAccounts)
            {
                [int]$y=1
                $choice =@{}
                Write-Host ""
                ForEach ($storageAccount in $azureStorageAccounts)
                {
                    Write-Host $y : $storageAccount
                    $choice=$choice+@{$y=$storageAccount}
                    $y++
                }
            }
            Write-Host ""
            [int]($input)=Read-host "Enter the corresponding number to the existing storage" 
            if(($input -gt $y) -or ($input -eq 0))
                {
                    Write-Host "No associated Storage Account for that number" -foreground RED
                }
                else
                {
                    $storageAccountName = $choice.($input)  
                    $location = ($allAzureStorageAccounts | where -Property Label -eq $storageAccountName).GeoPrimaryLocation
                    Write-host ""
                    Write-host "Using existing storage account:"$storageAccountName "at" $location "data center" -foreground Green
                          
                }
        }
    Else
        {
            Write-host ""
            Write-Host "Either Provide the switch -existingStorage or provide new Storage Account Name and Location. See example" -foreground RED
            Exit
        }
}
catch {
    "$_"
    exit
}

Currently creation of HDInsight cluster will error out if a cluster creation is attempted on a storage account that was created on an affinity group. The cryptic message that is shown on the  Windows Azure Portal is “The request has failed. Please contact support for more information” .  To remedy this situation, I added a simple check and also resolving for storage account key and creating a destination context.

#Currently it is not possible to create a HDInsight cluster on an affinitygroup. May remove when it's available.
$affinitysetting = get-azurestorageaccount -storageaccountname $storageAccountName | %{$_.affinitygroup}
if ($affinitysetting)
    {
        Write-Host "Cannot create HDInsight cluster on affinity group. Choose a storage account that is not part of an affinity group" -foreground RED
        Exit
    }   
      
$storageAccountKey = Get-AzureStorageKey $storageAccountName | %{$_.Primary}
# Create a storage context object
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

The following code block is checking for is the user used -existingcontainer or creating a new container. If existing container is the intention, you will be prompted with a choice again.

if($newContainerName)
    {
    # Create a Blob storage container
    New-AzureStorageContainer -Name $newContainerName -Context $destContext
    $containerName = $newContainerName
    }
Elseif($existingContainer)
    {
        [int]$i=1
        $containerChoices = ($destContext|Get-AzureStorageContainer ).Name
        $containerChoice =@{}
        Write-Host ""
        ForEach ($cont in $containerChoices)
            {
                Write-Host $i : $cont
                $containerChoice=$containerChoice+@{$i=$cont}
                $i++
            }
        Write-Host ""
        [int]($contInput)=Read-host "Enter the corresponding number to the container" 
        if(($contInput -gt $i) -or ($contInput -eq 0))
            {
                Write-Host "No associated container for that number" -foreground RED
            }
        Else
            {   
                $containerName = $containerChoice.($contInput)
                Write-host ""
                Write-host "Using existing container:"$containerName -foreground Green      
            }           
    }
Else
    {
        Write-host ""
        Write-Host "Either Provide the switch -existingContainer container or provide new Container Name and Location. See example" -foreground RED
        Exit
    }

This last part of the code will either
1. Create a new cluster with SQL Azure as Hive and Oozie meta store (recommended option) 
2. Create a new cluster without SQL Azure as meta data store and internally will use Derby database

I highly recommend creating an empty SQL Azure database (let’s call it metastore) and using it as Hive/Oozie metadata store (1st option). This helps when the name node gets corrupted, becomes unavailable and/or the Derby database becomes unavailable for whatever reason. This way, the only thing you lose is the cluster. If you spin up another cluster the data and the metadata is saved. All you need is an empty SQL Azure database because the schema tables will be automatically created by the HDInsight cluster creation process. If you have SQL Azure databases created when you use the option –existingmetastoredatabase you will be prompted for choosing one. You must know the correct the username and password for the SQL Azure database. Wrong username or password will result in the same cryptic message I talked about earlier.

if(!$existingMetastoreDatabase)
    {
        # Create a new HDInsight cluster
        $hdCred = Get-Credential -Message "Provide Username and Password for your HDInsight cluster. Password requires at least 10 characters with 1 Uppercase, 1 lowercase, 1 number and 1 special character"
        New-AzureHDInsightCluster -Subscription $subscriptionName -Name $clusterName -Location $location -DefaultStorageAccountName "${storageAccountName}.blob.core.windows.net" `
        -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainerName $containerName -ClusterSizeInNodes $clusterNodes -Credential $hdCred
    }
Elseif($existingMetastoreDatabase)
    {
        $sqlServers=(Get-AzureSqlDatabaseServer).ServerName
        [int]$j=1
        $serverChoice = @{}
        $databaseChoice = @{}
        Write-Host ""
        forEach ($sqlServer in $sqlServers)
        {
            $sqlDatabaseName = (Get-AzureSqlDatabase -ServerName $sqlServer).Name | where {$_ -ne "master"}
            ForEach ($sqlDatabase in $sqlDatabaseName)
            {
                Write-Host $j : $sqlDatabase "on Server" $sqlServer
                $serverChoice = $serverChoice+@{$j=$sqlServer}
                $databaseChoice = $databaseChoice+@{$j=$sqlDatabase}
                $j++
            }
        }
        Write-Host ""
        [int]($serverDatabaseInput) = Read-host "Enter the corresponding number to the Database" 
        if(($serverDatabaseInput -gt $j) -or ($serverDatabaseInput -eq 0))
            {
                Write-Host "No associated database for that number" -foreground RED
            }
        Else
            {   
                $sqlServerName = $serverChoice.($serverDatabaseInput)
                $metastoreDatabase = $databaseChoice.($serverDatabaseInput)
            }
        Write-Host ""
        $metastoreUsername = Read-Host "Enter SqlServer Username"
        $metastorePassword = Read-Host "Enter SqlServer Password" -AsSecureString
        $cred = New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $metastoreUsername, $metastorePassword
        $hdCred = Get-Credential -Message "Provide Username and Password for your HDInsight cluster. Password requires at least 10 characters with 1 Uppercase, 1 lowercase, 1 number and 1 special character"
        # Create a new HDInsight cluster with meta-store
        New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes `
        | Set-AzureHDInsightDefaultStorage -StorageAccountName "${storageAccountName}.blob.core.windows.net" -StorageAccountKey $storageAccountKey -StorageContainerName $containerName `
        | Add-AzureHDInsightMetastore -SqlAzureServerName "${sqlServerName}.database.windows.net" -DatabaseName $metastoreDatabase -Credential $cred -MetastoreType HiveMetaStore `
        | Add-AzureHDInsightMetastore -SqlAzureServerName "${sqlServerName}.database.windows.net" -DatabaseName $metastoreDatabase -Credential $cred -MetastoreType OozieMetaStore `
        | New-AzureHDInsightCluster -Subscription $subscriptionName -Name $clusterName -Location $location -Credential $hdCred
    }
Else
    {
        Write-Host "Provide any missing parameter(s). See example"
    }

You will notice that I am using the same SQL Azure database for Hive and Oozie metastore. I am doing this for the simplicity of the script. You can easily modify the code to separate the metastore of Hive and Oozie if running into any concurrency issues.

Here are some examples that you can use to create your HDInsight cluster.

This first example creates a 2 node HDInsight cluster called myhdinsight on existing storage account and container and uses an existing metastore on sql azure database.

.\HDInsightClusterDeploy.ps1 -clustername myhdinsight -clusternodes 2 -existingstorage -existingcontainer –existingmetastoredatabase

The second example creates a 2 node HDInsight cluster called myhdinsight in West US data center with a storage account name newstorage and also creates a new container called newcontainer.

.\HDInsightClusterDeploy.ps1 -clustername myhdinsight -clusternodes 2  -location "west us" -storageaccountname newstorage -newcontainername newcontainer

This third example creates a 2 node HDInsight cluster called myhdinsight on existing storage and existing container and uses derby to as metastore. Again, I don’t recommend this option but can be used for test purposes to quickly spin up a cluster.

.\HDInsightClusterDeploy.ps1 -clustername myhdinsight -clusternodes 2 -existingstorage –existingcontainer

 

Source:
Provision HDInsight Clusters
Microsoft .NET SDK for Hadoop

Special thanks to Cindy Gross who had me put on my thinking cap and suggested many improvements of my code. You can follow her blogs on msdn.

Appendix:
Putting it all together for the pleasure of copy and paste. Save it as HDInsightClusterDeploy.ps1. voilà!

##############################################################################################################################
# .SYNOPSIS
#    This script, HDInsightClusterDeploy.ps1 is used for creating new HDInsight Services Cluster
#
# .NOTES
#    Filename: HDInsightClusterDeploy.ps1
#    Author: Murshed Zaman & Cindy Gross
#    Date: 11/13/2013
#    Requires: PowerShell V2
#              Windows Azure PowerShell (http://www.windowsazure.com/en-us/manage/services/hdinsight/install-and-configure-powershell-for-hdinsight/)
#               Microsoft .NET SDK for Hadoop (same URL as above)
#               Follow: Connect to your subscription from the above mentioned URL to Get/Import-AzurePublishSettingsFile once
#               Run Add-AzureAccount    on Windows Azure PowerShell for connection to your default subscription
#    Version: 1.0
#    Revision: none
#
# .DESCRIPTION
#    This script will create HDInsight Cluster for a given Azure subscription. There are some mandatory
#     and some optional parameters. See examples for clarification.
#
# .EXAMPLE (Using existing Azure Storage. User is prompted for selecting from their existing storage.)
#    .\HDInsightClusterDeploy.ps1 -clusterName myhdinsight -clusterNodes 2 -existingStorage -existingContainer
#
# .EXAMPLE (Supply the "location" in double quotes, a new storage account name and new container name)
#    .\HDInsightClusterDeploy.ps1 -clusterName myhdinsight -clusterNodes 2  -location "West US" -storageAccountName newstorage -newContainerName newcontainer
#
# .EXAMPLE (Use existing SQL Azure as Oozie and Hive MetaStore on existing storage/container. SQL Azure Server and database must exist, firewall configured, windows azure services on)
#    .\HDInsightClusterDeploy.ps1 -clusterName myhdinsight -clusterNodes 2 -existingStorage -existingContainer -existingMetastoreDatabase
#
##############################################################################################################################

param
    (
     [Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$false, HelpMessage='Provide a new cluster name for your HDInsight services')]
     [string] $clusterName,
     [Parameter(Position=1, Mandatory=$true, ValueFromPipeline=$false, HelpMessage='Provide the # of nodes you would like in the cluster' )]
     [int] $clusterNodes,
     [Parameter(Position=2, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='-existingStorage' )]
     [switch] $existingStorage,
     [Parameter(Position=3, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='-existingStorage' )]
     [switch] $existingContainer,
     [Parameter(Position=4, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='SQL Azure Server and Database must exist')]
     [switch] $existingMetastoreDatabase,
     [Parameter(Position=5, Mandatory=$false, ParameterSetName='new', ValueFromPipeline=$false, HelpMessage='Provide location if creating new storage account' )]
     [string] $location,
     [Parameter(Position=6, Mandatory=$false, ParameterSetName='new', ValueFromPipeline=$false, HelpMessage='Provide a name if creating a new storage account')]
     [string] $storageAccountName,
     [Parameter(Position=7, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='Provide a new container name')]
     [string] $newContainerName
     )
     
            
$multipleSubscription = Get-AzureSubscription|%{$_.SubscriptionName}
if($multipleSubscription.Count -gt 1)
    {
        [int]$s=1
        $subChoice = @{}
        Write-Host ""
        forEach($subName in $multipleSubscription)
        {
            Write-Host $s : $subName
            $subChoice = $subChoice+@{$s=$subName}
            $s++
        }
        Write-Host ""
        [int]($subNumber) = Read-Host "Enter the corresponding number to the subscription you would like to use"
        if(($subNumber -gt $s) -or ($subNumber -eq 0))
        {
            Write-Host "No associated subscription for that number" -foreground RED
            Exit
        }
        else
        {
            $selectSubscription = $subChoice.($subNumber)
            Select-AzureSubscription -SubscriptionName ${selectSubscription} -Current 
            $subscriptionName = $selectSubscription
        }
    }
else
    {
        $subscriptionName = $multipleSubscription
        Select-AzureSubscription -SubscriptionName ${subscriptionName} -Current
    }


try{
    If($storageAccountName)
        {
            if($location)
            {
                $storageAccountName = $storageAccountName.ToLower()
                Write-host "Using storage account:"$storageAccountName "at" $location "data center" -foreground Green
                New-AzureStorageAccount -StorageAccountName $storageAccountName -Location $location
            }
            Else
            {
                Write-Host""
                Write-Host "Please provide a location when creating a new storage account. See example." -foreground RED
                Exit
            }    
        }
    Elseif($existingStorage)
        {
            $allAzureStorageAccounts = Get-AzureStorageAccount
            $azureStorageAccounts = $allAzureStorageAccounts.label

            if($azureStorageAccounts)
            {
                [int]$y=1
                $choice =@{}
                Write-Host ""
                ForEach ($storageAccount in $azureStorageAccounts)
                {
                    Write-Host $y : $storageAccount
                    $choice=$choice+@{$y=$storageAccount}
                    $y++
                }
            }
            Write-Host ""
            [int]($input)=Read-host "Enter the corresponding number to the existing storage" 
            if(($input -gt $y) -or ($input -eq 0))
                {
                    Write-Host "No associated Storage Account for that number" -foreground RED
                    Exit
                }
                else
                {
                    $storageAccountName = $choice.($input)    
                    $location = ($allAzureStorageAccounts | where -Property Label -eq $storageAccountName).GeoPrimaryLocation
                    Write-host ""
                    Write-host "Using existing storage account:"$storageAccountName "at" $location "data center" -foreground Green
                        
                }
        }
    Else
        {
            Write-host ""
            Write-Host "Either Provide the switch -existingStorage or provide new Storage Account Name and Location. See example" -foreground RED
            Exit
        }
}
catch {
    "$_"
    Exit
}
#Currently it is not possible to create a HDInsight cluster on an affinitygroup. May remove when it's available.
$affinitysetting = get-azurestorageaccount -storageaccountname $storageAccountName | %{$_.affinitygroup}
if ($affinitysetting)
    {
        Write-Host "Cannot create HDInsight cluster on affinity group. Choose a storage account that is not part of an affinity group" -foreground RED
        Exit
    }    
    
$storageAccountKey = Get-AzureStorageKey $storageAccountName | %{$_.Primary}
# Create a storage context object
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

if($newContainerName)
    {
    # Create a Blob storage container
    New-AzureStorageContainer -Name $newContainerName -Context $destContext
    $containerName = $newContainerName
    }
Elseif($existingContainer)
    {
        [int]$i=1
        $containerChoices = ($destContext|Get-AzureStorageContainer ).Name
        $containerChoice =@{}
        Write-Host ""
        ForEach ($cont in $containerChoices)
            {
                Write-Host $i : $cont
                $containerChoice=$containerChoice+@{$i=$cont}
                $i++
            }
        Write-Host ""
        [int]($contInput)=Read-host "Enter the corresponding number to the container" 
        if(($contInput -gt $i) -or ($contInput -eq 0))
            {
                Write-Host "No associated container for that number" -foreground RED
                Exit
            }
        Else
            {    
                $containerName = $containerChoice.($contInput)
                Write-host ""
                Write-host "Using existing container:"$containerName -foreground Green        
            }            
    }
Else
    {
        Write-host ""
        Write-Host "Either Provide the switch -existingContainer container or provide new Container Name and Location. See example" -foreground RED
        Exit
    }

if(!$existingMetastoreDatabase)
    {
        # Create a new HDInsight cluster
        $hdCred = Get-Credential -Message "Create admin Username and Password for your HDInsight cluster. Password requires at least 10 characters with 1 Uppercase, 1 lowercase, 1 number and 1 special character"
        New-AzureHDInsightCluster -Name $clusterName -Location $location -DefaultStorageAccountName "${storageAccountName}.blob.core.windows.net" `
        -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainerName $containerName -ClusterSizeInNodes $clusterNodes -Credential $hdCred
    }
Elseif($existingMetastoreDatabase)
    {
        $allSqlServers= Get-AzureSqlDatabaseServer
        $sqlServers = ($allSqlServers |where -Property Location -eq $location).ServerName
        if ($sqlServers)
        {
            [int]$j=1
            $serverChoice = @{}
            $databaseChoice = @{}
            Write-Host ""
            forEach ($sqlServer in $sqlServers)
            {
                $sqlDatabaseName = (Get-AzureSqlDatabase -ServerName $sqlServer).Name | where {$_ -ne "master"}
                ForEach ($sqlDatabase in $sqlDatabaseName)
                {
                    Write-Host $j : $sqlDatabase "on Server" $sqlServer
                    $serverChoice = $serverChoice+@{$j=$sqlServer}
                    $databaseChoice = $databaseChoice+@{$j=$sqlDatabase}
                    $j++
                }
            }
            Write-Host ""
            [int]($serverDatabaseInput) = Read-host "Enter the corresponding number to the Database" 
            if(($serverDatabaseInput -gt $j) -or ($serverDatabaseInput -eq 0))
                {
                    Write-Host "No associated database for that number" -foreground RED
                    Exit
                }
            Else
                {    
                    $sqlServerName = $serverChoice.($serverDatabaseInput)
                    $metastoreDatabase = $databaseChoice.($serverDatabaseInput)
                }
            Write-Host ""
            $metastoreUsername = Read-Host "Enter SqlServer Username"
            $metastorePassword = Read-Host "Enter SqlServer Password" -AsSecureString
            $cred = New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $metastoreUsername, $metastorePassword
            $hdCred = Get-Credential -Message "Create admin Username and Password for your HDInsight cluster. Password requires at least 10 characters with 1 Uppercase, 1 lowercase, 1 number and 1 special character"
            # Create a new HDInsight cluster with meta-store
            New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes `
            | Set-AzureHDInsightDefaultStorage -StorageAccountName "${storageAccountName}.blob.core.windows.net" -StorageAccountKey $storageAccountKey -StorageContainerName $containerName `
            | Add-AzureHDInsightMetastore -SqlAzureServerName "${sqlServerName}.database.windows.net" -DatabaseName $metastoreDatabase -Credential $cred -MetastoreType HiveMetaStore `
            | Add-AzureHDInsightMetastore -SqlAzureServerName "${sqlServerName}.database.windows.net" -DatabaseName $metastoreDatabase -Credential $cred -MetastoreType OozieMetaStore `
            | New-AzureHDInsightCluster -Name $clusterName -Location $location -Credential $hdCred
        }
        else
        {
            Write-Host "You need to create a SQL Server Database in "$location "to use this option" -foreground RED
            Exit
        }
    }
else
    {
        Write-Host "Provide any missing parameter(s). See example"
    }

About MurshedAzureCAT
http://about.me/murshedazurecat

9 Responses to Automating HDInsight cluster creation with PowerShell

  1. Pingback: Microsoft SQL Server Development Customer Advisory Team

  2. Pingback: Cindy Gross: SQL Server + Big Data

  3. Pingback: Thursday, November 28, 2013 on #WindowsAzure | Alexandre Brisebois

  4. Pingback: Cindy Gross: SQL Server + Big Data

  5. tor says:

    What if i want to use hive metastore from SQL server on a VM and not SQL azure? Can this also be scripted?

    • mzaman01 says:

      Before I answer your question. My question to you is – What is the reason for using SQL Server on a VM instead?

  6. Pingback: HDInsight orchestration using Azure App Services – Real Life Code

  7. Pingback: Sample PowerShell Script: HDInsight Custom Create | Befriending Dragons

  8. Pingback: Your First HDInsight Cluster–Step by Step | Befriending Dragons

Leave a comment