Flattening JSON docs in Snowflake Data Warehouse

In this blog, I will go over some tips and tricks that I used to flatten JSON data into a relational form in the Snowflake data warehouse.

There are many examples of how to load JSON data into the Snowflake data warehouse, thus I am going to skip that part. But it is worth mentioning that we do a full load of our MongoDB over the weekend and the rest of the time we do delta loads based on the record updated date in MongoDB. When loading the delta portion of the JSON docs we heavily use what I call the delete-insert process with joins to insert the data into JSON tables. We use Snowpipe to load the data from MongoDB to Snowflake.

My JSON examples are from MongoDB, but any valid JSON will work. My screenshots of JSON documents are from MongoDB Compass, a UI tool that allows you to view/update/delete data in MongoDB. For purposes of this being a public blog, I will blur and obfuscate any identifying information. But you will learn how to write SQL code to flatten JSON data on Snowflake with the code examples.

First, here is a picture of one of the JSON documents:

I usually use a ‘create or replace table/view <name> as’ statement to create a table or a view at the beginning of the following code. I am describing the patterns that I follow, obviously, you decide what works for your data.

Some explanations:

  1. While flattening string characters, I always use the LEFT function to limit the number of characters as I do not want my code to fail if there are more characters one day in the fields that I am working with in the future.
  2. GUID ids sometimes will need “$oid” label to be able to flatten and “$date” for date values.  
  3. I always cast my columns to a data type.
  4. The billing address is a higher-level object and address1 is a child object thus a dot notation is needed to bring out address1 and others.
  5. I am just interested in data that are not soft deleted on MongoDB by restricting to deleted = false.
  6. Why is everything v:? When we load JSON data in the table, the column name we use is v, thus v: notation to refer to the data elements.
SELECT 
    LEFT(v:_id:"$oid", 24)::VARCHAR(24)  AS id
,   LEFT(v:name, 100)::NVARCHAR(100) AS name
,   v:createdDate:"$date"::TIMESTAMP AS createdDate
,   v:updatedDate:"$date"::TIMESTAMP AS  updatedDate
,   LEFT(v:accountStatus, 20)::VARCHAR(20) AS accountStatus
,   LEFT(v:billingId, 20)::VARCHAR(30) AS billingId
,   LEFT(v:accountLevel, 20)::VARCHAR(20) AS accountLevel
,   LEFT(v:phone, 30)::NVARCHAR(30) AS phone
,   LEFT(v:fax, 30)::NVARCHAR(30) AS fax
,   LEFT(v:taxId, 50)::NVARCHAR(50) AS taxId
,   LEFT(v:description, 500)::NVARCHAR(500) as description
,   LEFT(v:billingAddress.address1, 100)::NVARCHAR(100) as billingaddress1
,   LEFT(v:billingAddress.address2, 100)::NVARCHAR(100) as billingaddress2
,   LEFT(v:billingAddress.unitNumber, 30)::NVARCHAR(30) as billingUnitNumber
,   LEFT(v:billingAddress.city, 100)::NVARCHAR(100) as billingCity
,   LEFT(v:billingAddress.state, 100)::NVARCHAR(100) as billingState
,   LEFT(v:billingAddress.stateIso, 2)::NCHAR(2) as billingStateIso
,   LEFT(v:billingAddress.postalCode, 20)::NVARCHAR(20) as billingPostalCode
,   LEFT(v:billingAddress.country, 100)::NVARCHAR(100) as billingCountry
,   LEFT(v:billingAddress.countryIso, 2)::CHAR(2) as billingcountryIso
,   LEFT(v:website, 50)::NVARCHAR(50) as website
,   CASE WHEN v:deleted='true' THEN 1 else 0 end as deleted
FROM 
    <database>.<schema>.<json_table>
WHERE
    v:deleted = 'false'
;

Way to think about arrays: JSON documents are made of key/value pairs. Think of the Key as the column name, and the value holds the measure for that key.  My definition of an array is a way to store different values using the same key (a list of values). In a JSON document, arrays usually will be listed in a square bracket [ ] and if you are looking at the document using MongoDB Compass, it will say the word ‘array’ next to the key (column name).

Two ways to flatten an array:

  1. When there are known quantities of values, I can flatten it by just calling the array with dot notation and array element number. Example: coordinates on a map.
  2. When I do not know the quantity or if an array contains an unknown number of values, I will use the lateral flatten function.

Method 1

In my following example, location is a higher-level array and coordinate is a child array in it. The array coordinates only have two values, but I also need both values to find the point on a map.

SELECT 
   LEFT(v:_id:"$oid", 24)::VARCHAR(24) AS id
   ,v:locations[0].coordinates[0]
   ,v:locations[0].coordinates[1] 
FROM 
    <database>.<schema>.<json_table>
;

Method 2

The other way to flatten an array is to use the lateral flatten function. It is much easier to deal with an array if you use the lateral flatten function provided by Snowflake. I would compare the lateral flatten function as a full outer join in SQL. For each id, the function will flatten everything that was included in the locations array. If an array value is missing from a record, it will be NULL.

SELECT 
    LEFT(v:_id:"$oid", 24)::VARCHAR(24) AS id
,   LEFT(flat2.value:name, 100)::NVARCHAR(100) AS name
,   LEFT(flat2.value:address1, 100)::NVARCHAR(100) AS address1
,   LEFT(flat2.value:city, 100)::NVARCHAR(100) AS city
,   LEFT(flat2.value:state, 100)::NVARCHAR(100) AS state
,   LEFT(flat2.value:postalCode, 20)::NVARCHAR(20) AS postalCode
,   LEFT(flat2.value:country, 100)::NVARCHAR(100) AS country
,   LEFT(flat2.value:googlePlacesId, 450)::VARCHAR(450) AS googlePlacesId
,   CONCAT(flat2.value:coordinates[0]||', '||flat2.value:coordinates[1])::VARCHAR(100) AS coordinates
FROM
    <database>.<schema>.<json_table>
    , LATERAL FLATTEN (INPUT => v:locations, OUTER => TRUE) flat2
;

When working with arrays in a JSON document, sometimes I will need to keep the sort or index order of certain array elements. To help with that I use the zero-based index of the array. In my following example, I have multiple tags for each Id. For business reasons, I need to combine the tags into one field. If I want to keep the order of the tags as they show up on the document, I can use the following code.

SELECT 
    id
,   LISTAGG(tags, ',') WITHIN GROUP (ORDER BY sortId) AS tags
FROM
(
SELECT
    LEFT(v:_id:"$oid", 24)::VARCHAR(24) AS id
,   LEFT(flat2.value, 1000)::VARCHAR(1000) AS tags
,   flat2.index AS sortId
FROM 
    <database>.<schema>.<json_table>
    , LATERAL FLATTEN (input => v:tags, outer => true) flat2
)a
GROUP BY
    id
;

Finally, flattening more than one array using multiple lateral flatten functions in the same query may produce two or more rows for the same id and that may not be the intended behavior. I usually flatten only one array at a time and then create another view where I join all the views together.

Useful Hive settings

I have found that it is hard to find information on all the set commands for Hive. There are some listed here. I am going to use this blog to save some hive parameters that I wasn’t able to find easily but are very useful.

DIRECTORY RECURSIVENESS 

Two of these parameters work in conjunction. These can be session level or passed in at the beginning of a Hive query. They can also be set permanently in hive-site.xml.

Set mapred.input.dir.recursive=true;  
Set hive.mapred.supports.subdirectories=true;

When set to true Hive tables can be pointed to the higher level directory. Think of a scenario where the directory structure is as following:

/iislogs/02b016a5d19748e5a7e711cc0ef94395/WebRole.Web/WebRole.Web_IN_0/Momentum/
W3SVC472036332/u_ex12041323.log

In the above example the GUID like directory name is auto generated and changes frequently. In this case, it helps just to point your table to the first level of the directory “iislogs” which is constant and let hive do the recursive work for you.  See table DDL in the next example.

If you are using partitioning on a table I don’t suggest using this option as Hive may just look into the top level directory and scan all data instead of partition elimination.

SKIP HEADER ROWS

Sometimes when handling log files you will see header rows at the beginning of the data file. This is how a iislog file looks like (real IPs are obfuscated):

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2012-04-13 23:00:02
#Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
2012-04-13 23:00:02 W3SVC472036332 RD00155D3AA932 10.200.255.18 GET /blko/get/stats – 20000 – 10.200.216.12 HTTP/1.1 – – – 10.195.255.2:20000 200 0 0 336 82 296
2012-04-13 23:02:22 W3SVC472036332 RD00155D3AA932 10.200.255.18 GET /opt/get/stats – 20000 – 10.200.216.12 HTTP/1.1 – – – 10.195.255.2:20000 200 0 0 336 82 265

It has 4 lines of headers that you do not want to include in your Hive query. To skip header lines from your tables you have choices and two of them are using PIG or Hive. In Hive you can set a table property that will allow you to skip the header lines.

CREATE EXTERNAL TABLE iislogs (
  date STRING,
  time  STRING,
  s_sitename  STRING,
  s_computername  STRING,
  s_ip  STRING,
  cs_method  STRING,
  cs_uri_stem  STRING,
  cs_uri_query  STRING,
  s_port  STRING,
  cs_username  STRING,
  c_ip  STRING,
  cs_version  STRING,
  cs_user_agent  STRING,
  cs_cookie  STRING,
  cs_referer  STRING,
  cs_host  STRING,
  sc_status  INT,
  sc_substatus  STRING,
  sc_win32_status  STRING,
  sc_bytes  INT,
  cs_bytes  INT,
  time_taken INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘  STORED AS TEXTFILE
LOCATION ‘./iislogs’
tblproperties ("skip.header.line.count"="4");

PRINT HEADER

Sometimes you want to see the header of columns for a Hive query that you want to run. To accomplish this you can use the following set command before your query to show column headers in STDOUT.

Set hive.cli.print.header=true;
select l_returnflag, count(*) as count from lineitem group by l_returnflag;

Result without header:

A       1478493
N       3043852
R       1478870

Result with header:

l_returnflag    count
A       1478493
N       3043852
R       1478870

Windows Azure PowerShell Copy Blob Script

Here is a PowerShell script that can be used to move data in Windows Azure from one subscription to another subscription’s blob storage. The script uses Start-AzureStorageBlobCopy command which is an asynchronous copy operation and copies all the blobs from one container to another. If selected file moving is the intention you can pass in some type of file matching pattern.

Prerequisite:
Download Windows Azure PowerShell
Install Windows Azure Certificate for your subscription

Copy and paste code from here.

#Change these to match the source account
$sourceStorageAccountName = "<AzureStorageAccountName>"
$sourceContainerName = "<AzureContainerName>"
$sourceStorageKey = "<AzureStorageSubscriptionKey>" #Need this if moving data from a different subscription 
  
#Destination Account Information  
$destStorageAccountName = Read-Host "Enter Destination Storage Account Name"
$destStorageAccountKey = Get-AzureStorageKey $destStorageAccountName | %{$_.Primary}
$destContainerName = Read-Host "Enter Destination Container Name"
  
$sourceContext = New-AzureStorageContext -StorageAccountName $sourceStorageAccountName -StorageAccountKey $sourceStorageKey -Protocol Http
$destContext = New-AzureStorageContext -StorageAccountName $destStorageAccountName -StorageAccountKey $destStorageAccountKey
  
$uri = $sourceContext.BlobEndPoint + $sourceContainerName +"/"
# Copy Operation
Get-AzureStorageBlob `
    -Context $sourceContext `
    -Container $sourceContainerName | 
    ForEach-Object `
    { Start-AzureStorageBlobCopy `
        -SrcUri "$uri$($_.Name)" `
        -DestContext $destContext `
        -DestContainer $destContainerName `
        -DestBlob "hackathon/$($_.Name)" `
    }
#Checking Status of Blob Copy -- This can be commented out if no confirmation is needed 
Get-AzureStorageBlob `
    -Context $sourceContext `
    -Container $sourceContainerName |
    ForEach-Object `
    { Get-AzureStorageBlobCopyState `
        -Blob $_.Name `
        -Context $destContext `
        -Container $destContainerName `
        -WaitForComplete `
    }

 

If the data you are copying is on a public blob you can use ‘anonymous’ access instead of providing the $sourceStorageKey. In that case, change $sourceContext to something like the following:

$sourceContext = New-AzureStorageContext -Anonymous -StorageAccountName $sourceStorageAccountName -Protocol Http

Automating HDInsight cluster creation with PowerShell

If I have to do some work more than twice I like automating the process (or try to as much as I can). Recently, I became interested in spinning up my own HDInsight cluster in Windows Azure with some customization. Since it’s work and I had to spin up clusters more than once, I tried automating the process with PowerShell. In the following code examples I will try to explain what I did and why.

Prerequisite:
1. Windows Azure subscription
2. Windows PowerShell V2
3. Windows Azure PowerShell and Microsoft .net SDK for Hadoop
4. Follow the link from #3 to download and import publishsettings

Now let’s start with the parameters needed. In this first section, I am gathering the parameters from the user of the script. 

If you are creating your cluster for the first time and you don’t have any data yet here are the options to use:
1. $clusterName = Name that you would like to give your cluster
2. $clusterNodes = Number of data nodes you want
3. $location = Location of the Cluster provided in double quote. ie: “West US”
4. $storageAccountName = New storage account
5. $newContainerName = New container

If you have existing data in WASB (Windows Azure Storage Blob) the parameters are as following: 
1. $clusterName = Name that you would like to give your cluster
2. $clusterNodes = Number of data nodes you want
3. $existingStorage = An existing storage account. No need to worry, you will be prompted to choose it
4. $existingContainer = If there is an existing container that houses your data
5. $existingMetastoreDatabase = Existing Hive/Oozie database on SQL Azure that can be metadata repository

param
    (
     [Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$false, HelpMessage='Provide a new cluster name for your HDInsight services')]
     [string] $clusterName,
     [Parameter(Position=1, Mandatory=$true, ValueFromPipeline=$false, HelpMessage='Provide the # of nodes you would like in the cluster' )]
     [int] $clusterNodes,
     [Parameter(Position=2, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='-existingStorage' )]
     [switch] $existingStorage,
     [Parameter(Position=3, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='-existingStorage' )]
     [switch] $existingContainer,
     [Parameter(Position=4, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='SQL Azure Server and Database must exist')]
     [switch] $existingMetastoreDatabase,
     [Parameter(Position=5, Mandatory=$false, ParameterSetName='new', ValueFromPipeline=$false, HelpMessage='Provide location if creating new storage account' )]
     [string] $location,
     [Parameter(Position=6, Mandatory=$false, ParameterSetName='new', ValueFromPipeline=$false, HelpMessage='Provide a name if creating a new storage account')]
     [string] $storageAccountName,
     [Parameter(Position=7, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='Provide a new container name')]
     [string] $newContainerName,
     [Parameter(Position=8, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='Provide a subscription name if more than one subscription exists')]
     [string] $subscriptionName
     )

Now the fun begins. this part of the code will evaluate if your Windows Azure PowerShell (or Windows PowerShell) is connected to your Azure account. It does this by the try-catch block and trying to create new or access an existing storage accounts. If it cannot do that, it will fail and ask you to run Add-AzureAccount  for your Windows Azure subscription.

This part of the code will also prompt you for choosing existing storage account if you have used the second set of parameters from above or create a new storage account if you have used the first set.

try{
    If($storageAccountName)
        {
            if($location)
            {
                $storageAccountName = $storageAccountName.ToLower()
                Write-host "Using storage account:"$storageAccountName "at" $location "data center" -foreground Green
                New-AzureStorageAccount -StorageAccountName $storageAccountName -Location $location
            }
            Else
            {
                Write-Host""
                Write-Host "Please provide a location when creating a new storage account. See example." -foreground RED
                Exit
            }   
        }
    Elseif($existingStorage)
        {
            $allAzureStorageAccounts = Get-AzureStorageAccount
            $azureStorageAccounts = $allAzureStorageAccounts.label
  
            if($azureStorageAccounts)
            {
                [int]$y=1
                $choice =@{}
                Write-Host ""
                ForEach ($storageAccount in $azureStorageAccounts)
                {
                    Write-Host $y : $storageAccount
                    $choice=$choice+@{$y=$storageAccount}
                    $y++
                }
            }
            Write-Host ""
            [int]($input)=Read-host "Enter the corresponding number to the existing storage" 
            if(($input -gt $y) -or ($input -eq 0))
                {
                    Write-Host "No associated Storage Account for that number" -foreground RED
                }
                else
                {
                    $storageAccountName = $choice.($input)  
                    $location = ($allAzureStorageAccounts | where -Property Label -eq $storageAccountName).GeoPrimaryLocation
                    Write-host ""
                    Write-host "Using existing storage account:"$storageAccountName "at" $location "data center" -foreground Green
                          
                }
        }
    Else
        {
            Write-host ""
            Write-Host "Either Provide the switch -existingStorage or provide new Storage Account Name and Location. See example" -foreground RED
            Exit
        }
}
catch {
    "$_"
    exit
}

Currently creation of HDInsight cluster will error out if a cluster creation is attempted on a storage account that was created on an affinity group. The cryptic message that is shown on the  Windows Azure Portal is “The request has failed. Please contact support for more information” .  To remedy this situation, I added a simple check and also resolving for storage account key and creating a destination context.

#Currently it is not possible to create a HDInsight cluster on an affinitygroup. May remove when it's available.
$affinitysetting = get-azurestorageaccount -storageaccountname $storageAccountName | %{$_.affinitygroup}
if ($affinitysetting)
    {
        Write-Host "Cannot create HDInsight cluster on affinity group. Choose a storage account that is not part of an affinity group" -foreground RED
        Exit
    }   
      
$storageAccountKey = Get-AzureStorageKey $storageAccountName | %{$_.Primary}
# Create a storage context object
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

The following code block is checking for is the user used -existingcontainer or creating a new container. If existing container is the intention, you will be prompted with a choice again.

if($newContainerName)
    {
    # Create a Blob storage container
    New-AzureStorageContainer -Name $newContainerName -Context $destContext
    $containerName = $newContainerName
    }
Elseif($existingContainer)
    {
        [int]$i=1
        $containerChoices = ($destContext|Get-AzureStorageContainer ).Name
        $containerChoice =@{}
        Write-Host ""
        ForEach ($cont in $containerChoices)
            {
                Write-Host $i : $cont
                $containerChoice=$containerChoice+@{$i=$cont}
                $i++
            }
        Write-Host ""
        [int]($contInput)=Read-host "Enter the corresponding number to the container" 
        if(($contInput -gt $i) -or ($contInput -eq 0))
            {
                Write-Host "No associated container for that number" -foreground RED
            }
        Else
            {   
                $containerName = $containerChoice.($contInput)
                Write-host ""
                Write-host "Using existing container:"$containerName -foreground Green      
            }           
    }
Else
    {
        Write-host ""
        Write-Host "Either Provide the switch -existingContainer container or provide new Container Name and Location. See example" -foreground RED
        Exit
    }

This last part of the code will either
1. Create a new cluster with SQL Azure as Hive and Oozie meta store (recommended option) 
2. Create a new cluster without SQL Azure as meta data store and internally will use Derby database

I highly recommend creating an empty SQL Azure database (let’s call it metastore) and using it as Hive/Oozie metadata store (1st option). This helps when the name node gets corrupted, becomes unavailable and/or the Derby database becomes unavailable for whatever reason. This way, the only thing you lose is the cluster. If you spin up another cluster the data and the metadata is saved. All you need is an empty SQL Azure database because the schema tables will be automatically created by the HDInsight cluster creation process. If you have SQL Azure databases created when you use the option –existingmetastoredatabase you will be prompted for choosing one. You must know the correct the username and password for the SQL Azure database. Wrong username or password will result in the same cryptic message I talked about earlier.

if(!$existingMetastoreDatabase)
    {
        # Create a new HDInsight cluster
        $hdCred = Get-Credential -Message "Provide Username and Password for your HDInsight cluster. Password requires at least 10 characters with 1 Uppercase, 1 lowercase, 1 number and 1 special character"
        New-AzureHDInsightCluster -Subscription $subscriptionName -Name $clusterName -Location $location -DefaultStorageAccountName "${storageAccountName}.blob.core.windows.net" `
        -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainerName $containerName -ClusterSizeInNodes $clusterNodes -Credential $hdCred
    }
Elseif($existingMetastoreDatabase)
    {
        $sqlServers=(Get-AzureSqlDatabaseServer).ServerName
        [int]$j=1
        $serverChoice = @{}
        $databaseChoice = @{}
        Write-Host ""
        forEach ($sqlServer in $sqlServers)
        {
            $sqlDatabaseName = (Get-AzureSqlDatabase -ServerName $sqlServer).Name | where {$_ -ne "master"}
            ForEach ($sqlDatabase in $sqlDatabaseName)
            {
                Write-Host $j : $sqlDatabase "on Server" $sqlServer
                $serverChoice = $serverChoice+@{$j=$sqlServer}
                $databaseChoice = $databaseChoice+@{$j=$sqlDatabase}
                $j++
            }
        }
        Write-Host ""
        [int]($serverDatabaseInput) = Read-host "Enter the corresponding number to the Database" 
        if(($serverDatabaseInput -gt $j) -or ($serverDatabaseInput -eq 0))
            {
                Write-Host "No associated database for that number" -foreground RED
            }
        Else
            {   
                $sqlServerName = $serverChoice.($serverDatabaseInput)
                $metastoreDatabase = $databaseChoice.($serverDatabaseInput)
            }
        Write-Host ""
        $metastoreUsername = Read-Host "Enter SqlServer Username"
        $metastorePassword = Read-Host "Enter SqlServer Password" -AsSecureString
        $cred = New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $metastoreUsername, $metastorePassword
        $hdCred = Get-Credential -Message "Provide Username and Password for your HDInsight cluster. Password requires at least 10 characters with 1 Uppercase, 1 lowercase, 1 number and 1 special character"
        # Create a new HDInsight cluster with meta-store
        New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes `
        | Set-AzureHDInsightDefaultStorage -StorageAccountName "${storageAccountName}.blob.core.windows.net" -StorageAccountKey $storageAccountKey -StorageContainerName $containerName `
        | Add-AzureHDInsightMetastore -SqlAzureServerName "${sqlServerName}.database.windows.net" -DatabaseName $metastoreDatabase -Credential $cred -MetastoreType HiveMetaStore `
        | Add-AzureHDInsightMetastore -SqlAzureServerName "${sqlServerName}.database.windows.net" -DatabaseName $metastoreDatabase -Credential $cred -MetastoreType OozieMetaStore `
        | New-AzureHDInsightCluster -Subscription $subscriptionName -Name $clusterName -Location $location -Credential $hdCred
    }
Else
    {
        Write-Host "Provide any missing parameter(s). See example"
    }

You will notice that I am using the same SQL Azure database for Hive and Oozie metastore. I am doing this for the simplicity of the script. You can easily modify the code to separate the metastore of Hive and Oozie if running into any concurrency issues.

Here are some examples that you can use to create your HDInsight cluster.

This first example creates a 2 node HDInsight cluster called myhdinsight on existing storage account and container and uses an existing metastore on sql azure database.

.\HDInsightClusterDeploy.ps1 -clustername myhdinsight -clusternodes 2 -existingstorage -existingcontainer –existingmetastoredatabase

The second example creates a 2 node HDInsight cluster called myhdinsight in West US data center with a storage account name newstorage and also creates a new container called newcontainer.

.\HDInsightClusterDeploy.ps1 -clustername myhdinsight -clusternodes 2  -location "west us" -storageaccountname newstorage -newcontainername newcontainer

This third example creates a 2 node HDInsight cluster called myhdinsight on existing storage and existing container and uses derby to as metastore. Again, I don’t recommend this option but can be used for test purposes to quickly spin up a cluster.

.\HDInsightClusterDeploy.ps1 -clustername myhdinsight -clusternodes 2 -existingstorage –existingcontainer

 

Source:
Provision HDInsight Clusters
Microsoft .NET SDK for Hadoop

Special thanks to Cindy Gross who had me put on my thinking cap and suggested many improvements of my code. You can follow her blogs on msdn.

Appendix:
Putting it all together for the pleasure of copy and paste. Save it as HDInsightClusterDeploy.ps1. voilà!

##############################################################################################################################
# .SYNOPSIS
#    This script, HDInsightClusterDeploy.ps1 is used for creating new HDInsight Services Cluster
#
# .NOTES
#    Filename: HDInsightClusterDeploy.ps1
#    Author: Murshed Zaman & Cindy Gross
#    Date: 11/13/2013
#    Requires: PowerShell V2
#              Windows Azure PowerShell (http://www.windowsazure.com/en-us/manage/services/hdinsight/install-and-configure-powershell-for-hdinsight/)
#               Microsoft .NET SDK for Hadoop (same URL as above)
#               Follow: Connect to your subscription from the above mentioned URL to Get/Import-AzurePublishSettingsFile once
#               Run Add-AzureAccount    on Windows Azure PowerShell for connection to your default subscription
#    Version: 1.0
#    Revision: none
#
# .DESCRIPTION
#    This script will create HDInsight Cluster for a given Azure subscription. There are some mandatory
#     and some optional parameters. See examples for clarification.
#
# .EXAMPLE (Using existing Azure Storage. User is prompted for selecting from their existing storage.)
#    .\HDInsightClusterDeploy.ps1 -clusterName myhdinsight -clusterNodes 2 -existingStorage -existingContainer
#
# .EXAMPLE (Supply the "location" in double quotes, a new storage account name and new container name)
#    .\HDInsightClusterDeploy.ps1 -clusterName myhdinsight -clusterNodes 2  -location "West US" -storageAccountName newstorage -newContainerName newcontainer
#
# .EXAMPLE (Use existing SQL Azure as Oozie and Hive MetaStore on existing storage/container. SQL Azure Server and database must exist, firewall configured, windows azure services on)
#    .\HDInsightClusterDeploy.ps1 -clusterName myhdinsight -clusterNodes 2 -existingStorage -existingContainer -existingMetastoreDatabase
#
##############################################################################################################################

param
    (
     [Parameter(Position=0, Mandatory=$true, ValueFromPipeline=$false, HelpMessage='Provide a new cluster name for your HDInsight services')]
     [string] $clusterName,
     [Parameter(Position=1, Mandatory=$true, ValueFromPipeline=$false, HelpMessage='Provide the # of nodes you would like in the cluster' )]
     [int] $clusterNodes,
     [Parameter(Position=2, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='-existingStorage' )]
     [switch] $existingStorage,
     [Parameter(Position=3, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='-existingStorage' )]
     [switch] $existingContainer,
     [Parameter(Position=4, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='SQL Azure Server and Database must exist')]
     [switch] $existingMetastoreDatabase,
     [Parameter(Position=5, Mandatory=$false, ParameterSetName='new', ValueFromPipeline=$false, HelpMessage='Provide location if creating new storage account' )]
     [string] $location,
     [Parameter(Position=6, Mandatory=$false, ParameterSetName='new', ValueFromPipeline=$false, HelpMessage='Provide a name if creating a new storage account')]
     [string] $storageAccountName,
     [Parameter(Position=7, Mandatory=$false, ValueFromPipeline=$false, HelpMessage='Provide a new container name')]
     [string] $newContainerName
     )
     
            
$multipleSubscription = Get-AzureSubscription|%{$_.SubscriptionName}
if($multipleSubscription.Count -gt 1)
    {
        [int]$s=1
        $subChoice = @{}
        Write-Host ""
        forEach($subName in $multipleSubscription)
        {
            Write-Host $s : $subName
            $subChoice = $subChoice+@{$s=$subName}
            $s++
        }
        Write-Host ""
        [int]($subNumber) = Read-Host "Enter the corresponding number to the subscription you would like to use"
        if(($subNumber -gt $s) -or ($subNumber -eq 0))
        {
            Write-Host "No associated subscription for that number" -foreground RED
            Exit
        }
        else
        {
            $selectSubscription = $subChoice.($subNumber)
            Select-AzureSubscription -SubscriptionName ${selectSubscription} -Current 
            $subscriptionName = $selectSubscription
        }
    }
else
    {
        $subscriptionName = $multipleSubscription
        Select-AzureSubscription -SubscriptionName ${subscriptionName} -Current
    }


try{
    If($storageAccountName)
        {
            if($location)
            {
                $storageAccountName = $storageAccountName.ToLower()
                Write-host "Using storage account:"$storageAccountName "at" $location "data center" -foreground Green
                New-AzureStorageAccount -StorageAccountName $storageAccountName -Location $location
            }
            Else
            {
                Write-Host""
                Write-Host "Please provide a location when creating a new storage account. See example." -foreground RED
                Exit
            }    
        }
    Elseif($existingStorage)
        {
            $allAzureStorageAccounts = Get-AzureStorageAccount
            $azureStorageAccounts = $allAzureStorageAccounts.label

            if($azureStorageAccounts)
            {
                [int]$y=1
                $choice =@{}
                Write-Host ""
                ForEach ($storageAccount in $azureStorageAccounts)
                {
                    Write-Host $y : $storageAccount
                    $choice=$choice+@{$y=$storageAccount}
                    $y++
                }
            }
            Write-Host ""
            [int]($input)=Read-host "Enter the corresponding number to the existing storage" 
            if(($input -gt $y) -or ($input -eq 0))
                {
                    Write-Host "No associated Storage Account for that number" -foreground RED
                    Exit
                }
                else
                {
                    $storageAccountName = $choice.($input)    
                    $location = ($allAzureStorageAccounts | where -Property Label -eq $storageAccountName).GeoPrimaryLocation
                    Write-host ""
                    Write-host "Using existing storage account:"$storageAccountName "at" $location "data center" -foreground Green
                        
                }
        }
    Else
        {
            Write-host ""
            Write-Host "Either Provide the switch -existingStorage or provide new Storage Account Name and Location. See example" -foreground RED
            Exit
        }
}
catch {
    "$_"
    Exit
}
#Currently it is not possible to create a HDInsight cluster on an affinitygroup. May remove when it's available.
$affinitysetting = get-azurestorageaccount -storageaccountname $storageAccountName | %{$_.affinitygroup}
if ($affinitysetting)
    {
        Write-Host "Cannot create HDInsight cluster on affinity group. Choose a storage account that is not part of an affinity group" -foreground RED
        Exit
    }    
    
$storageAccountKey = Get-AzureStorageKey $storageAccountName | %{$_.Primary}
# Create a storage context object
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

if($newContainerName)
    {
    # Create a Blob storage container
    New-AzureStorageContainer -Name $newContainerName -Context $destContext
    $containerName = $newContainerName
    }
Elseif($existingContainer)
    {
        [int]$i=1
        $containerChoices = ($destContext|Get-AzureStorageContainer ).Name
        $containerChoice =@{}
        Write-Host ""
        ForEach ($cont in $containerChoices)
            {
                Write-Host $i : $cont
                $containerChoice=$containerChoice+@{$i=$cont}
                $i++
            }
        Write-Host ""
        [int]($contInput)=Read-host "Enter the corresponding number to the container" 
        if(($contInput -gt $i) -or ($contInput -eq 0))
            {
                Write-Host "No associated container for that number" -foreground RED
                Exit
            }
        Else
            {    
                $containerName = $containerChoice.($contInput)
                Write-host ""
                Write-host "Using existing container:"$containerName -foreground Green        
            }            
    }
Else
    {
        Write-host ""
        Write-Host "Either Provide the switch -existingContainer container or provide new Container Name and Location. See example" -foreground RED
        Exit
    }

if(!$existingMetastoreDatabase)
    {
        # Create a new HDInsight cluster
        $hdCred = Get-Credential -Message "Create admin Username and Password for your HDInsight cluster. Password requires at least 10 characters with 1 Uppercase, 1 lowercase, 1 number and 1 special character"
        New-AzureHDInsightCluster -Name $clusterName -Location $location -DefaultStorageAccountName "${storageAccountName}.blob.core.windows.net" `
        -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainerName $containerName -ClusterSizeInNodes $clusterNodes -Credential $hdCred
    }
Elseif($existingMetastoreDatabase)
    {
        $allSqlServers= Get-AzureSqlDatabaseServer
        $sqlServers = ($allSqlServers |where -Property Location -eq $location).ServerName
        if ($sqlServers)
        {
            [int]$j=1
            $serverChoice = @{}
            $databaseChoice = @{}
            Write-Host ""
            forEach ($sqlServer in $sqlServers)
            {
                $sqlDatabaseName = (Get-AzureSqlDatabase -ServerName $sqlServer).Name | where {$_ -ne "master"}
                ForEach ($sqlDatabase in $sqlDatabaseName)
                {
                    Write-Host $j : $sqlDatabase "on Server" $sqlServer
                    $serverChoice = $serverChoice+@{$j=$sqlServer}
                    $databaseChoice = $databaseChoice+@{$j=$sqlDatabase}
                    $j++
                }
            }
            Write-Host ""
            [int]($serverDatabaseInput) = Read-host "Enter the corresponding number to the Database" 
            if(($serverDatabaseInput -gt $j) -or ($serverDatabaseInput -eq 0))
                {
                    Write-Host "No associated database for that number" -foreground RED
                    Exit
                }
            Else
                {    
                    $sqlServerName = $serverChoice.($serverDatabaseInput)
                    $metastoreDatabase = $databaseChoice.($serverDatabaseInput)
                }
            Write-Host ""
            $metastoreUsername = Read-Host "Enter SqlServer Username"
            $metastorePassword = Read-Host "Enter SqlServer Password" -AsSecureString
            $cred = New-Object -TypeName System.Management.Automation.PSCredential -ArgumentList $metastoreUsername, $metastorePassword
            $hdCred = Get-Credential -Message "Create admin Username and Password for your HDInsight cluster. Password requires at least 10 characters with 1 Uppercase, 1 lowercase, 1 number and 1 special character"
            # Create a new HDInsight cluster with meta-store
            New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes `
            | Set-AzureHDInsightDefaultStorage -StorageAccountName "${storageAccountName}.blob.core.windows.net" -StorageAccountKey $storageAccountKey -StorageContainerName $containerName `
            | Add-AzureHDInsightMetastore -SqlAzureServerName "${sqlServerName}.database.windows.net" -DatabaseName $metastoreDatabase -Credential $cred -MetastoreType HiveMetaStore `
            | Add-AzureHDInsightMetastore -SqlAzureServerName "${sqlServerName}.database.windows.net" -DatabaseName $metastoreDatabase -Credential $cred -MetastoreType OozieMetaStore `
            | New-AzureHDInsightCluster -Name $clusterName -Location $location -Credential $hdCred
        }
        else
        {
            Write-Host "You need to create a SQL Server Database in "$location "to use this option" -foreground RED
            Exit
        }
    }
else
    {
        Write-Host "Provide any missing parameter(s). See example"
    }

TPCH schema for Hadoop – Hive

Recently I had some fun with installing Hadoop on my virtual cluster environment. After the installation I started to think about moving some data over to Hadoop and then create some hive tables – after all I play the part of a data architect in real life :). The following is what I did to load TPCH data that was residing on my D: drive (local) :

hadoop fs -put D:\tpch_1gb\data\nation.tbl /tpch_1gb/nation/nation.tbl
hadoop fs -put D:\tpch_1gb\data\orders.tbl /tpch_1gb/orders/orders.tbl
hadoop fs -put D:\tpch_1gb\data\part.tbl /tpch_1gb/part/part.tbl
hadoop fs -put D:\tpch_1gb\data\partsupp.tbl /tpch_1gb/partsupp/partsupp.tbl
hadoop fs -put D:\tpch_1gb\data\region.tbl /tpch_1gb/region/region.tbl
hadoop fs -put D:\tpch_1gb\data\supplier.tbl /tpch_1gb/supplier/supplier.tbl
hadoop fs -put D:\tpch_1gb\data\lineitem.tbl /tpch_1gb/lineitem/lineitem.tbl
hadoop fs -put D:\tpch_1gb\data\customer.tbl /tpch_1gb/customer/customer.tbl

 

Note that I am creating these tables as “external tables” to avoid another copy of the data in my HDFS space. When you create external tables Hive works as a metadata repository forgoing the need to save the data to it’s default directory. DDL for TPCH:

CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (l_orderkey bigint , l_partkey bigint ,        l_suppkey bigint ,        l_linenumber bigint ,        l_quantity FLOAT,        l_extendedprice FLOAT,        l_discount FLOAT,        l_tax FLOAT,        l_returnflag STRING ,        l_linestatus STRING ,        l_shipdate STRING ,        l_commitdate STRING ,        l_receiptdate STRING ,        l_shipinstruct STRING,        l_shipmode STRING,        l_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/lineitem’;

CREATE EXTERNAL TABLE IF NOT EXISTS customer ( c_custkey BIGINT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal FLOAT, c_mktsegment STRING, c_comment STRING) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/customer’;

CREATE EXTERNAL TABLE IF NOT EXISTS nation ( n_nationkey INT, n_name STRING, n_regionkey INT, n_comment STRING) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/nation’;

CREATE EXTERNAL TABLE IF NOT EXISTS orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus STRING, o_totalprice FLOAT, o_orderdate STRING, o_orderpriority STRING, o_clerk STRING, o_shippriority INT, o_comment STRING) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/orders’;

CREATE EXTERNAL TABLE IF NOT EXISTS part ( p_partkey BIGINT, p_name STRING, p_mfgr STRING, p_brand STRING, p_type STRING, p_size INT, p_container STRING, p_retailprice FLOAT, p_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/part’;

CREATE EXTERNAL TABLE IF NOT EXISTS partsupp ( ps_partkey BIGINT, ps_suppkey BIGINT, ps_availqty INT, ps_supplycost FLOAT, ps_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/partsupp’;

CREATE EXTERNAL TABLE IF NOT EXISTS region (        r_regionkey INT, r_name STRING, r_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/region’;

CREATE EXTERNAL TABLE IF NOT EXISTS supplier ( s_suppkey BIGINT, s_name STRING, s_address STRING, s_nationkey INT, s_phone STRING, s_acctbal FLOAT, s_comment STRING ) COMMENT ‘CREATED BY MURSHEDZ’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘|’ STORED AS TEXTFILE LOCATION ‘/tpch_1gb/supplier’;

Now you can run TPCH queries against it (with modifications). Queries to come later..

Parallelize and load – Basics of splitting a table

If you are loading data from one SQL Server to another, you have options:

    1. Bcp-out the file and then Bcp-in
    2. Use SSIS to load data directly.

If the table is large you can split up the file creation or increase the data flow tasks in SSIS to load the table faster. This is a very basic way to parallelize a load.  Here is a simple trick to split up the table and it works well if the table in question has at least one somewhat distinct integer type column.

Example:

Table DDL –

CREATE TABLE [dbo].[Tablename](

[DateKey] [int] NOT NULL,

[AccountId] [int] NULL,

[OrderItemId] [bigint] NOT NULL,

[AdId] [int] NOT NULL,

[More columns]….

)

GO

Using a mod function on the distinct integer type column to split the table in 5 pieces-

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 0

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 1

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 2

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 3

Select * from [DatabaseName]..[Tablename] where OrderItemId%5 = 4

This way of splitting up a table can be used in a Bcp-out call or in SSIS tasks. You can use any divisor number for the mod function.  Just remember to use a 0 based number on the other side of the equal sign.

As an example if you want to divide your table into 10 pieces –

Select * from [DatabaseName]..[Tablename] where OrderItemId%10 = 0

Select * from [DatabaseName]..[Tablename] where OrderItemId%10 = 9

Following script can help in creating Bcp-out scripts.

DECLARE @servername sysname, @dbname sysname, @tablename sysname, @outputdir sysname
        ,@splitbyCol varchar(300), @numsplits int, @totalsplits int
SELECT  @servername = @@SERVERNAME
       ,@dbname = DB_NAME()
       ,@outputdir = ‘c:\temp\’
       ,@tablename = ‘customer’
       ,@splitbyCol = ‘customerID’
       ,@numsplits = 0
       ,@totalsplits = 5

WHILE (@numsplits < @totalsplits)
BEGIN
SELECT ‘bcp "Select * from ‘ +@dbname+’.’+OBJECT_SCHEMA_NAME(object_id)+’.’+ name + ‘ Where ‘+@splitbyCol+ ‘%’+Cast(@totalsplits as varchar(5))+ ‘ = ‘+Cast(@numsplits as varchar(5)) +’" queryout ‘
       + @outputdir + name + ‘.dat.’+Cast(@numsplits as varchar(5))+’ -b 10000 -T -c -S ‘ + @servername
       FROM sys.objects
       WHERE type_desc = ‘USER_TABLE’
       AND name = @tablename

       SET @numsplits = @numsplits+1
END

Some blogsites that I follow