November | 2022 | Murshed Zaman

Flattening JSON docs in Snowflake Data Warehouse

November 5, 2022 by Murshed Leave a comment

In this blog, I will go over some tips and tricks that I used to flatten JSON data into a relational form in the Snowflake data warehouse.

There are many examples of how to load JSON data into the Snowflake data warehouse, thus I am going to skip that part. But it is worth mentioning that we do a full load of our MongoDB over the weekend and the rest of the time we do delta loads based on the record updated date in MongoDB. When loading the delta portion of the JSON docs we heavily use what I call the delete-insert process with joins to insert the data into JSON tables. We use Snowpipe to load the data from MongoDB to Snowflake.

My JSON examples are from MongoDB, but any valid JSON will work. My screenshots of JSON documents are from MongoDB Compass, a UI tool that allows you to view/update/delete data in MongoDB. For purposes of this being a public blog, I will blur and obfuscate any identifying information. But you will learn how to write SQL code to flatten JSON data on Snowflake with the code examples.

First, here is a picture of one of the JSON documents:

I usually use a ‘create or replace table/view <name> as’ statement to create a table or a view at the beginning of the following code. I am describing the patterns that I follow, obviously, you decide what works for your data.

Some explanations:

While flattening string characters, I always use the LEFT function to limit the number of characters as I do not want my code to fail if there are more characters one day in the fields that I am working with in the future.
GUID ids sometimes will need “$oid” label to be able to flatten and “$date” for date values.
I always cast my columns to a data type.
The billing address is a higher-level object and address1 is a child object thus a dot notation is needed to bring out address1 and others.
I am just interested in data that are not soft deleted on MongoDB by restricting to deleted = false.
Why is everything v:? When we load JSON data in the table, the column name we use is v, thus v: notation to refer to the data elements.

SELECT 
    LEFT(v:_id:"$oid", 24)::VARCHAR(24)  AS id
,   LEFT(v:name, 100)::NVARCHAR(100) AS name
,   v:createdDate:"$date"::TIMESTAMP AS createdDate
,   v:updatedDate:"$date"::TIMESTAMP AS  updatedDate
,   LEFT(v:accountStatus, 20)::VARCHAR(20) AS accountStatus
,   LEFT(v:billingId, 20)::VARCHAR(30) AS billingId
,   LEFT(v:accountLevel, 20)::VARCHAR(20) AS accountLevel
,   LEFT(v:phone, 30)::NVARCHAR(30) AS phone
,   LEFT(v:fax, 30)::NVARCHAR(30) AS fax
,   LEFT(v:taxId, 50)::NVARCHAR(50) AS taxId
,   LEFT(v:description, 500)::NVARCHAR(500) as description
,   LEFT(v:billingAddress.address1, 100)::NVARCHAR(100) as billingaddress1
,   LEFT(v:billingAddress.address2, 100)::NVARCHAR(100) as billingaddress2
,   LEFT(v:billingAddress.unitNumber, 30)::NVARCHAR(30) as billingUnitNumber
,   LEFT(v:billingAddress.city, 100)::NVARCHAR(100) as billingCity
,   LEFT(v:billingAddress.state, 100)::NVARCHAR(100) as billingState
,   LEFT(v:billingAddress.stateIso, 2)::NCHAR(2) as billingStateIso
,   LEFT(v:billingAddress.postalCode, 20)::NVARCHAR(20) as billingPostalCode
,   LEFT(v:billingAddress.country, 100)::NVARCHAR(100) as billingCountry
,   LEFT(v:billingAddress.countryIso, 2)::CHAR(2) as billingcountryIso
,   LEFT(v:website, 50)::NVARCHAR(50) as website
,   CASE WHEN v:deleted='true' THEN 1 else 0 end as deleted
FROM 
    <database>.<schema>.<json_table>
WHERE
    v:deleted = 'false'
;

Way to think about arrays: JSON documents are made of key/value pairs. Think of the Key as the column name, and the value holds the measure for that key. My definition of an array is a way to store different values using the same key (a list of values). In a JSON document, arrays usually will be listed in a square bracket [ ] and if you are looking at the document using MongoDB Compass, it will say the word ‘array’ next to the key (column name).

Two ways to flatten an array:

When there are known quantities of values, I can flatten it by just calling the array with dot notation and array element number. Example: coordinates on a map.
When I do not know the quantity or if an array contains an unknown number of values, I will use the lateral flatten function.

Method 1

In my following example, location is a higher-level array and coordinate is a child array in it. The array coordinates only have two values, but I also need both values to find the point on a map.

SELECT 
   LEFT(v:_id:"$oid", 24)::VARCHAR(24) AS id
   ,v:locations[0].coordinates[0]
   ,v:locations[0].coordinates[1] 
FROM 
    <database>.<schema>.<json_table>
;

Method 2

The other way to flatten an array is to use the lateral flatten function. It is much easier to deal with an array if you use the lateral flatten function provided by Snowflake. I would compare the lateral flatten function as a full outer join in SQL. For each id, the function will flatten everything that was included in the locations array. If an array value is missing from a record, it will be NULL.

SELECT 
    LEFT(v:_id:"$oid", 24)::VARCHAR(24) AS id
,   LEFT(flat2.value:name, 100)::NVARCHAR(100) AS name
,   LEFT(flat2.value:address1, 100)::NVARCHAR(100) AS address1
,   LEFT(flat2.value:city, 100)::NVARCHAR(100) AS city
,   LEFT(flat2.value:state, 100)::NVARCHAR(100) AS state
,   LEFT(flat2.value:postalCode, 20)::NVARCHAR(20) AS postalCode
,   LEFT(flat2.value:country, 100)::NVARCHAR(100) AS country
,   LEFT(flat2.value:googlePlacesId, 450)::VARCHAR(450) AS googlePlacesId
,   CONCAT(flat2.value:coordinates[0]||', '||flat2.value:coordinates[1])::VARCHAR(100) AS coordinates
FROM
    <database>.<schema>.<json_table>
    , LATERAL FLATTEN (INPUT => v:locations, OUTER => TRUE) flat2
;

When working with arrays in a JSON document, sometimes I will need to keep the sort or index order of certain array elements. To help with that I use the zero-based index of the array. In my following example, I have multiple tags for each Id. For business reasons, I need to combine the tags into one field. If I want to keep the order of the tags as they show up on the document, I can use the following code.

SELECT 
    id
,   LISTAGG(tags, ',') WITHIN GROUP (ORDER BY sortId) AS tags
FROM
(
SELECT
    LEFT(v:_id:"$oid", 24)::VARCHAR(24) AS id
,   LEFT(flat2.value, 1000)::VARCHAR(1000) AS tags
,   flat2.index AS sortId
FROM 
    <database>.<schema>.<json_table>
    , LATERAL FLATTEN (input => v:tags, outer => true) flat2
)a
GROUP BY
    id
;

Finally, flattening more than one array using multiple lateral flatten functions in the same query may produce two or more rows for the same id and that may not be the intended behavior. I usually flatten only one array at a time and then create another view where I join all the views together.

Filed under Data Warehouse Tagged with AWS, Data Warehouse, Flatten JSON, flatten json techniques, JSON, JSON Docs, Snowflake, Snowflake data warehouse, Snowpipe

Murshed Zaman

Flattening JSON docs in Snowflake Data Warehouse

Follow me on Twitter

Recent Posts

Archives

Categories