A JSON file that contains lots of dates and arrays needs to be processed in Snowflake. The user wants to ensure optimal performance while querying the data.
How can this be achieved?
A.
Flatten the data and store it in structured data types in a flattened table. Query the table.
B.
Store the data in a table with a VARIANT data type. Query the table.
C.
Store the data in a table with a VARIANT data type and include STRIP_NULL_VALUES while loading the table. Query the table.
D.
Store the data in an external stage and create views on top of it. Query the views.
From Snowflake documentation
If you are not sure yet what types of operations you want to perform on your semi-structured data, Snowflake recommends storing the data in a VARIANT column for now.
For data that is mostly regular and uses only data types that are native to the semi-structured format you are using (e.g. strings and integers for JSON format), the storage requirements and query performance for operations on relational data and data in a VARIANT column is very similar.
For better pruning and less storage consumption, we recommend flattening your OBJECT and key data into separate relational columns if your semi-structured data includes:
Dates and timestamps, especially non-ISO 8601 dates and timestamps, as string values
Numbers within strings
Arrays
Non-native values (such as dates and timestamps in JSON) are stored as strings when loaded into a VARIANT column, so operations on these values could be slower and also consume more space than when stored in a relational column with the corresponding data type.
I'm undecided between A vs. B. In a real-world task, I would do B and do some ELT if needed to prep the data for analysis.
Based on the docs below, it appears that A is recommended by Snowflake as more performant.
https://docs.snowflake.com/en/user-guide/semistructured-considerations
Storing Semi-structured Data in a VARIANT Column vs. Flattening the Nested Structure¶
For better pruning and less storage consumption, we recommend flattening your OBJECT and key data into separate relational columns if your semi-structured data includes:
Dates and timestamps, especially non-ISO 8601 dates and timestamps, as string values
Numbers within strings
Arrays
I hesitate between A&B And will be happy to provoke a discussion
If you know your use cases for the data, perform tests on a typical data set. Load the data set into a VARIANT column in a table. Use the FLATTEN function to extract the OBJECTs and keys you plan to query into a separate table. Run a typical set of queries against both tables to see which structure provides the best performance.
https://docs.snowflake.com/en/user-guide/semistructured-considerations
I have changed my opinion to B
For better pruning and less storage consumption, we recommend flattening your OBJECT and key data into separate relational columns if your semi-structured data includes:
Dates and timestamps, especially non-ISO 8601 dates and timestamps, as string values
Numbers within strings
Arrays
A voting comment increases the vote count for the chosen answer by one.
Upvoting a comment with a selected answer will also increase the vote count towards that answer by one.
So if you see a comment that you already agree with, you can upvote it instead of posting a new comment.
user_1011
1 month agoicegrandpa
9 months, 1 week ago0e504b5
10 months, 1 week agopvskbrod
1 year, 3 months agopvskbrod
1 year, 3 months agoRajivnb
1 year, 2 months agoMultiCloudIronMan
1 year, 5 months ago