Process Nested Jsons Using AWS Athena
Overview This Article shows how to import a nested json like order and order details in to a flat table using AWS Athena . Athena…

Overview

This Article shows how to import a nested json like order and order details in to a flat table using AWS Athena . Athena is the most powerful tool that can scan millions of nested documents on S3 and transform it to flat structure if needed. Athena will automatically scale up the required CPU to process it without any human intervention.

In this article we will first take some sample nested JSON data structure which we will transform to flat structure.

Example Json Data Structure 

The following is the nested json structure which can exhibit data of order and orderliness . 
Note :  Athena reads the file in the following format only. 

Its converted to this Flat structure

Step 1: Upload File To S3

Once data file is ready you can import to any S3 bucket using s3 upload file option from AWS Console . Following is the screenshot of file uploaded to s3.


Step 2: Access Orders Data Using Athena

This step needs to be careful while creating Athena structure for the provided data file . Athena looks like a relational table structure but it will not store any data. It will directory query the file at run time and provide the result.

  • Create database in athena with following query like traditional sql query.

“Create database testme”

  • Once database got created , create a table which is going to read our json file in s3.

Step 3: Create Athena Table Structure for nested json along with the location of data stored in S3

Create table and access the file

Athena has good inbuilt support to read these kind of nested jsons.  Following is the schema to read orders data file.  

CREATE EXTERNAL TABLE `test_orders1`(
  `details` array<struct<orderno:int,detailno:int,cost:double,partnumber:string,linecode:string,qty:int>> COMMENT ‘from deserializer’, 
  `header` struct<orderno:int,orderby:string,amount:double,orderdate:string> COMMENT ‘from deserializer’)
ROW FORMAT SERDE 
  ‘org.openx.data.jsonserde.JsonSerDe’ 
STORED AS INPUTFORMAT 
  ‘org.apache.hadoop.mapred.TextInputFormat’ 
OUTPUTFORMAT 
  ‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’
LOCATION
  ‘s3://test-bucket/load’
TBLPROPERTIES (
  ‘transient_lastDdlTime’=’1558949443’)

Explanation

Details:

is an array object contains 2 jsons in it. So represented same in the structure. 
`details` 

array<struct<orderno:int,detailno:int,cost:double,partnumber:string,linecode:string,qty:int>>

Header :

Schema for header info loading

  `header` struct<orderno:int,orderby:string,amount:double,orderdate:string>

Location:

 ‘s3://test-bucket/load’

  Which is pointing to s3 bucket to access the file. We can not specify file directly here

  • If schema defined well table will create successfully. 
  • Based on the schema the following is the table structure in Athena

Step 4: Preview Table Data

If You are table is created successfully in AWS Athena now we should be able to see the data from S3 data file which we provided to do that click on the menu (three dotted icon) click on preview table 

  • You can see the data is displaying on results console. Now we have to view the results in a flat view so follow the below steps to access inner objects/fields.

Step 5: Now Let’s show our data in two columns header and detail

Here “item” means nested array contains order details . In order to split arrays in to rows we need to use CROSS JOIN in conjunction with unnest operator.

Data Format after first level transformation:

Use the following query to access 

SELECT
  header,
  item
FROM
  (test_orders1
CROSS JOIN UNNEST("details") t (item))

Explanation:

Header  — header json object from Json File.
Item       —  Item object is refers to details array in json record.

5b. Finally transform header and details objects as flat columns.

SELECT
  header.orderno,
  header.orderby,
  header.amount,
  header.orderdate,
  item.partnumber,
  item.qty,
  item.cost
FROM
  (test_orders1
CROSS JOIN UNNEST("details") t (item))

Output:

Now each orderline represented in flat view. 

5c. If you want to get the total cost of the order by partnumber we can use the below query. 

SELECT
  header.orderno,
  header.orderby,
  header.orderdate,
  item.partnumber,
  (item.qty*item.cost) as totalcost
FROM
  (test_orders1
CROSS JOIN UNNEST("details") t (item))

Export to CSV

Once result set is ready we can export to a csv file.

Csv file;

Athena has flexibility to export big result sets to a csv file.