Updated 21 day ago
This Article shows how to import a nested json like order and order details in to a flat table using AWS Athena . Athena is the most powerful tool that can scan millions of nested documents on S3 and transform it to flat structure if needed. Athena will automatically scale up the required CPU to process it without any human intervention.
In this article we will first take some sample nested JSON data structure which we will transform to flat structure.
The following is the nested json structure which can exhibit data of order and orderliness .
Note : Athena reads the file in the following format only.
Its converted to this Flat structure
Once data file is ready you can import to any S3 bucket using s3 upload file option from AWS Console . Following is the screenshot of file uploaded to s3.
This step needs to be careful while creating Athena structure for the provided data file . Athena looks like a relational table structure but it will not store any data. It will directory query the file at run time and provide the result.
“Create database testme”
Create table and access the file
Athena has good inbuilt support to read these kind of nested jsons. Following is the schema to read orders data file.
CREATE EXTERNAL TABLE `test_orders1`( `details` array<struct<orderno:int,detailno:int,cost:double,partnumber:string,linecode:string,qty:int>> COMMENT 'from deserializer', `header` struct<orderno:int,orderby:string,amount:double,orderdate:string> COMMENT 'from deserializer') ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION 's3://test-bucket/load' TBLPROPERTIES ( 'transient_lastDdlTime'='1558949443')
is an array object contains 2 jsons in it. So represented same in the structure.
Schema for header info loading
Which is pointing to s3 bucket to access the file. We can not specify file directly here
If You are table is created successfully in AWS Athena now we should be able to see the data from S3 data file which we provided to do that click on the menu (three dotted icon) click on preview table
Here “item” means nested array contains order details . In order to split arrays in to rows we need to use CROSS JOIN in conjunction with unnest operator.
Data Format after first level transformation:
Use the following query to access
SELECT header, item FROM (test_orders1 CROSS JOIN UNNEST("details") t (item))
Header -- header json object from Json File.
Item -- Item object is refers to details array in json record.
SELECT header.orderno, header.orderby, header.amount, header.orderdate, item.partnumber, item.qty, item.cost FROM (test_orders1 CROSS JOIN UNNEST("details") t (item))
Now each orderline represented in flat view.
SELECT header.orderno, header.orderby, header.orderdate, item.partnumber, (item.qty*item.cost) as totalcost FROM (test_orders1 CROSS JOIN UNNEST("details") t (item))
Once result set is ready we can export to a csv file.
Athena has flexibility to export big result sets to a csv file.