Overview
This Article shows how to import a nested json like order and order details in to a flat table using AWS Athena . Athena is the most powerful tool that can scan millions of nested documents on S3 and transform it to flat structure if needed. Athena will automatically scale up the required CPU to process it without any human intervention.
In this article we will first take some sample nested JSON data structure which we will transform to flat structure.
Example Json Data Structure
The following is the nested json structure which can exhibit data of order and orderliness .
Note : Athena reads the file in the following format only.
Its converted to this Flat structure
Step 1: Upload File To S3
Once data file is ready you can import to any S3 bucket using s3 upload file option from AWS Console . Following is the screenshot of file uploaded to s3.
Step 2: Access Orders Data Using Athena
This step needs to be careful while creating Athena structure for the provided data file . Athena looks like a relational table structure but it will not store any data. It will directory query the file at run time and provide the result.
- Create database in athena with following query like traditional sql query.
“Create database testme”
- Once database got created , create a table which is going to read our json file in s3.
Step 3: Create Athena Table Structure for nested json along with the location of data stored in S3
Create table and access the file
Athena has good inbuilt support to read these kind of nested jsons. Following is the schema to read orders data file.
CREATE EXTERNAL TABLE `test_orders1`(
`details` array<struct<orderno:int,detailno:int,cost:double,partnumber:string,linecode:string,qty:int>> COMMENT ‘from deserializer’,
`header` struct<orderno:int,orderby:string,amount:double,orderdate:string> COMMENT ‘from deserializer’)
ROW FORMAT SERDE
‘org.openx.data.jsonserde.JsonSerDe’
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’
LOCATION
‘s3://test-bucket/load’
TBLPROPERTIES (
‘transient_lastDdlTime’=’1558949443’)
Explanation
Details:
is an array object contains 2 jsons in it. So represented same in the structure.
`details`
array<struct<orderno:int,detailno:int,cost:double,partnumber:string,linecode:string,qty:int>>
Header :
Schema for header info loading
`header` struct<orderno:int,orderby:string,amount:double,orderdate:string>
Location:
‘s3://test-bucket/load’
Which is pointing to s3 bucket to access the file. We can not specify file directly here
- If schema defined well table will create successfully.
- Based on the schema the following is the table structure in Athena
Step 4: Preview Table Data
If You are table is created successfully in AWS Athena now we should be able to see the data from S3 data file which we provided to do that click on the menu (three dotted icon) click on preview table
- You can see the data is displaying on results console. Now we have to view the results in a flat view so follow the below steps to access inner objects/fields.
Step 5: Now Let’s show our data in two columns header and detail
Here “item” means nested array contains order details . In order to split arrays in to rows we need to use CROSS JOIN in conjunction with unnest operator.
Data Format after first level transformation:
Use the following query to access
SELECT
header,
item
FROM
(test_orders1
CROSS JOIN UNNEST("details") t (item))
Explanation:
Header — header json object from Json File.
Item — Item object is refers to details array in json record.
5b. Finally transform header and details objects as flat columns.
SELECT
header.orderno,
header.orderby,
header.amount,
header.orderdate,
item.partnumber,
item.qty,
item.cost
FROM
(test_orders1
CROSS JOIN UNNEST("details") t (item))
Output:
Now each orderline represented in flat view.
5c. If you want to get the total cost of the order by partnumber we can use the below query.
SELECT
header.orderno,
header.orderby,
header.orderdate,
item.partnumber,
(item.qty*item.cost) as totalcost
FROM
(test_orders1
CROSS JOIN UNNEST("details") t (item))
Export to CSV
Once result set is ready we can export to a csv file.
Csv file;
Athena has flexibility to export big result sets to a csv file.