I am new to hadoop and AWS. I have setup multi-node (4 instances t2.large) AWS EC2 cluster with cloudera Hadoop distribution. I have tested the environment with basic examples using CSV files such as word count.
Now, my main project is to analyze data in JSON files. I have around 4 million JSON files approximately 60GB data. Each file has a big JSON entry, basically all information about one record in each file.
I am bit confused on how to approach this. May be copy the files to HDFS and build Map Reduce jobs(using java, as i am comfortable in that) to create large CSV files and then create tables from these CSV in Hive for analysis. Because converting these files to CSV locally may take a lot time. Even copying these files to AWS will be slow, but once copied i can use the computing power of instance. Not sure if this is correct? How can i start with this?
Is there a way where i can process JSON directly or any other approach that will make the process efficient? I have around 1 month to process this data into a form which can be queried and then build further from there.
Any help would be really beneficial.