1

My task is:

  1. Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
  2. Through Hive I am processing the data and generating the result in one table
  3. That result containing table from Hive is again exported to MS SQL SERVER back.

All these things I have to implement with the help of Amazon Services. (In my case I am using Amazon S3 for storing the data and Amazon Elastic Map Reduce.)

Actually, the data which I am importing from MS SQL Server is very large (near about 5,000,000 entries in one table. Likewise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).

I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using ten m1.large instance and still I have same performance as on my single local machine.

And also Is there any other way to improve the performance or just to increase the number of instance?

In order to improve the performance what number of instances should I need to use?

wfaulk
  • 6,828
  • 7
  • 45
  • 75
Bhavesh Shah
  • 111
  • 2

0 Answers0