I’m playing around with Spark 3.0.1 and I’m really impressed by the performance with Spark SQL on GB of data.
I’m trying to understand what’s the best way to import multiple JSON files in the Spark dataframe before running the analysis queries.
Right now I tried importing ~1.500 .gz files containing a json structured file each. These files are stored in a S3 bucket and I have a data pipeline which is filling this bucket every x amount of time. We’re talking about a full size of 5GB for 1.500 gz files, uncompressed it’s around 60/70GB (complete dataset).
Importing these files from S3 is taking almost 3 to 4 minutes, while the SQL queries take only a bunch of seconds.
The bottleneck is clearly S3 here.
What would be the ideal approach here to speed up the import of these gz files?
Any suggestion would be extremely appreciated.
Thank you!