5

I have 60TB of data that resides in 12 csv files.

The data will be loaded into a clustered database where the loading processes is single threaded. In order to improve my load performance I need to initiate a load process from each node.

So far so good from this point of view. My biggest problem is how can I split this data? It is zipped, and each csv file has around 5TB of data! I tried split but it takes too long!

squillman
  • 37,618
  • 10
  • 90
  • 145
Up_One
  • 150
  • 1
  • 5
  • 4
    You should have realistic expectations. I would expect _any_ method to split these files to take several days to run to completion, depending on the speed of your storage. For example, if you can read and write at an average of 100MByte/sec, I would expect this job to take about a week. – Michael Hampton Jul 03 '14 at 17:23
  • yeah it seems so ! This was a architectural problem from the beginning, the csv file should be generated in smaller files ! :( – Up_One Jul 03 '14 at 17:29
  • What was used to compress the files, e.g. zip or gzip. – John Auld Jul 03 '14 at 17:34
  • zip - was used ! but i dont think i will be able to split ! to much time – Up_One Jul 03 '14 at 17:35

3 Answers3

1

The easiest but not the fastest, most likely, way is

unzip -p <zipfile> | split -C <size>
Nik
  • 409
  • 3
  • 4
  • I am not going to go for that ! since it cause alot of overhead ! another reason is that my database load supports zipped data loading ! – Up_One Jul 03 '14 at 17:50
  • 1
    so, you can use zipsplit util. it can split one zip archive to several zip archives by file and size – Nik Jul 03 '14 at 17:56
  • yeah that might be a choice but is taking more time to split the file then to load it ! kkkkk – Up_One Jul 03 '14 at 18:00
  • you need to estimate unzip/zip operations on your server. In most modern systems, they faster then reading from disk. in this case, you can unzip certain file from archive then zip it and immediately send it to another node by nc. and you can run this operation for different files from archive parallel, if you have multicore processor. on server `unzip -p | zip -r - - | nc 12345` on nodes `nc -l 12345 > ` So, you server disk will loaded only by read operation, all writes operation will be on nodes – Nik Jul 03 '14 at 18:27
0

Assuming the order of the data unimportant, one way to do this -not so much of a faster way- but at least somewhat parallel would be to write a script that does the following.

  1. Open the zip file.
  2. Get the first file.
  3. Read the data out of the file, say in lines.
  4. For each csv line, write out a new zipfile containing the line.
  5. Rotate the file selected (say five zipfiles) using the output of one line.
  6. Once you reach a certain size (say 50GB) create a brand new zip file.

This isn't any faster than a sequential read of the big file, but allows you to split up the file into smaller chunks which can be loaded in parallel whilst the remaining data is completed.

Like most compressed output, its not seekable (you cannot jump X bytes ahead) so the biggest downside you have is if the process aborts for some reason you'd be forced to restart the whole thing from scratch.

Python provides support for doing something like this via the zipfile module.

Matthew Ife
  • 22,927
  • 2
  • 54
  • 71
0

Do you have to load the 12 files in order or can they be imported in parallel?

I ask because it would seem that if they have to be loaded in order then splitting them further won't enable you to run anything in parallel anyway and if they don't then you can import the 12 files you already have in parallel.

If the files aren't already available on the nodes, transferring them there may take as long as the import anyway.

Bottlenecks can show up in surprising places. Have you started the single-thread import process and verified that the nodes are underutilised? You may be solving the wrong problem if you haven't checked.

Ladadadada
  • 25,847
  • 7
  • 57
  • 90