Multi-core and copy speed

What i want to do is to copy 500K of files.

I want to copy within server from one destination to another.It includes emails mostly so many small files.

Its over 23 GB only but takes so long (over 30 mins and not done yet) , linux cp command also only uses 1 CPU .

So if i script it to use multiple cps , would that make it faster.

System is 16 cores , 16 GB Ram , 15K Drivers (15000 RPM SATA) .

What are other options?

I believe tarring and untaring would take even longer and wont use multi-core ..

Phyo Arkar Lwin

Posted 2011-10-21T22:39:34.713

Reputation: 401

see my answer to this question as to why copying a lot of files requires a lot of disk I/O: http://superuser.com/questions/344534/why-does-copying-the-same-amount-of-data-take-longer-if-spread-across-many-separ/344860#344860

– sawdust – 2011-10-22T01:08:57.290

Answers

Your bottleneck is hard-drive speed. Multi-core can't speed this up.

Pubby

Posted 2011-10-21T22:39:34.713

Reputation: 334

Harddrive . when tested with hdpram it returns 278MB/s are you sure about this? it should only take 100 seconds to copy 23GB file. So using mulitiple CP in multi-threading progams wont improve this too? – Phyo Arkar Lwin – 2011-10-21T22:48:53.873

1No, no it won't. The bottleneck is almost certainly the read/write speed of the physical media itself unless you're using enterprise-level gear. – Shinrai – 2011-10-21T22:51:34.667

@V3ss0n I do know that hard drives are not random access, which prevents them from being accessed in parallel. – Pubby – 2011-10-21T22:51:57.693

2@Pubby8 - Umm, HDD are random access devices (at the block/sector level). It's often compared to tape (e.g. magnetic tape) which is a sequential block device. I suspect you're trying to state that the typical device can only perform one I/O operation at a time. There is an animal called dual-port disk drive that can do two operations at once, but there filesystem issues that make this rather complicated. – sawdust – 2011-10-22T01:14:31.443

What i want to make sure is , there was a program i made in python , which extract text from multiple file format using different kind of parser (doc , pdf , eml , etc) into database for later indexing and search. At first the script was only single process , and after making it multi-process using multiprocessing module (high level Fork, so same as forking) it increase speed significantly. But it only works well up to 4 process , at 6 process IO Stall and totally slow thing down , and even freeze whole process sometime. – Phyo Arkar Lwin – 2011-10-22T12:16:28.900

so the sweet spot there is 4 processes. Should i test that way? – Phyo Arkar Lwin – 2011-10-22T12:16:57.307

Coping a single large file is faster than moving lots of small files as there is lots of latency with the setup and tear down of each operation - also the disk and OS can do lots of read-ahead with a single large file. So tarring it first would make it quicker. Though once you factor in the time taken to tar, it may not speed things up too much.

Note that you are only reading from a single disk, so parallelising your calls to the disk may actually slow things down, where it tries to serve multiple files at the same time.

Paul

Posted 2011-10-21T22:39:34.713

Reputation: 52 173

1Wouldn't tarring require reading all the files, creating the tar, deleting the original files, and then creating the copy? Seems like it would definitely take longer. – Pubby – 2011-10-21T23:01:50.777

Yes for sure - I agreed with your answer, mine was just to provide some additional info. Given that the copy seems to be underway at the time the OP wrote the question, it seemed to be an information gathering exercise. There will be circumstances where tarring first may provide better overall performance. – Paul – 2011-10-22T03:23:16.470

Although the question has been quite old, I think the best way is to zip using multi-cores like lbzip2 and pbzip2. Transfer the compressed file and decompress it using multi-cores. You can find about the commands on Internet.