There are many limits for transferring many small files. Some have already been mentioned: network latency, disk write speed, etc. However most of those can be optimized best by using "rsync". If the files don't exist on the destination, and you are pretty sure the process won't be interrupted, using tar piped to tar will be very efficient:
cd /SOURCE/DIR && tar cf - . | ssh DESTINATIONHOST "cd /DESTINATION/DIR && tar xpvf -"
Fundamentally you need to batch all the files together so that the startup/shutdown overhead of SCP only happens once. If you do that startup/shutdown for each file, it will be very inefficient. The above "tar" pipe will do that. In fact, 90% of all use cases this will be good enough.
This "tar pipe" has the benefit of parallel processing (reading in one process while writing in another). However it is limited by a few things:
- TCP/IP will never utilize 100% of the pipe it has.
- Each process is limited by disks that can only do one write or one read at a time. If you use spinny disks, that's fine. If you use SSDs or RAID (the kinds of RAID that permit multiple parallel reads) this technique will under-perform.
You can work around #2 through various hacks like running two or more processes, each on a subset of the files. However those are imperfect and a bit sloppy.
TCP/IP is more difficult to work around and will continue to be your limit. In fact, if you tune the system so that everything is optimal, TCP/IP won't use the full pipe. Every time TCP/IP thinks it has found the optimal send rate, it will try to send a little more to test if there is "more room" available. This will fail and TCP/IP will back-off a bit. This constant increase/fail/back-off loop means that a TCP/IP stream will alternate between 100% utilization and 50% utilization... the result is that on average the pipe will be 75-80% utilized. (NOTE: These are estimates... do some google searches to find the exact numbers. The point is that it will be the average of 100% and something that isn't 100%, therefore it won't ever be 100%).
If you run multiple TCP/IP streams, they will all be constantly looping through this increase/fail/back-off loop. If you are unlucky they'll all collide at the same time and all back off very far, leaving the pipe underutilized more. If you are lucky they'll collide less and you'll get a graph that looks like many bouncing balls... still leaving the pipe underutilized in aggregate.
Oh, and if you have a single machine who's TCP/IP implementation doesn't have the latest optimizations, or isn't tuned perfectly, it can send the whole system out of whack.
So if TCP/IP is so terrible, why do we continue to use it? It isn't so bad in the typical case of many different types of traffic sharing a pipe. The problem here is that you have a very specific application with a very specific requirement. Therefore you need a very specific solution. Luckily a lot of people also are in your position so these solutions are becoming easier to find.
Systems like http://asperasoft.com/ use a custom protocol over UDP/IP so they can control the back-off/rety algorithm. They use forward-error-correction (FEC) so that small errors don't require retransmission (with TCP/IP a small error is a signal to back off), custom compression schemes, delta copying, and their own back-off algorithms and rate-limiting systems to achieve full (or close-to-full) utilization of the pipe. These are all proprietary so it isn't clear exactly what techniques Aspera and their competitors use or exactly how they work.
There are many companies that have invented such systems and either made them part of their own products, or sell them as a commercial product.
I don't know of any open source implementations at this time. (I'd like to be corrected!)
If this is a very pressing problem and worth spending money to fix, try one of the commercial products. Or, if you can not change your software, you'll need to buy a larger pipe. Luckily, 10G and 40G network interfaces are coming down in price.