Some unrelated points:
80K is a lot of files.
80,000 files in one directory? No operating system or app handles that situation very well by default. You just happen to notice this problem with rsync.
Check your rsync version
Modern rsync handles large directories a lot better than in the past. Be sure you are using the latest version.
Even old rsync handles large directories fairly well over high latency links... but 80k files isn't large...it is huge!
That said, rsync's memory usage is directly proportional to the number of files in a tree. Large directories take a large amount of RAM. The slowness may be due to a lack of RAM on either side. Do a test run while watching memory usage. Linux uses any left-over RAM as a disk cache, so if you are running low on RAM, there is less disk caching. If you run out of RAM and the system starts using swap, performance will be really bad.
Make sure --checksum is not being used
--checksum
(or -c
) requires reading each and every block of every file. You probably can get by with the default behavior of just reading the modification times (stored in the inode).
Split the job into small batches.
There are some projects like Gigasync which will "Chop up the workload by using perl to recurse the directory tree, building smallish lists of files to transfer with rsync."
The extra directory scan is going to be a large amount of overhead, but maybe it will be a net win.
OS defaults aren't made for this situation.
If you are using Linux/FreeBSD/etc with all the defaults, performance will be terrible for all your applications. The defaults assume smaller directories so-as not to waste RAM on oversized caches.
Tune your filesystem to better handle large directories: Do large folder sizes slow down IO performance?
Look at the "namei cache"
BSD-like operating systems have a cache that accelerates looking up a name to the inode (the "namei" cache"). There is a namei cache for each directory. If it is too small, it is a hindrance more than an optimization. Since rsync is doing a lstat() on each file, the inode is being accessed for every one of the 80k files. That might be blowing your cache. Research how to tune file directory performance on your system.
Consider a different file system
XFS was designed to handle larger directories. See Filesystem large number of files in a single directory
Maybe 5 minutes is the best you can do.
Consider calculating how many disk blocks are being read, and calculate how fast you should expect the hardware to be able to read that many blocks.
Maybe your expectations are too high. Consider how many disk blocks must be read to do an rsync with no changed files: each server will need to read the directory and read one inode per file. Let's assume nothing is cached because, well, 80k files has probably blown your cache. Let's say that it is 80k blocks to keep the math simple. That's about 40M of data, which should be readable in a few seconds. However if there needs to be a disk seek between each block, that could take much longer.
So you are going to need to read about 80,000 disk blocks. How fast can your hard drive do that? Considering that this is random I/O, not a long linear read, 5 minutes might be pretty excellent. That's 1 / (80000 / 600), or a disk read every 7.5ms. Is that fast or slow for your hard drive? It depends on the model.
Benchmark against something similar
Another way to think about it is this. If no files have changed, ls -Llr
does the same amount of disk activity but never reads any file data (just metadata). The time ls -Llr
takes to run is your upper bound.
Is rsync (with no files changed) significantly slower than ls -Llr
? Then the options you are using for rsync can be improved. Maybe -c
is enabled or some other flag that reads more than just directories and metadata (inode data).
Is rsync (with no files changed) nearly as fast as ls -Llr
? Then you've tuned rsync as best as you can. You have to tune the OS, add RAM, get faster drives, change filesystems, etc.
Talk to your devs
80k files is just bad design. Very few file systems and system tools handle such large directories very well. If the filenames are abcdefg.txt, consider storing them in abdc/abcdefg.txt (note the repetition). This breaks the directories up into smaller ones, but doesn't require a huge change to the code.
Also.... consider using a database. If you have 80k files in a directory, maybe your developers are working around the fact that what they really want is a database. MariaDB or MySQL or PostgreSQL would be a much better option for storing large amounts of data.
Hey, what's wrong with 5 minutes?
Lastly, is 5 minutes really so bad? If you run this backup once a day, 5 minutes is not a lot of time. Yes, I love speed. However if 5 minutes is "good enough" for your customers, then it is good enough for you. If you don't have a written SLA, how about an informal discussion with your users to find out how fast they expect the backups to take.
I assume you didn't ask this question if there wasn't a need to improve the performance. However, if your customers are happy with 5 minutes, declare victory and move on to other projects that need your efforts.
Update: After some discussion we determined that the bottleneck is the network. I'm going to recommend 2 things before I give up :-).
- Try to squeeze more bandwidth out of the pipe with compression. However compression requires more CPU, so if your CPU is overloaded, it might make performance worse. Try rsync with and without
-z
, and configure your ssh with and without compression. Time all 4 combinations to see if any of them perform significantly better than others.
- Watch network traffic to see if there are any pauses. If there are pauses, you can find what is causing them and optimize there. If rsync is always sending, then you really are at your limit. Your choices are:
- a faster network
- something other than rsync
- move the source and destination closer together. If you can't do that, can you rsync to a local machine then rsync to the real destination? There may be benefits to doing this if the system has to be down during the initial rsync.