6

I've just started to use GCS as backup for my web servers. One server has 1.2 million JPEGS (3.5TB) and this all rsynced over flawlessly over 10 hours or so.

The other has 2.5 million JPEGS (just thumbnails/previews though - 300GB total). The first time I did it the "building synchronization state" went through all 2.5 million quite quickly. A few minutes. My session got interrupted though (wifi dropped) and when I SSHed in to try to run it again the "At source listing" prompt quickly nips through 10000, 20000, 30000. Then grinds to a near halt. Half an hour later it's only up to 300,000. I know it has to work out what files the destination has too, but I don't feel that should significantly slow down the "At source listing..." echoes?

Does it suggest a problem with my filesystem, and if so what should I check?

Or is it expected behaviour, for any reason?

Is trying to use gsutil rsync with 2 million files to one bucket a bad idea? I could find no guidelines from google on how many files can sit in a bucket so I'm assuming it's billions/unlimited?

FWIW the files are all in nested subdirectories, with no more than 2000 files in any one directory.

Thanks

edit: the exact command I'm using is:

gsutil -m rsync -r /var/www/ gs://mybucketname/var/www
Codemonkey
  • 1,034
  • 2
  • 17
  • 36
  • 1
    Are there symbolic links under /var/www? If so, are there circular links? One thing you might try (if you're up for it) is adding a log statement in the _BuildTmpOutputLine function in gsutil/gslib/commands/rsync.py, so it prints out the current file being processed, so you can see where it hangs. If you do this please report back your findings. – Mike Schwartz Oct 23 '15 at 15:15
  • No links. I'll do that now though, thanks! – Codemonkey Oct 23 '15 at 15:53
  • Well I now know that it's each 32,000th file that creates a large pause. Which is the size of "buffer_size" in that file. – Codemonkey Oct 23 '15 at 16:11
  • So at 32,000 per read we're looking at approx 80 ~4MB temp files each containing 32,000 URLs that are then combined to one 320MB file. It doesn't feel that writing a 4MB temp file should take 10+ seconds, so I wonder if something can be improved – Codemonkey Oct 23 '15 at 16:21
  • "output_chunk.writelines(unicode(''.join(current_chunk)))" is the line that's taking all the time. – Codemonkey Oct 23 '15 at 16:42
  • Thanks for pointing me down this path Mike. I've ended up asking a new question, if you could have a look that'd be great. Thanks! – Codemonkey Oct 23 '15 at 17:57
  • Hmm, I just tried creating a directory with 32400 files in it and running gsutil rsync -r dir gs://my-bucket, and that writelines line ran fast. The filenames were short (names like 0/0/0 ... 9/5/102), so the amount of data to sort is smaller than your case, but I'd be surprised if that's the problem. Do you have any non-ASCII chars in the file names? Another thing you could try is splitting that line into three: ''.join(...), unicode(), and writelines(), and adding logging to see which is slow. Let me know. – Mike Schwartz Oct 23 '15 at 21:53
  • p.s. maybe we should switch to email - you can reach me at gs-team@google.com – Mike Schwartz Oct 23 '15 at 21:55

1 Answers1

5

I have discovered that changing

output_chunk.writelines(unicode(''.join(current_chunk)))

to

output_chunk.write(unicode(''.join(current_chunk)))

in /gsutil/gslib/commands/rsync.py makes a big difference. Thanks to Mike from the GS Team for his help - this simple change has been rolled out on github already:

https://github.com/GoogleCloudPlatform/gsutil/commit/a6dcc7aa7706bf9deea3b1d243ecf048a06a64f2

Codemonkey
  • 1,034
  • 2
  • 17
  • 36