0

I run a large server providing open source software (https://ftp.halifax.rwth-aachen.de), currently serving more than 30 TByte of data with Multi-Gigabit throughput. Data is synchronized and kept up-to-date using rsync, i.e. synchronizing the data from some main rsync server to my local copy.

Currently the storage backend is disk based with a filesystem (ZFS). There are ideas to move this project to a virtualized environment, where the bulk of the storage would be provided via S3 (not hosted at Amazon, but some local and presumably costly data center appliance that speaks S3).

Based on my experience with rsync I believe synchronizing lots of data via S3 is not a good idea, but I lack actual experience with S3.

How bad is it? Is S3 (the protocol) suitable for this kind of operation? In addition to serving lots of read requests (200/sec on average), would the S3 server be able to tell rsync whatever rsync needs to know to synchronize the data?

Bonus question: would S3 be suitable to serve data via rsync, i.e. keep rsync://ftp.halifax.rwth-aachen.de/ running?

Live statistics of the current (ZFS/disk based) system: https://ftp.halifax.rwth-aachen.de/~cotto/

C-Otto
  • 294
  • 5
  • 16
  • 1
    [`rclone`](https://github.com/rclone/rclone) might be a better tool than rsync when using an S3 compatible back-end. For our use-cases and with our ([Ceph](https://docs.ceph.com/en/latest/radosgw/s3/)) S3 back-end the amount of horizontal scaling provided was actually an improvement for most downloaders. But your S3 API's will need some tuning when you start hammering them. And you can't really do partial uploads with S3 backend and/or rclone when large (binary) files only change a small bit. – HBruijn Sep 05 '22 at 10:28

0 Answers0