6

I've been looking into using AWS cli for data integrity checks to verify a backup has been transferred from a Linux file server correctly to AWS s3. Likewise, I would like to verify when restoring a file from backup to the Linux file server it also transferred correctly.

I examined the etag stored with the object on S3, because it appears to be a md5sum. However, if the file is transferred as multipart in the case of large files, the etag is no longer valid.

But before I go further in trying to do a MD5sum to what has just been sync'ed to S3 each time, is this really necessary to do this? When using rsync between Linux file systems over the internet, it isn't common practice to do an md5sum on the files transferred to verify them. Because it is assumed I think that rsync has already taken care of this?

So I'm wondering does AWS cli sync already take care of the data integrity check for us?

Edward_178118
  • 895
  • 4
  • 14
  • 30

1 Answers1

9

The short answer is yes, aws s3 sync and aws s3 cp calculate an MD5 checksum and if it doesn't match when upload is complete will retry up to five times.

The longer answer:

The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads. If the checksum that S3 calculates does not match the Content-MD5 provided, S3 will not store the object and instead will return an error message back the AWS CLI. The AWS CLI will retry this error up to 5 times before giving up.

If the request is signed with a Signature Version 4 then the MD5 checksum is not calculated.

Note that the AWS CLI will add a Content-MD5 header for both the high level aws s3 commands that perform uploads (aws s3 cp, aws s3 sync) as well as the low level s3api commands including aws s3api put-object and aws s3api upload-part.

Reference

AWS CLI S3 FAQ

kenlukas
  • 2,886
  • 2
  • 14
  • 25