5

Issue: I have a presigned url which is valid for 15 minutes. Upload can be initiated any number of times if the presigned url is captured in this time frame.

I want to make an S3 presigned url for upload as secure as possible, so that uploaded file is not modified. I always need first version.

Solutions I researched from various sources:

  1. Have upload in one folder of the bucket and then use a lambda (only once for the first time) to move it to another folder which eventually consumes the uploaded file. Cons: introduction of lambda and its cold start for even a small file.
  2. Have a version check in all the places where S3 file is consumed. Cons: Too much code change and version needs to be maintained somewhere along side filename.

Any other ideas?

schroeder
  • 123,438
  • 55
  • 284
  • 319

3 Answers3

2

Solution #1 is viable but only if the source bucket is versioned and the Lambda function writes to a database that supports strongly consistent reads, because that's the only way you can tell authoritatively whether you're doing it "only once for the first time."

When you ask S3 about the existence of an object for the very first time ever using a GetObjectMetadata/HeadObject request, you are guaranteed to get an accurate answer -- the object exists, or it doesn't. Subsequent requests within a "short" (but undocumented) time may not reflect the latest truth from the bucket's master index.

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all Regions with one caveat. The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.

https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html

"a subsequent GET might not return the object" means by inference that the Lambda invocation for upload #2 has no guaranteed way to authoritatively determine that it is indeed not handling the first upload of the object, because the target object may appear to be absent when it is in fact already present.

If it is of significant importance that only one upload -- the first upload -- actually be used, then an external database is required.

The Lambda cold start issue is pretty insignificant in languages with a lightweight environment like Javascript, which coincidentally has an SDK that is much better written than most other languages. Python is fast, too, but the SDK is inferior. Java should be avoided in Lambda functions unless you are repurposing existing Java code.

This would be the solution I would adopt, but I would note that the 15 minute expiration should be unnecessary. The expiration time of a pre-signed URL is as precise as the clock on the system running the code that generated the URL, so if your server clock is accurate and you generate a URL that expires in 10 seconds, then that URL expires in exactly 10 seconds from the time it was created.

Also potentially noteworthy, the expiration time on a pre-signed URL is checked when the request arrives at S3. You shouldn't need to finish the upload in 10 seconds, you only need to start the upload within 10 seconds. This is at least true for download pre-signed URLs... but should be true for uploads as well.

Modifying your code to look at a specific version is a non-starter, because querying S3 to learn that version ID each time is slow and costly.

1

I think the best solution is to create an incrementation Lambda (with pre-signed-url response), as a proxy, that checks the version of a DynamoDB write. That gives the real url only if it is a first time request.

rudyhadoux
  • 11
  • 2
0

So actually both options are OK.

  1. Cold-start is only really an issue for languages like Java, for javascript and python cold-start times are pretty minimal, and not really an issue if there is continuous volume on the bucket. The lambda would trigger for each new object in the bucket, but would only copy across if there are no previous versions of the file.

  2. Object versioning within S3 is an amazing tool. By placing a version on the object, you can modify your code to always take the first uploaded file -- i.e. Only take the version with the earliest last modified date. This information will already be in S3 and doesn't need to be stored anywhere. You can do a cleanup job to completely delete all later versions to reduce clutter and space.

keithRozario
  • 3,571
  • 2
  • 12
  • 24
  • There's a problem with #2 as described, *"Only take the version with the earliest last modified date. This information will already be in S3 and doesn't need to be stored anywhere."* You absolutely do *not* want to be doing a slow and expensive ListObjectVersions call each time you need to do something with the object. The correct version ID needs to be stored somewhere other than S3 (database) so you can readily find it. – Michael - sqlbot Mar 26 '20 at 00:34
  • And there's a problem with #1, as well. *"The lambda would trigger for each new object in the bucket, but would only copy across if there are no previous versions of the file."* That's a problem because HEAD Object/GetObjectMetadata (same thing) is only immediately consistent *once* for a given object key. You get one chance for that request to tell you the truth about the presence/absence of an object and subsequent responses are eventually consistent (possibly cached by an S3 index replica and non-authoritative). ListObjects/ListObjectVersions is always only eventually consistent. – Michael - sqlbot Mar 26 '20 at 00:42
  • @Michael-sqlbot I don't understand why ListObjectVersions is slow? It's a API (similar to calling something like Dynamo), and can be set to a specific prefix (i.e. the entire filename) -- is there something you know that suggest it's slower than a DB read? – keithRozario Mar 26 '20 at 07:11
  • You should find it much slower than a DB read could be, although it's been a while since I've work intensely with the low-level APIs. It's also $0.005 per 1000 requests. – Michael - sqlbot Mar 26 '20 at 12:28
  • Ah, I take your point about the pricing, it is much more expensive than DynamoDB read/writes. But, considering you don't have the write anything, or maintain a separate store, for lower volume use-cases, I'd prefer this over a separate store. – keithRozario Mar 27 '20 at 01:03
  • There are low volume cases where you're right, that approach could be fine. – Michael - sqlbot Mar 27 '20 at 03:42