0

While reading up official documentation on Amazon S3 Java SDK, I found an interesting note:

Your network connection remains open until you read all of the data or close the input stream. We recommend that you read the content of the stream as quickly as possible.

My question is, why does Amazon recommend reading the data ASAP as against, say, streaming it into a data pipeline where we can process the data line by line? I couldn't find the answer on Amazon's documentation website nor on their pricing pages. Nowhere is it mentioned that a long-lived HTTP connection would cost more. Therefore looking for some input from the community.

Thanks

Juzer Ali
  • 149
  • 5

1 Answers1

2

The reason is because you're essentially¹ reading bytes directly from a network socket. The SDK is not buffering the entire object in memory or on disk for you.

The S3 service -- like any web service -- will not tolerate excessive stalls/blocking on the socket by the client. The specific timeouts imposed by the service aren't documented, but the idea behind this warning is that you don't want to just leave this lying around and expect that it will remain infinitely available -- as it would if everything were fetched and stashed somewhere locally.

There is no financial impact of how quickly or slowly you read. It's about reliability, since a TCP connection left idle/stalled will eventually be closed. S3 doesn't multiplex multiple simultaneous operations on the same socket, so no other interactions with the service would be affected if the connection is closed unexpectedly.

This recommendation doesn't necessarily exclude line-by-line stream processing, if done efficiently.


¹ essentially but not quite, because TLS.

Michael - sqlbot
  • 21,988
  • 1
  • 57
  • 81