2

I have 200 jsonl (json-lines) files in an s3 bucket. Each file contains 100,000 JSONs to be written into a DynamoDB.

I want to use Lambda to download the file from S3, and batch-write it into the DynamoDB (the files already perfectly match the table schema).

I have 200 files, but I can't call 200 lambdas concurrently -- since DynamoDB is limited to just 10,000 WCU's per second, I can only write 10,000 rows per second. And Lambda's can only last 300 seconds, before they time out.

What's the best way to do this?

My current thinking was to asynchorously call 5 Lambdas at once, and monitor the log files to see how many are done, calling the next one, only after one is complete?

OR...

Can I set the concurrent execution limit to 5 for the lambda function, and then asychorously call the function 200 times (one for each file)? Will AWS automatically trigger the next lambda when the one is complete?

keithRozario
  • 146
  • 7
  • Just know that 10,000 WCU limit on DynamoDB is a default. You can request that it be raised. From the documentation: "AWS places some default limits on the throughput you can provision. These are the limits unless you request a higher amount. To request a service limit increase see https://aws.amazon.com/support." – Kirk Jul 16 '18 at 15:46

1 Answers1

1

From the Amazon Docs:

https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html

By setting a concurrency limit on a function, Lambda guarantees that allocation will be applied specifically to that function, regardless of the amount of traffic processing remaining functions. If that limit is exceeded, the function will be throttled. How that function behaves when throttled will depend on the event source. For more information, see Throttling Behavior

Then from aws documents dealing with throttling Behavior: https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html#throttling-behavior

On reaching the concurrency limit associated with a function, any further invocation requests to that function are throttled, i.e. the invocation doesn't execute your function. Each throttled invocation increases the Amazon CloudWatch Throttles metric for the function. AWS Lambda handles throttled invocation requests differently, depending on their source:

Synchronous invocation: If the function is invoked synchronously and is throttled, Lambda returns a 429 error and the invoking service is responsible for retries. The ThrottledReason error code explains whether you ran into a function level throttle (if specified) or an account level throttle (see note below). Each service may have its own retry policy. For example, CloudWatch Logs retries the failed batch up to five times with delays between retries. For a list of event sources and their invocation type, see Supported Event Sources.

Asynchronous invocation: If your Lambda function is invoked asynchronously and is throttled, AWS Lambda automatically retries the throttled event for up to six hours, with delays between retries. Remember, asynchronous events are queued before they are used to invoke the Lambda function.

So it seems that if you set a concurrent limit (it defaults to 1000 set across all your functions), then AWS will either give you a 429 status code (for request-response) or will automatically que and retry your function for up to 6 hours.

It doesn't specify how the delaying function between retries works though.

keithRozario
  • 146
  • 7