31

I'd like to be able to batch delete thousands or tens of thousands of files at a time on S3. Each file would be anywhere from 1MB to 50MB. Naturally, I don't want the user (or my server) to be waiting while the files are in the process of being deleted. Hence, the questions:

  1. How does S3 handle file deletion, especially when deleting large numbers of files?
  2. Is there an efficient way to do this and make AWS do most of the work? By efficient, I mean by making the least number of requests to S3 and taking the least amount of time using the least amount of resources on my servers.
tpml7
  • 479
  • 1
  • 5
  • 21
SudoKill
  • 413
  • 1
  • 4
  • 5

9 Answers9

35

The excruciatingly slow option is s3 rm --recursive if you actually like waiting.

Running parallel s3 rm --recursive with differing --include patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include pattern matching.

Enter bulk deletion.

I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects.

Here's an example:

cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "{Key=%s}," "$@")],Quiet=true"' _
  • The -P8 option on xargs controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time.
  • The -n1000 option tells xargs to bundle 1000 keys for each aws s3api delete-objects call.
  • Removing ,Quiet=true or changing it to false will spew out server responses.
  • Note: There's an easily missed _ at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.

But how do you get file-of-keys?

If you already have your list of keys, good for you. Job complete.

If not, here's one way I guess:

aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys
antak
  • 469
  • 1
  • 4
  • 6
  • 12
    Great approach, but I found that listing the keys was the bottleneck. This is much faster: `aws s3api list-objects --output text --bucket BUCKET --query 'Contents[].[Key]' | pv -l > BUCKET.keys` And then removing objects (this was sufficient that going over 1 parallel process reaches the rate limits for object deletion): `tail -n+0 BUCKET.keys | pv -l | grep -v -e "'" | tr '\n' '\0' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket BUCKET --delete "Objects=[$(printf "{Key=%q}," "$@")],Quiet=true"' _` – SEK Aug 13 '18 at 18:09
  • 5
    You probably should also have stressed the importance on `_` in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is that `bash -c` passes all arguments as positional parameters, starting with `$0`, while "$@" only processes parameters starting with `$1`. So the underscore dummy is needed to fill the position of `$0`. – Vlad Nikiforov Oct 01 '18 at 12:42
  • @VladNikiforov Cheers, edited. – antak Oct 02 '18 at 01:30
  • 5
    One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've used `split -l 1000` to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue. – joelittlejohn Apr 03 '19 at 12:32
  • 2
    If you just want al list of the keys, I would think `aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | awk '{print $4}'` would be simpler and you can add a `| grep` to filter that down from there. – Hayden Nov 27 '19 at 22:37
  • Man doing this with millions of objects really sucks, but thanks to all of you for the pointers. I used the split command to to split into 10k key files, then used SEK's command to run them with some parallelism. Then also deleting the split files when completed to offer some checkpointing. I found 10 threads to give me warnings to slow down from AWS. I'm going with 4 at the moment and it's going well. – Nathan Loyer Dec 30 '20 at 05:04
  • For aws cli v2, disabling the pager helps when running the s3api delete-objects command: export AWS_PAGER="" – imdibiji May 04 '21 at 02:32
16

AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).

The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!

I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).

If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.

If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.

Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.

Ed D'Azzo
  • 196
  • 1
  • 2
13

A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.

https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html

cam8001
  • 231
  • 2
  • 4
  • 2
    Be careful, though, as this can be very expensive if you have a lot of objects, https://stackoverflow.com/questions/54255990/cheapest-way-to-delete-2-billion-objects-from-s3-ia – Will Feb 13 '20 at 09:07
5

I was frustrated by the performance of the web console for this task. I found that the AWS CLI command does this well. For example:

aws s3 rm --recursive s3://my-bucket-name/huge-directory-full-of-files

For a large file hierarchy, this may take some considerable amount of time. You can set this running in a tmux or screen session and check back later.

dannyman
  • 358
  • 4
  • 15
  • 2
    It looks like the `aws s3 rm --recursive` command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk – Brandon Feb 22 '18 at 04:35
3

I know this post is really old at this point but if you're having to do this today, the AWS dashboard now has an "Empty" feature on the bucket search page which will perform a bulk delete (1000 at a time) for you:

Image of AWS Dashboard - S3 Empty Button Highlighted

  • 1
    I've tried different ways and looks like this option works best for me! Emptied my bucket with ~100k files and ~50GB in size in minutes. – sonlexqt Nov 24 '21 at 05:32
3

There already mention about s3 sync command before, but without example and word about --delete option.

I found it fastest way to delete content of folder in S3 bucket my_bucket by:

aws s3 sync --delete "local-empty-dir/" "s3://my_bucket/path-to-clear"

Hubbitus
  • 281
  • 2
  • 5
0

I found rclone to be pretty fast as it uses the S3 API.

https://rclone.org/

rclone delete --progress --transfers=1000 <rclone_confg>:<s3_bucket_and_prefix>

Shawnzam
  • 101
  • 1
0

I made a python script for this.

P.S. it nukes your account s3, all buckets.

import concurrent.futures
import boto3

def purge_bucket(Bucket, S3Client):
    response = S3Client.list_objects_v2(Bucket=Bucket)
    while 'Contents' in response and response['KeyCount'] > 0:
        for key in response['Contents']:
            value = key['Key']
            key.clear()
            key['Key'] = value
        print(f'Deleting {len(response["Contents"])} keys at {Bucket}')
        out = S3Client.delete_objects(
            Bucket=Bucket, 
            Delete={'Objects': response['Contents']}
        )
        if 'Errors' in out:
            print(f'Errors at {Bucket}: {out["Errors"]}')
        response = S3Client.list_objects_v2(Bucket=Bucket)
    return Bucket

s3 = boto3.client('s3')
response = s3.list_buckets()
if len(response['Buckets']) > 0:
    with concurrent.futures.ThreadPoolExecutor() as executor:
        runs = []
        for bucket in response['Buckets']:
            bucket = bucket['Name']
            runs.append(executor.submit(purge_bucket, Bucket=bucket, S3Client=s3))
        for run in concurrent.futures.as_completed(runs):
            try:
                end = s3.delete_bucket(Bucket=run.result())
                print(end)
            except Exception as e:
                print(f'{run.result()}: {e}')
0

Without knowing how you're managing the s3 buckets, this may or may not be particularly useful.

The AWS CLI tools has an option called "sync" which can be particularly effective to ensure s3 has the correct objects. If you, or your users, are managing S3 from a local filesystem, you may be able to save a ton of work determining which objects need to be deleted by using the CLI tools.

http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

Bill B
  • 41
  • 1