How can I download a copy of an S3 public data set?



i was naively assuming I could do something like

s3cmd sync s3://snap-d203feb5 /var/tmp/copy

but I seem to have the wrong idea of how to go about this. I cannot even get a simple thing to work;

vnix$ s3cmd ls s3://snap-d203feb5
Bucket 'snap-d203feb5':
ERROR: Bucket 'snap-d203feb5' does not exist

I guess the identifier I have is not for a "bucket" but for a "public data set". How do I go from one to the other? Do I have to start up an EC2 instance and create a bucket for this? How? The instructions at seem to assume I want to use the data in an EC2 instance, but in this case, I'd just like to browse a bit, at least for a start.

By the by, copy/pasting the "US Snapshot ID" causes a nasty traceback from Python; they publish the ID with a weird Unicode (I presume) dash which cannot directly be copy/pasted. Is there a mistake when I copy it? And what's the significance of "US" in there? Can't I use the data outside North America??


Posted 2012-09-03T21:16:40.843

Reputation: 2 480

Have you tried using a normal ASCII dash (or two dashes) instead? Various blogging platforms tend to "prettify" -- into Unicode dashes. – user1686 – 2012-09-03T22:48:07.793

Yes, of course I tried that. Result above. With the weird dashes, like I said, there's a hideous backtrace. – tripleee – 2012-09-04T04:03:21.193



The public data sets are not hosted on Amazon S3 as such, rather they are provided as Amazon Elastic Block Store (EBS) snapshots. While these are stored on S3 in fact, it is not possible to access such a snapshot directly, rather you need to create a new EBS volume from it and attach it to an Amazon EC2 instance for further processing at your discretion.

Just browsing the data set is a reasonable use case of course, unfortunately you can currently not avoid using an EC2 instance and EBS volume though - see section How It Works for details:

Select public data sets are hosted on Amazon EC2 for free as Amazon Elastic Block Store (Amazon EBS) snapshots. Amazon EC2 customers can access this data by creating their own personal Amazon EBS volumes, using the public data set snapshots as a starting point. They can then access, modify and perform computation on these volumes directly using their Amazon EC2 instances [...]

To get started using the Public Data Sets on AWS, simply perform these three easy steps:

  1. Sign up for an Amazon EC2 account.
  2. Launch an Amazon EC2 instance.
  3. Create an Amazon EBS volume using the Snapshot ID listed in the catalog above for your chosen snapshot.

How these steps are performed in detail is explained in the documentation you linked already, i.e. Launching an Instance and Creating a Public Data Set Volume.

Once you have it available like so, you might store the data set in a S3 bucket of yours of course.

Steffen Opel

Posted 2012-09-03T21:16:40.843

Reputation: 2 755