8

I need to set up an SFTP server that, essentially, has very large capacity. I need to give one of our partners SFTP login details to a server where they will upload millions of files, totalling a few hundred Terabytes. I will then be selective and quite rarely reading some of these files. This is the only actual requirement, any technology choice is up for grabs.

What comes to mind as the easiest way is to have some sort of EC2 instance running the SFTP server in such a way that anything uploaded is either directly sent to S3, or some sort of process discovers new files when they get uploaded, copies them to S3, and deletes them from disk.

Is this the best way? Is there any other way of getting a server that essentially has "infinite and magically growing disk space"?

Thanks for your help! Daniel

Daniel Magliola
  • 1,402
  • 9
  • 20
  • 33
  • 3
    100 TB of data transfer on S3 will run you almost $5,000. The value of your "partners" porn is probably far less than that. – HopelessN00b Mar 20 '15 at 20:08
  • 2
    https://code.google.com/p/s3fs/ is probably your best bet. On top of the transfer costs @HopelessN00b points out, that same 100 TB will cost you $3k/month to store. – ceejayoz Mar 20 '15 at 20:22
  • 2
    Thanks for your constructive comment @HopelessN00b. For anyone else that may've been dissuaded from answering by that idea... We've run the math, yes, transferring and holding that amount of information will cost us a lot of money. Having the data (definitely not porn) is worth that cost for our business. – Daniel Magliola Mar 20 '15 at 20:50
  • 2
    Alright, so your question is...? How to install SFTP on an AWS instance? How to write a script to delete files? What? Not to be indelicate, but you'd think a company wanting to spend tens of thousands of dollars a month for this "few hundred" TB of data would be willing to hire a consultant for a few grand to set up this system for them. – HopelessN00b Mar 20 '15 at 21:19
  • 2
    Can you explain the context as to why it's impractical to have them either install something on their end to upload it directly to S3 or set-up something like AWS Storage Gateway? If you're loading _hundreds_ of Terabytes then surely they can afford to spend a little bit of time installing an S3 client on a server with direct access to their storage. – thexacre Mar 20 '15 at 23:01
  • @HopelessN00b yes, we are absolutely trying to get someone with the right experience, but it's taking a while, and we'll need this set up before we'll probably find someone, unfortunately. My question is what's the best combination of solutions to be able to achieve this, and how to set up that combination. For example, I didn't know about AWS Storage gateway until today, i'm going to research that. There may be other, better way to do this than AWS+S3, I'm trying to find out, from the ServerFault community, what they recommend. – Daniel Magliola Mar 22 '15 at 12:19
  • @thexacre This is one of those "big company has a way they work and it won't be changed just for you" situation. They want an SFTP server where they will upload their data. We need to give them that. I'm trying to avoid what is probably the standard solution they normally deal with, of just having a NAS with a bunch of hard-drives in our office, for a large number of reasons. Can you explain to me how AWS Storage Gateway would work? Do I just mount that as a local drive in my EC2 server where the SFTP server is running? – Daniel Magliola Mar 22 '15 at 12:23

4 Answers4

10

I answered this same question on Stack Overflow.

s3fs is indeed a reasonable solution, and in my case, I've coupled it with proftpd with excellent results, in spite of the theoretical/potential problems.

At the time I wrote the answer, I had only set this up for one of my consulting clients... but since then, I've also started drinking my own kool-aid and am using it in production at my day job. Companies we exchange data with upload and download files all day long on my sftp server, which is storing everything directly on S3. As a bonus, my report exporting system -- which writes excel spreadsheets directly to S3 -- can export reports "to the FTP server" by simply putting them directly into the ftp server's bucket, with appropriate metadata to show the uid, gid, and mode of each file. (s3fs uses x-amz-meta-uid, -gid, and -mode headers to emulate filesystem permissions). When the client logs on to the server, the report files are just... there.

I do think the ideal solution would probably be an sftp to S3 gateway service, but I still haven't gotten around to designing one, since this solution works really well... with some caveats, of course:

Not all of the default values for s3fs are sane. You will probably want to specify these options:

-o enable_noobj_cache   # s3fs has a huge performance hit for large directories without this enabled
-o stat_cache_expire=30 # the ideal time will vary according to your usage
-o enable_content_md5   # it's beyond me why this safety check is disabled by default

It's probably best to use a region other than US-Standard, because that's the only region that doesn't offer read-after-write consistency on new objects. (Or, if you need to use US-Standard, you can use the almost undocumented hostname your-bucket.s3-external-1.amazonaws.com from the us-east-1 region to prevent your requests from being geo-routed, which may improve consistency.)

I have object versioning enabled on the bucket, which s3fs is completely unaware of. The benefit of this is that even if a file should get "stomped," I can always go to bucket versioning to recover the "overwritten" file. Object versioning in S3 was brilliantly designed in such a way that S3 clients that are unaware of versioning are in no way disabled or confused, because if you don't make versioning-aware REST calls, the responses S3 returns are compatible with clients that have no concept of versioning.

Note also that transferring data into S3 is free of data transfer charges. You pay only the per-request pricing. Transferring data out of S3 into EC2 within a region is also free of data transfer charges. It's only when you transfer out of S3 to the Internet, to Cloudfront, or to another AWS region that you pay transfer charges. If you want to use the lower-priced reduced-redundancy storage, s3fs supports that with -o use_rrs.

As an amusing aside, you'll always get a warm fuzzy feeling when you see the 256 terabytes of free space (and 0 used, since a real calculation of sizes is impractical because of the fact that S3 is an object store, not a filesystem).

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.9G  1.4G  6.2G  18% /
s3fs            256T     0  256T   0% /srv/s3fs/example-bucket

Of course, you can mount the bucket anywhere. I just happen to have it in /srv/s3fs.

Michael - sqlbot
  • 21,988
  • 1
  • 57
  • 81
  • Although it does not answer the question that was actually asked, if I had multiple terabytes of data that I wanted to load into S3, the recently announced [Amazon Import/Export Snowball](https://aws.amazon.com/importexport/) would be what I'd pitch the client *hard* for the initial data load. A 50 TB SAN shipped to their door, load it up, shop it back, Amazon loads the data, at a price far lower than the bandwidth to transfer the data. – Michael - sqlbot Oct 27 '15 at 21:41
  • Have you any experience running a web based GUI on top of this FTP setup? If yes, what has worked or been problematic? – T. Brian Jones Apr 13 '16 at 16:31
  • @T.BrianJones my inclination is usually to avoid GUIs as I prefer the clear and obvious behavior that manually-editable configuration files usually offer. For my setups, I have a custom script called `setupftpuser` that calls ProFTPd's `ftpasswd` utility to create users, create home directories, and set permissions. It also backs up the password file before making changes. If called on an existing user it tells you the user is already provisioned, and asks if you'd like to change the password. A GUI that managed essentially the same process, and it should be fine if it's well-written. – Michael - sqlbot Apr 13 '16 at 16:38
5

Check out the SFTP Gateway on the AWS Marketplace.

We experienced reliability issues with s3fs, so we developed a custom solution specifically for this purpose. We've been using it in production for several years without issue and have recently released it to the AWS Marketplace.

Jeff
  • 173
  • 1
  • 6
  • do note that this is one-way (uploading to sftp stores the file to s3, but the file can no longer be downloaded from the sftp). Also, putting files in s3 does not make them available through sftp. – Vincent De Smet Aug 23 '17 at 19:19
  • Just to clarify... SFTP Gateway does have a "download" directory as well that syncs from S3 back down to the sftp server. By keeping uploads and downloads separate, you as the admin, have complete control over what people can upload and download. – Jeff Mar 23 '18 at 13:24
  • is this a newly added feature? Certainly didn't exist when this comment was posted pretty much a year ago – Vincent De Smet Mar 23 '18 at 22:23
  • Yes, it was a feature added after this original post. We are actively maintaining it and continue to add new features like server side encryption support and shared downloads. – Jeff Mar 24 '18 at 23:38
1

There are two options. You can use a native managed SFTP service recently added by Amazon (which is easier to set up). Or you can mount the bucket to a file system on a Linux server and access the files using the SFTP as any other files on the server (which gives you greater control).

Managed SFTP Service

  • In your Amazon AWS Console, go to AWS Transfer for SFTP and create a new server.

  • In SFTP server page, add a new SFTP user (or users).

    • Permissions of users are governed by an associated AWS role in IAM service (for a quick start, you can use AmazonS3FullAccess policy).

    • The role must have a trust relationship to transfer.amazonaws.com.

For details, see my guide Setting up an SFTP access to Amazon S3.

Mounting Bucket to Linux Server

As @Michael already answered, just mount the bucket using the s3fs file system (or similar) to a Linux server (Amazon EC2) and use the server's built-in SFTP server to access the bucket.

Here are basic instructions:

  • Install the s3fs

  • Add your security credentials in a form access-key-id:secret-access-key to the /etc/passwd-s3fs

  • Add a bucket mounting entry to the fstab:

      <bucket> /mnt/<bucket> fuse.s3fs rw,nosuid,nodev,allow_other 0 0
    

For details, see my guide Setting up an SFTP access to Amazon S3.

Use S3 Client

Or use any free "FTP/SFTP client", that's also an "S3 client", and you do not have setup anything on server-side. For example, my WinSCP or Cyberduck.

Martin Prikryl
  • 7,327
  • 2
  • 36
  • 71
0

AWS now provides an SFTP over S3 service called AWS Transfer For SFTP. It has the benefits of S3 (highly durable, available, distributed storage) combined with the well known and established SFTP protocol.

By default, users authenticate using private/public key pairs, and using IAM policies you can set up permissions for SFTP users on S3 buckets. You can add authentication schemes by implementing your own functionality on AWS API Gateway and AWS Lambda.

We've wrapped AWS Transfer for SFTP in a Heroku add-on called SFTP To Go to both provide flexible authentication schemes and lower TCO (as a service endpoint has a fixed cost on AWS, but can be shared by many users without any security or performance compromise.

SNeumann
  • 119
  • 2