3

I've consulted this question, but I'd like to hear some more input.

I'm building a scraping app that is going to act as an aggregator of sorts for a large number of businesses which use a few popular (in that industry) tracking/staffing/inventory management systems. In order to access those and scrape the info, we need their usernames and passwords so they will get accessed a lot.

What I've surmised from my research is that the optimal route would be to set up a separate node on AWS whose sole purpose is to house the passwords, and make it accessible only internally and only to the node that will be doing the work.

I will be providing an API: ask for a scrape and ye shall receive the results later (results not actually being sensitive).

Now, is there a way to add a password on top of all that, or is that more or less enough? This will not be an application that's known far and wide, but I'm concerned as losing the auth info would be a catastrophe for these businesses.

dsp_099
  • 165
  • 6
  • 1
    Just checking: 1) You need their usernames and passwords any time, not only when they are logged in/using the system? 2) This applications do not have (or can offer) an authorization API you could use to get a token or something like that to get the info you need? (because that would be the best way to do it). – CristianTM Jan 08 '16 at 13:08
  • @CristianTM - 1) Yes, I need them all the time because 2) Access to these sites will be continuous and automated (by another app. I wish it wasn't so, but it is). Seems like the authorization would work but it'd be the same on every single time so I dont really see the point – dsp_099 Jan 08 '16 at 13:17
  • The point of 2 is that tokens can be a) revoked if there is a compromise or at any time the user wants to deny access, without loosing the password b) can be renewed/replaced on each use or interval, making a leakage of an old token safe. But the sites must provide such API, if they dont and are not willing to provide one, its not an option. – CristianTM Jan 08 '16 at 13:21
  • Well, I'm building it, so I can make that a requirement. I think a convincing case can be made for having it. Could you explain more about how to organize something like that? – dsp_099 Jan 08 '16 at 13:27
  • The token API must be provided by the sites you are gathering info from. If I understand, you do not control those, right? – CristianTM Jan 08 '16 at 13:30
  • @CristianTM Ah I misread. No, those sites are out of reach. We only have credentials of their users – dsp_099 Jan 08 '16 at 13:34

3 Answers3

1

There are off the shelf secure solutions for this.

You could store the passwords in a password manager like Hashicorp Vault or Thycotic Secret Server and use their APIs to retrieve and use them as needed for scraping.

Rolling your own is also an option. Adapting the example from this blog about using Amazon KMS with DynamoDB is one straight-forward option for doing that if you are using AWS and have Java programmers on staff.

Alain O'Dea
  • 1,615
  • 9
  • 13
0

Based on the discussion and your requirements, and that the sites you will gather info do not provide and can not provide an adequate token API for accessing the data you need from the users, your best option is to really enforce the security of that server and to have a good contigency plan. You should also let your users know you are going to store their passwords.

You coulddepending on how much you can spend to protect this data, use some crypto hardwware (HSM, TPM chip or even a smartcard/token if it will not be too high usage) to encrypt and decrypt this data on the server. What you gain is that a simple dump of the data will not leak data, the attacker must dump data and be able to control the server to use the HSM/TPM/token/smartcard).

Using a key stored in software (in disk) is better than nothing, but just makes things a little bit harder to an attacker, he will look for it when he sees that there is encrypted data.

Generate a key derived with PBKDF2 from some data in your code, some data on disk and any other place you can store on server also will make thins a little bit more difficult to the attacker, and may only give you some time to act and warn your users if something happens.

CristianTM
  • 2,532
  • 15
  • 20
0

What I've surmised from my research is that the optimal route would be to set up a separate node on AWS whose sole purpose is to house the passwords.

Assuming you don't want or need some SQL or other data management to manage auth credentials, I don't think you need a full AWS node (and attendant management and security concerns) to manage that data securely.

I would consider basic properties file, hosted on a secured encrypted S3 bucket, with a very limited access policy, only available to your app instance(s), which has been configured with the appropriate IAM role.

Using this policy:

{   "Version": "2012-10-17",   "Statement": [
    {
      "Action": [
        "s3:GetObject"
      ],
      "Sid": "Stmt0123456789",

      "Resource": [
        "arn:aws:s3:::<your S3 bucket name>/credentials.properties"
      ],
      "Effect": "Allow"
    }   ] }

Only instances with that policy can read that data. Its simple, low cost, and with S3 server side encryption, quite secure.

For more information on S3 server-side encryption options see Protecting Data Using Server-Side Encryption with Customer-Provided Encryption Keys

Rodrigo Murillo
  • 1,927
  • 11
  • 17