7

This is a multiple part question, which all really come back to the main question:

How best to protect data in a Hadoop (wikipedia) cluster.

(Current version, vanilla main-branch distro - though I'm open to hearing about others). After much searching and researching, I have not come up with an awful lot...

My scenario is a service provider, that hosts and process large amounts of data from several big corporate customers (multi-tenancy). These customers do not access the Hadoop directly, but only through the SaaS application. However, these customers are often direct competitors, and often quite paranoid (justifiably, since each would probably be happy about some corporate espionage against the others....).

My old-school, knee-jerk reaction is to deploy individual, isolated instances for each customer. However, this is not practical, nor does it allow to take advantage of Hadoop's benefits and capabilities.
Also, I find it hard to believe that with all the big users of Hadoop, there are no good solutions for these issues...

Particularly, I'm looking at these issues:

  • Limiting access to the specific users in use by each application (application user per customer)
  • Encryption
  • Isolation between customers, i.e. not allowing one customer to view another's.
  • General hardening recommendations

I've managed to come up with a few directions, but haven't been able to verify that these are good directions, or if there are better solutions.

  • Service level authorization
  • Network/system isolation, to prevent anyone but the application from direct access
  • File / folder permissions, per application user (i.e. customer).
    Problems I've found with this approach:
    • Permissions are only applied at the NameNode; direct access to DataNodes would still provide access.
    • Authentication is a bit "iffy", at least until they add in Kerberos support (after that, we'll have to see re implemenation...)
    • It seems to me that this doesnt provide enough isolation between customers.
  • HDFS Federation / Namespaces
    This might be able to provide better isolation of privileges, not to mention seperate servers and allocated bandwidth per customer (to prevent one trying to DoS another via the NameNode single point of failure). But I haven't found any real information on real-wrold usage, or how it stands up to misuse.
    Also, this doesnt solve the issue of soft authentication (does it?), and direct DataNode block access (does it?)
  • For data encryption, I'm torn between HDFS encryption (a single, symmetric key shared between ALL Nodes), or application-level encryption (and the key (or keys, say one per customer) would still need to be distributed to each Task Node for MapReduce jobs).
AviD
  • 72,138
  • 22
  • 136
  • 218

1 Answers1

2

It really depends on who you're trying to protect your data from.

I have set up Hadoop clusters that use eCryptfs on each node, to ensure that data can be transparently shared between nodes, but also ensure that all data is encrypted before being written to disk. This provides a measurable level of privacy and protection, if you're trying to secure data from being vulnerable to physical theft of disks or remote network storage underlying virtual machines in a cloud environment.

Full disclosure: I'm one of the authors, and the current maintainer of eCryptfs userspace utilities.

Dustin Kirkland
  • 233
  • 1
  • 5
  • Thanks! Of course, this only deals with the encryption part, do you have any info on the other issues? – AviD Feb 20 '12 at 09:38
  • Using eCryptfs (this is not the default?), symmetric keys would still have to be distributed to all the nodes, as I noted in the question - is this correct? Is there any way around that, or at least to mitigate this? – AviD Feb 20 '12 at 09:39
  • @AviD, I only have authoritative experience and information about the encryption part of this question, sorry. – Dustin Kirkland Feb 20 '12 at 15:19
  • @AviD, actually, the keys do *not* necessarily need to be distributed among nodes, actually, if each individual node *already* has eCryptfs mounted. Point Hadoop at the mounted, cleartext data in the upper eCryptfs mount on each node, and Hadoop will proceed with no special knowledge of eCryptfs. The lower part of the eCryptfs mount will ensure that only encrypted data gets written to disk. – Dustin Kirkland Feb 20 '12 at 15:22
  • Thanks @Dustin, so as I understand it, eCryptfs would be completely transparent to HDFS. All communication with the DataNodes would be unencrypted (or at least, independant of the eCryptfs encryption), same with replication, and each DataNode would encrypt independantly of other DataNodes (and with different keys). In effect, as far as Hadoop is concerned, there *is no encryption*, and the actual storage device just happens to encrypt all data. Is this correct? What kind of overhead is there for this setup - performance, storage, M/R jobs, etc. – AviD Feb 20 '12 at 17:03
  • @AviD, your synopsis is basically correct. For information on overhead/performance, see: http://askubuntu.com/questions/100752/how-does-ecryptfs-impact-harddisk-performance – Dustin Kirkland Feb 20 '12 at 19:11