This is a multiple part question, which all really come back to the main question:
(Current version, vanilla main-branch distro - though I'm open to hearing about others). After much searching and researching, I have not come up with an awful lot...
My scenario is a service provider, that hosts and process large amounts of data from several big corporate customers (multi-tenancy). These customers do not access the Hadoop directly, but only through the SaaS application. However, these customers are often direct competitors, and often quite paranoid (justifiably, since each would probably be happy about some corporate espionage against the others....).
My old-school, knee-jerk reaction is to deploy individual, isolated instances for each customer. However, this is not practical, nor does it allow to take advantage of Hadoop's benefits and capabilities.
Also, I find it hard to believe that with all the big users of Hadoop, there are no good solutions for these issues...
Particularly, I'm looking at these issues:
- Limiting access to the specific users in use by each application (application user per customer)
- Encryption
- Isolation between customers, i.e. not allowing one customer to view another's.
- General hardening recommendations
I've managed to come up with a few directions, but haven't been able to verify that these are good directions, or if there are better solutions.
- Service level authorization
- Network/system isolation, to prevent anyone but the application from direct access
- File / folder permissions, per application user (i.e. customer).
Problems I've found with this approach:
- Permissions are only applied at the NameNode; direct access to DataNodes would still provide access.
- Authentication is a bit "iffy", at least until they add in Kerberos support (after that, we'll have to see re implemenation...)
- It seems to me that this doesnt provide enough isolation between customers.
- HDFS Federation / Namespaces
This might be able to provide better isolation of privileges, not to mention seperate servers and allocated bandwidth per customer (to prevent one trying to DoS another via the NameNode single point of failure). But I haven't found any real information on real-wrold usage, or how it stands up to misuse.
Also, this doesnt solve the issue of soft authentication (does it?), and direct DataNode block access (does it?) - For data encryption, I'm torn between HDFS encryption (a single, symmetric key shared between ALL Nodes), or application-level encryption (and the key (or keys, say one per customer) would still need to be distributed to each Task Node for MapReduce jobs).