Access control for big data analytics needs policy-based security that includes context as well as users and roles
Security for big data analytics is challenging. Here’s why: When you can’t analyze in place, you need to copy that data — at which point all the stipulations about who can see or change all manner of data under what circumstance should be replicated, too. Today, that’s nearly impossible to do.
On the Hadoop/Spark side, we have only role-based, limited access control lists (ACLs) or the Wild West. But I believe there’s a way forward: Adopt the policy-based approach that has arisen in the broader security market. To explore how that could work, we need to revisit the history of access control and how it evolved to produce a policy-based model.
In the beginning, there were usernames and passwords to keep out everyone who might want in, despite what Richard Stallman said.
There was an inherent problem with this system. The number of user/password combinations tended to explode as new applications were written, so we ended up with a different user/password for each application. Worse, some applications asked for different passwords to reach different levels of security.
We became smarter and divided up “roles” from usernames. We’d have one “user/password,” but to access the administrative functions, that user/password would also need an “admin” role, for example. However, each application tended to implement this on its own, so you still had a growing list of passwords to remember.
We became even smarter and created central systems that eventually becameLDAP, Active Directory, and the like. These united the user/password in a core repository and established one place to look up the roles for a given user — but this replaced one problem with another.
In an ideal world, each new application looks at the list of roles in Active Directory and maps them to application roles, so there’s a clean, one-to-one relationship. In reality, most applications think of roles differently, and besides, simply because you’re an admin for one application doesn’t mean you should be an admin for another. In the end, you’ve replaced an explosion of user/password combinations with an explosion in the number of roles.
Which begs the question: Who ends up in charge of adding new roles? It tends to be either some IT-administrative or shared-HR function. Since there’s a good chance none of those people with the menial task of adding roles will actually understand the application very well, this usually ends up being a “manager approval” or “rubber stamp,” and that isn’t, as they say, good.
Many applications still punt on the question of roles by using AD for authentication and having the application handle its own local role implementation. There’s a lot to be said for this approach, because it’s clearly the application administrator who knows who should have what level of access.
Meanwhile, there are clear rules that do not cleanly fit into a user/role system. At its simplest, because I’m a banking customer doesn’t mean I can withdraw money from any account even if I have the “canWithdraw” role. Roles often need to be associated with data, which is why we have ACLs that map to entries in our data store. That is, account 1234 has an association that identifies me as its owner and my spouse as an authorized account administrator.
However, some businesses have rules that are more complicated than “is this yours?” or “what permissions do you have on this record?” Instead, they use what you might call “contextual” or “policy-based” security rules. In other words, I might have permission only to withdraw money while I’m within the continental United States. There’s no way to express this in an ACL or role-based model. Instead, we’ve crossed over into policy-based security.
Why you should care about advanced security
Clearly, using ABAC-style policies and XACML is a hefty step over RBAC. You should have the motivation to do this, if only to avoid a big, fat $100 million fine. I mean, $100 million here and $100 million there, and before long it adds up to real money.
Also, some organizations have complex rules and ownership of data. As these companies increasingly move to become data-driven and can’t analyze everything in place, but instead require centralization, they’ll need a system that goes beyond the common RBAC models of today. Moreover, to make that feasible, they’ll need tagging and libraries that allow them to apply policies expressed in something like XACML as well as the tools to manage the policy centrally while applying it locally where meaningful.
When we look at today’s big data offerings, such as Ranger and Sentry, nothing comes close to answering this call. Even solutions for RDBMS-based systems tend to be proprietary, expensive, and often incomplete. Organizations doing high security with complex security rules are forced to implement this on their own. Heck, data tagging tools are still in their infancy for big data systems like Hadoop.
In other words, there's a big opportunity here for the vendors who can figure it out. Clearly, the defense industry is the first customer, because it's already doing it out of necessity. As more companies create central data repositories for big data analysis, the need for policy-based security is only going to grow.