When Hadoop started, it had a security problem. The spin from the various Hadoop vendors and proponents tended to be something like, "We see security as a front-end application issue." This is what you say when you don't have a good answer.

Since then, solutions like Apache Knox and Cloudera Manager have provided answers for authentication and authorization for basic database management functions. The underlying Hadoop Filesystem now incorporates Unix-like permissions.

This hasn't completely quashed the issue, largely because of the way entrepreneurs think: If you can't come up with a new idea, then plunk the S-word after the name of a new technology and you have a "BOLD IDEA FOR A NEW STARTUP!!!!" Rummage through the dustbin of recent history and you'll find startups devoted to SOA security, AJAX security, open source security, and so on. Now we have big data security startups -- and the money will roll right in! How do you launch a security startup? Scare people, of course.

The real security problem with Hadoop in particular and big data in general isn't with everyday access rights -- that took all of 10 minutes for the vendors and open source community to solve. The big problem is that when you aggregate a lot of data, you lose context. While I doubt many people are aggregating a lot of data without any context, aggregating any data means losing some context. A highly scalable architecture like Hadoop makes it feasible to store context, too, but checking all that context with each piece of data is an expensive proposition.

Here's what you need to know about context: Though you learn all about authentication and authorization in any basic computer science course, the most important details are often skirted. Yes, you can get access to the database as a certain user, and yes, you can get access to the BankAccounts table, but which rows can you access? The more data you aggregate, the challenge of preserving granular rights and permissions grows.

How do you keep all of those data ownership and data context rules in place without killing the performance that caused you to choose a big data solution in the first place? Well, there are emerging technology solutions, such as Accumulo, created by the big data community -- including everyone's favorite member, the NSA.

Luckily, this has all been thought of before in research and in great detail. In fact, almost exactly one decade ago this was a hot topic. When you're building your big data project that aggregates gobs of data from various places in the company and wondering about security, I suggest simply searching on "datawarehouse security." Though 70 percent of the results will be vendor pitches or complaints about RBAC, you'll find plenty of results that explain exactly how this was done before. Much of that previously published material describes neither technologies nor tools, but methodologies -- and those more or less translate directly to big data.

Andrew C. Oliver is a professional cat herder who moonlights as a software consultant. He is president and founder of Mammoth Data (formerly Open Software Integrators), a big data consulting firm based in Durham, N.C.