An attractive application of Hadoop and other Big Data technologies is to analyze users' social activities, sometimes without their express knowledge

The now-trendy concept of Big Data usually implies ever-growing hordes of data, including unstructured info posted on Facebook and Twitter, and ways of gleaning intelligence from all of it to create business opportunities. The concept, however, also carries with it risks for anyone opening up about themselves on the Internet and raises questions about who exactly owns all this data.

Big Data is associated with technologies such as the Apache Hadoop distributed computing platform and is prompting some technology companies, including IBM, to make major acquisitions. But the term "Big Data," claims GigaOm analyst Derrick Harris, is a bit of a misnomer; it's really about data from different sources, including social networks and even cell phones. "It's coming from sensors, it's coming from computers, it's coming from the Web," he says.

The strong interest by both IT and business units in Big Data is "about being able to harness it, and it's about being able to do something with it" -- in essence, analyzing it, says Harris. "The great thing about Big Data is we accumulate this amount of information and we have systems in place where we can use that for good," such as analyzing human genome information or making government data available, says Mozilla developer evangelist Christian Heilmann. Business analysts can study large data sets by renting servers for an hour, using technology such as Hadoop, he says.

Mining the social networks' Big Data Companies such as Echo and Cloudera are seeking their niche in the Big Data and social network data spaces. "The Big Data play right now for these big multi-million-dollar companies is around activity data," says Chris Saad, vice president of strategy at Echo. Both enterprise IT and individual users are sure to see a growing menu of Big Data services available as data gathering grows in prominence.

For example, serving ventures such as media companies and ad agencies, Echo StreamServer pulls in social media data relevant to a client into a single stream. Echo, which cites companies such as Reuters as customers, captures data about clients on sites such as Twitter and Facebook, as well as from the client's own sites. The clients can then create real-time experiences out of the data, Saad says. Clients get a "big unified data set" to develop applications such as forums and live blogging.

Cloudera offers its own distribution of Hadoop that serves as a platform for data management, and its Cloudera Enterprise provides large-scale data storage and analysis. Amr Awadallah, Cloudera's CTO, says the Hadoop distribution enables organizations to collect and combine social data and store it in a centralized data store. Users can then run MapReduce jobs to analyze this data for insight and factors such as new relationships.

But who owns the data? Mozilla's Heilmann views Big Data as any information accumulated on the Web -- any real-time data. But who specifically owns this data? "That's a very loaded conversation," he says.

"I think it's dangerous right now that the speed and beauty of these interfaces [on sites such as Facebook] make people give information away without realizing that they have done it," Heilmann says. For example, people can upload photos of themselves intoxicated and a potential employer can view them for at least some time afterward.

"You have a real problem deleting anything from the Internet," Heilmann stresses. "As soon as you put it up there, it will be cached, it will be copied somewhere else. You should be very mature about what you put online."

GigaOm's Harris says ownership of the data depends on circumstances. "Certainly, the companies generating it own the data," he says.

Although there is publicly owned data on the Web, Facebook and Twitter, for their part, own the data their users generate, Harris notes. And Big Data concepts such as data marketplaces have resulted, for example, in firms analyzing Twitter streams for a month at a time, Harris says. "There's a lot [of data] that's just available out there if you could harness it" and analyze it.

Cloudera's Awadallah says the question of who owns unstructured data is a hard one to answer. Data such as customer purchasing information in Apple's App Store belongs to Apple, he says. And although Google gives users to the right to delete data, it still owns the data itself, he adds.

Thus, the Data Portability Project for porting of social network data promotes the notion that users own their own data and social networks should make it easier for users to move it around. The effort has produced an initiative that aims to get sites to disclose what users can do with their data once it has been uploaded, says Saad, who in addition to his Echo job is co-founder of the Data Portability Project.

Still, Saad notes that in some cases users share ownership and custody of their data with the online services they use. "It's kind of like money in a bank. You own the money but you are basically giving it to the bank to safeguard for you and potentially use on your behalf," he says. &nbsp;

The issue is not just about privacy. One of the tenets of Big Data is to analyze data from multiple sources to identify trends, business opportunities, market shifts, potential customers, customer sentiment, and a lot more. When Big Data tools analyze information available on the Web, do they really have the right to do so without permission of the owner?

"It really depends," Saad argues. "If you're publishing on the public Internet, I think the social contract is such that people expect their data to be polled and crunched and indexed and used." On the other hand, "it's a little difference when Facebook, for example [is] expected to be a private network and it continues to push the boundary of what part of your information is made public. That's when it's violating the social contract."