Archive

How do machines understand what place you’re talking about when you say the name of a city, a street or a neighborhood? With geocoding technology, that’s how. Every location-based service available uses a geocoder to translate the name of a place into a location on a map. But there isn’t a really good, big, stable, public domain geocoder available on the market.

Steve Coast, the man who lead the creation of Open Street Map, has launched a new project to create what he believes is just what the world of location-based services needs in order to grow to meet its potential. It’s called OpenGeocoder and it’s not like other systems that translate and normalize data.

Google Maps says you can only use its geocoder to display data on maps but sometimes developers want to use geo data for other purposes, like content filtering. Yahoo has great geocoding technology but no one trusts it will be around for long. Open Street Map (OSM) is under a particular Creative Commons license and “exists for the ideological minority,” says Coast himself in a Tweet this week. And so Coast, who now works at Microsoft, has decided to solve the problem himself.

This has been tried before, see for example GeoCommons, but the OpenGeocoder approach is different. It is, as one geo hacker put it, “either madness or genius.”

The way OpenGeocoder works is that users can search for any place they like, by any name they like. If the site knows where that place is, it will be shown on a big Bing map. If it doesn’t, then the user is encouraged to draw that place on the map themselves and save it to the global database being built by OpenGeocoder.

Above: The river of my childhood, which I just added to the map.

Every single different way a place can be described must be drawn on the map or added as a synonym, before OpenGeocoder will understand what that string of letters and numbers means with reference to place. Anyone can redraw a place on the map, too.

Then developers of location-based services can hit a JSON API or download a dump of all the place names and locations for use in understanding place searches in their own apps. It appears that just under 1,000 places have been added so far. It will take a serious barn-raising to build out a map of the world this way. It wouldn’t be the first time something a little like this has been done before though.

“If only it was that simple :(” said map-loving investor Steven Feldman on Twitter. “Maybe it is?”

The approach is focused largely on simplicity. Coast said in his blog post announcing the project:

“OpenGeocoder starts with a blank database. Any geocodes that fail are saved so that anybody can fix them. Dumps of the data are available.

“There is much to add. Behind the scenes any data changes are wikified but not all of that functionality is exposed. It lacks the ability to point out which strings are not geocodable (things like “a”) and much more. But it’s a decent start at what a modern, crowd-sourced, geocoder might look like.”

Testing the site, I grew frustrated quickly. I searched for the neighborhood I live in: Cully in Portland, Oregon. There was no entry for it, so I added one. But there are no street names on the map so I got lost. I had to open a Google Map in the next tab and switch back and forth between them in order to find my neighborhood on the OpenGeocoder map. Then, the neighborhood isn’t a perfect rectangle, so drawing the bounding box felt frustratingly inexact. I did it anyway, saved, then tried recalling my search. I found that Cully,Portland,Oregon (without spaces) was undefined, even though I’d just defined Cully, Portland, Oregon with spaces. I pulled up the defined area, then searched for the undefined string, then hit the save button, and the bounding box snapped back to the default size, requiring me to redraw it again, on a map with no street names. Later, I learned how to find the synonym adding tool to solve that problem.

In other words, the user experience is a challenge. That’s the case with Wikipedia too, and OpenGeocoder just launched, but I expect it will need some meaningful UX tweaks before it can get a lot of traction.

“I’m obsessed with the need for an open-source geocoder, and this is a fascinating take on the problem,” says data hacker Pete Warden about OpenGeocoder. “By doing a simple string match, rather than trying to decompose and normalize the words, a lot of the complexity is removed. This is either madness or genius, but I’m hoping the latter. The tradeoff will be completely worthwhile if it makes it more likely that people will contribute.”

Coast is a giant figure in the mapping world. In 2009, readers of leading geo publication Directions Magazine voted him the 2nd most influential person in the geospatial world, ahead of the Google Maps leadership and behind only Jack Dangermond, the dynamic founder of 41-year old $2 billion GIS company ESRI. Coast will turn 30 years old next month.

The more I play with OpenGeocoder, the more it grows on me. I hope Coast and others are able to put in the time it will take to make it as great as it could be.

Good software developers are hard to find. Startups are all about finding creative solutions to common problems – so why not this one too?

Two startups that have found creative and interesting ways to solve their developer shortage problems are travel photo network Jetpac and mobile app search startup Quixey. Both used contests and games to overcome their challenges and get access to the high-level coding talent they needed. Their efforts may illustrate a part of what people call the gamification of work that’s expected to be a big part of the future.

How Jetpac Built a Photo Quality Algorithm for $5k in 3 Weeks

Jetpac is a young San Francisco startup that asks you to log in with your Facebook account, then it searches through all the photos your friends have uploaded. It looks for photos with the names of places in their captions, then builds a personalized travel photo magazine out of your friends’ pictures.

One member of the founding team is leading data hacker Pete Warden. (Disclosure, Warden told me this story while I was staying at his house on a trip to SF, but it’s such a cool story I’ve been telling it ever since – and it works well with the Quixey story too.)

Warden says that when the team was first showing off its service in a demo, far too many of the photos that came up were terrible. They were blurry, boring, bad photos. It was easy for a human being to look at these photos and know they should be excluded from the collections displayed.

Could a machine be taught to look at new photos and determine whether they were high or low quality? Warden suspected that it was possible, but recognized the limitations of his own knowledge. He didn’t have the machine learning skills to build something himself, much less at the pace the company needed a solution.

Here’s what they did: They looked at 30,000 photos with their human brains and quickly judged whether each was a good or bad photo for a travel magazine experience.

Then they visited the website Kaggle, where data science challenges gets turned into contests with prizes that anyone in the world can win. The Jetpac team took all the metadata they had about these 30,000 photos, including the dimensions, and they substituted standardized numbers for words that appeared more than once. They uploaded all that data onto Kaggle but they only included the corresponding human judgement of whether a photo was good or bad for 10,000 of the photos.

The challenge they set up was this: could Kaggle participants write code that could analyze the patterns of metadata effectively enough based on the 10k photos they were told the human judgements about well enough to accurately guess whether humans would call the other 20k photos good or bad just based on the other metadata available about them?

The startup put up a tiny $5k bounty, one of the smallest Kaggle had ever hosted, and applied a deadline in 3 weeks.

People loved it. All kinds of computer scientists moonlighting as Kaggle competitors jumped into the fray and wrote algorithms they thought could predict photo quality. They drafted something up, then uploaded their “guesses” for the other 20k photos to Kaggle’s server, then were told what percentage they got right – how often they accurately predicted a person would deem a photo good. Then they changed their code and tried to improve their results.

212 teams, consisting of 418 people, competed for 3 weeks. The contest leaderboard showed the top ten teams all had more than 85% accuracy.

All the algorithms found that there were some words in photo captions that make them far more likely to be connected to a good travel photo than a bad one. Among the best words: Peru, Cambodia, Michigan, tombs, trails and boats. What photo captions are the most likely to signify a bad photo for a travel magazine? San Jose, mommy, graduation and CEO, Warden says.

All the algorithms found that there were some words in photo captions that make them far more likely to be connected to a good travel photo than a bad one. Among the best words: Peru, Cambodia, Michigan, tombs, trails and boats. What photo captions are the most likely to signify a bad photo for a travel magazine? San Jose, mommy, graduation and CEO, Warden says.

Bo Yang, a USC PhD whose team had just narrowly lost out on winning the Netflix prize, squeaked out a small improvement in his photo quality algorithm to take the top prize in the very last day. Yang was interviewed by the Kaggle team here.

Part of the Kaggle terms of service are that contest sponsors must have non-exclusive IP rights to the work, so the Jetpac team was able to put code from the contestants directly into their app.

Jetpac’s Warden says of the experience as a startup,

“The two biggest enemies of a startup are lack of money and lack of time. Packaging the data didn’t take as long as we thought and after we uploaded it to the site, all of the details of dealing with the contestants were automated. So it saved us a massive amount of time compared to finding, hiring and explaining our problem to an outside contractor.

“And we would never have gotten anywhere near the quality from the circle of people we know. The short term nature of the project wouldn’t have made it attractive as a project for most – just the overhead of setting up a contract and that sort of stuff. The caliber of people participating in these contests is amazing. They aren’t starving college students, many are highly skilled professionals who make a lot more money than I do, in their day jobs. They do this for fun.”

Jetpac had to think through how to set up the contest, but the Kaggle team helped them, too. It’s hard to imagine a way that such a complex problem could get so much brain power thrown at it so fast and so inexpensively. Warden says the end results have been great.

How Quixey Finds Great Developers with $100, 60 Second Challenges

Quixey, a Silicon Valley app search engine (it’s cool, try it – I found this on it), faces the same struggle to find developers that so many startups do. They have high-profile VC backing (Eric Schmidt of Google, among others) and had been paying $20k per developer hire to traditional recruiters.

Liron Shapira, co-founder and CTO of Quixey, says the company came up with a very elegant solution. Called the Quixey Challenge, it’s a simple contest. If you can find and fix a bug in the code for an algorithm you’re given, in under 60 seconds, the company PayPals you $100.

In order to qualify for the monthly contest, you’ve got to succeed at least 3 times in challenge rounds over the weeks prior to the big event. If you qualify, then the company calls you on Skype and administers the challenge face-to-face. It only lasts 60 seconds. If, in preparation, you succeed 5 times – then the system automatically contacts you to see if you might be interested in working for Quixey.

Shapira says that 38 prizes were awarded in the December challenge, and it resulted in 3 full time hires and 2 intern hires. Winners also receive Quixey Challenge hoodies, which Shapira says can be seen floating around the elite student body of Carnegie Mellon University.

“We’ve had about 5k users sign up and practice and we’ve reached out to 500 or something,” Shapira told me. “Those are incredibly valuable leads to have.”

“We just hired a guy named Marshall who doesn’t have a college degree and lives in Grand Rapids Michigan. He wouldn’t come in from a Silicon Valley recruiter, but he reads Hacker News and he nailed the interview.

“You can’t judge if someone is one of the best programmers in the country in 1 minute, but it turns out you can in 5 minutes. You only need 3 practices to qualify for the challenge but people take 10. A low percentage like 1 in 15 or 20 users will be good enough to get contacted, so we are able to filter people out with high accuracy.

“We wasted so much time figuring out peoples’ skills before. Many times we’ll do the challenge or interviews and it will take 15 minutes. The fact that some people can do it under 1 minute and others, also working in Silicon Valley, take 15 minutes, is evidence of the 10x engineer idea. Debugging is something you do every day at work, if you can get more than half of the bug fixes that we put in front of you, fast, then you are probably very good and we want to talk you.”

Facebook has cut a deal with political website Politico that allows the independent site machine-access to Facebook users’ messages, both public and private, when a Republican Presidential candidate is mentioned by name. The data is being collected and analyzed for sentiment by Facebook’s data team, then delivered to Politico to serve as the basis of data-driven political analysis and journalism.

The move is being widely condemned in the press as a violation of privacy but if Facebook would do this right, it could be a huge win for everyone. Facebook could be the biggest, most dynamic census of human opinion and interaction in history. Unfortunately, failure to talk prominently about privacy protections, failure to make this opt-in (or even opt out!) and the inclusion of private messages are all things that put at risk any remaining shreds of trust in Facebook that could have served as the foundation of a new era of social self-awareness.

We, ok I, have long argued here at ReadWriteWeb that aggregate analysis of Facebook data is an idea with world-changing potential. The analogy from history that I think of is about Real estate Redlining. Back in the middle of the last century, when US Census data and housing mortgage loan data were both made available for computer analysis and cross referencing for the first time, early data scientists were able to prove a pattern of racial discrimination by banks against people of color who wanted to buy houses in certain neighborhoods. The data illuminated the problem and made it undeniable, thus leading to legislation to prohibit such discrimination.

I believe that there are probably patterns of interaction and communication of comparable historic importance that could be illuminated by effective analysis of Facebook user data. Good news and bad news could no doubt be found there, if critical thinking eyes could take a look.

“Assuming you had permission, you could use a semantic tool to investigate what issues the users are discussing, what weight those issues have in relation to everything else they are saying and get some insights into the relationships between those issues,” writes systemic innovation researcher Haydn Shaughnessy in a comment on Forbes privacy writer Kashmir Hill’s coverage of the Politico deal. “As far as I can see people use sentiment analysis because it is low overhead; the quickest, cheapest way to reflect something of the viewpoints, however fallible the technique. Properly mined though you could really understand what those demographics care about.”

Several years ago I had the privilege to sit with Mark Zuckerberg and make this argument to him, but it doesn’t feel like the company has seized the world-changing opportunity in front of it.

Facebook does regularly analyzes its own data of course. And sometimes it publishes what it finds. For example, two years ago the company cross referenced the body of its users’ names with US Census data that tied last names and ethnicity. Facebook’s conclusion was that the site used to be disproportionately made up of White people – but now it’s as ethnically diverse as the rest of America. Good news!

But why do we only hear the good news? That millions of people are talking about Republican Presidential candidates might be considered bad news, but the new deal remains a very limited instance of Facebook treating its user data like the platform that it could be.

It could be just a sign of what’s to come, though. “This is especially interesting in terms of the business relationships–who’s allowed to analyze Facebook data across all users?” asks Nathan Gilliatt, principal at research firm Social Target and co-founder of AnalyticsCamp. “To my knowledge, they haven’t let other companies analyze user data beyond publicly shared stuff and what people can access with their own accounts’ authorization. This says to me that Facebook understands the value of that data. It will be interesting to see what else they do with it.”

I’ve been told that Facebook used to let tech giant HP informally hack at their data years ago, back when the site was small and the world’s tech privacy lawyers were as yet unaroused. That kind of arrangement would have been unheard of for the past several years, though. Two years ago, social graph hacker Pete Warden pulled down Facebook data from hundreds of millions of users, analyzing it for interesting connections before planning on releasing it to the academic research community. Facebook’s response was assertive and came from the legal department. Warden decided not to give the data to researchers after all. (Disclosure: I am writing this post from Warden’s couch.)

“Like a lot of Facebook’s studies, this collaboration with Politico is fascinating research, it’s just a real shame they can’t make the data publicly available, largely due to privacy concerns” bemoans Warden. “Without reproducability, it loses a lot of its scientific impact. With a traditional opinion poll, anyone with enough money can call up a similar number of people and test a survey’s conclusions. That’s not the case with Facebook data.”

“Everyone is going ‘gaga’ over the potential for Facebook,” says Kaliya Hamlin, Executive Director of a trade and advocacy group called the Personal Data Ecosystem Consortium.

“The potential exists only because they have this massive lead (monopoly) so it seems like they should be the ones to do this.

“Yes we should be doing deeper sentiment analysis of peoples’ real opinions. But in a way that they are choosing to participate – so that the entities that aggregate such information are trusted and accountable.

“If I had my own personal data store/service and I chose to share say my music listening habits with a ratings service like Neilson – voluntarily join a panel. I have full trust and confidence that they are not going to turn on me and do something else with my data – it will just go in a pool.

“Next thing you know Facebook is going to be selling to the candidate the ability to access people who make positive or negative comments in private messages. Where does it end? How are they accountable and how do we have choice?”

Not everyone is as concerned about this from a privacy perspective. “There are many things in the online world that give me willies for Fourth-Amendment-like reasons,” says Curt Monash of data analyst firm Monash Research. “This isn’t one of them, because the data collectors and users aren’t proposing to even come close to singling out individual people for surveillance.”

Monash’s primary concern is in the quality of the data. “There’s a limit as to how useful this can be,” he says. “Online polls and similar popularity contests are rife with what amounts to ballot box stuffing. This will be just another example. It is regrettable that you can now stuff an online ballot box by spamming your friends in private conversation.”

It doesn’t just have to be about messages, though. Social connections, Likes and more all offer a lot of potential for analysis, if it’s done appropriately.

“We need trust and accountability frameworks that work for people to allow analysis AND not allow creepiness,” says Hamlin.

Two years ago social news site Reddit began giving its users an option to “donate your data to science” by opting in to have activity data made available for download. Massive programming Question and Answer site StackOverflow has long made available periodic dumps of its users’ data for analysis. “You never know what’s going to come out of it,” StackOverflow co-founder Joel Spolsky says about analysis of aggregate user data.

The unknown potential is indicitive not just of how valuable Facebook data is, but potentially of the relationship between data and knowledge generally in the emerging data-rich world.

That’s the thesis of author David Weinberger’s new book, Too Big to Know. “It’s not simply that there are too many brickfacts [datapoints] and not enough edifice-theories,” he writes. “Rather, the creation of data galaxies has led us to science that sometimes is too rich and complex for reduction into theories. As science has gotten too big to know, we’ve adopted different ideas about what it means to know at all.”

The world’s largest social network, rich with far more signal than any of us could wrap our heads around, could help illuminate emergent qualities of the human experience that are only visible on the network level.

When researchers Alasdair Allan and Pete Warden announced at the Where 2.0 Conference in Santa Clara a few weeks ago that iPhones and 3G iPads are storing records of where their users are and where they’ve been, the news created quite a stir. Google also stores a similar list on Android devices, so naturally questions have swirled in the last few weeks around how both Apple and Google are collecting and using this location data and to what extent it encroaches on user privacy.

Yesterday, representatives from both companies were called before a senatorial subcommittee to answer questions from the likes of Senators Al Franken (Minn.) and Patrick Leahy (Vt.) on whether or not our mobile devices are becoming Big Brother 2.0.

During the testimony, the senators were careful to say that the government is well aware of the many benefits of the technology created by both companies and is in no way eager to stifle innovation or create knee-jerk legislation. That being said, in the words of Senator Leahy, while the “digital age can do some wonderful, wonderful things for all of us … American consumers and businesses face threats to privacy like no time before.”

Naturally, even without the information that has recently come to light, there has been a growing concern among lawmakers and consumers alike that both Google and Apple are not doing enough to become guardians of the user’s personal data rather than wholesalers. Leahy told the representatives that he was “deeply concerned” about the reports that iPhones and Android devices were “collecting, storing, and tracking user location data without the user’s consent”.

“I am also concerned about reports that this sensitive location information may be maintained in an unencrypted format, making the information vulnerable to cyber thieves and other criminals”, the Senator said.

As to the basic allegations that lay before the two giants of the mobile space, Apple has previously stated that, though it is partly at fault for not educating its users to fully understand the technical issues with providing fast and accurate location information, the company does not (nor has it ever) tracked the location of a user’s iPhone.

At the time, Apple explained that, while it did find a few bugs in the architecture, it was adamant that it is using the location data stored on its devices to maintain and improve upon a crowdsourced database of WiFi hotspots and cell towers — not to keep a log of a user’s prior location. The geo-tagged data from iPhones, for example, is used to help build data about WiFi networks and cell tower locations, which let location-based services work even when GPS/satellite positioning isn’t available or functioning seamlessly.

Be that as it may, Senator Franken noted that consumers remain confused, so he posed the question directly to Apple’s VP of Software Technology Bud Tribble: “does this data indicate anything about the user’s location, or doesn’t it?”

Tribble’s response was to reiterate the main message to the average consumer: that the data is a record of the location of cell towers and WiFi hotspots, it does not contain any customer information. It is anonymous. However, that comes with a nuance. When a portion of that database is downloaded onto your phone, your phone knows which hotspots and towers it can transmit through, so the combination of the location of those towers and your phone knowing which towers it can transmit through, allows the phone to give you a basic location without GPS.

So, he is essentially saying, yes of course Apple tracks your location. That’s what GPS and WiFi and cell tower positioning are designed to do, and yes it does store location-based information on its devices in order to do that, but no it isn’t keeping a full history of your locations, and while it does know where you are, it doesn’t necessarily know who you are.

Though Apple doesn’t seem to be doing anything intentionally nefarious with this information, the point remains that the laws of this country have not yet come anywhere near to adequately addressing the capabilities of modern technologies. In an earlier panel, Jason Weinstein, deputy assistant attorney general of the Criminal Division of the U.S. Justice Dept, told the subcommittee that once companies have access to consumer info (if you give Apple or Google permission to use your location or something similar), they can legally share that data with third-party businesses.

Only when companies have previously promised not to share something, like your location, can they be held accountable in court. As Justin Brookman, the Director for the Center of Democracy & Technology’s Project on Consumer Privacy, said, “the default law in this country for the sharing of data is that you can do anything you want”, with the exception being any prior promise the company has made not to share specific data.

Franken also asked the representatives from the two companies about the fact that they run the biggest app stores in the world, yet require no privacy policy for their apps, and so asked them if they would consider adding a privacy policy.

Alan Davidson, Google’s director of public policy, said that Google has relied on a permission-based model which requires users to give permission before any sharing takes place, but that the next step is important for Google to consider and is “a very good suggestion”. He said that he would “take that issue back to leadership”. And for Apple, Tribble said that Apple contractually requires third-party developers to disclose if they’re going to do anything with user information, but does not integrate an over-arching privacy policy. He then continued on to say that a general privacy policy would not be enough, that true transparency goes beyond what’s in the privacy policy and needs to be integrated into the user interface of an app, designing feedback to the user about what’s happening to the information into the actual app.

Franken then asked Tribble about why Apple only asks users if they want to share location with an app, while Google asks the user if they want to share location, address book information, contacts, and so on. Tribble responded by saying that a long checkmark box of opt-in sharing options would only confuse the user and be unwieldy both to present and read on a mobile device.

There’s no doubt that Tribble makes two valid points here, but Ashkan Soltani, an independent researcher who has worked with the Wall Street Journal on mobile-privacy investigations, shortly thereafter quickly cut to the heart of the matter. He told the senators that the biggest privacy threat to mobile users today is the simple fact that “consumers are repeatedly surprised by the information that apps and app platforms are accessing”. Users are entrusting their phones and computers with a great deal of personal information, he said, and these platforms are not taking adequate steps to make clear to the consumer that third-parties have access to this information.

Not only that, but the other issue is that platform providers, too, are often caught off-guard as to the types (or amount) of information they’re gathering. Soltani cited the examples of Google Street View collecting WiFi information during Street View surveys and this recent example of Apple’s location storage cache.

So, it seems that not only are lawmakers and legislation slow to catch up to the uses and capabilities of modern technology, so too are the providers themselves. Going forward, Soltani suggested, we will need to begin to formulate solid definitions to questions as fundamental as “What does ‘opt-in’ mean?” and further define oft-used concepts like location. Is a user’s location defined within 4 feet or 100 miles? What is “anonymous” going to mean in a location-crazy world, and how are we going to define “third-party” and what those “third-parties” rightly have access to?

The legislative process is just beginning here, and may well be glacial in its progress. Though there is certainly some questionable thinking to be found coming from these two companies in how they’re thinking about privacy, it’s great to see evidence of their willingness to work with the government to find the best solution for enterprise — and more importantly, the consumer — going forward.

Kudos to the senators and the subcommittee for asking the right questions.

Senators John Kerry, and John McCain introduced a bill to the Senate floor last week entitled “The Commercial Privacy Bill Of Rights” that would reform and codify how Internet user data could be used online.

On the surface, this seems like the type of altruistic bill that falls in to the no-brainer area of Congressional legislation. Privacy, protection, trust, accountability. All the good political buzzwords apply. Yet, it is not that simple. Data is the lifeblood of the Web and the use of consumer data and the bill would allow the Federal Trade Commission and the Department of Commerce to have a significant hand in regulation of how data is collected and used by companies. Advertisers, innovators and consumer groups are concerned with the bill, not so much because of the wording of the legislation, but rather the amount of control it places in the hands of the FTC and whether or not that is necessary.

The right to security and accountability:
Collectors of information must implement security measures to protect the information they collect and maintain.

The right to notice, consent, access and correct information:
The collector must provide the ability for an individual to opt-out of any information collection that is unauthorized by the Act and provide affirmative consent (opt-in) for the collection of sensitive personally identifiable information.

The right to data minimization, distribution constraints and data integrity:
Holds companies to use the data they collect only for specific purposes of conducting business within a set timeline hold any third-party accessors of that data to the same standards as the collector.

Voluntary Safe Harbor Programs:
Companies can opt-out of portions of the bill if they set policies that are equally as stringent as the bill.

How will it affect the advertising ecosystem?

The advertising community feels that this law is unnecessary because the industry has been crafting its own privacy policies for some time and think that the market can regulate itself.

“We’ve set up a system; now they are going to replace it with the FTC,” Dan Jaffe, executive vice president of the Association of Nationals Advertisers said in and AdWeek interview. “It basically undermines the momentum last December when botht he FTC and Commerce Department scolded the industry for not moving fast enough.”

Kaliya Hamlin, the executive director of the Personal Data Ecosystem Consortium, believes the bill is absolutely necessary. Foremost, she says, it will bring American privacy laws in to correlation with those of the European Union, which would allow for greater international commerce as currently U.S. sites that do not comply with European privacy laws either cannot operate there or have to change their data handling processes to accommodate European law.

“Our consortium has a point of view about how the future could be and it doesn’t have to be the way it is,” Hamlin said. “It is totally possible to advertise (and potentially even more effective then today) in an ecosystem that gets re-wired to respect people and their information – that seeks to build with them and connect them to offers and opportunities that are relevant to them.”

Hamlin noted that on a recent trip to Europe that she encountered many pop-ups on sites that telling asking if her information could be used. It is likely that if this bill is passed similar pop-ups could be coming to the United States.

Will it stifle innovation?

Hamlin does not think so.

“So there are currently 20+ startups innovating around developing personal data banks and services around them,” Hamlin said. “So, where is the stifling?”

Yet, social network data hacker and ReadWriteWeb contributor Pete Warden, told our editor Marshall Kirkpatrick last week that “these regulations will deter startups from building new tools like Mint.com or Rapportive, while the big corporations can devote whole departments to working around any new rules,” Warden said.

There is clearly a split in the tech community on if and how the bill would affect innovation. It is notable that the United States leads Europe in volume and quality of innovation. Is that because of the culture on the other side of the pond? The amount of regulation by the European Union?

A group of the largest tech companies, including Microsoft, Intel and e-Bay signed an agreement last week supporting the bill. Notably, Facebook and Google have not weighed in on the bill.

CNET pointed out last week that the bill will not apply to the government or law enforcement, which brings up an interesting double standard in how the government views itself in relation to business. Internet consumers are as wary of the government and how it uses their information as they are of businesses. When it comes to the day-to-day processes of government function, it does not operate all that different from a large enterprise corporation yet yields more power over the lives of American’s than almost all corporations put together.

In the long run, start-ups and innovators will learn to deal with the new regulations if the bill passes. It comes down to a matter of trust. How much do consumers trust the businesses that use and control their data? This bill would help companies gain and keep the trust of consumers.

But, the question becomes: does the federal government need to legislate that trust or is it something that companies have to craft and earn on their own?

A bipartisan bill limiting what companies can do with online user activity and profile data may be introduced by Senators John McCain and John Kerry one week from Wednesday, according to a report today in the Wall St. Journal. The Journal’s Julia Angwin, citing anonymous sources, reports that the bill will require that sharing of user data between companies be opted-into by users and that users be able to see what data about them is being shared.

That might not sound so bad on the surface, but in a new world of fast-developing technology – it’s good to think hard before making laws based on what might seem like common sense. The internet is a young thing and legislation like this could cut deep. Leadership on the issue from John McCain, who less than 3 years ago thought it appropriate to run for the Presidency without ever having used the internet before, seems particularly inappropriate. This is an issue that needs to be looked at from a pro-technology perspective, at least in part.

Data flying from point to point, out of your sight, only somewhat under your control, until magic happens – isn’t that the nature of the Internet? And isn’t using it at all opting-in to redistribution of data? That might be too philosophical. There’s a practical story here of old-fashioned invention, too.

Requiring opt-in, instead of opt-out, for data sharing would likely greatly reduce the amount of user data available for sharing, analysis and use in creating new software and services. While unpoliced data sharing clearly frightens many people (where are the victims of these crimes?) – the consequences of stifling data sharing by industry are more tangible.

“These regulations,” says leading social network data hacker and ReadWriteWeb contributor Pete Warden, “will deter startups from building new tools like Mint.com or Rapportive, while the big corporations can devote whole departments to working around any new rules.”

Handling a Tidal Wave

The consequences of such legal action are hard to foresee. “The tricky question is, what is Personally Identifiable Information?” says Warden. “Everyone wants to just regulate names, addresses, etc. but since you can deanonymize almost any user-generated data set, and derive that information…any regulations will end up affecting far more applications than you might expect.”

Indeed, data is widely expected to become one of the key factors in the future of economic and social development. And so much data is personal. (There’s a whole lot of personal data about all of us collected and shared off-line too, at the grocery store for example, or in direct marketing databases – but that’s not the subject of so much ire.)

This is a common theme here on this blog. The example I’ve offered most commonly in calling for data to flow as freely as possible is the history of what’s called real estate redlining. In the 1960s, when both U.S. Census information and real estate mortgage loan information were made available for bulk analysis, it was proven that banks around the U.S. were discriminating against home loan applicants in traditionally African American neighborhoods.

That was a big deal and I suspect that there are patterns of comparable importance, both positive and negative, hiding in the huge flowing river of online user data.

Dr. Dirk Helbing, of the Swiss Federal Institute of Technology, chairs the team building a project called the Living Earth Simulator (LES), a massive data project aiming to simulate as much natural and social activity on earth as possible. Those simulations, to be carried out on a scale inspired by the Large Hadron Collider, would aim to discover all kinds of patterns hidden in the mass of human and ecological data, including social network data.

Here’s how he explains the importance of data analysis for pattern recognition. “Many problems we have today – including social and economic instabilities, wars, disease spreading – are related to human behaviour, but there is apparently a serious lack of understanding regarding how society and the economy work,” he says. “Revealing the hidden laws and processes underlying societies constitutes the most pressing scientific grand challenge of our century.“

Data analysis uncovered systematic racial discrimination in housing loans in the 1960′s. In the future, analysis of the incredible living census that is our internet data could be used to discover patterns and opportunities relevant to global warming, overpopulation, the spread of disease and the fact that the world today is an awful, unfair mess.

The Wall St. Journal’s extensive reports on these matters make no such effort. It’s remarkable that the paper of record for capitalism makes no serious gesture in its reporting on data privacy to recognize the incredible economic engine that is online data. Instead, the publication’s tone is fear-mongering and self-congratulatory. (A code on Ashley’s computer knows that she likes the movie The Princess Bride and that information is sold to other companies for 1/10 of 1 cent…’Well, I like to think I have some mystery left to me,” Ashley says, “but apparently not!” Poor woman! No mystery left!)

I hope that McCain and Kerry don’t introduce a bill requiring online user data sharing to be opt-in only. If they do, I hope there’s a lot more conversation and learning than there is legislating.

No one puts it better than the US Department of Commerce. That body said the following in its announcement of the new Federal Privacy Policy Office in December:

Strong commercial data privacy protections are critical to ensuring that
the Internet fulfills its social and economic potential. Our increasing use of the Internet generates voluminous and detailed flows of personal information from an expanding array of devices.

Some uses of personal information are essential to delivering services and applications over the Internet. Others support the digital economy, as is the case with personalized advertising.

Some commercial data practices, however, may fail to meet consumers’ expectations of privacy; and there is evidence that consumers may lack adequate information about these practices to make informed choices. This misalignment can undermine consumer
trust and inhibit the adoption of new services. It can also create legal and practical uncertainty for companies. Strengthening the commercial data privacy framework is thus a widely shared interest.

However, it is important that we examine whether the existing policy framework has resulted in rules that are clear and sufficient to protect personal data in the commercial context.

The government can coordinate this process, not necessarily by acting as a regulator, but rather as a convener of the many stakeholders–industry, civil society, academia–that share our interest in strengthening commercial data privacy protections. The Department of Commerce has successfully convened multi-stakeholder groups to develop and
implement other aspects of Internet policy. Domain Name System (DNS) governance provides a prominent example of the Department’s ability to implement policy using this model.

Convening multi-stakeholder conversations between diverse industry and other experts. That sounds like a much better idea than passing laws that cut so deep, here so close to the dawn of the internet.

Like a prism to a ray of sunlight, stream-hacking startup Mediasift CEO Nick Halstead took the stage today with Twitter’s Ryan Sarver at the Data 2.0 conference to announce Twitter’s second data resales channel partnership. Halstead’s service will allow customers to parse the full Twitter fire hose along any of the 40 fields of data hidden inside every Tweet, with the addition of augmentated data layers from services including Klout (influence metrics), PeerIndex (influence), Qwerly (linked social media accounts) and Lexalytics (text and sentiment analysis). Storage, post-processing and historical snapshots will also be available.

The price? Dirt cheap. Halstead told me after the announcement that customers would be able to apply as many as 10,000 keyword filters to the fire hose for as little as 30 cents an hour. The most computationally expensive filtering Mediasift will offer won’t be priced above $8k per year. (Pricing approximate but indicative, Halstead says.) What does this mean? It means that far more developers than ever before will now have a stable, officially aproved and very affordable way to access highly targetted slices of data. Twitter just found a way to hand developers an Amazon River’s worth of golden tinker-toys, each with more than 40 points of contact, at commodity prices.

While Twitter’s partnership with bulk data reseller Gnip (announced in November) offered half the firehose in bulk for a whopping $360k or 5% of the firehose for $60k per year – Mediasift prices and use cases will be very different. Pricing will be modeled like Amazon Cloud Computing and each function’s cost will be spelled out as the user requests it. Geo-filtering is expensive, keyword filtering is cheap – for example. Keyword filtering is done, stable and available now. Storage of the data for post-processing and snapshots of historical data are described as in alpha stages.

Want a feed of negative Tweets written by C-level execs about any of 10,000 keywords? Trivial! Basic level service, Halstead says! Want just the Tweets that fit those criteria and are from the North Eastern United States? That you’ll have to pay a little extra for. The possibilities are staggering.

All the message contents, all the bio contents, bio contents from other social services (like LinkedIn) associated to peoples’ Twitter accounts via Qwerly, sentiment analysis – all of these will be points of contact where filtering can occurr. Want a feed of negative Tweets written by C-level execs about any of 10,000 keywords? Trivial! Basic level service, Halstead says! Want just the Tweets that fit those criteria and are from the North Eastern United States? That you’ll have to pay a little extra for. The possibilities are staggering.

Below: This is what a Tweet looks like. Every little message has more than 40 different fields. Datasift customers will be able to filter the full Twitter firehose by any of those fields or by data from additional 3rd party services.

“Twitter are moving up the value chain by offering the high-level information that developers want,” said ReadWriteWeb contributer and leading social data hacker Pete Warden about the announcement. “Rather than selling commodity information for further processing, this partnership offers a narrow but deep slice.”

This is a bet on a future wherein greater value is built by widespread, low-cost access to social data on the part of many and diverse developers. It’s not just raw data either, it’s really rich. It’s the opposite of what the Gnip announcement signalled and what many people have feared – that Twitter would horde its river of data and sell it just to high bidders.

More on this in the coming days. I’m very excited to start hacking on it all with ReadWriteWeb’s team of developers.

Should We Replace the Term Big Data with “Unbounded Data”?

This is actually from a couple weeks ago, but I think it’s worthy of inclusion. Clive Longbottom of Quocirca makes the case that “Big Data” is the wrong way to talk about the changes in the ways we store, manage and process data. The term certainly gets thrown around a lot, and in many cases for talking about managing data that is much smaller than the petabytes of data that arguably defines big data. Longbottom suggests the term “unbounded data”:

Indeed, in some cases, this is far more of a “little data” issue than a “big data” one. For example, some information may be so esoteric that there are only a hundred or so references that can be trawled. Once these instances have been found, analysing them and reporting on them does not require much In the way of computer power; creating the right terms of reference to find them may well be the biggest issue.

Hadapt and Mapr Take on Cloudera

Mapr is a new Hadoop vendor and competitor to Cloudera co-founded by ex-Googler M.C. Srinivas. Mapr announced that it is releasing its own enterprise Hadoop distribution that uses its own proprietary replacement for the HDFS file system. In addition to Cloudera, Mapr will compete with DataStax and Appistry.

Hadapt is a new company attempting to bring SQL-like relational database capabilities to Hadoop. It leaves the HDFS file system intact and uses HBase.

Tokutek Updates Its MySQL-based Big Database

Don’t count MySQL out of the big data quite yet. Tools like HandlerSocket (coverage) and Memcached help the venerable DB scale. So does TokuDB from Tokutek, a storage engine used by companies like Kayak to scale-up MySQL and MariaDB while maintaining ACID compliance.

The new version adds hot indexing, for building queries on the fly, and hot column addition and deletion for managing columns without restarting the database.

The Dark Side of Big Data

Computerworld covers the relationship between surveillance and big data at the conference. “It will change our existing notions of privacy. A surveillance society is not only inevitable, it’s worse. It’s irresistible,” Jeff Jonas, chief scientist of Entity Analytic Solutions at IBM, told Computerworld.

we covered this issue last year and asked what developers would do with access to the massive data sets location aware services enable. It’s still an open question.

Yesterday at the GigaOM Structure Big Data conference in New York City, Apache Cassandra parent company DataStax announced a new product that integrates Cassandra, Hive and Apache Hadoop. The new product, called Brisk, essentially uses Cassandra to replace Hbase and the Hadoop storage layer. “There’s a lot of nice properties about using the Hadoop programming model on top of a Cassandra layer, especially if you’re already using the database and want to do more large-scale batch processing,” says ReadWriteWeb resident big data expert Pete Warden.

This new permutation of Hadoop represents the ongoing evolution of the database.

What we’re seeing now is not necessarily a consolidation in the market, though that’s happening in the columnar database space and elsewhere. More importantly we’re seeing technologies converge, as projects split apart and combine to form new tools. The process is already underway, as Cassandra itself combines concepts from Amazon.com’s Dynamo and Google’s BigTable.

The merger between CouchOne and Membase is another example. This was both a market and technical consolidation: the forthcoming Couchbase Elastic Server will combine Membase and Apache CouchDB.

Hadoop and Hbase have been around for a few years now, and Google has been using MapReduce and BigTable for even longer. But the relational database paradigm has been around for about four decades. These new databases, and the algorithms for manipulating them, are still remarkably new.

Projects like Brisk and Couchbase are likely only the beginning of the forks, mergers and strange offshoots we’ll see in coming years.

Functions and variables added to now are automatically synced, in real-time

Call client functions from the server and server functions from client

DNode is another RPC system for Node.js. “NowJS is a higher-level interface than DNode,” our own Pete Warden explains. “It offers an abstraction layer to make calling remote functions very simple without worrying about ports or sockets.”