To get a flavor of the exchange, we’ll start with this from Andreessen:

What never gets discussed in all of this robot fear-mongering is that the current technology revolution has put the means of production within everyone’s grasp. It comes in the form of the smartphone (and tablet and PC) with a mobile broadband connection to the Internet. Practically everyone on the planet will be equipped with that minimum spec by 2020.

versus this from Payne:

If we’re gonna throw around Marxist terminology, though, can we at least keep Karl’s ideas intact? Workers prosper when they own the means of production. The factory owner gets rich. The line worker, not so much.

Owning a smartphone is not the equivalent of owning a factory. I paid for my iPhone in full, but Apple owns the software that runs on it, the patents on the hardware inside it, and the exclusive right to the marketplace of applications for it.

…

You spent a lot of paragraphs on back-of-the-napkin economics describing the coming Awesome Robot Future, addressing the hypotheticals. What you left out was the essential question: who owns the robots?

Namely, at some point we’ll have all these robots doing stuff for us, but how are we going to spread that wealth around? Who owns the robots and when are they going to learn to share? In this vision of the distant future, that critical “singularity of moral enlightenment” (SME) is never explained. I wish I could ask Captain Picard how it all went down.

It’s one thing to lack an explanation for the SME, and to consider it an aspirational quasi-religious utopian goal, but it’s another thing entirely to fail to acknowledge it.

That someone as powerful and famous as Mark Andreessen, who is personally involved in the development and nurturing of so many technology platforms, has trouble seeing the logical inconsistency of his own rhetoric can only be explained by the fact that, as the controller of such platforms, it is he who reaps their benefits. It’s yet another case of someone thinking “this system works for me therefore it is super awesome for everyone and everything, amen.”

I’m hoping Al3x’s fine response will get Marc to consider how SME is gonna happen, and when.

At first glance, data miners inside governments, start-ups, corporations, and political campaigns are all doing basically the same thing. They’ll all need great engineering infrastructure, good clean data, a working knowledge of statistical techniques and enough domain knowledge to get things done.

I do think there are differences, though, and here I’m not talking about ethics or trust issues, I’m talking about pure politics[1].

Namely, the world of data mining is divided into two broad categories: people who want to cause things to happen and people who want to prevent things from happening.

I know that sounds incredibly vague, so let me give some examples.

In start-ups, irrespective of what you’re actually doing (what you’re actually doing is probably incredibly banal, like getting people to click on ads), you feel like you’re the first person ever to do it, at least on this scale, or at least with this dataset, and that makes it technically challenging and exciting.

Or, even if you’re not the first, at least what you’re creating or building is state-of-the-art and is going to be used to “disrupt” or destroy lagging competition. You feel like a motherfucker, and it feels great[2]!

The same thing can be said for Obama’s political data miners: if you read this article, you’ll know they felt like they’d invented a new field of data mining, and a cult along with it, and it felt great! And although it’s probably not true that they did something all that impressive technically, in any case they did a great job of applying known techniques to a different data set, and they got lots of people to allow access to their private information based on their trust of Obama, and they mined the fuck out of it to persuade people to go out and vote and to go out and vote for Obama.

Now let’s talk about corporations. I’ve worked in enough companies to know that “covering your ass” is a real thing, and can overwhelm a given company’s other goals. And the larger the company, the more the fear sets in and the more time is spent covering one’s ass and less time is spent inventing and staying state-of-the-art. If you’ve ever worked in a place where it takes months just to integrate two different versions of SalesForce you know what I mean.

Those corporate people have data miners too, and in the best case they are somewhat protected from the conservative, risk averse, cover-your-ass atmosphere, but mostly they’re not. So if you work for a pharmaceutical company, you might spend your time figuring out how to draw up the numbers to make them look good for the CEO so he doesn’t get axed.

In other words, you spend your time preventing something from happening rather than causing something to happen.

Finally, let’s talk about government data miners. If there’s one thing I learned when I went to the State Department Tech@State “Moneyball Diplomacy” conference a few weeks back, it’s that they are the most conservative of all. They spend their time worrying about a terrorist attack and how to prevent it. It’s all about preventing bad things from happening, and that makes for an atmosphere where causing good things to happen takes a rear seat.

I’m not saying anything really new here; I think this stuff is pretty uncontroversial. Maybe people would quibble over when a start-up becomes a corporation (my answer: mostly they never do, but certainly by the time of an IPO they’ve already done it). Also, of course, there are ass-coverers in start-ups and there are risk-takers in corporation and maybe even in government, but they don’t dominate.

If you think through things in this light, it makes sense that Obama’s data miners didn’t want to stay in government and decided to go work on advertising stuff. And although they might have enough clout and buzz to get hired by a big corporation, I think they’ll find it pretty frustrating to be dealing with the cover-my-ass types that will hire them. It also makes sense that Facebook, which spends its time making sure no other social network grows enough to compete with it, works so well with the NSA.

When you find a website that claims to be free for users, we should know to be automatically suspicious. What is sustaining this service? How could you possibly have 35 people working at the underlying company without a revenue source?

We’ve been trained to not think about this, as web surfers, because everything seems, on its face, to be free, until it isn’t, which seems outright objectionable (as I wrote about here). Or is it? Maybe it’s just more honest.

When I go to the newest free online learning site, I’d like to know how they plan to eventually make money. If I’m registering on the site, do I need to worry that they will turn around and sell my data? Is it just advertising? Are they going to keep the good stuff away from me unless I pay?

And it’s not enough to tell me it’s making no revenue yet, that it’s being funded somehow for now without revenue. Because wherever there is funding, there are strings attached.

If the NSF has given a grant for this project, then you can bet the project never involves attacking the NSF for incompetence and politics. If it’s a VC firm, then you’d better believe they are actively figuring out how to make a major return on their investment. So even if they’re not selling your registration and click data now, they have plans for it.

So in other words, I want to know how you’re being funded, who’s giving you the money, and what your revenue model is. Unless you are independently wealthy and want to give back to the community by slaving away on a project, or you’re doing it in your spare time, then I know I’m somehow paying for this.

Just in the spirit of disclosure and transparency, I have no income and I pay a bit for my WordPress site.

Recently I’ve been seeing various articles and opinion pieces that say that Facebook should pay its users to use it, or give a cut of the proceeds when they sell personal data, or something along those lines.

This strikes me a naive to a surprising degree; it means people really don’t understand how web businesses work. How can people simultaneously complain that Facebook isn’t a viable business and that they don’t pay their users for their data?

People have gotten used to getting free services, and they assume that infrastructure somehow just exists, and they want to have that infrastructure, and use it, and never see ads and never have their data used, or get paid whenever someone uses their data.

But you can’t have all of that at the same time!

These companies need to monetize somehow, and instead of asking users for money directly, which isn’t the current culture, they get creepy with data. The fact that there are basically no rules about personal information (aside from some medical information) means that the creepiness limit is extreme, and possibly hasn’t been reached yet.

What are the alternatives? I can think of a few, none of them particularly wonderful:

Legislate privacy laws to make personal data sharing or storing illegal without explicit consent for each use (right now you just sign away all your rights at once when you sign up for the service, but that could and probably should change). This would kill the internet as we know it. In the short term the consequences would be extreme. Besides the fact that some people would save and use data illegally, which would be very hard to track and to stop, places like Twitter, Facebook, and Google would have no revenue model. An interesting thought experiment on what would happen after this.

Make people pay for services, either through micro-payments or subscription services like Netflix. This would maybe work, but only for people with credit cards and money to spare. So it would also change access to the internet, and not in a good way.

Wikipedia-style donation-based services. This is clearly a tough model, and they always seem to be on the edge of solvency.

Get the government to provide these services as meaningful infrastructure for society, like highways. Imagine what Google Government would be like.

I’m enjoying reading and learning about agile software development, which is a method of creating software in teams where people focus on short and medium term “iterations”, with the end goal in sight but without attempting to map out the entire path to that end goal. It’s an excellent idea considering how much time can be wasted by businesses in long-term planning that never gets done. And the movement has its own manifesto, which is cool.

The post I read this morning is by Mike Cohn, who seems heavily involved in the agile movement. It’s a good post, with a good idea, and I have just one nerdy pet peeve concerning it.

I’m a huge fan of stealing good ideas from financial modeling and importing them into other realms. For example, I stole the idea of stress testing of portfolios and use them in stress testing the business itself where I work, replacing scenarios like “the Dow drops 9% in a day” with things like, “one of our clients drops out of the auction.”

I’ve also stolen the idea of “resampling” in order to forecast possible future events based on past data. This is particularly useful when the data you’re handling is not normally distributed, and when you have quite a few data points.

To be more precise, say you want to anticipate what will happen over the next week (5 days) with something. You have 100 days of daily results in the past, and you think the daily results are more or less independent of each other. Then you can take 5 random days in the past and see how that “artificial week” would look if it happened again. Of course, that’s only one artificial week, and you should do that a bunch of times to get an idea of the kind of weeks you may have coming up.

If you do this 10,000 times and then draw a histogram, you have a pretty good sense of what might happen, assuming of course that the 100 days of historical data is a good representation of what can happen on a daily basis.

Here comes my pet peeve. In Mike Cohn’s blog post, he goes to the trouble of resampling to get a histogram, so a distribution of fake scenarios, but instead of really using that as a distribution, for the sake of computing a confidence interval, he only computes the average and standard deviation and then replaces the artificial distribution with a normal distribution with those parameters. From his blog:

Armed with 200 simulations of the ten sprints of the project (or ideally even more), we can now answer the question we started with, which is, How much can this team finish in ten sprints? Cells E17 and E18 of the spreadsheet show the average total work finished from the 200 simulations and the standard deviation around that work.

In this case the resampled average is 240 points (in ten sprints) with a standard deviation of 12. This means our single best guess (50/50) of how much the team can complete is 240 points. Knowing that 95% of the time the value will be within two standard deviations we know that there is a 95% chance of finishing between 240 +/- (2*12), which is 216 to 264 points.

What? This is kind of the whole point of resampling, that you could actually get a handle on non-normal distributions!

For example, let’s say in the above example, your daily numbers are skewed and fat-tailed, like a lognormal distribution or something, and say the weekly numbers are just the sum of 5 daily numbers. Then the weekly numbers will also be skewed and fat-tailed, although less so, and the best estimate of a 95% confidence interval would be to sort the scenarios and look at the 2.5th percentile scenario, the 97.5th percentile scenario and use those as endpoints of your interval.

The weakness of resampling is the possibility that the data you have isn’t representative of the future. But the strength is that you get to work with a honest-to-goodness distribution and don’t need to revert to assuming things are normally distributed.

I read this article yesterday about racism in Silicon Valley. It’s interesting, written by an interesting guy named Eric Ries, and it touches on stuff I’ve thought about like stereotype threat and the idea that diverse teams perform better than homogeneous ones.

In spite of liking the article pretty well, I take issue with two points.

In the beginning of the article Ries lays down some ground rules, and one of them is that “meritocracy is good.” Is it really good? Always? And to what limit? People are born with talent just as they’re born rich or poor, and what makes talent a better or more fair way of sorting people? Or are we just claiming it’s more efficient?

Actually I could go on but this blog post kind of says everything I wanted to say on the matter. As an aside, I’m kind of sick of the way people use the idea of “meritocracy” to overpay people who they justify as having superhuman qualifications (I’m looking at you, CEO’s) or a ridiculous, massively scaleable amount of luck (most super rich entrepreneurs).

Second, I’m going to coin a term here, but I’m sure someone else has already done so. Namely, I consider it horizon bias to think that wherever you are, whatever you do, is the coolest place in the world and that everyone else is just super jealous of you and wishes they had that job. So you don’t look beyond your horizon to see that there are other jobs that may be more attractive to people. The reason this comes up is the following paragraph:

What accounts for the decidedly non-diverse results in places like Silicon Valley? We have two competing theories. One is that deliberate racisms keeps people out. Another is that white men are simply the ones that show up, because of some combination of aptitude and effort (which it is depends on who you ask), and that admissions to, say Y Combinator, simply reflect the lack of diversity of the applicant pool, nothing more.

I’d like to offer a third option, namely that only white guys show up because that’s who thinks working in Silicon Valley is an attractive idea. I know it’s kind of like the second option above, but it’s not exactly. The qualification “because of some combination of aptitude and effort” is the difference.

Let’s say I’m considering moving to Silicon Valley to work. But all of my images of that place come from movies and my experiences with my actual friends in the dotcom bubble era who slept under their desks at night. Plus I know that the housing market out there is crazy and that the commute sucks. Finally, I’d picture myself working with lots of single, ambitious, and arrogant young men who believe in meritocracy (code for: use vaguely libertarian philosophical arguments to act ruthlessly). I can imagine that these facts keep plenty of non-white non-men away.

Next, going on to the point about horizon bias. People who already work in Silicon Valley already selected themselves as people who think it’s a great deal. And then they sit around wondering why it’s not a more diverse place, in spite of having everything awesomely meritocratic.

Going back to the article, Ries mentions this idea that diverse teams outperform homogeneous ones. I’d like to look at that in light of horizon bias and ask whether that’s the wrong way to look at it. In other words maybe it’s more a function of what the common goal is, which leads to a diverse team if the common goal is broadly attractive, than how the exact team was created. If goals are super attractive, attractive enough to draw diverse people, then maybe those goals deserve success more.

For example, one of the strengths of Occupy Wall Street has been the diversity of its membership. People of all ages, all backgrounds, and all races have been coming together to speak for the 99%. It’s of course fitting, since 99% does represent lots of people, but I’d like to point out that it is diverse because the cause resonates with so many people, which makes it successful.

Another example. I worked at the math department at M.I.T., which is famously not diverse. And I saw the “Truth Values” play recently which made me think about that experience some more. There’s lots of horizon bias in math, because there’s this assumption that everyone who was ever a math major should want to someday become a math professor (at M.I.T. no less). So it’s easy enough to wring your hands when you see that, although 45% of the undergrad math majors are women, and 40% of the grad students in math are women (I’m making these numbers up by the way), only 1% of the tenured faculty at the top places are women (again totally made up).

And of course there’s real discrimination involved (trust me), but there’s also the possibility that a bunch of women just never wanted to be a professor, they just wanted to get a Ph.D. for whatever reason. But the horizon bias at the top places assumes that everyone would want to become a professor.

On the one hand I’m just making things worse, because I’m pointing out that in addition to the real discrimination that takes place for those women who actually do want to become professors, there’s also this natural but invisible self-selection thing going on where women leave the professorship train at some point. Seems like I’ve made one problem into two.

On the other hand, we can address this horizon bias, if it exists. But instead of addressing it by blotting out the names of candidates on applications (a good idea by the way, and one I think I’ll start using), we would need to address it by looking at the actual company or department or culture and see why it’s less than attractive to people who aren’t already there. It’s a bigger and harder kind of change.

Yesterday Columbia announced a proposal to build an Institutes for Data Sciences and Engineering a few blocks north of where I live. It’s part of the Bloomberg Administration’s call for proposals to add more engineering and entrepreneurship in New York City, and he’s said the city is willing to chip in up to 100 million dollars for a good plan. Columbia’s plan calls for having five centers within the institute:

A few comments. Currently the data involved in media 1) and finance 5) costs real money, although I guess Bloomerg can help Columbia get a good deal on Bloomberg data. On the other hand, urban traffic data 2) and health data 3) should be pretty accessible to academic researchers in New York.

There’s a reason that 1) and 5) cost money: they make money. The security center is kind of in the middle, since you can try to make any data secure, you don’t need to particularly pay for it, but on the other hand if you can find a good security system then people will pay for it.

On the other hand, even though it’s a great idea to understand urban infrastructure and health data, it’s not particularly profitable (not to say it doesn’t save alot of money potentially, but it’s hard to monetize the concept of saving money, especially if it’s the government’s or the city’s money).

So the overall cost structure of the proposed Institute would probably work like this: incubator companies from 1) and 5) and maybe 4) fund the research going on in (themselves and) 2) and 3). This is actually a pretty good system, because we really do need some serious health analytics research on an enormous scale, and it needs to be done ethically.

Speaking of ethics, I hope they formalize and follow The Modeler’s Hippocratic Oath. In fact, if they end up building this institute, I hope they have a required ethics course for all incoming students (and maybe professors).