I don’t remember when I first came across topic models, but I do remember being an early proponent of them in industry. I came to appreciate how useful they were for exploring and navigating large amounts of unstructured text, and was able to use them, with some success, in consulting projects. When an MCMC algorithm came out, I even cooked up a Java program that I came to rely on (up until Mallet came along).

Generating features for other machine learning tasks

Blei frequently interacts with companies that use ideas from his group’s research projects. He noted that people in industry frequently use topic models for “feature generation.” The added bonus is that topic models produce features that are easy to explain and interpret:

“You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.”

Scaling to large corpuses

The early algorithms that came out of the academic research community couldn’t handle large numbers of unstructured documents. I remember having problems fitting topic models against 100,000 documents (nowadays, this is a relatively modest corpus). But things changed around 2010, Blei explained:

“At some point, Google showed me that they fit a topic model to a billion documents and a million topics, something I couldn’t have done, and were using it as features for various things … Before 2010, it was hard for us to analyze with our modest academic clusters … to analyze 100,000 documents. That was a big deal for us — if we analyzed 100,000 documents, that was a big piece of analysis. After that work, we now regularly can analyze half a million documents, three million documents, five million documents — and with a computer cluster, we can analyze billions of documents. This was a big change. This changed the scale at which we could do these kinds of analyses and, further, our algorithm generalized to many different kinds of statistical models.

“That’s just a little story, part of a bigger story in scaling up machine learning. I think a lot of the interest in statistical machine learning and data science right now is thanks both to it being a rich field that provides tools that help us understand and exploit patterns in data, but also that we’ve all been working hard to bring it up to date to modern data set sizes. This idea, which is called stochastic optimization, is one of the cornerstones of scaling up machine learning. What’s amazing about this is, that idea is from 1951. It’s Robbins and Monro, 1951, this little eight page mathematics paper.”

Explosion of work in industry and academia

I think it’s fair to say that topic models are now being used by data analysts in all disciplines and companies. In recent years, it’s become a technique that researchers from the humanities and social sciences have come to rely on. Blei marvels at the number of people using topic models in their work:

“It’s funny, I remember at first I was aware of all the different extensions, and at some point that changed. I no longer can keep my hand on everything. I can’t know about all the different topic modeling papers that are out there, and so I get asked a lot of questions, ‘Hey, has anybody written a paper about this or this?’ I usually have to answer, ‘I’m not sure.’ I know what my students are up to, and when something has a lot of traction, of course, I know about it, but at some point I lost my own handle of the topic modeling literature because it was growing so fast. … I think that the success stories for topic modeling that I find most compelling are the ones where people in the social sciences and in political science and in the digital humanities are using these tools to help them with their close reading of large archives of documents.”

You can listen to our entire interview in the SoundCloud player above, or subscribe through TuneIn, iTunes, SoundCloud, or RSS

When I first took over organizing Hardcore Data Science at Strata + Hadoop World, one of the first speakers I invited was Kira Radinsky. Radinsky had already garnered international recognition for her work forecasting real-world events (disease outbreak, riots, etc.). She’s currently the CTO and co-founder of SalesPredict, a start-up using predictive analytics to “understand who’s ready to buy, who may buy more, and who is likely to churn.”

I recently had a conversation with Radinsky, and she took me through the many techniques and subject domains from her past and present research projects. In grad school, she helped build a predictive system that combined newspaper articles, Wikipedia, and other open data sets. Through fine-tuned semantic analysis and NLP, Radinsky and her collaborators devised new metrics of similarity between events. The techniques she developed for that predictive software system are now the foundation of applications across many areas.

The challenges of prediction: from news headlines to cholera

Early versions of a predictive system did not yield interesting results, until Radinsky and her collaborators discovered the additional insights they could derive from correlations:

“The problem was, when we were looking at only patterns of causality, we used to have only trivial things. I’ll give an example. An Iranian professor was killed, and the system would output, and the funeral would be held, which is correct. You would even find it in the news … The problem is that when you build a system based on only those causality patterns, you only train it on what people already know … The next step that we did was add correlations in addition to what people already know, as cause and effect. We had this graph of causality, and we added additional correlations. Again, we don’t know their cause and effect, but we are going to use them when trying to predict future news events.

…

“Cholera is a waterborne disease … The system knew that storms can cause cholera. Again, not all storms cause cholera. So, we did storyline detection. We took all the news, and were trying to align the stories in a way that the same storyline or the same articles in the same topic would be aligned. This is a very well-studied academic topic, and we applied it in a way that would actually work for finding correlations of predictions and would look for correlations in similar storylines. What we found is that in all the storylines, in the discussed storms that eventually caused cholera, you would find that two years before that, you would have a drought in those areas. This is very surprising. The thing is, it was based on around six examples from Angola since 2006, which is not a lot of examples.

…

“In Bangladesh, since I think 1964, there were 90 significant cases of cholera. In 84% of them, before that, you had a drought. The thing is, what’s in common between Bangladesh and Angola? What we found out is that in countries with low GDP, not surprising, poor countries have high chances of cholera. Countries with low concentrations of water, they have this pattern of drought and then two years later storms, and then cholera. This is very surprising because, again, cholera is a waterborne disease. I would expect it to happen in places that have a lot of water.”

Predictive analytics for sales: no black boxes

In some domains, you need models that are easy to explain and interpret. Radinsky explained:

“The way the sales process usually works between two businesses, is they get a big list of potential leads, potential people that can buy from them — either people registering on their website, people giving them their business card, random names sometimes. They get, let’s say, a list of 20,000 people, but there are only five sales reps. They need to start calling them and generating opportunities to actually start closing deals with them. This is how this world works. … [Our system] tells them which lead is going to close and the size of the deal they can expect from them so they can actually manage the pricing. It’s similar with customers you already have: what’s the probability of churn. The issue with that is that when you’re building a prediction system for somebody to use … it has to be [explainable] in natural language … Nobody likes black boxes. Even when you try to predict future news events and you don’t explain why or what’s the pattern behind that, there’s no action item that they act on.”

Cancer research: same algorithms, new predictions

We closed by discussing recent applications of predictive analytics to medicine. Radinsky described how she recently teamed up with medical researchers to see if her techniques and tools can be used in the fight against cancer:

“Today, we’re working with doctors to try to predict different types of cancers using exactly the same algorithms. They’re providing us data about patients since 1975, like blood samples that were taken for those patients every year — similar to a sales process where you get some kind of input from your customers on a yearly basis, if you have a long period of interaction with them. Based on those, we’re trying to predict who’s going to have cancer or not in the next 20 or 30 years, based on this historical data.”

You can listen to our entire interview in the SoundCloud player above, or subscribe through SoundCloud, TuneIn, or iTunes.

I rarely work with social network data, but I’m familiar with the standard problems confronting data scientists who work in this area. These include questions pertaining to network structure, viral content, and the dynamics of information cascades.

Predicting whether an information cascade will double in size

Can you predict if a piece of information (say a photo) will be shared only a few times or hundreds (if not thousands) of times? Large cascades are very rare, making the task of predicting eventual size difficult. You either default to a pathological answer (after all most pieces of information are shared only once), or you create a balanced data set (comprised of an equal number of small and large cascades) and end up solving an artificial task.

Thinking of a social network as an information transport layer, Kleinberg and his colleagues instead set out to track the evolution of cascades. In the process, they framed an interesting balanced algorithmic prediction problem: given a cascade of size k, predict whether it will reach size 2k (it turns out 2k is roughly the median size of a cascade conditional on whether it reaches size k).

Their resulting predictive model used many features, ranging from content (was text overlaid on a photo), the root node (degree), temporal factors (time to reach size k, acceleration), and many others. To no surprise, they found that the temporal features are the most predictive — cascades that reach a certain size very quickly are likely to keep growing in size. But they also found feature redundancy: removing all the temporal features still led to models that produced decent results.

The role of content in cascades: characterizing “memorable” quotes

What role does content have in the formation of these cascades? Social networks are difficult environments to study such questions, as they make it hard to assess what caused a piece of content to go viral (was it the content itself, or another factor, like the person who shared it). What would be nice is to have a laboratory where viral content is being generated and is paired adjacently with less viral content.

Is there something in the text that one can use to predict whether or not it will become memorable? They measured the memorability of a quote using search engines (Google/Bing) and IMDB. As a baseline, one can build a simple bag-of-wordsclassifier (memorable/non-memorable) using the resulting term vectors. The team also considered features like distinctiveness of text (unigram, bigram, trigram frequency from the AP newswire) and “part-of-speech” composition.

These latter factors (distinctiveness and part of speech) turn out to be important in predicting whether a piece of text becomes memorable. Kleinberg and his collaborators found that memorable quotes tended to be comprised of “… a sequence of unusual words built on a scaffolding of common parts-of-speech patterns.” They subsequently found that some of their findings extended to comment threads in social networks: threads are longer when text is more distinctive.

Given a person’s network neighborhood, can we identify their most significant social ties? A natural starting point is to use an embeddedness metric for an edge e (defined to be the number of mutual friends shared by endpoints of e). It’s natural to think that if an edge is highly embedded, it’s likely to be a stronger tie.

Edge between v and w has embeddedness 4, because they have 4 mutual friends. Source: Jon Kleinberg.

Given a user v, can we rely only on network structure to detect his/her significant other (“relationship partner”)? It turns out, using embeddedness alone to rank a person’s friends fares poorly at this task. Embeddedness finds nodes from the largest clusters in a person’s network graph, and in practice, this is frequently a person’s co-workers.

Backstrom and Kleinberg devised the following algorithmic problem: (1) for each user v, rank all friends w by competing metrics (embeddedness, dispersion, and activity-based metrics), and (2) determine for what fraction of users is the top-ranked friend the “relationship partner.” Dispersion performed much better than embeddedness across all categories. And for married couples, dispersion outperformed the activity-based metrics described above:

Fraction of users for which the top-ranked friend is the true “relationship partner.” (“photo” = number of photos in which v and w are both tagged; “profile view” = # of times v viewed w’s profile in last 90 days). Source: Jon Kleinberg.

The first row in the table above illustrates the power of having the correct algorithm. With the proper metric (dispersion), structural considerations outperform activity-based measures in detecting significant relationships.

These three examples provide a glimpse into the many interesting studies being conducted to understand how information, human audiences, and data come to interact. We’ll be covering related developments in future posts.

I only really started playing around with GraphLab when the companion project GraphChi came onto the scene. By then I’d heard from many avid users and admired how their user conference instantly became a popular San Francisco Bay Area data science event. For this podcast episode, I sat down with Carlos Guestrin, co-founder/CEO of Dato, a start-up launched by the creators of GraphLab. We talked about the early days of GraphLab, the evolution of GraphLab Create, and what’s he’s learned from starting a company.

MATLAB for graphs

Guestrin remains a professor of computer science at the University of Washington, and GraphLab originated when he was still a faculty member at Carnegie Mellon. GraphLab was built by avid MATLAB users who needed to do large scale graphical computations to demonstrate their research results. Guestrin shared some of the backstory:

“I was a professor at Carnegie Mellon for about eight years before I moved to Seattle. A couple of my students, Joey Gonzales and Yucheng Low were working on large scale distributed machine learning algorithms specially with things called graphical models. We tried to implement them to show off the theorems that we had proven. We tried to run those things on top of Hadoop and it was really slow. We ended up writing those algorithms on top of MPI which is a high performance computing library and it was just a pain. It took a long time and it was hard to reproduce the results and the impact it had on us is that writing papers became a pain. We wanted a system for my lab that allowed us to write more papers more quickly. That was the goal. In other words so they could implement this machine learning algorithms more easily, more quickly specifically on graph data which is what we focused on.”

The original killer app = recommenders

Many of the machine learning projects and start-ups I interact with find initial traction in automatic recommender systems. GraphLab is no exception. (In fact, I first heard about GraphLab from users of its Collaborative Filtering library.) Recommenders are an easy entry point because product recommendations are so common on web sites and they are conceptually easy to explain to non-experts. Guestrin explained how a recommender library that started as an afterthought became a project in its own right:

“We put out this software in the open source community, and it was not something that we decided to do with a lot of ambition. We just put it out there. My postdoc at the time, Danny Bickson, came and said, ‘I’m going to write a recommender library on top of the system.’ We didn’t work on recommender systems in my lab, so it wasn’t really a high priority for me as a professor, but he really wanted to do it. He started implementing something called matrix factorization on top of it. The performance was incredibly good … He also put his recommender library out in the open source. Somehow, we started getting emails from folks saying, ‘we tried that and this doesn’t work, or this was fast, or we want these other things.’ We started getting feature requests for something that was an afterthought for us. … Danny started engaging with that community and getting lots of positive feedback, and what started as an afterthought — let’s put something on the open source — became a project of its own with growing adoption and really nice feedback from folks like Pandora and others.”

GraphLab’s tabular data structure (SFrames) was unveiled at last February’s Strata Conference in California. It definitely caught the attention of many attendees I spoke with, particularly Pydata users. With an API similar to Pandas, and a growing library of algorithms that are as easy to use as scikit-learn, more Python users will start gravitating toward GraphLab Create. Among other things, it scales to much larger data sets (even on a single machine), it is much faster than comparable Pydata tools (Python API calls a C++ backend), its library of algorithms is expanding, and tools for model management are on the way. (It’s a great time to be a Python data enthusiast, as there are other emerging frameworks — like Apache Spark — that are targeting Pydata users.) Guestrin noted:

“I started talking to some customers and they said, ‘yeah, we have graph data from social networks, but we also have this data with user profile information,’ which turns out to be tables. We have these images of pictures people take and we have this text information from product reviews and then I realized … Let’s design something that is highly scalable both for tabular and graph data and text and images.

… “For example, boosted decision trees is a well-known model for machine learning that can do well with data that requires non-linear features. We incorporated a very efficient implementation of boosted decision trees. … Similarly, deep learning has been getting really amazing performance, especially on things like image data and audio data, so we wanted to incorporate the library where you could do things like deep learning networks easily.

… “One of the things that’s been a big focus for us is the deployment piece of machine learning. If you think about machine learning, there is the training, there’s the data exploration, the data engineering, the training of the models, the intelligence — but eventually, your goals should deploy your solution as a system that runs on tons and tons and tons of data, maybe even on a cluster, or deploy that as a service that can be created in real time.”

Open Core

Many of the tools that Pydata users have come to depend upon are open source. I asked Guestrin which components of GraphLab Create will be open source. The answer will be revealed at Strata+Hadoop World next month, but I think it’s safe to guess that the components for data transformation (SFrames, SGraphs) and many basic machine learning algorithms will be open source. Guestrin stressed their commitment to open source:

We benefitted from the open source community giving us feedback, contributing to our code, and we continue to be committed to that community. We’re inspired by companies like MongoDB and Elasticsearch that have an open source core and add-on tools. That’s how I view the company. However, when we started the company, we wrote GraphLab Create from scratch. It wasn’t a next version of GraphLab or PowerGraph or GraphChi. … We wanted to make sure the code was in good shape before we put it out there for people to contribute and participate in that way. We’ve only made GraphLab Create available as a free binary thus far. … [At Large-scale Machine Learning Day at Strata+Hadoop World] you’ll also be able to use the open source version of the core components of the GraphLab Create.

]]>0Ben Loricahttp://radar.oreilly.com/ben/http://radar.oreilly.com/?p=734232015-02-04T18:22:31Z2015-01-22T13:00:32ZThere are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: if you can pose your problem as a simple optimization problem then you’re almost done.

Of course, in practice, most machine learning projects can’t be reduced to simple optimization problems. Data scientists have to manage and maintain complex data projects, and the analytic problems they need to tackle usually involve specialized machine learning pipelines. Decisions at one stage affect things that happen downstream, so interactions between parts of a pipeline are an area of active research.

Some common machine learning pipelines. Source: Ben Recht, used with permission.

Identify and build primitives

The first step is to create building blocks. A pipeline is typically represented by a graph, and AMPLab researchers have been focused on scaling and optimizing nodes (primitives) that can scale to very large data sets. Some of these primitives might be specific to particular domains and data types (text, images, video, audio, spatiotemporal) or more general purpose (statistics, machine learning). A recent example would be ml-matrix — a distributed matrix library that runs on top of Apache Spark.

Casting machine learning models in terms of primitives makes these systems more accessible. To the extent that the nodes of your pipeline are “interpretable,” resulting machine learning systems are more transparent and explainable than methods relying on black boxes.

Make machine learning modular: simplifying pipeline synthesis

While primitives can serve as building blocks, one still needs tools that enable users to build pipelines. Workflow tools have become more common, and these days, such tools exist for data engineers, data scientists, and even business analysts (Alteryx, RapidMiner, Alpine Data, Dataiku).

Do some error analysis

“We’re trying to put (machine learning systems) in self-driving cars, power networks … If we want machine learning models to actually have an impact in everyday experience, we’d better come out with the same guarantees as one of these complicated airplane designs.” — Ben Recht

ML pipelines are beginning to resemble the intricacy of block diagrams from airplanes. Source: Ben Recht, used with permission (Click for a larger view).

Can we bound approximation errors and convergence rates for layered pipelines? Assuming we can can compute error bars for nodes, the next step would be to have a mechanism for extracting error bars for these pipelines. In practical terms, what we need are tools to certify that a pipeline will work (when deployed in production) and that can provide some measure of the size of errors to expect.

To that end Laurent Lessard, Ben Recht, and Andrew Packard have been using techniques from control theory to automatically generate verification certificates for machine learning pipelines. Their methods can analyze many of the most popular techniques for machine learning on large-data sets. And their longer term goal is to be able to derive performance characteristics and analyze the robustness of complex, distributed software systems directly from pseudocode. (A related AMPLab project Velox provides a framework for managing models in production.)

As algorithms become even more pervasive, we need better tools for building complex yet robust and stable machine learning systems. While other systems like scikit-learn and GraphLab support pipelines, a popular distributed framework like Apache Spark takes these ideas to extremely large data sets and a wider audience. Early results look promising: AMPLab researchers have built large-scale pipelines that match some state-of-the-art results in vision, speech, and text processing.

]]>1Ben Loricahttp://radar.oreilly.com/ben/http://radar.oreilly.com/?p=732492015-01-15T15:28:30Z2015-01-15T15:28:30ZBack in 2008, when we were working on what became one of the first papers on big data technologies, one of our first visits was to LinkedIn’s new “data” team. Many of the members of that team went on to build interesting tools and products, and team manager DJ Patil emerged as one of the best-known data scientists. I recently sat down with Patil to talk about his new ebook (written with Hilary Mason) and other topics in data science and big data.

Proliferation of programs for training and certifying data scientists

Patil and I are both ex-academics who learned learned “data science” in industry. In fact, up until a few years ago one acquired data science skills via “on-the-job training.” But a new job title that catches on usually leads to an explosion of programs (I was around when master’s programs in financial engineering took off). Are these programs the right way to acquire the necessary skills? Patil isn’t sure:

“We should call a spade to spade which is [how] you and I both saw that master’s of financial engineering. The MIS degree, the information sciences degree. Many of these became effectively, in the perception of people’s minds at this stage, as second-rate degrees to computer science, or math, or physics. My fear is that the data science degree will become that. That would suck. That would be terrible. I think it’s very reasonable to say, “Hey, that data science can bloom into something much more organic.” Informatics and biophysics are good examples of areas that have done that. What is a right curriculum and the right things that are in there? My fear right now is that it’s overly geared toward consumer Internet products versus all the great things that can be done in social sciences and government, enterprise technology, medicine, health, hospital — all these areas I think are wide open, and it’s unclear in early stages on how to do that.”

Data products

Patil’s previous ebook covered some of his experiences building data products at LinkedIn. We talked about how the ideas he laid out are playing out beyond Silicon Valley:

“Yeah, I think it’s starting to emerge a little bit more, but I think it’s still very Silicon Valley centric. I think the thing we’re starting to see is when people say “data product,” they’re no longer restricted in how they think of it; it can be a whole company.

…

“I think it absolutely can be the government, and I think we’re going to see a lot more of that, the president signing an executive order that says, “Hey, everything has to be machine readable.” One of the big things we’re going to see over the next decade is how do we start really unlocking the value proposition from the genome, the genome to the phenome, the phenotyping to the medical records, and the outcomes of all these things. How does that all start to come together; that’s the data problem at the end of the day. One of the things that we’ll start realizing is that part of this is a numbers game, the more people who have access to their genome — what are the great things we might be able to unlock in terms of new pharmaceuticals and new treatments and understanding who we are?”

Ethics

One of the highlights of my conversation with Patil was our discussion on ethics — a subject that we’ve both been thinking about a lot. In particular, one of the things we’re following closely is the growing number of data scientists willing to take into account the (cultural) impact of models and data collection:

“Yes, and I think the thing that I’m happy about is many of those in the data science communities are the first to raise their hands about calling this an important item. I think what we’re going to start seeing as a critical component for the chief data officer is data ethics — just because you can it doesn’t mean you should. There has been a number of times where I’ve worked with data and people, where people asked, “What is the implication of us doing this?” Implication a lot of times is this perception, how’s this going to make someone feel? Is it going to be good? Is it going to be bad? What are the long-term aspects that we have to think through in putting this out there?

…

“Another issue I think that will be public debate is, “Should we be allowing these things to happen?” I think a lot of times people are most often concerned about the consumer Internet companies; I think people often forget about all these data brokers and other people who have been collecting this stuff, and the data is not even transparent to us. I’m not trying to give us a pass — to start to redirect the conversation away from Silicon Valley. It’s more a way of saying that we need to have a conversation where we talk about where our data is, how do we have control of it?

…

“I don’t [think many of the data science training programs extensively cover ethics]. I was very fortunate in my training to be required to go through ethics, an ethics class in very traditional style. I can’t tell you how many times that class has come to aid. Just simple questions — whether at LinkedIn, RelateIQ, the government, whatever — they always come to me because they give me a formal way to think about it and to have a conversation, because you hold incredible power when you have access to the data; you have to be able to ask yourself, “Should we be doing this?” Or, “How should we go about doing it?”

One of the trends we’re following is the rise of applications that combine big data, algorithms, and efficient user interfaces. As I noted in an earlier post, our interest stems from both consumer apps as well as tools that democratize data analysis. It’s no surprise that one of the areas where “cognitive augmentation” is playing out is in data preparation and curation. Data scientists continue to spend a lot of their time on data wrangling, and the increasing number of (public and internal) data sources paves the way for tools that can increase productivity in this critical area.

Scalability ~ data variety and size

Not only are enterprises faced with many data stores and spreadsheets, data scientists have many more (public and internal) data sources they want to incorporate. The absence of a global data model means integrating data silos, and data sources requires tools for consolidating schemas.

Random samples are great for working through the initial phases, particularly while you’re still familiarizing yourself with a new data set. Trifacta lets users work with samples while they’re developing data wrangling “scripts” that can be used on full data sets.

Empower domain experts

In many instances, you need subject area experts to explain specific data sets that you’re not familiar with. These experts can place data in context and are usually critical in helping you clean and consolidate variables. Trifacta has tools that enable non-programmers to take on data wrangling tasks that used to require a fair amount of scripting.

Consider DSLs and visual interfaces

“Programs written in a [domain specific language] (DSL) also have one other important characteristic: they can often be written by non-programmers…a user immersed in a domain already knows the domain semantics. All the DSL designer needs to do is provide a notation to express that semantics.”

I’ve often used regular expressions for data wrangling, only to come back later unable to read the code I wrote (Joe Hellerstein describes regex as “meant for writing & never reading again”). Programs written in DSLs are concise, easier to maintain, and can often be written by non-programmers.

Trifacta designed a “readable” DSL for data wrangling but goes one step further: their users “live in visualizations, not code.” Their elegant visual interface is designed to accomplish most data wrangling tasks, but it also lets users access and modify accompanying scripts written in their DSL (power users can also use regular expressions).

These ideas go beyond data wrangling. Combining DSLs with visual interfaces can open up other aspects of data analysis to non-programmers.

Many data analysis tasks involve a handful of data sources that require painstaking data wrangling along the way. Scripts to automate data preparation are needed for replication and maintenance. Trifacta looks at user behavior and context to produce “utterances” of its DSL, which users can then edit or modify.

Don’t forget about replication

If you believe the adage that data wrangling consumes a lot of time and resources, then it goes without saying that tools like Tamr and Trifacta should produce reusable scripts and track lineage. Other aspects of data science — for example, model building, deployment, and maintenance — need tools with similar capabilities.

A few months ago, I spoke with UC Berkeley Professor and Databricks CEO Ion Stoica about the early days of Spark and the Berkeley Data Analytics Stack. Ion noted that by the time his students began work on Spark and Mesos, his experience at his other start-up Conviva had already informed some of the design choices:

“Actually, this story started back in 2009, and it started with a different project, Mesos. So, this was a class project in a class I taught in the spring of 2009. And that was to build a cluster management system, to be able to support multiple cluster computing frameworks like Hadoop, at that time, MPI and others. To share the same cluster as the data in the cluster. Pretty soon after that, we thought about what to build on top of Mesos, and that was Spark. Initially, we wanted to demonstrate that it was actually easier to build a new framework from scratch on top of Mesos, and of course we wanted it to be also special. So, we targeted workloads for which Hadoop at that time was not good enough. Hadoop was targeting batch computation. So, we targeted interactive queries and iterative computation, like machine learning.

…

“I also co-founded a company, Conviva. It was in the area of video management, and one of its products was an analytics tool. And as a part of that product, one feature was adhoc queries. And, at that time, you know … we were using MySQL. MySQL was not good enough. I saw first hand the limitation of the existing technologies, especially on the open source side. And finally, you’re pretty anchored in seeing the trends and the problems in the industry around us, because we are funded at Berkeley by many companies, like Facebook, Yahoo and so forth…so, after the initial building…batch jobs, on top of Hadoop, they were looking for the next level; you want to have something faster.”

It’s one thing to build something as an academic project, where papers and conference presentations are the standard metrics. Successful open source projects involve great developers coming together to tackle real problems — having great timing is also usually an important and under appreciated factor. Ion explained:

“There are many components. And if you look back, you can always revise history. Especially if you had success. First of all, we had a fantastic group of students. Matei, the creator of Spark and others who did Mesos. And then another great group of different students who contributed and built different modules on top of Spark, and made what Spark it is today, which is really a platform. So, that’s one: the students.

“The other one was a great collaboration with the industry. We are seeing first hand what the problems are, challenges, so you’re pretty anchored in reality.

“The third thing is, we are early. In some sense, we started very early to look at big data, we started as early 2006, 2007 starting to look at big data problems. We had a little bit of a first-mover advantage, at least in the academic space. So, all this together, plus the fact that the first releases of these tools, in particular Spark, was like 2000 lines of code…very small, so tractable.”

UC Berkeley has had many successful open source projects in the past — BSD and Postgres are prominent examples. I asked Ion if early on they thought Spark would be embraced so enthusiastically by industry:

“Absolutely not. We wanted to have some good, interesting research projects; we wanted to make it as real as possible, but in no way could we have anticipated the adoption and the enthusiasm of people and of the community around what we’ve built.”

I recently spoke with Sarah Meiklejohn, a lecturer at UCL, and an expert on computer security and cryptocurrencies. She was part of an academic research team that studied pseudo-anonymity (“pseudonymity”) in bitcoin. In particular, they used transaction data to compare “potential” anonymity to the “actual” anonymity achieved by users. A bitcoin user can use many different public keys, but careful research led to a few heuristics that allowed them to cluster addresses belonging to the same user:

“In theory, a user can go by many different pseudonyms. If that user is careful and keeps the activity of those different pseudonyms separate, completely distinct from one another, then they can really maintain a level of, maybe not anonymity, but again, cryptographically it’s called pseudo-anonymity. So, if they are a legitimate businessman on the one hand, they can use a certain set of pseudonyms for that activity, and then if they are dealing drugs on Silk Road, they might use a completely different set of pseudonyms for that, and you wouldn’t be able to tell that that’s the same user.

“It turns out in reality, though, the way most users and services are using bitcoin, was really not following any of the guidelines that you would need to follow in order to achieve this notion of pseudo-anonymity. So, basically, what we were able to do is develop certain heuristics for clustering together different public keys, or different pseudonyms. I’m happy to get into the technical details, but I’m not sure how relevant they are. The point is that, if you think these are good heuristics, then basically they provided evidence that a certain set of pseudonyms were called into the same owner. In that owner could be a single individual or it could be an entire service, like bit scams or another exchange.”

In the course of their research, Sarah and her collaborators realized that addresses used to collect excess bitcoins (“change addresses”) provided a good clustering mechanism:

“If you think about making change with physical cash, if I walk into a physical store and I hand the clerk a $20 bill, and my thing only costs $5, then I’m going to get $15 back in change, right? And in bitcoin, that process of making change is actually completely transparent, so you can observe the change public key in the blockchain.

“What we tried to do is distinguish change addresses, as we called them, from the legitimate recipient in the transaction. So, in my example in the store, you’d see two public keys as the out in that transaction, one of them would receive $5, and the other would receive $15. What we tried to do is develop a heuristic for distinguishing that $15 part of the transaction from the legitimate $5 recipient. That turned out to be much trickier, but that really was the bulk of the work in the project, just trying to make that heuristic as safe as possible.”

Once they settled on heuristics with which to cluster addresses, the research project still required a data set for testing their theories. This entailed conducting and following transactions through the bitcoin ledger:

Image courtesy of Sarah Meiklejohn.

“The main issue with doing these clusters, with doing any kind of heuristics with bitcoin is that it’s very hard to test how well you are doing. You know there’s no ground truth data in bitcoin. And since part of the point of it is this anonymous cryptocurrency, people aren’t going to voluntarily reveal information, to allow you to test your heuristics or not. So we really had to collect this ground truth data ourselves. What this meant was this very manual process of doing our all in transactions, with as many users and services as we could find. For example, we opened up accounts with exchanges and deposited bitcoins into our accounts and withdrew them, and repeated this process with the same exchange many times, and with many different exchanges, and as you said, what we ended up with was this very carefully put together ground truth dataset. And then, layering this clustering on top of that dataset, let us sort of bootstrap from that very minimal amount of ground truth data to saying the same things about much larger clusters of public keys. And also, the ground truth data allowed us to try to measure our heuristics and try to validate our heuristics.

…

“I would say the main conclusion was that, if you are not carefully considering anonymity, then you are probably not really achieving it. As I said, there are guidelines that you can follow, to keep the activity of your different pseudonyms distinct. And we did see people following those sorts of guidelines. So they were completely evading detection. We sort of developed not just these clustering heuristics, but this kind of mechanism for trying to follow flows of bitcoin throughout the network. And so, if you are careful, then you could go undetected. But if you are not careful, which the majority of users do not seem to be, then you are actually exposing quite a lot of activity, perhaps unknowingly.”

The techniques they used are accessible, but their research relied on some familiarity with the underlying data (and the protocol that produces it). It’s a data set that requires some “unpacking” before one can dive into analytics. One way data scientists can contribute is by creating scalable tools for visual exploration:

“I think data visualization goes a long way. So even mapping out … as I mentioned, bitcoin induces this graph structure, where you can think of the notes as public keys, or you could do clustering as users, and then these edges representing transactions between them. So if you had some great data visualization tool where you could explore those relationships visually, I think that would be quite helpful. So we did try to develop such a tool, using D3, but I think we didn’t push it as far as we could’ve. And so that would already be a way of exploring these relationships, not just typing queries into a database, but actually seeing the information could allow you to spot anomalies or to spot really interesting relationships.”

As some of these service providers reach a certain scale, they will start coming under the scrutiny of regulators. Certain tenets are likely to remain: currencies require continuous liquidity and large financial institutions need access to the lender of last resort.

There are also cultural norms that take time to change. Take the example of notaries, whose services seem amenable to being replaced by blockchain technologies. Such a wholesale change would entail adjusting rules and norms across localities, which means going up against the lobbying efforts of established incumbents.

“Traditional models for financial payment networks and banking rely on centralized control in order to provide security. The architecture of a traditional financial network is built around a central authority, such as a clearinghouse. As a result, security and authority have to be vested in that central actor. The resulting security model looks like a series of concentric circles with very limited access to the center and increasing access as we move farther away from the center. However, even the most outermost circle cannot afford open access.

…

“Centralized financial networks can never be fully open to innovation because their security depends on access control. Incumbents in such networks effectively utilize access control to stifle innovation in competition, presenting it as consumer protection. Centralized financial networks are fragile and require multiple layers of oversight and regulation to ensure that the central actors do not abuse their authority and power for their own profit. Unfortunately, the centralized architecture of traditional financial systems concentrate power, creating cozy relationships between industry insiders and regulators, and often lead to regulatory capture, lax oversight, corruption, and, in the end, financial crises.”

“Centralized financial networks can never be fully open to innovation because their security depends on access control.”

“Bitcoin’s unique architecture and payment mechanism has important implications for network access, innovation, privacy, individual empowerment, consumer protection, and regulation. If a bad actor has access to the bitcoin network, they have no power over the network itself and do not compromise trust in the network. This means that the bitcoin network can be open to any participant without vetting, without authentication or identification, and without prior authorization.

“Not only can the network be open to anyone, but it can also be open to any software application, again, without prior vetting or authorization. The ability to innovate without permission at the edge of the bitcoin network is the same fundamental force that has driven Internet innovation for 20 years at a frenetic pace, creating enormous value for consumers, economic growth opportunities, and jobs.

…

“I urge you to resist the temptation to apply centralized solutions to this decentralized network. Centralizing bitcoin will weaken its security, dull its innovative potential, remove its most disruptive yet also most promising features, and disempower its users while empowering incumbents.”

The regulatory status of cryptocurrencies and blockchains will be among the topics we’ll address at our one-day event — O’Reilly Radar Summit: Bitcoin & the Blockchain. Antonopoulos will be speaking at the event as well, but in the meantime if you want to learn more about bitcoin, join Antonopoulos in an upcoming free O’Reilly webcast.