Open Source, Data Science, Startups, and Life

After spending a few years in the Bay Area working at High Tech, “Big Data” startups (Datameer, Alpine Data Labs, H2O.ai, to name a few) my family and I decided to leave the fast paced sun-soaked peninsula to start the next chapter in our lives. In June of 2016, my wife, son, and I packed up our small rental house in San Mateo California and moved to the Pacific Northwest (PNW) for the outdoors, the affordability, and what I call the Maker movement culture of Portland Oregon. Whether its biking, cooking, brewing, composing, crafting, or any other activity that involves sustained concentration; Makers are flourishing in this town due to a number of key elements. Over the next few months, I’ll attempt to extract these elements as I interview local makers in many industries and professions. Follow me as I embark on a tour of Portland Makers delivered to you in dispatches every month.

In the meantime, check out a chapter from a book I contributed to called “The End of Tech Companies” by Rob Thomas.

“Developers make software for the world to use. The job of a developer is to crank out code – fresh code for new products, code fixes for maintenance, code for business logic, and code for supporting libraries.” –Nick Hardiman

When was the last time you built something from nothing? Was it the time you had to make a diorama for a school project? How about a gift for someone else? Perhaps you composed a song for someone you love. Whatever it was, there is nothing quite like the feeling of creating something from nothing. It is a form of expression that invokes creativity, freedom, passion, and deep thinking. For these reasons, a growing number of people are inspired to learn new skills to make new things. Until recently, people who identified themselves as “makers” were considered hobbyists, do-it-yourselfers, craftsmen, or simply tinkerers. Although those makers are continuing to thrive, other makers in the form of Designers, Developers, Marketers, and the emerging Data Science Practitioners are moving out of niche areas into many professions across every industry. Some of these professions used to be considered Ivory Tower disciplines. But now, computer science, for example, has been penetrated by the “maker movement” and its practitioners simply recast as “developers”.

Developers represent the largest maker movement of our time by making software for everything imaginable, from consumer applications to enterprise processes, to entire marketplaces and, most recently, to automated systems that can think. It is important to realize that these developer makers operate differently than others. For one thing, they are highly suspicious of “black box” solutions. Many vendors have tried to reach developer makers with proprietary software solutions, and failed. Developer makers also differ fromfrom other professions in how they work. For example, as Paul Graham states, “one reason programmers dislike meetings so much is that they’re on a different type of schedule from other people. Meetings cost them more.” Although this was written about developers, it applies to any profession in which sustained attention is needed to build. He goes on to say that “when you are operating on a maker’s schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in.” Thinking gets shifted back to the fast-paced world of immediate actions and away from deep concentration. Reduce distractions, and developer makers become far more productive on account of their highly resourceful and self-reliant nature, providing highly detailed information about programs in the form of documentation, example code, and a thriving community. It is no wonder that developer makers are the makers who most significantly disrupt industries and professions. One profession that had been out of reach until recently is Information Management.

In early 2010, a diverse functional team sat down to discuss why user growth and revenue had slowed, even though installations were on the rise. Marketing professionals presented their campaign data that showed a strong conversion rate of web traffic to downloads. Product analysts showed that the free-to-paid conversion was steady. Lastly, financial analysts showed that the daily, weekly, and monthly active user counts were declining, along with subsequent revenue. Business analysts were on the hook to come up with an explanation for this disconnect. Unfortunately, they had neither access to data, nor a flexible data environment, nor the sophisticated analytical tools needed to connect the dots.

Meanwhile, the engineering team was collecting application log files, network delivery log files, installer log files, and license types as part of the quality control effort. These individuals were not considered part of the “information” or “business intelligence” group and therefore did not make this data available for others – until a group of data scientists and engineers, or data makers, convinced the engineering team to open its data assets to the organization through a distributed data environment built on Hadoop. As soon as these data makers were able to work with the data in an unrestricted way, they quickly developed data products, including curated data sets and business metrics, which could be validated by analysts before being rolled out to the rest of the organization. It took a team of data makers, who could facilitate a conversation across the two organizations, to expand the corpus of information that was available to the business. Data was programmatically used to solve the riddle of the user problem. As it turned out, the answer could be found in the combination of clickstream data, installer log files, and transaction records that showed a channel-specific relationship to specific product offerings that was not otherwise apparent.

As they disrupt the information consumption status quo, data makers are emerging as organizational change agents. Prior to 2011, Information Management consisted of a linear series of steps to produce a dashboard or report that could be distributed, perhaps quarterly, as part of a business review. Data makers apply creativity to attack business outcomes. They wrangle, munge, extract, and analyze data to transform it into a product that incites others to act. To foster a data maker culture, it is critical to make data available, provide an open forum for results to be discussed, and provide a collaborative environment for data artifacts to be shared across organizations.

Today these professions share information freely and promote education through workshops and online courses. Individuals and organizations are starting to realize that to do their best work or attract top talent, the walls between professionals must come down. Makers, by their very nature, are collaborative and open to all comers. Makers have driven the rise of open source software, meetups, hackathons, Massive Online Open Courses (MOOCs), and various programs that promote inclusivity in technology. This is the Maker Era; a key cultural condition for prospering in the post-tech world.

Like this:

Google or the Yellowpages, Uber or Yellow Cab, Netflix or Comcast, Nest or Honeywell, Stitchfix or Nordstrom, Etsy or Bernhardt Furniture, every industry, profession, startup, enterprise, is making a hard shift to digital at a rapid pace. Data exhaust is growing exponentially from every interaction as a by-product of this digital evolution from mobile, web, social, commerce, and many new touch points of the digital landscape. According to Oxford Dictionaries, data is “facts and statistics collected together for reference or analysis.” It is important to note that data is different than information. It is an abstraction from information that lends itself to code and math to make data products. An ecosystem of data suppliers, producers, services, and consumers are emerging to support a dataFirst development practice.

Many factors are accelerating the transition from the offline world to the always on generation including a cultural shift in connectedness, a technology shift from centralized to decentralized computing infrastructure, and an economic shift from cost prohibitive resources to accessible cloud computing, memory, storage, and software due to the rise of Open Source Software (OSS) and indirect monetization business models. Making this transition is not easy and requires data literacy to be competitive and take full advantage of this shift.

At IBM, we have recognized this shift by declaring our strategic initiatives as cloud computing and cognitive solutions. At the center of this shift is data. Said another way, our clients are moving their business online and in doing so creating data exhaust that can be leveraged for machine learning to build data products like customer service chat bots and teaching assistants. Unfortunately, there isn’t a data platform for working with exhaust data and transactional data to build data products for cognitive solutions.

We have built platforms for application development, Bluemix, and cognitive solutions, Watson, and have yet to build a platform for data to connect the two with a robust data ecosystem of data producer and consumer partners. Instead, as an industry, we have continued to drive product-centric data ecosystems that have succeeded in the past, but are now faltering due to the transforming data consumer. For example, NoSQL data ecosystem (Hadoop, Cassandra, MongoDB), the MPP data ecosystem (Vertica, Netezza, Greenplum, or RDBMS data ecosystem (MySQL, PostGRES, DB2, Oracle, etc.) all have depended on a Business Intelligence consumption model to drive a business process. Dashboards are dead.

On September 26th, IBM will launch the first data platform built on open source software, cloud computing, and include key Watson services to deliver cognitive solutions. Additionally, we’ll introduce dataFirst methods to help clients and partners bridge the gap between digital and cognitive solutions. We’ll introduce dataFirst certifications to extend the data platform by supporting a broad ecosystem of partners building on a single data platform that is open for all. We’ll bring together leading data programs across IBM including consulting services, skills and training, independent software vendor’s, technology leaders, and many others who have an interest in data to a seamless experience to maximize interactions between data producers and consumers.

In the past year, we invested in the open source technology most notably Apache Spark as the Analytics OS and introduced industry leading user experiences for both the data consumers and data producers. Watson Analytics makes analytics consumable and the Data Science Experience makes data producible. Together, these two offerings represent two ends of the data & analytics spectrum. After launching the data platform these two disparate experiences become connected through a fabric with open services to a growing ecosystem of suppliers for data ingestion, persistence, machine learning, orchestration, discovery, and access. In addition to the Data Science Experience and Watson Analytics; other producer and consumer endpoints will also emerge to address every industry and profession. For example, in IoT, we’ll introduce experiences for device makers and application developers. By connecting data producers to data consumers, a data marketplace is born for dataFirst practitioners to collaborate and learn from each other instead of remaining niche providers of disconnected ecosystems.

Like this:

Few inventions in American history have had the massive impact of the IBM ® System/360—on technology, on the way the world works and on the organization that created them. Jim Collins, author of Good to Great, ranks the S/360 as one of the all-time top three business accomplishments, along with Ford’s Model T and Boeing’s first jetliner, the 707.

Most significantly, the S/360 ushered in an era of computer compatibility—for the first time, allowing machines across a product line to work with each other. In fact, it marked a turning point in the emerging field of information science and the understanding of complex systems. After the S/360, we no longer talked about automating particular tasks with “computers.” Now, we talked about managing complex processes through “computer systems.”

Before the System/360 operating system was introduced in 1961 there were individual peripherals with different user interfaces, programming models, connection ports, and storage media. This means for any new business solution a programmer was effectively starting from scratch. Its like Ford having to invent a new engine every time they release a new car and train all of the mechanics that support it. 10 million lines of code later, 20+ peripheral solutions emerged as the information platform to help put a man on the moon, manage millions of flight reservations, and introduce the era of information science. A standard operating system for information science created economies of scale, which in turn lowered the barrier to entry to information science. Now, anyone wanting to develop an information solution could take advantage of System 360 and take information science even further. One problem, the operating system was attached to a massive super computer.

Fast forward 30 years, a portable operating system emerged with no strings attached.

Over the next 30 years the System 360 operating system was adopted by practically every Fortune 100 company as their system of record. IBM established the Chief Information Officer, and many companies followed suit by appointing CIOs to organize their information assets. Information management was established, and the information age began.

But data, development, and access were all confined to only a few select information managers due to high costs associated with the systems and fear of the spread of misinformation. It wasn’t until Linus Torvalds, a computer scientist (a new profession at the time) invented a new operating system in 1991 that was “portable” to any system, big or small, that data began to be democratized. Not only did he make the operating system portable, he also licensed it as an open source technology, completely removing all barriers to adoption: all you needed was hardware, skills, and creativity. This portable application operating system was Linux. At 13 million lines of code, Linux established itself as the application development operating system that launched the Web, Social, Mobile, and numerous applications that created new systems of engagement. A massive audience could now interact with ubiquitous information.

In 2000, Linux received an important boost when IBM announced it would embrace Linux as strategic to its systems strategy. A year later, IBM invested US$1 billion to back the Linux movement, embracing it as an operating system for IBM servers and software. Over the next 15 years, IBM introduced 500 solutions built on Linux and contributed millions of lines of code from over 600 open source contributors.

Millions of applications built on Linux opened the flood gates to rich data with value trapped just below the surface. To fish value out, mathematicians fell madly in love with system engineers.

Almost as quickly as Linux was introduced, the amount of data exhaust created by applications across mobile, social, web in new systems of engagement introduced a data problem never before seen. Simply finding information became a monumental challenge: the world wide web needed an open source search engine. Doug Cutting, then an engineer at Yahoo, and University of Washington grad student Mike Cafarella built what became Apache Hadoop, a marvel of systems engineering, designed to distribute data and processing across many commodity servers. Apache Hadoop turned working with data on its head. All of a sudden you could leap over the information managers who controlled the so-called “extract transform load” (ETL) process that bottlenecked new data ingestion. Apache Hadoop introduced “extract load transform” (ELT) making it possible for anyone to work with any data type—no matter the source. Apache Hadoop’s success as an unstructured data management environment set the bar, the introduction of Apache Spark a few years later did compute took distributed systems even further. If Apache Hadoop was the hard drive, Apache Spark is the processing chip for complex math. In a short time, we went from algebra to calculus, making machine learning possible at a much larger scale.

Now, mathematicians can use any data type to build algorithms that learn and the challenge is no longer a data problem by a systems engineering one. Apache Spark changes the way we work with data with an elegant API that makes it so you don’t have to think in terms of distributed programming. Spark does this by storing the logic in-memory in what is called a directed acyclic graph (DAG) model to process data interactively while carrying forward the advantages that Hadoop introduced; there’s no need to format, cleanse, or manipulate the data before storage and processing. Hadoop and Spark have set the stage for a new way to manage and compute data, ushering in the Cognitive Era. Spark and Hadoop alone are not enough to successfully build a robust platform that is portable, scalable, usable, and flexible—and able to meet the demands of industry. For this reason, we launched the Spark Technology Center in San Francisco, and an additional STC in India last week. The Spark Technology Center is growing the ecosystem around Apache Spark, to help meet the real world demand for Spark-based applications. Already we have introduced Apache SystemML, Torree, and most recently Quarks to expand the industry use cases for distributed analytics.

A quark (/ˈkwɔːrk/ or /ˈkwɑːrk/) is an elementary particle and a fundamental constituent of matter. Quarks combine to form composite particles called hadrons, the most stable of which are protons and neutrons, the components of atomic nuclei.

Last week IBM introduced a new open source project called Quarks. It is called Quarks to represent the smallest analytics operating system that can run on any device imaginable. It was created from years of research and development on System S or streams innovation that supplies the foundation for continuous computing for the most advanced organizations in the world including the city of Stockholm, Wimbledon, Telco’s, Financial Institutions, Government, Automotive, and many others.

Now every device can be more intelligent at the edge without having to be always connected to the internet. Quarks allows complex models to be run at amazing speeds and allows analytics to be run against data streams, not only data at rest. Quarks works with Spark to unify access to data across the organization through support for multiple programming languages and a multitude of data sources, and it reduces development time with high-level tools for machine learning and streaming data.

Quarks with Spark opens data science to many users such as designers, mathematicians, data scientist and developers. It’s an agile way to build applications powered by any kind of data and push it to the absolute edge of the web.

So what’s next? My prediction is we are at the verge of data products becoming mainstream.

Data products are not quite applications nor are they simply mathematical equations. To me, they are a combination of data pipelines that feed machine learning algorithms that are embedded into the very fabric of our decision making experiences. These data products are the gateway to the cognitive era. Data products will be built to augment human thought processes in a computerized model. Cognitive computing involves self-learning systems that use data mining, pattern recognition and natural language processing to mimic the way the human brain works.

Want to build data products and learn more about open source analytics? Join me at our next Datapalooza event in Austin next month by going to http://www.spark.tc/datapalooza

Like this:

Note: this is a repost from the article I wrote for KDNuggets last November.

In statistics, bootstrapping can refer to any test or metric that relies on random sampling with replacement. In simple terms, it allows a way to measure the accuracy of the sampling distribution often used in constructing a hypothesis test. In business, bootstrapping refers to starting a business without external help or capital. Bootstrapping in general parlance refers to an absurdly impossible action, “to pull oneself over a fence by one’s bootstraps.” R and Hadoop are very much bootstrapped technologies having received zero direct investment capital and relying on what might appear to be a random group of contributors over the past 20 years in practically every industry and use case imaginable.

R Pirates Pillage Businesses Worldwide

R first appeared in 1993 when Ross Ihaka and Robert Gentleman at the University of Auckland released a free version as a software package. Since then, R has grown to over 3 million users in the US alone according to the download site log files released last year.

In addition, R surpassed SAS with over 7,000 unique packages you can view on Crantastic website. It is no wonder it has found wide use in many industries and academia. In fact, during the summer of 2014, R surpassed IBM SPSS as the most widely used analytics software for scholarly articles according to Robert Muenchen. For this reason, R is now the “Gold Standard” for doing all sorts of statistics, economics, and even machine learning. Furthermore, from my experience I found that many if not most people use R as a complimentary tool today for spot checking their work even when using other far more expensive or popular enterprise software. It is no wonder R is quickly taking over as the go to tool for Data Scientists in the 21st century.

What is fueling R growth is predominantly the community for making the core software useful and relevant by providing answers to common questions via many blogs and user groups. In addition, it is clear there is an underserved job market according to data from LinkedIn (see image below). Due to this demand, R is now offered in practically all major universities as the de facto language for statistical programming and many new online courses are starting each day. Datacamp is one such example having built an interactive web environment with rich lessons that non-programmers can easily get started without ever touching a command line.

Businesses too are flocking to statistics and embracing the probabilistic vs the deterministic nature of problems that arise when data is expanding at an increasing size and rate where tradition Business Intelligence cannot keep pace. For this reason, many turned to Hadoop to open up the data platform to unlock the world of enterprise data management that had been kept away from business analysts for many years. Gone are the days of pre-filtered, pre-aggregated dashboards and excel workbooks that are emailed around haphazardly to executives and decision makers left to little interpretation or devoid of any “storytelling” to guide the business to make informed decisions.

Hadoop Growth

Apache Hadoop came almost 10 years after R first hit the scene in 2005 and wasn’t widely adopted until as late as 2013 when more than half of the Fortune 50 got around to building their own clusters. The name “Hadoop” is after a toy elephant of famed Yahoo! engineer Doug Cutting who along with Mike Cafarella originally developed the technology to create a better search engine, of course. Along with its ability to process enormous sums of data on relatively inexpensive hardware, it also made it possible to store data on a distributed file system (HDFS) without having to transform it ahead of time. As with R, many open source projects were created to re-imagine the data platform. Starting with getting data into HDFS (sqoop, flume, kafka, etc.) to compute and streaming (Spark, YARN, MapReduce, Storm, etc.), to querying data (Hive, Pig, Stinger / Tez, Drill, Presto, etc.), to datastores (Hbase, Cassandra, Redis, Voldermort, etc.), to schedulers (Oozie, Cascading, Scalding, etc.), and finally to Machine Learning (Mahout, MLlib, H2O, etc.) among many other applications.

Unfortunately, there is not a simple way to see all of these technologies and easily install with one line of code like R. Nor is MapReduce a simple language for the average developer. In fact, you can clearly see the shortage like R of Hadoop and MapReduce skilled workers to the number of jobs available thanks to LinkedIn. It is for this reason Hadoop has not fully caught fire in the same way R has and there is talk of its demise at the recent Strata Hadoop World conference in NYC this past fall.

What’s the Real Problem Here? In One Answer, Data

Over the past few years, the issues of data have cropped up in the field of data science as the number one problem faced when working with the vast variety and volumes of data. I’d be remiss to not mention the velocity of unrelenting data waves crashing against our fragile analysis environments. In fact, it is projected that the volume of data is expected to exceed the number of stars in the universe by 2020 according to IDC. Fortunately, there is an entirely new approach to this problem that has until now escaped us in our persistent habit of wanting to constrain data to our querying tools.

Machine Learning is the new SQL

Put simply, “Machine Learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data.” It is a quantum shift in the standard way of simply counting things; instead, its the start of a fantastic journey into the deeper pools of the unknown.

So here comes the really interesting part of the story. According to my LinkedIn analysis, Machine Learning and Data Science are actually very well matched to the overall demand in the job market to the people available (unlike R and Hadoop).

We’ll need another plot to really understand what is going on here of the actual number of jobs that exist for Data Science vs Machine Learning. My interpretation of this graph is that the job of a data scientist today is synonymous to that of an analytics professional or analyst and the real opportunity is in the growing area of Machine Learning.

Machine Learning is the New Kid on the Block

Data Science was first described as the intersection of programming or “hacking skills”, math and statistics along with business expertise according to Drew Conway’s blog. As it turns out, programming is too generic a term and what is really meant is applied math to large scale data through new algorithms that can crawl through this tangled mess. To search for answers in this jungle, simply flying over the canopy will not reveal the treasure boxes hidden just beneath the canopy. It is evident to me from the number of machine learning projects that have cropped up and the maturity of the market accepting probabilistic information not only deterministic marks a new era in the race to find value in our data assets.

“The machine does not isolate man from the great problems of nature but plunges him more deeply into them.” – Antoine de St. Exupery

Hadoop 2.0 is Here, Sort Of

Many people have tried to claim that Hadoop 2.0 had arrived with MR2 or YARN or high-availability HDFS capabilities, but this is a misnomer when considering the similarly named Web 2.0 that brought us into the age of the web applications like Facebook, Twitter, LinkedIn, Amazon, and the vast majority of the internet. According to John Battelle and Tim O’Reilly of now Strata fame defined the shift as simply “Web as a Platform” meaning software applications are built upon the Web as opposed to the desktop. Hints of this change are coming from Apache Spark and specifically new capabilities like SparkR, KeystoneML, and extensions that are making it possible to develop intelligent applications on large-scale data. As Matei Zaharia, the godfather of Spark, would say himself, “its all about data science and interfaces” as he reported in his keynote address earlier this year. It is now finally possible for Data Scientists and Developers together in the same framework. It is clear to me having worked in the “Big Data” industry for some time, software developers and statisticians want to program in their language, not MapReduce. It’s an exciting time to move off the desktop and onto the cluster where the constraints are lifted and the opportunities are endless.

Jobs and Skills Analysis Explained

Many people have conducted research as of late on the growing popularity of statistical software using indirect methods like academic research citations, job posts, books, website traffic, blogs, surveys like KDnuggets annual poll, GitHub Activity and many more. However, all of these methods have generally been focused on the technical crowd. Where the rubber meets the road is in the business context which in my mind LinkedIn represents as it is highly representative of the business world. Further, if you want to go to an even more general audience you can perform the same trick with Google Adwords. Reverse engineering Ad platforms is a good way to get back of the envelope market sizing information. I wrote a complete blog on this subject on my personal blog. In the following instructions, I’ll walk through how I used LinkedIn as my sample and R to analyze the business market for my analysis above.

1. Data Gathering

There are two ways to gather data from LinkedIn. One is to use the ad shown in the left image below or the other is to use the direct search functionality shown to the right below.

In this case, I went the manual route and used the search function. For each product category that I search there is a count of results that show up and use that as a proxy for demand. See below:

From this example, we can see the phrase “R Programming” has 1720 results. I’ve also included “R statistics” and other “R” relevant terms.

2. Data Analysis

As a new R user myself, I manually created each data frame to hold the data by first creating the individual vectors for the people and jobs:

R comes with many visualization packages, the most notable one being ggplot2. For this situation, I used a built in barplot as it was much easier out of the box. Frankly, the visualizations that are produced in R may not seem the most compelling to the general audience, but it does force you to consider what you’re plotting making for more informed visuals.

to get the bar graph (in H2O colors):

barplot(skills$ratio,names.arg=name,col=“#fbe920”, main=“Ratio of People to Jobs”,xlab=“Skill”)

Thats it! Pretty simple and I am sure there are ways of doing this analysis more elegantly, but for me this was the way that I can be sure the analysis makes sense.

To get the full script you can download to try yourself or add to my analysis.

Like this:

Well, its official, the future is here. This week we were introduced to how we’ll power our homes, businesses, transportation, and industries for years to come. In Elon Musk’s keynote, he took us on a journey that started with a very simple problem statement. We are producing energy poorly and polluting the planet in doing so. Like any good engineering problem, the solution lies in breaking down the problem into its individual parts. The first part is where to get the energy. Well, it turns out “we have this handy fusion reactor in the sky called the sun” explains Musk. Unfortunately, the sun doesn’t shine at night, so the second part of the problem is how do we store the energy to use it at night and off peak hours (as our energy needs fluctuate). Turns out the answer is a battery pack. So there you have it, we now have a way to store the sun in our homes and power our lives not only in developed countries, but technically anywhere. How to make this economically feasible is a question left unanswered.

Copyright of Tesla Motors.

Curious how the battery pack works? You can actually view the patent here. In fact, Tesla went even further to remove its patents and open source the technology for anyone to use. With Tesla’s announcement this week, it is clear to me that they are not a product company trying to sell a few cars to the wealthy, but it is a technology company looking to make a societal revolution by allowing others to improve on these technologies without fear of litigation.

Why might you ask am I talking about energy technology on my analytics blog? To me, there are striking similarities to how data as a resource is consumed by the few who have the skills, technology, and resources to apply it in their every day lives. The problem statement in this case is that we continue to manage data poorly and are polluting the decision making process with bad insights that impact every aspect of our lives. Similar to the energy problem, there is no single solution, but multiple parts that we need to address. The first part of this problem is the pervasiveness of data exhaust from every digital thing that is poorly instrumented and difficult to work with. For this, we have a pretty good solution called “Hadoop” that we’ll need to continue to make easier for people to use. Hadoop clusters in your home anyone? Don’t believe me, check out the company BigBoards. Next, we’ll need to find a way to store and process data efficiently. For this, the front runner to me is Spark in its ability to crunch through data fast, applying sophisticated operations to find patterns in data, and then stream it into applications easy to use APIs. Last, we’ll need to learn how to engineer a universal way of consuming the “energy” or as it is commonly referred to in the analytics category, the “Insight” that comes out. For this we have not yet identified a universal solution, and represents the Data Science Last Mile I have written about before. Looking into the future, there is a lot to be optimistic about as I look at additional parts of society we’ll disrupt:

Like this:

Editors Note: This is my first blog since returning from paternity leave. I am happy to announce the birth of my son Maxwell Horwitz. I’m already applying data science to his routine. His health records are stored online, we monitor his intake (and outake of food), and monitor his sleep via motion cameras. Its an exciting time to bring a new life into the world!

Data Science is often referred to as a combination of developer, statistician, and business analyst. In more casual terminology, it can be more aptly described as hacking, domain knowledge, and advanced math. Drew Conway does a good job of describing the competencies in his blog post. Much of the recent attention is focused on the early stages of the process of establishing an analytics sandbox to extract data, format, analyze, and finally create insight (see figure 1 below). Many of the advanced analytic vendors are focused on this workflow due to the historical context of how business intelligence has been conducted over the past 30 years. For example, a new comer to the space, Trifacta, recently announced a 25 Million dollar venture round that is applying predictive analytics to the data to help improve the feature creation step. Its a very good area to focus considering some 80% of the work is spent here working to un-bias data, find the variables that really matter (signal / noise ratio), and identify the best model (linear regression, decision trees, naive bayes, etc.) to apply to the data. Unfortunately, most of the insights created often never make it past what I am calling the “Data Science Last Mile.”

Figure 1. Simple data science workflow.

What is the Data Science Last Mile? Its the final work that is done to take found insight and deliver in a highly usable format or integrate into a specific application. There are many examples of this last mile and here are what I consider to be the top examples.

Example 1. Reports, Dashboards, and Presentations

Thanks to the business intelligence community, we are now accustomed to expect our insights in a dashboard format with charts and graphs piled on top of each other. Newer visual analytics tools like Tableau and Platfora add to the graphing melange by making it even easier to plot seemingly unrelated metrics against each other. Don’t get me wrong, there will always be a place for dashboards. As a rule of thumb, metrics should only be reported as frequent as is the ability to take action on them. At Intel, we had daily standup meetings at 7am where we reviewed key metrics and helped drive the priorities each day for the team. We had separate meetings scheduled on a project basis for analysis that was more complex like bringing up a new process or production tool. Here the visualization format is very well defined and there is even an industry standard called SPC or Statistic Process Control. For every business, there are standard charts for reporting metrics and outside of that there is a well defined methodology for plotting data.

One of my favorite books of all time on the best design practice for displaying data is by Edward Tufte. My former boss and mentor recommended the book The Visual Display of Quantitative Information and it changed my life. One of my favorite visuals from this book is how the French visualized their train time tables.

Presentations, reports, and dashboards is where data goes to die. It was common practice to review these charts on regular basis and apply the recommendation to the business, product, or operations on a quarterly or even annual basis.

Example 2. Models

Another way data science output is ingested by an organization is as inputs in a model. From my experience, this is predominantly done using Excel. Its quite surprising to me that there aren’t many other applications that have been built to make this process easier? Perhaps its due to this knowledge being locked away in highly specialized analysts heads? Whatever the reason, this seems like a primary area for disruption that there is a significant need to standardize this process and de-silo this exercise. One of my favorite quotes is from a former colleague. We were working over a long weekend analyzing our new product strategy business models when he stated, “When they told me I’d be working on models, this is NOT what I had in mind.” Whether you’re in Operations, Finance, Sales, Marketing, Product, Customer Relations, Human Resources, the ability to accurately model your business means you’re likely able to predict its success.

Example 3. Applications

Finally and quite possibly my favorite examples are those products built on data science around us without us evening knowing it. One of the oldest examples I can think of is the weather report. We are given a 7 day forecast in the form of sun, clouds, rain drops, and green screens managed by verbose interpretive dancers. As opposed to reporting the raw probability data of rain, barometric pressure, wind speeds, temperatures and many other factors that go into this prediction. Another example is a derived index of credit worthiness (FICO), or the Stock Market index, or Google’s Page Rank, or a likeliness to buy value, or a number of other singular values that are used to great effect. These indices are not reported wholesale, although you can find them if you try. For example, go to http://pagerank.chromefans.org/ to see any websites Google pagerank yourself). Instead, they are packaged into a usable format like Search, Product Recommendations, and many other productized formats that bridge the gap between habit and raw data. For me, this is the area that I am most focused on as the last mile of work that needs to be done to push data into the every aspect of our decision making process.

How does Big Data fit into this scenario? Big data is about improving accuracy with more data. It is well known that the best algorithm looses out to the more data inputs you have. However, conducting sophisticated statistics and analysis on large datasets is not a trivial task. A number of startups have sprung up in the last couple years to build frameworks around this, but require a significant amount of code skills. Only a few have provided a much more approachable way of applying data science to big data. One such company that has built a visual and highly robust way of conducting analytics at scale is my very own Alpine Data Labs. It has a bevy of native statistical models that you can mix and match to product highly sophisticated algorithms that rival the best in class in a matter of minutes, not months. Pretty wild to think that only a few years ago we were still hand tooling algorithms on a quarterly basis.

It is evident to me that the focus needs to shift back towards the application of data science before we find our self disillusioned. I for one, am already thinking about how to build new products that start with data as its core value than an add on to be determined later. There is much more to write about on this subject, but now I hear Maxwell calling out for my attention just as I would have predicted 🙂

Like this:

I am pleased to report Big Data is here to stay and we are now moving into the application age with many moving beyond descriptive (BI) to prescriptive or machine learning focus. After attending STRATA NYC last month and Databeat this past week I am seeing first hand how this trend is rapidly evolving. First, lets take an example of another major technological shift that happened a little over 10 years ago when the internet and web applications came of age.

At first there were only a handful of way to access the web via “internet portals.” Many people could access the open web to truly leverage its amazing potential to communicate, access information quickly, and create content. Next, we saw the dot.com boom create a huge demand for web developers with little emphasis on design. I remember fondly many of my engineering colleagues jumping into the fray learning php, html, tcp/ip and other web programming languages to take advantage of the demand. It wasn’t until the bubble bust and the next era of Web 2.0 arrived that frameworks became standard and the focus shifted to design. These days do people call themselves Web Developers? Not really, I’d say you see more Web Designers attracting the high salary that can use established web frameworks to design the best customer experience.

It often reminds me of the situation of the Data Scientist today where many believe the best are great programmers who can leverage R, Python and MapReduce to create one off analysis. Scott Yara from Pivotal went so far as to say this last week, “It only takes minutes for a Programmer to become a Data Scientist.” Do we truly believe that? When we heard from Allen Day, Data Scientist at MapR, he did not talk in terms of data frames or Hadoop jobs. Instead, he focused on the design component of engineering a big data application. No question he has a strong ability to program and work with big data technology, but what truly sets him apart is his ability to design solutions. You can hear more snipets from his talk “What Shape is Your Data,” by liking us on Facebook.

Today the majority of Data Science applications are heavily coding and scripting frameworks (Python, R, Scala, Java, and Map Reduce). However, at Alpine we are thinking differently about how to design and replicate analysis without having to start from scratch each time. We go further and abstract the code into representations of operations to make it less programming intensive. I agree with Trifacta’s CEO Joe Hellerstein, when he states “Let’s take the programming requirement out of Data Science.”

Like this:

Hackathons, according to Wikipedia are “…(also known as a hack day, hackfest or codefest) is an event in which computer programmers and others involved in software development, including graphic designers, interface designers and project managers, collaborate intensively on software projects.” For me the first thing that pops into my mind is Facebook, but you may be surprised to learn this was originally devised by the smart marketing folks at Sun during the height of the internet boom!

Since then, Hackathons have become a main stay for socializing, innovating, and friendly competition between like minded individuals with a common theme or goal in mind.

I beg the question, why limit hackathons to programmers.

Beyond Programmers

Already we can see some areas where Hackathons have been applied to not only programming but also life sciences (e.g. Open Bioinformatics Foundation). Personally, I would love to see more “hackathons” in other areas. Perhaps government (ahem shutdown) or cooking? to think of a couple random examples. For me, it feels like hackathons are simply a way to fast prototype ideas into reality.

If you want to start your own hackathon, what is the structure?

Place

Hackathons can take place practically anywhere. At Facebook HQ and their notoriously grueling hackathon to the mile high club British Airlines hackathon. Or at no place at all and purely online as in the case of Kaggle, which I’d consider a form of hackathon.

Now that we have a place, what is the structure of a hackathon once you’re there?

Structure

Overall, I’d say there is usually a short presentation by the organizers about the goal and guidelines of the hackathon. Once announced, people generally break up into teams of 2-4 and go off to generate ideas, mockups, and get to work. Hackathons can last a simple evening as was the case with AirBnB or last for more than a week in some organizations.

Here are some good tips from our pals at Quora if you are so inclined to create your very own hackathon.

In the meantime, if you want to see what all the fuss is about, I invite you to join one of our hackathons we’re hosting next month for the National Association of Realtors

Like this:

Customer Lifetime Value (CLV) is an often over used and over simplified term people use to describe the amount of value a customer generates for your business. Take this example from Kissmetrics where they average together “expenditure” and “visits” into their variables and then take a random retention rate (r) to calculate the LTV. From this simple example, they come up with a range of $5k – $25k ??? that’s over a 5X difference and then they average it together??!?! Yikes!

kiss metrics

When I first joined AVG Technologies back in 2010 a customer life time value estimate was generated in a very similar way by simply dividing the total annual revenue by the monthly active users. Unfortunately, this was a bogus metric as it really over simplified customer lifetime, monetization, and the true cost of acquisition. Especially considering the fact we had over 110 Million users worldwide. At its best, the CLV answers in one simple number all of the most important questions about your customers:

Where do my customers come from? How many are making their way through the acquisition and on-boarding funnel to become an active user? and at what cost?

What is the average lifetime of my customers? What impacts their churn behavior and what dimensions are important to segment by?

How many transactions are conducted in the lifetime of my customer? what is the average order value?

Over time, we developed a methodology to extract clickstream cohorts with the channel attribution information and join it to the customer record data. We then conducted linear regression over a multitude of dimensions to identify the key variables that impact the churn or monetization of a customer. For more information, please contact me directly @whatisanalytics

At its worse, a bad CLV can lead to over spending on acquiring users; a death wish in the startup space or underestimating the total value a marketing campaign or new product introduction could be generating.

Like this:

With new technologies coming out so fast for analytics, its hard to keep up with the best tool for the job. Take Berkely’s Data Analytics Stack (BDAS) featuring Spark, Shark, Mesos, for advanced analytics and mining. Should I use this or stick with Apache Hadoop, Hive, and Mahout? How do you decide? From my experience, I’ve found this to be the most common stack:

Configuration:

Hadoop: for distributed file system for data collection.

Database: Hbase or Cassandra to enable random reads

Analysis: Hive, Pig, Impala for advanced analysis

Real-Time: Storm or Spark

Visualization: Tableau Software or if you have programmers D3.JS

Applications: Datameer, Alpine Data Labs, WibiData, Wise.io, others?

Infrastructure: On-premise or Hosted?

Add-ons: Hue, Sqoop, and Flume.

Example of a possible configuration

Is this generally what you see? Are there additional configuration I am missing? Feel free to leave a comment or contact me directly.