Open Source, Data Science, Startups, and Life

After years of trying to avoid coding, I’ve recently picked the habit back up and actually enjoying it this time around.

My first formal programming experience was part of my core engineering curriculum at the University of Washington. I quickly realized how fun and addictive programming can be. We learned the basics of building a simple search algorithm, simulated environments, and some simple tip calculators that we simply cannot get enough of. After an all night session of coding my final project on many cups of coffee I won the freshman competition for best game of the quarter. However, I was so immersed in programming the rest of my coursework suffered. Perhaps this is why so many computer programmers drop out. From this moment, I realized I was not my full passion but more of a fun hobby, so instead I went into Materials Science and Nanotechnology (as it was just emerging). For me the real draw to engineering and technology has been more of what you can do with it than making something pretty.

Fast forward to today, and I’ve found a new interest in coding as the languages have become far more streamlined. Python for example makes it much easier to code by removing variable definitions, syntax, and other formalities that used to consume my time debugging. Furthermore, Python (among other languages) bridge more complex tasks that would include connecting to APIs, writing MapReduce, and creating simple User Interfaces. In addition, there are so many ways to learn that are FREE! (thank you Freemium Model). Coursera, Codeacademy, and Codeschool to name a few are some of the more popular options. Coursera is a great option for usually getting a more high level or theoretical understanding of the material. Paired with Codeacademy or Codeschool it really solidifies the material by doing. I’ve found the python course on Codeacademy to be not only fun (*cough* gamified), but really great at building on each lesson. I can write a whole review on these schools but I’ll save that for another post.

You may ask the question, “Why refresh or learn to code in the age of “drag and drop” enabled analysis tools (MS Excel, Tableau, SAS, Datameer, and others)?” Its a simple answer, while these tools are great for fast ad hoc analysis, you sometimes have to roll up your sleeves and really feel the data and go deep. Furthermore, as the big data space continues to heat up, I’ve already found many vendors try to trap you in a box that makes it difficult to integrate or build on unless you have their blessing. In addition to this, shuttling data around to a sandbox creates more costs, privacy concerns, and other issues that defeats the purpose of using a giant database in the first place. This, I believe, is what is driving the growth in open source projects in the space.

So far I have finished most of the Intro to Data Science course on Coursera, I’m midway through the Code Academy’s Python course and just about to start digging into SQL. In fact, we are currently planning a meetup for Bay Area Analytics to specifically focus on coding and querying data.

Thats all for now, I’ll update this as I get further along and even share some of my code on GitHub as I progress.

Like this:

Over the past few months, I’ve attended many big data, analytics, data science, social media, Hadoop, Vertica, d3.js, you name it meetups in SF and the Bay Area. I have yet to find one that is truly focused on leveraging data analytics to power your business. For this reason, I am starting a completely new and open group to discuss how we can solve real-world business problems with data by sharing experience and transfering knowledge that can make us all better at data decision making.

We’ll meet once a month in a location on the peninsula and have an open forum with break out sessions that will tailor to your interests. Already we have over 120+ members. I’ve analyzed your feedback and you can see what our members are interested in the infographic below.

Feel free to check out the group here: http://www.meetup.com/BayAreaAnalytics and of course let me know if you have other ideas or topics you’d like to discuss at future meetups.

There are many analytics models to choose from that have been developed over time by financial analysts, marketers, and product managers. Here is the first of five core analytic models that are essential to making data informed decisions.

At the top of the list is Cohort Analysis, it has been around a long time and is prevalent in medicinal, political, social, and other sciences. Lately, there has been a resurgence of this form of analysis and how it relates to web and product analytics (Jonathan Balogh, Jake Stein, and others). There are many excellent explanations of cohort analysis, so I won’t spend too much time explaining the concept. Overall, Cohort Analysis is the practice of segmenting a group of people by a dimension. Whether time, geography, demographic, product, or otherwise, the ultimate goal is to see how one group compares to another.

With this simple model, we are able to measure how a marketing campaign, new feature introduction, or an unknown variable causes change in customer behavior. For example, churn, retention, conversion, or customer referrals are critical for driving growth.

You can focus all of your attention on driving downloads through Search Engine Optimization (SEO), SEM (Search Engine Marketing), and Call to Action (CTA) improvements, but if you cannot retain these customers then its pouring money down the drain.

Lets take a typical online acquisition funnel as an example:

Typical Online Acquisition Trend

In this example, we can see our website traffic and downloads are steadily increasing, but our active users are staying flat. Is there an issue with our download to activation process? Do new customers try our product and then leave? or are there older customers who are now churning away? There is no way to tell without doing a cohort.

How to setup a cohort in simple to understand terms:

Filter by date range (Depending on your volume you can choose between one day to one week or the first 100k activations).

Collect customer behavior, demographics, or any other important dimension over the next 30 to 60 days or until you reach a steady state (churn rate stabilizes).

In this example, we can see clearly India has a steep initial fall off but then levels out, Germany has a more steady decay, and the US having the least churn of the three. From here you can see the issue is early in the customer on-boarding, as well as, a significant country behavioral difference. We can focus on better product and marketing design for those first few days and even narrow our attention to the first few hours. Additionally, we may want to limit our marketing spend in India until we resolve the high churn rate.

Cohort Analysis is a simple and powerful model to dig deep into your data to find the root cause of an issue and make data informed recommendations. Next week, we’ll take this concept further and see how we can find indicators of churn with linear regression and correlation.

Like this:

For my first post, I find the best place to start is by defining the subject matter we wish to talk about. So lets get started, what is analytics anyways? How is it different than traditional Business Intelligence? and why has it come back into focus after being dormant for so many year?

a Definition: Analytics is the application of computer technology, operational research, and statistics to solve problems in business and industry.

Historically, Analytics was heavily used in banking for portfolio assessment using social status, geographical location, net value, and many other factors. Today, Analytics is applied to a vast number of industries and is re-emerging due to the phenomenal explosion of data from our connected world.

With this explosion of data, we now see analytics re-emerging as a topic and instead called “Data Science.” In reality, analytics has been around a long while and this new breed of analysts are re-branding to garner higher salaries. With the advent of low cost and open source databases we’ll see analytics penetrate deeper into traditionally less analysis focused industries. Leading the pack is Apache Hadoop, primarily due to the aforementioned low cost and ease of scalability. Big Data consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools. McKinsey Global Institute estimates that big data analysis could save the American health care system $300 billion per year and the European public sector €250 billion.

Datameer circa 2012

Many industries have already adopted or are in the process of adopting a Big Data platform in their organization and the time has come to start discussing some simple analysis to leverage this vast amount of information.