Decisions & Discovery

Ever wonder what your own personal network looks like? You are likely connected to many different groups (family, friends, community, work), but do you know how they are connected? Or are they connected at all? Are you the glue that connects these various groups?

This is a great age we’re living in, and I’m glad to be involved with developing lots of really advanced technologies. One of the technology areas that I’m really fascinated with has been pushed forward by Stephen Wolfram. He created the industry standard computing environment Mathematica, which now serves as the engine behind his company’s newest creation, Wolfram|Alpha. (I’ve written a few posts on Wolfram|Alpha in the past, and you can read them here and here).

This is a technical post about what I’ve discovered in creating my own custom URL shortener. Hopefully, you can learn to do the same things I did, and my experience will save you some headaches if it’s something you’re interesting in trying.

On my website, I focus a lot about decisions and discovery. I love finding out how the world works and then applying what I’ve learned to make better decisions, and I also try to share what I can along the way. I hope that it helps others.

It’s a complex world, and we are constantly making decisions. Just imagine the number of decisions we make about breakfast: How big a breakfast should I have? Should I have coffee? If so, how much? Should I have toast? Should I use butter? Should I have one piece or two? Should I cut the toast? If so, should they be cut into rectangles or triangles? Should I keep the crust? Should I have juice? Should it be apple juice or orange juice? How about milk? I haven’t even gotten to the pancakes, waffles, syrup, sausage, cereal, bacon… (mmm, bacon…)

And these aren’t the really important ones! How do we know we’re making good decisions, and can we make better ones?

You might think that it’s a bit odd, treating yourself like a science experiement. However, the best way to achieve your goals may be to do just that – be committed to collecting data on yourself.

In science, we’re always collecting data and analyzing it to find out more about the world. However, collecting data isn’t only for people with pocket protectors (although we don’t all wear those!). It is something that any of us can use to help us achieve any goal we set for ourselves.

Imagine a guy with glasses who used to model baseball stats and play online poker nailing the outcome of the 2012 elections. And when I say “nailing”, I mean that he correctly predicted the U.S. Presidential contest in every one of the 50 states (and nearly every U.S. Senate race, too). He even performed better than some of the most widely-used polling firms. Now imagine that he gives his thoughts on making these types of predictions. That’s exactly what Nate Silver does in his new book The Signal and the Noise" target="_blank">The Signal and the Noise.

I’ve worked in what’s now being called “data science” for nearly twenty years. The title of Silver’s book – The Signal and the Noise – presents an important and sometimes overlooked part of this science. The “signal” is what we’re looking for in the data, and the “noise” is all the stuff in the data that gets in the way of what we’re looking for.

Stephen Wolfram is doing it again. I’m a big fan of Wolfram (you can read some of my other posts here, here, and here…), and am always intrigued by what he comes up with. A couple of days ago, Wolfram launched his latest contribution to data science and computational understanding – Wolfram|Alpha Pro.

Here’s an overview of what the new Pro version of Wolfram|Alpha can provide:

With Wolfram|Alpha Pro, you can compute with your own data. Just input numeric or tabular data right in your browser, and Pro will automatically analyze it—effortlessly handling not just pure numbers, but also dates, places, strings, and more.

Zoom in to see the details of any output—rendering it at a larger size and higher resolution.

Perform longer computations as a Wolfram|Alpha Pro subscriber by requesting extra time on the Wolfram|Alpha compute servers when you need it.

Licenses of prototying and analysis software go for several thousand dollars (Matlab, IDL, even Mathematica) – student versions can be had for a few hundred dollars, but you can’t leverage data science for business purposes on student licenses.

Wolfram|Alpha Pro lets anyone with a computer, an internet connection, and a small budget to leverage the power of data science. Right now, you can get a free trial subscription, and from there, the costs are $4.99/month. This price is introductory, but it could be sedutive enough to attract a lot of users (I’ve already signed up – all you need for the free trial is an e-mail address…)

One option that I find really interesting is Wolfram’s creation of the Computable Document Format (CDF), which interactivity lets you get dynamic versions of existing Wolfram|Alpha output as well as access to new content using interactive controls, 3D rotation, and animation. It’s like having Wolfram|Alpha is embedded in the document.

I had attended a Wolfram Science Conference back in 2006 and saw the potential for such a document format back then. There were a number of presenters who later wrote up their work into a paper, published by the journal Complex Systems. Since many of the presentations utilized a real interactivity with the data, I could see where much of the insight would be lost when people tried to write things down and limit their visualizations to simple, static graphs and figures.

I remember contacting Jean Buck at Wolfram Research, and recommending such a format. Who knows whether that had any impact, but I’m certainly glad to see that this is finally becoming a reality. I actually got the opportunity to meet Wolfram at the conference (he even signed a copy of his Cellular Automata and Complexity for me… – Jean was kind enough to arrange that for me – thanks, Jean!)

Among biomedical researcher trainees at UC-San Diego, 81% said they would modify or fabricate results to win a grant or publish a paper

This is obvious disturbing, and worth highlighting to try and root these things out. Science is about finding the truth – no matter what it is – and as more businesses start using data science in order to drive business outcomes, we need to make sure that science is about being honest – with the truth and with ourselves.

The scientific method was developed to provide the best way to figure out what the truth is, given the data we’ve got. It doesn’t make perfect decisions (no method can), but it’s the best method available.

Real scientists (the ones not highlighted in Jen’s research) care about what the data is actually saying and discovering the truth. When someone cares about something else other than the truth (money, celebrity, fame, etc.), then bad science is what you get. Of course, when there are people involved, sometimes the truth isn’t the top priority.

Here are some interesting data science nuggets that I thought were interesting for a mid-January day…

The first comes from TechMASH about data science being the next big thing. The primary nugget of note is that the supply of employees with the needed skills as data scientists – those people who really understand how to pull relevant information out of data reliably – is going to have a tough time meeting demand. Here’s an interesting infographic on the current disconnects – for example, while 37% of “business intelligence” professional studied business in school, 42% of today’s “data scientists” studied computer science, engineering, and natural sciences. This highlights the increasing demand for students that have solid mathematics backgrounds – it’s becoming more about knowing how you pull information from data, regardless of application.

Don’t get me wrong – to be effective applying data science, you need two things: a subject matter expert that understands what makes sense and what doesn’t, and someone who really understands data to pull out the information. Sometimes that can reside within one person, but it’s rare and takes many years of training to acquire the necessary excellence in both fields. And as the demands for data analysis grow, these two areas will likely form into distinct disciplines with interesting partnership opportunities being created.

The definition of data science is still being defined, but I’m convinced it will have huge impact in the next five years. And while the science aspects of data are starting to be defined, the engineering aspects of data and analytics are truly in their infancy…

On the same thread, here’s a Forbes article by Tom Groenfeldt on the need for data scientists, or Excel jockeys, or whatever they will be called in the future. For some companies, the move to “data science” is quite apparent, but for others, the current assemblance of business professionals that have figured out the ins-and-outs of Excel spreadsheets work quite well. This is likely a snapshot of where things are today, but I do believe that as the questions we ask of the data get more complicated, we will clearly see the need for a more rigorous science-based discipline to data wrangling…

The last tidbit is from the Wall Street Journal about the healthcare field being the next big area for Big Data. I do think that healthcare is ripe for leveraging data, and I’ve writtenother posts on the subject. One former Chief Medical Officer that I spoke with mentioned that one of the big problems is just getting the data useable in the first place. He said that, as of today, 85% of all medical records are still in paper form. The figure seems a bit high to me, but I don’t really know how many patient records in various individual doctor’s offices are still sitting in folders on shelves.

There has been a big push lately, spurred by financial support from the U.S government, for upgrading to electronic health records (EHR). This will help to solve the data collection problem – if you can’t get data into an electronic format, you can’t utilize information technologies to pull information out of the data.

I ran across this article from the Independent today about the impacts of data algorithms, the ethics of data mining, and the future of our lives in an automated, data-crunching world. Below is a quote from the article by Jaron Lanier, musician, computer scientist and author of the bestseller You Are Not a Gadget.

Algorithms themselves are a form of creativity. The problem is the illusion that they’re free-standing. If you start to think that information isn’t just a mask behind which people are hiding, if you forget that, you’ll pay a price for that way of thinking. It will cause you to be less creative.

If you show me an algorithm that dehumanises, impoverishes, manipulates or spies upon people,” he continues, “that same core maths can be applied differently. In every case. Take Facebook’s new Timeline feature [a diary-style way of displaying personal information]. It’s an idea that has been proposed since the 1980s [by Lanier himself]. But there are two problems with it. One, it’s owned by Facebook; what happens if Facebook goes bankrupt? Your life disappears – that’s weird. And two, it becomes fodder for advertisers to manipulate you. That’s creepy. But its underlying algorithms, if packaged in a different way, could be wonderful because they address a human cognitive need.

I think this is a really great read for anyone who’s interested in data, algorithms, and their impact on society – there’s a lot of really good stuff to take in. You can read the entire article here…