Category Archives: Systems Engineering

I guess I should probably know Nicholas Felton, from his high-profile “about me”:

He is the co-founder of Daytum.com, and currently a member of the product design team at Facebook. His work has been profiled in publications including the Wall Street Journal, Wired and Good Magazine and has been recognized as one of the 50 most influential designers in America by Fast Company.

Anyway, he has been producing ‘Personal Annual Reports’ which reflect each year’s activities. I haven’t dug too deeply, but the struck me as quite interesting and worth a deeper dive.

Systems thinking is difficult for those that have been educated to always apply reductionist thinking to problem solving. The idea in systems thinking is not to drill down to a root cause or a fundamental principle, but instead to continuously expand your knowledge about the system as a whole.

I would disagree that systems thinking doesn’t look for root causes. The point is that by expanding the problem, you have a higher chance of identifying the root causes. If you narrowly focus on the problem that is defined in front of you, you will rarely reach the true root causes – thus, true systems thinkers ask ‘what really is the system?’ The author does wind around and hits this point, but the original statement is a bit askew.

In John D Cook’s blog post (http://www.johndcook.com/blog/2010/12/15/big-data-is-not-enough/) he quotes Bradly Efron in an article from Significance. It is somewhat counter-culture (or at least thought-provoking) to the mainstream ‘Big Data’ mantra – Given enough data, you can figure it out. Here is the quote, with John D. Cook’s emphasis added:

“In some ways I think that scientists have misled themselves into thinking that if you collect enormous amounts of data you are bound to get the right answer. You are not bound to get the right answer unless you are enormously smart. You can narrow down your questions; but enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.”

What struck a chord with me (a data guy) was the statement ‘and they fit together in some complicated way’. Every time we examine a data set, there are all kinds of hidden nuances that are embedded in the content, or (more often) in the metadata. Things like:

‘Is this everything, or just a sample?’ – If it is a sample, then how was the sample created? Does it represent a random sample, or a time-series sample?

‘Are there any cases where there are missing cases from this data set?’ – Oh, the website only logs successful transactions, if it wasn’t successful, it was discarded.

‘Are there any procedural biases?’ – When the customer didn’t give us their loyalty card, all of the clerks just swiped their own to give them the discount.

‘Is there some data that was not provided due to privacy issues?’ – Oh, that extract has their birthday blanked out.

‘How do you know that the data you received is what was sent to you?’ – We figured out the issue – when Jimmy saved the file, he opened it up and browsed through the data before loading. It turns out his cat walked on the keyboard and changed some of the data.

‘How do you know that you are interpreting the content properly?’ – Hmm.. this column has a bunch of ‘M and F’s.. That must mean Male and Female. (Or, have you just changed the gender of all the data because you mistakenly translated ‘M-Mother and F-Father’?)

All of this is even more complicated once you start integrating data sets, and this is what Bradly Efron was getting at. All of these nuances are exacerbated when you start trying to marry data sets from different places. How do you reconcile two different sets of product codes which have their own procedural biases, but essentially report on the same things?

However, reading through the list, I started to see that this list was a bit incomplete or incorrect (at least in the explanations of the entries). I could even use entries in the list to refute other entries. For example:

“Gambler’s Fallacy: Assuming the history of outcomes will affect future outcomes”

Now, I know what the author was getting at, but the way this is stated is incorrect. The example given was:

“I’ve flipped this coin 10 times in a row and it’s been heads therefore the next coin flip is more likely to come up tails”

So in the example, the author is correct — those events are independent (where the probability of the subsequent flip is not dependent on the outcome of previous flips.. aka Bernoulli Trial). However, in the ‘definition’ of Gambler’s Fallacy, the author left out the critical word ‘independent’. If the events are not independent (e.g. the weather conditions observed at the start of an hour), then the future outcomes are different depending on the outcomes observed in the past. For example, we are more likely to observe rain at 2pm if we have observed rain at 1pm, with some measurable increase in probability.

Using the author’s own list, they fell prey to ‘Composition Fallacy: Assuming that a characteristic or beliefs of some or all of a group applies to the entire group’. (That sentence needs some work). Not all events are independent, and would lead someone to fall prey to a ‘Gambler’s Fallacy’, even though some events are independent.

There are a few others in this list which annoy me (such as ‘Appeal to Probability’), not because of the ‘idea’ behind it, but because of how it is vaguely expressed.

A pretty good summary of use cases for ‘big data’. This always ends up being the first set of questions when exposed to the idea of ‘big data’. “What the heck do _we_ do which is considered Big Data?” A lot of times this is because organizations don’t currently deal with these use cases BUT SHOULD to remain competitive. Things are a-changing.

The entry talks about some automatic data analysis done by zip code, and how it was projecting the deviation of average lifespan for individuals in a zip code broken out by First Name. He goes on to show how and why this type of analysis is incomplete. Without a complete view of the data (i.e. what the population’s lifespan variability is overall), it is easy to find patterns in the noise of the data. He theorizes that this type of incomplete analysis might yield headlines such as:

“Your first name reduces your life expectancy!!”, or “Margaret, it’s time to become Elizabeth!”. And why not “James, if you want to live longer, become Elizabeth now!”

The analyst needs to ensure that they are not identifying patterns in noise, due to an artifact of their methodology or incomplete analysis.

The author describes “technical debt” as the short-sighted decisions made in the past on a project which ultimately “come due” at some point in the future. It examines the paradigms of ‘whiz kids’ and ‘greybeards’ and how there is a continual tug-of-war between the two camps. While I am not sure I agree with the mass categorization, it is true that there are people who care about/plan for the future and there are people who are only focused on the tactical problem at hand.

Essentially, the author is saying “The ghosts of past decisions will bring ruin on the project’s future”. That has to be inspired by some Shakespearian quote..

A colleague pointed me to this Huffington Post article (which has embedded in it the TED talk “Deb Roy: The Birth Of A Word“) in a discussion about Sentiment Analysis. I guess I have been dismissing all of the sentiment analysis discussions in the past (without really examining the ideas behind it). I just couldn’t fathom how effective it could be — is it really better than just what a few people with some time on their hands could generate?

The TED talk started with a strange ‘experiment’ which involved wiring Deb Roy’ house with overhead video and audio in each room, and 200 terabytes of video recordings (he guess it’s probably the largest catalog of home movies). Three years of recording, starting after the birth of his son, he brought the video to MIT and they started the analysis. The types of analysis they are performing is extremely interesting — using multi-modal analysis to show correlation. Proximity (spatially), socially (interaction), audio, and video all play into the analysis they performed. Watch the video and I think you will be interested in what they are producing.

At ~12:30 min in the video, he describes how one of the MIT researchers on his team made the leap from a closed-space, controlled environment to the public-space. Using public media (e.g. TV) as the video, and social media (e.g. Twitter) as the ‘audio’, they can start showing how the two relate much like they did with the home movies. Social interconnectedness also was factored into the analysis.

Maybe I need to think a bit more about whether I should be dismissing these analysis ideas..

The post gives a summary of a discussion by JD Long about ‘Advice I wish someone gave me early in my career’, with the “20/30” hindsight that one might expect. The points made by Long, and the summary viewpoints by the poster are pretty well written. Here are a few that struck me as poignant to recent discussions we have been having in my department. I am rephrasing them based on my opinion and in light of our discussions.

Don’t discount Open Source – it is often the toolset which is ultimately the most transportable job to job (and project to project).

“Dependence on tools that are closed license and un-scriptable will limit the scope of problems you can solve. (i.e. Excel) Use them, but build your core skills on more portable & scalable technologies.”

The follow up point made about closed tools, I had to ponder a little bit to see if I agreed:

“Closed source software is often not scriptable, not because it’s closed source, but because it is often written for consumers who value usability over composability.”

The author makes a point about portability, not just OS to OS, but from scale to scale (scale up to clusters and scale down to mobile). This is in conjunction to career portability (the longevity / demand of a given toolset), which is always an important decision for a career. I think it often comes down to whether you consider yourself:

A “tool jockey” – intense and deep understanding in a given toolset or technology

A problem solver with (less-intense) understanding of a variety of tools

I think ultimately, irrespective of which camp you might self-identify with, effectiveness really boils down to good Systems Engineering process:

“Get really good at asking questions so you understand problems before you start solving them.”

Someone recently threw out a challenge to identify the artist of a particular painting /photograph which was hanging on the wall in the movie Iron Man 2. What she provided was this image:

‘unknown’ art from Iron Man 2

This appears to be two frames from the movie? I can’t recall seeing it in the movie, but ultimately it didn’t matter, since I wasn’t going to go down that path. (I assume she already tried that avenue, googling ‘iron man 2 office art’ or something similar).

With machine learning on the brain, I realized this was clearly a machine learning problem. Thankfully, I remembered I had already found an existing tool to do just this — take an input image and find other images which are the same. Here is where I first talked about TinEye.

All I did was pop the image into photoshop and chop off the right side, since I expected to match to a head-on image of the art and not a combined graphic. That being done, I uploaded the graphic to TinEye and viola it returned: