For Conway, the center of the diagram is Data Science. There's some controversy over what the bottom circle means (I'll address it farther down); all I can say, is if Conway meant something other than what I would call domain knowledge (e.g. physics), he chose the name Substantive Expertise very poorly. So assuming domain knowledge is at least part of what he meant, the idea is that a physicist, say, would have expertise in physics and math/stats knowledge, but lack hacking knowledge (I've met many physicists and I think that's less true than it used to be). Machine Learning experts tend to apply algorithms without an understanding of the domain they're analyzing (that sure as heck was my case when I first started building models in an industry that was totally new to me; I had to play a lot of catchup). And then people who can program and know their field but have no way to tell a statistically significant result from one arising from sheer coincidence are dangerous; they can arrive at some drastically wrong solutions and, for example, lose their companies lots of money.

Note that this isn't how a Venn diagram works.Hacking Skills, for example, should apply to that entire circle, and the part that doesn't intersect with anything should be labeled, e.g. "hackers". But that's a fairly minor point, it's obvious what he's getting across.

It... sure is busy. KDD stands for Knowledge Discovery and Data Mining, by the way. Despite that, Data Mining also has its own circle. I do appreciate what he did here, though, implying what makes data science worthy of its own field is the breadth of its required skills. Apparently one of those skills is Neurocomputing, which seems a little... specific.

He's flipped it on the diagonal, specified the substantive expertise as Social Sciences (his field), changed hacking to computer science (you can see why someone would object to being characterized as a hacker, although I for one embrace it), and for some reason changed Math & Stats to Quantitative Methods. More importantly, he's moved Data Science where Machine Learning was in Conway's -- that's an interesting distinction, and one I've seen in the field. There are data scientists who specialize in one domain, and then there are generalists (who usually started out in one field but branched out, like me: I started in chemistry and now I'm in insurance). Also, he's apparently not comfortable with Danger Zone, changing it to... a question mark. But apparently what matters to Matter (so to speak) is in the center of the diagram: Data-driven Computational [Social] Science.

A... bit wordy, shall we say? He also made sure to insert Empirical into Traditional Research.

4. After the Edward Snowden news broke, Joel Grus supplied this tongue-in-cheek (or is it?) version. Now we're getting into more rarefied Venn territory, with four circles, the fourth being "evil".

The slices are no longer comparable to Conway because we've changed from science to products, but the categorizations are noteworthy (and they follow true Venn methodology, not being slices in themselves). Domain Knowledge remains, Computer Science/Hacking remains as Software Engineering, and crucially, Harris has added Predictive Analytics and Visualization to the Statistics circle. But not the actual tools they use, that's in the intersection with Software Engineering. Okay.

6. In January 2014, Steven Geringer provided a tweak that, instead of putting Data Science in the middle three-way intersection like Conway, calls all of it data science and calls the intersection Unicorn (i.e. a mythical beast with magical powers who's rumored to exist but is never actually seen in the wild.)

This is... a little weird, Venn-diagrammatically speaking. I think I know what he's getting at. When I first heard people referred to as data scientists, I often heard the riposte, "Aren't all scientists, by definition, data scientists?" True, there are no sciences that do not deal in data (insert psychiatry joke here), but still, data science, while quite nebulous, isn't just an umbrella term.

Plus, I'm sorry, but you can see the screengrab of his mouse arrow in his diagram.

Edit: An earlier version of this post omitted to give Geringer credit where credit is definitely due: he was the first to remove the Danger Zone!(Great, now that song is going to be in my head all day). Now people with subject matter expertise and computer skills can make Traditional Software without blowing the world up, or whatever. (My apologies to Mr. Geriner, and my thanks for his correction.)

According to Malak, he's Inigo Montoya and we're all Vizzini when it comes to Substantive Expertise: "You keep using that word. I do not think it means what you think it means." Malak split it into Domain Expertise, and...er, knowledge of a domain, like Social Sciences. Maybe I'm dense, but I don't get the distinction. I'm also not sure what he's getting at with Holistic Traditional Research that, unlike Traditional Research, according to its placement doesn't include knowledge of the science you're researching? Am I reading that wrong? Holistic science is a thing, but it's not that thing. Anyways, Data Science is once again back in the unicorn position, and there are three danger zones (one of them double danger).Everyone be hatin' on the hackers.

8. My next example comes via Vincent Granville in April 2014, but he's reposting something by Gartner; I don't know the date of the original.

This is a Venn Diagram of Data Science Solutions, not data science itself; as such, Data Science is one of the circles, with other expertises (often not residing in the same person, but hopefully on the same team) being IT Skills and Business Skills. It kinda bothers me that the text labels are pointing to very specific positions in each slice, but the actual positions are arbitrary. That's business infographics for you.

Pretty standard computer-math-domain triad straight from Conway, but there's one revolutionary element: no danger zone. Now computer-and-domain geeks without stats can do Data Processing without everything going all to hell. Seems reasonable.EDIT: Sorry Shelly, Geringer beat you to it, you're just not very noteworthy anymore.

10. In November 2015, StackExchange Data Science user Stephan Kolassa came up with my personal favorite, adding Communications to Conway and changing his Substantive Expertise to Business:

For all his effort, he was rewarded with only 21 (I'm one of them) upvotes in this beta-release forum. His categories are pretty good, too. I think I fall under The Good Consultant. Or possible The Mediocre Consultant. The Consultant Who Tries Really Hard? And yes, that's what a four-set Venn diagram looks like, not four circles like Malak's above, which does not contain all the combinations of intersections.

Okay, this owes a debt to Tierney from four years prior, and although it purports to be a Venn diagram of data science, (a) it's not a Venn diagram, and (b) Data Science is inside one of the circles. It's good to see Big Data acknowledged, though. But... Calibri? Really? You went with the default font?

12. Finally (and I'm sure I don't have them all; If you know of any Venn diagrams I missed, please let me know!), later in 2016 Gartner redid their busy Data Solutions diagram, and made it prettier and confined to data science, as blogged by Christi Eubanks:

We've come full circle, back to Conway, except again Danger Zone is replaced, this time by Data En gineer. I like the callouts pointing to the edges better than their previous mess, as well.

13. Data Science Venn diagrams of the future:

Wikipedia's page on data science has the following totally-not-a-Venn-diagram:

Really, in my opinion, this is the way to look at data science. Maybe not these exact skills, but it really is a synergy of different disciplines. Unfortunately, skill in one discipline can sometimes mask serious deficiencies in another and give data science a bad name. (I may or may not have contributed somewhat to this phenomena in my misspent youth, like, last year.)

Of course, then you'd need a really complicated Venn diagram. They do exist: here's one for seven sets:

Tuesday, April 19, 2016

l33t (leet, 1337) is a simple substitution cipher that started in BBSes in the early 1980s (ah, how I remember my 600 baud modem) that substitutes a few letters for numbers, e.g. '3' for 'E'. My name in l33t (and there were many versions of l33t) might be D4v1d T4yl0r or D4vid T4y10r (depending on whether the 1 substituted for 'i' or 'l').The code and the various simple dialects of l33t I used can be found in this Github gist.I used a rather short crossword puzzle word list, and even so got many words I've never heard of ("genips"?) I actually had one of them ("kirtle") in a spelling bee when I was 10 years old (I got it wrong). Still, most of them are somewhat familiar. And yes, there's a postal code beginning with "V" in British Columbia that refers to lady parts.Edit: a couple of observations from redditors:

It doesn't appear on the list, but the postal code for the small town of Rosslyn Village, Ontario is P0T 2G0.

T4B 0R5 was assigned to Airdrie, Alberta; it would have been better if it had been assigned to Tabor! (Mmm, Tabor corn!)

In 2011, the White House launched We The People, a platform for citizen to submit and sign petitions. Once petitions reached a certain threshold (25,000 signatures at first, then raised to 100,000 in 2013), the Administration composed an official response. Famously, the White House even responded after a petition from 2012 for the U.S. government to build a Death Star received enough signatures.

I was curious to explore the differences between the successful petitions (i.e. the ones that garnered enough signatures to warrant a response) and the unsuccessful. There are, of course, many factors, including how well the petitioner publicized it, but I was particularly interested in the differences in words used.

Here are the results, followed below by some observations and finally the technobabble for those who want to know how and why I performed the analysis this way:

Observations

First of all, some numbers: there were 294 successful petitions with a total of 21,527 words (only counting each word once per petition, as explained below), and 3,868 unsuccessful petitions with a total of 304,706 words.

The words most characteristic to successful petitions have more extreme log-likelihood values; the top 25 range from 11.94 to 29.72, while the top 10 among unsuccessful petitions range from 5.06 to 12.91. This could be due to the fact that there are so many more unsuccessful petitions, and probably a greater range of reasons they were unsuccessful.

If you'd like to see a random selection of five petition titles containing each of the words listed in the graphic, head on over to my other, nerdier blog.

"Gun" is the most characteristic word in successful petitions. (Again, "successful" means there were enough signatures for the White House to write a response, few if any of the petitions' requests were enacted.) There were both successful pro- and anti-gun control petitions; this is an issue that people on both sides feel passionate enough about to participate in this project.

There are also four names from Netflix's Making a Murderer series: Avery (the defendant), Halbach (the victim) and both first and last names of Brendan Dassey, the other defendant.

The Westboro Baptist church is a popular item, as are "tragedy," "Connecticut" and "CT" from the Newtown shootings. (Interestingly, "Newtown" had a somewhat lower log-likelihood keyness of 8.54, indicating it was a little more common in unsuccessful petitions than the other terms.)

Some superlative words seem to be common in successful petitions: "imperative" and "definitely".

As for the unsuccessful petitions, the most overrepresented word is "say". A skim through these petitions (and there are a lot of them), reveals that many of them are motivated by some perceived injustice of perception, e.g. "people say X, but I believe Y".

The word "I" is associated with unsuccessful petitions, as well. Those who write (possibly rant) in the first person don't get a lot of support; perhaps the fact that they reference themselves is an indication they don't have enough organizational support to get a lot of signatures.

"Genocide" turns up a lot in unsuccessful petitions. Most of the time, it's in the phrase "white genocide".

"2014" turns up more characteristically than any other year because of the number of unsuccessful petitions calling for a boycott of the Sochi Olympics and/or action regarding Ukraine.

Technobabble

There are plenty of ways one could approach this project of differentiating between the words contained successful and unsuccessful petitions, like topic modeling or exploratory machine learning, but sometimes simple is best, especially since, as I mentioned, the specific words used in a petition is likely to be a secondary feature or at best a proxy feature (i.e. one that implies something else that is actually causal) in the determining factors of what makes a petition successful.

So, since we have two corpora (words used in successful petitions, words used in unsuccessful petitions), and a frequency metric, why not do a simple keyness analysis? Dunning log-likelihood keyness is one of two approaches (the other being chi-squared) to determining the significance of differences in item frequencies between two datasets; it's used a lot in corpus linguistics, although it's falling out of favor as more advanced techniques become computationally feasible.

The nice thing about keyness is that it is a measure of significance, combining the effect of ratio and absolute difference between frequencies, and the size of the corpora. This makes it unnecessary to eliminate stopwords, which is an arbitrary process that always rubs me the wrong way. If "and" is used very often, it still might be more significantly found in one corpus than another, but the threshold of difference in frequencies is much higher than for a less common word. Conversely, if an extremely rare word is found 10 times in one large corpus and once in another, it's unlikely to pass the threshold of significance even though the ratio between the frequencies is high.

Wednesday, March 23, 2016

I just signed up for a 30-day trial of IBM Watson's Bluemix, a set of mostly language processing APIs that are, in some cases, quite illuminating, and in other cases, rather entertaining.

One of the tools is Personality Insights, which will take any text and algorithmically predict the personality traits of the author. When I saw there was even a one-click tool to submit the RSS feed of a blog, how could I resist? I submitted prooffreader.com.

The results are... interesting. Like a horoscope, I wonder how much is overgeneralization; I hope there is some overfitting going on, because some of it is not at all flattering!

For one, my emotional range is only 26.8%. My reaction: Meh.

And while I agree I'm not the most outgoing person in the world, a 0.8% score for "extraversion" is a little extreme! Ah well, at least I can comfort myself that unlike IBM's Jeopardy!-winning supercomputer, I know how to spell "extroversion". Well, at least today I learned that the proper (but less common) spelling of what most people call "extroversion" is "extraversion", because Latin.