Posts categorized "Errors"

Really enjoying Propublica pieces lately. There are several articles about topics of great interest to me, and those who read my books will be familiar with these themes.

My favorite is an article that speaks a truth about data projects -- much as we sweat about data collection, data integrity and statistical models, the true challenge is in persuading the rest of the world to adopt our endproducts. The title of this piece says it all "The FBI built a database that can catch rapists--and almost nobody uses it. " (link).

The data project in question is an early effort to link data from multiple sources to leverage correlations to solve the problem of identifying serial offenders. However, less than 10% of local police departments contribute data to the system, rendering it toothless. In my experience, it is common to find data projects stuck in first gear, and failing to make any real-world impact.

Kudos to the authors for asking the dirty question of the return on investment of such a system. It is believed that in 12 years, the system may have helped solve 33 crimes. It costs $800,000 per year to maintain (most likely, contractor expenses). You do the math!

For managers, the key is to diagnose properly the reasons for inaction. Lack of adoption is frequently blamed on technology but the reality is much more complicated.

David Epstein reports on raids on steriods labs. (link) Law enforcement is the most effective way to catch cheaters in sports. In Numbers Rule Your World, I explained why anti-doping tests are ineffective, in the sense that false negatives are rampant, letting lots of dopers off the hook. This conclusion comes from a simple statistical calculation. In the chapter on a lottery cheat, I described how statistics can be used to prove that "someone has beyond reasonable doubt cheated" but physical evidence is required to nail the perpetrator.

Epstein then expanded the conversation: "World-class athletes are merely the fine layer of frost atop the iceberg’s tip when it comes to the steroid economy." The headline of the piece is "Everyone's Juicing".

I find it interesting that Epstein said "In years of reporting on performance-enhancing drugs, I’ve frequently been asked why athletes in smaller sports or facing lower stakes would dope, given that there’s little money in it for them." This feels odd because when I was researching my book five or six years ago, I heard the opposite claim, that elite athletes couldn't possibly be doping because they don't need steroids. (Think Barry Bonds back in the days.) This tells me that (a) public opinion has shifted due to the Armstrong revelations and (b) the human mind will rationalize any story even if the story flips.

Epstein has another article in August about false negatives, which should be familiar territory for my readers (link).

***

Joaquin Sapien reports on the case of one Ruddy Quezada, who was released after spending 24 years in prison for murder. This case reminds me of the Innocence Project, whose amazing work I featured in Numbers Rule Your World. In the current scenario, though, we don't know if Quezada was innocent, only that the prosecution lied about how they coerced the witness to testify. The witness testimony was the only piece of evidence in that case, which means that the prosection is left with no avenue to re-try the case.

The case I used in my book concerns false confessions so both cases deal with coerced evidence.

The New York Times has been making waves this week featuring management practices at Amazon and workplace tracking practices at various companies (link). These are essential references for how data make us dumber.

I am going to ignore the shocking claim by the journalist who stated that GE is "long a standard-setter in management practices." To give him some credit, he did not say "good" management practice. It is true that business schools like to glorify GE managers. But the most famous GE doctrine is to line all employees up at the end of the year, and give the bottom 10% pink slips. (See Jack Welch's Wiki page.) This practice is of the same cloth as the "purposeful Darwinism" that was vilified in the article about Amazon.

What I want to focus on is the completely bonkers line of argument paraded by software vendors who sell workplace tracking (i.e. surveillance) tools.

1. The performance of your workers is completely measured by our continuous and usually stealthy tracking of data.

2. Because of the continuous and stealthy nature of tracking, the data are objective, unbiased, trustworthy, and accurate.

I couldn’t imagine living in a world where I’m supposed to guess what’s important, a world filled with meetings, messages, conference rooms, and at the end of the day I don’t know if I delivered anything meaningful.

So what are the data that would allow each worker to know every day whether they "delivered something meaningful"? The article mentioned just two types of data: the usual tracking of how people spent their time at work; and little notes workers are encouraged to send to bosses to "nudge" or "cheer" each other.

Just because you can count "nudges" or "cheers", or you can count the words, or pairs of words, or triplets of words, most frequently associated with someone, doesn't mean you know anything meaningful about their performance.

In fact, a lot of this data are manipulated, and probably worthless.

Even within the Times articles, there are multiple examples of why employee notes are not to be trusted. "People wouldn't put something negative in a public forum, because it would reflect poorly on them," said one vendor. At Amazon, employees reported that the secret feedback system is "frequently used to sabotage others". I find it hard to believe that we even need proof of such behavior. In fact, that is one of the key points I made in Numbersense.

Counting emails, or minutes spent on the work computer, is similarly pointless. Someone who spent 20 minutes on the computer is not necessarily more productive than someone who spent 10 minutes working and 10 minutes web-surfing random things. The former employee might be slower, or confused, or learning on the job, or day-dreaming. Again, it's hard to believe that we even need proof of this point.

There is a tendency to believe that data have intrinsic value. One of the worrying trends in the age of Big Data is insufficient time spent understanding if the data collected measure the right things, and whether the analyses provide even marginally trustworthy answers to the questions being asked.

In our newest column, we take on the recent media obsession with companies who make robots that hire people. (link)

As with most articles about data science, the journalists failed to dig up any evidence that these robots work, other than glowing quotes from the people who are selling these robots. We point out a number of challenges that such algorithms must overcome in order to generate proper predictions. We also discuss why measuring the outcomes of these predictions is so hard: one problem is we have no objective standard for someone being the "correct" hire; another is the action we take based on the predictions affects the outcome that was predicted.

In the last installment, I embarked on a project--perhaps only a task--to assemble a membership list for an organization. It sounded simple: how hard could it be to merge two lists of people? Of course, I couldn’t just stitch one list on top of the other as there are members who subscribed to the newsletter as well as joined the Facebook group. These duplicate rows must be merged so that each individual is one row of data.

With barely a sweat, I blew past my initial budget of two hours.

After a half day, I produced a merged list by matching Facebook usernames to email usernames. It felt like running an obstacle course, with one annoying issue popping after another was resolved. Stray punctuation, ambiguous names, case sensitivity, and so on. Most of these problems lacked clear-cut solutions. Some periods (full stops) were redundant but not all; some middle names were part of the last name but not all. Tick, tick, tick, tick. These data issues demanded consideration, and considerable time.

At the start of Day 2, I executed a planned U-turn. Starting with the two lists of people, I attempted to match first and last names. I tried usernames as the key first because only a small portion of the Email list included names. However, a match of first and last names is a more confident result than a match of usernames.

Immediately, I stepped into text-matching quicksand. I must process the Facebook names (previously scraped) the same way I fixed up the names in the Email list.

As before, I tried a “full outer join.” Disaster. The output data had a crazy number of rows. I sensed missing values. Sure enough, there were some Facebook members for whom I did not have names (for example, they provided names in Chinese or Korean characters.) Each of these members with missing names matched, erroneously, the whole set of email subscribers who also did not provide names.

One way out of this mess was to extract only people with non-missing names from either list, and then merge those subsets. This path was not easy though. I had created four types of members: those with matching names on both lists; those having a Facebook name which didn’t match to anyone with an email name; those having an email name which didn’t match to anyone with a Facebook name; those who provided no usable names in either list.

The challenge was to combine those four groups of members in such a way that each unique member is just one row of data. For each such member, I also wanted to gather all other information from both Facebook and email lists. This required defining a number of dummy columns and also various columns sourcing the data.

I experienced a soothing satisfaction when the output data appeared as expected.

But the job was not yet finished. I ended up with two merged lists, one based on username matching, and the other, name matching. It was time to merge the merged. I spare you the details, most of which resembled the above.

Knowing my client’s name was on the list, I looked him up. There he was, again and again, occupying four or five rows. This might make your heart sink since I had tried so hard to maintain one row per member. But don’t worry. I was simplifying things a little bit. If someone provided multiple email addresses, as my client did, I had decided to keep all of them.

At long last, the master list of members was born. This exercise bore instant rewards. It is very useful to know which members are on both lists and which members are on just one. We have a rough measure of how involved a member is. The hard work lies ahead since our goal is to gain a much deeper understanding of the members.

An organization wanted to understand its base of members so the first order of business was constructing a database of all people who can be considered members. We decided to define membership broadly. Members included those who join the Facebook group, and those who subscribed to the newsletter.

The organization kept two separate lists which I would merge to create a master list. For simplicity, I’ll call them the FB list, and the Email list. In merging, the key is the key. Let me explain. The simplest key is an email address. If someone’s email address shows up on both lists, then I infer that those entries concern the same person, and combine them. My goal is to remove double counting of anyone who appears on both lists.

Sounds simple enough.

But never that simple, right? First, the Facebook group is the graveyard of data. Facebook provides zero statistics on group members and activities. Yes, the company that makes a business out of data does not hear the data-deprived group owners who have been pleading for years.

What is a data scientist to do? Scrape, that’s what. Members can find out who else is in the group by the scroll-wait-reset-scroll routine. You know that feeling. I know you do. You scroll to the bottom of a web page. Your browser gets the hint. It loads a few more items, while the slider floats away, usually to the wrong spot. You re-set the position, and scroll some more. After much scrolling, I scraped that page to compile the Facebook list. It’s got the name of the person, their Facebook username, and their location (when available).

Notice I didn’t say email address. So the FB list did not contain the all-important key. Another possible key is first and last names. Reviewing the email list, I realized that newsletter subscribers are not required to provide names so matching names to the FB list will yield few hits. The third candidate is not as accurate; I tried matching the Facebook username to the email username.

The client furnished an Excel file, which I’ve been calling the Email list. Upon opening the list, I turned the email address into all uppercase letters. I have matched enough text data to know that people are hardly in control of their fingers when they type text into web forms. “John”, “JOHN”, “joHN”, “JOhn”, and so on typically mean the same thing, regardless of case. (The occasional sadist offers “J0hn,” or “Jhon,” or “Jo hn.”)

Meanwhile, the client wondered if email addresses are really case-insensitive. I suggested asking Google. The search engine gave an ambiguous answer. The part after the @ sign is case-insensitive whereas the part before @ is case-sensitive, but then most email providers treat both parts as case-insensitive.

It’s rare when Google complicates your life. I fished out the UPPERCASE(email_address) formula, deleted it, broke up the email address into the user name and domain name parts, upper-cased the domain name, and reconnected the two parts, re-inserting the @ sign. The machine must follow these steps but a human being instinctively knows where to apply the cut. Some researchers believe the brain executes those steps at warp speed but I don’t buy it.

Next, I dropped the domain names from the split-and-spliced email addresses to get ready to match to Facebook usernames. Sheesh, the client did not ask if Facebook usernames are case sensitive or not. (They aren’t.) I proceeded to merge the two lists.

I executed a “full outer join.” With this procedure, any username that appears in one or both of the lists will find its way to the output dataset. On this first attempt, nothing merged. Even though username “davidcolumbus,” say, lived on both lists, the computer did not combine the data; the two matches sat one on top of the other.

I took a deep breath, for I had reached a point where I must be honest with myself. This project was sure to bust the two hours I originally allotted. The merge could easily take another hour, maybe two, if no new issues emerged.

The matching rows did not combine because the computer only joins eponymous columns. Since the Facebook and email usernames are different entities, those columns carry different labels.

But syncing those labels solves one problem while creating another! Members who appear on only one list have only one of the usernames. Besides, Facebook usernames are unique while email usernames, when detached from their domains, are not. A better solution is to set up a third username column in both lists, whose purpose in life is to be the matching key.

What about the other columns? Did I want them combined or not? Take as an example first and last names which show up on both lists. If I standardized the labels of these columns, the computer would attempt to merge them. What if David Columbus appeared as Dave Columbus on the other list with matching usernames? Forcibly combining the name columns would cause one of these variations to be dropped. If I wanted to keep both spellings, I must retain all name columns, which happens should I assign distinct labels, which is exactly the opposite of what I did with the username columns.

If that isn’t confusing enough, I stumbled upon another issue. In the Email list, while most names appeared as “First <space> Last,” there were examples of “Last <space> First”, and “Last <comma> First”, and “First Initial <space> Last,” and so on. As an analyst, your first thought is “What’s wrong with our designers? Why didn’t they create separate text boxes for first and last names?” Then, you accept that blame gets you nowhere; you still have to fix what’s broken.

A soft voice enters your head. You wish you hadn’t seen the problem. You hope it was just a bad dream. But you wake up.

In front of me I had two paths. I could follow path A, and that meant developing code to automatically detect the various anomalies and fixing them. This path would take hours. Which is the first name in “Scott Lewis”? How would a computer figure this out? What rule could apply generally?

And then, there was path B, better known as handcrafting. If I had 1,000 rows of data, and if it took two seconds to scan a name and determine the type of anomaly, I would have completed the exercise in 30 minutes or so.

I chose path B. It was ugly and unsexy but more of a sure thing.

I wish I could tell you I stopped looking. But I couldn’t help it. Some cultures embrace double surnames, like “De” something or “Von” something. My code was parsing “Chris De Jong” as first name Chris, and last name Jong. I needed a more complex rule. Something like “If the name has three words, take the first as first name, and the last two as the surname.” This rule runs afoul of someone like “Mary Anne Rutherford.” At a crossroad again. I could teach a computer how to lump the middle name, or I could exercise my brain some more.

By this time, I was exhausted. If you have followed me to this point, you have my admiration. In the next installment, I shall finish the assignment.

I also spoke about data science at a Faculty-Alumni panel titled "Science Under Attack!". Here is what I said:

In the past five to 10 years, there has been an explosion of interest in using data in business decision-making. What happens when business executives learn that the data do not support their theories? It turns out that the reaction is similar to what other panelists have described - science under attack! When I bring data into the boardroom, the data are measuring something, which means the data are measuring someone; and you can bet that someone isn't too happy about being measured. My analysts encounter endless debates, wild goose chases, and being asked to conduct one analysis after another until the managers find the story they like.

I think two reasons for the gap between data analysts and business managers who are often non-technical peopel are (a) a communications gap and (b) the nature of statistics as a discipline.

Imagine you have to sell a product to Koreans in Korea. You don't speak a word of Korean and your counterpart does not speak English. What would you do? You'd probably hire a translator who would deliver your sales pitch in Korean. What you wouldn't do is to stay in Korea for a year, teach the counterpart English, and then give your original pitch in English. But that is exactly what many data analysts are doing today. When challenged about their findings, we try to explain the minute details of how the statistical output is generated, effectively teaching managers math. And we are not succeeding. I have spent much of my career thinking about how to bridge this gap, how to convey technical knowledge to the non-technical audience.

The second reason for the gap is the peculiar nature of statistical science. What we offer are educated guesses based on a pile of assumptions. This is because statistics is a science of incomplete information. We can never produce a definitive answer because we simply do not have all the data we need. But this creates an opening for people who are pre-disposed to oppose our conclusions to nitpick our assumptions.

I also want to bring up a different threat to science, which is the era of Big Data is upon us. This is a threat from within, not from without.

The vast quantity of data is creating lots of analyses by a lot of people, most of which are false. A nice illustration of this is the website tylervigen.com. This guy dumped a lot of publicly available data into a database, and asked the computer to select random pairs of variables and computed the correlation between these variables. For example, one variable might be U.S. spending on science, space and technology and the other is suicides by hanging, strangulation or suffocation. You know what, those two variables are extremely correlated to the tune of 99.8%.

Another aspect of Big Data analysis deserves attention, that many of these analyses do not have a correct answer. Take Google's Pagerank algorithm which is behind the famous search engine. Pagerank is supposed to measure the "authority" of a webpage. The model behind the algorithm assumes that the network of hyperlinks between webpages provides all the information needed to measure authority. But no one can verify how accurate the Pagerank metric is because no one can tell us the true value of authority.

In the case of Pagerank, we may be willing to look past our inability to scientifically validate the method because the search engine is clearly useful and successful. But I'd submit that many Big Data analyses are also impossible to verify but in many cases, they may not be useful, and in the worst cases, may even be harmful.

For those who have found it tough to keep up with Andrew Gelman's prolificacy, here are some brief summaries of several recent posts:

On people obsessed with proving the statistical significance of tiny effects: "they are trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down." (link)

[I left a comment. In Big Data, we have thousands, no millions, of kangaroos jumping out of sync, but still one feather.]

On people testing a zillion things hoping to land on the one that "works": "I suggest you should fit a hierarchical model including all comparisons and then there will be no need for such a corrections." (link)

[This is something Andrew has been advocating for a while. The idea is that such models have in some sense a built-in correction for the multiple comparisons problem. Unfortunately, some researchers are wrongly interpreting Gelman. I recently read a report that cites Gelman's paper as evidence that "multiple comparisons" is not a real problem, and then proceed to fit dozens of regressions without any mechanism to control for multiple comparisons!]

On when to throw out all your data, the lot of it: "Sure, he could do all this without ever seeing data at all—indeed, the data are, in reality, so noisy as to have have no bearing on his theorizing—but the theories could still be valuable." (link)

In Part 1, I covered the logic behind recent changes to the statistical analysis used in standard reports by Optimizely.

In Part 2, I ponder what this change means for more sophisticated customers--those who are following the proper protocols for classical design of experiments, such as running tests of predetermined sample sizes, adjusting for multiple comparisons, and constructing and analyzing multivariate tests using regression with interactions.

For this segment, the choice of sticking with the existing protocol or not depends on many factors, such as the decision-making culture and corporate priorities. No matter what you do, it is important to realize that improved analysis tools do not obviate careful planning and execution.

***

Let me start with my advice. Initially, keep running your tests to the usual fixed sample sizes. In essence, you ignore the stopping rule suggested by the Stats Engine. Over a series of tests, including some A/A tests, you can measure how likely those stopping rules would have correctly ended the tests (relative to the fixed-size testing protocol). This allows you to estimate the “time saving” achieved from sequential testing.

***

As I pointed out in last year’s presentation at the Optimizely Experience, the testing team should be concerned about what proportion of significant findings are correctly called, and what proportion of non-significant findings are incorrectly called. The “false discovery rate” is the flip side of the first quantity.

A testing program using fixed samples may face one of several problems:

a) Too few tests are called significant.

b) Too many tests are called significant.

c) It takes too long to call a test.

You need to figure out what is your biggest problem.

Conceptually, relative to a fixed-size test, a sequential test saves time if the true response rate differs from the design assumption substantially. If you’re testing on a web page for which the response rate is well-known and relatively stable, then there should be hardly any time saving on average. This is why I don’t recommend watching tests like a horse race, minute by minute. (As I said in Part 1, if you are watching a horse race, the Stats Engine will provide some sanity.)

Assuming that you underestimated the true effect by say 20 percent. The following stylized chart is my expectation of how the new Stats Engine results compare to the classical results.

The horizontal axis shows the sample size (at which Optimizely calls an end to the sequential test) as a ratio of the fixed sample size (by design). When this is 100%, the sequential test has the same length as the fixed-sample test. Because the true effect is substantially larger than expected, for a large proportion of tests, the sequential procedure calls for an “early” exit. However, there will be a small number of tests for which the sequential test will end much later than a fixed-sample test.

On the other hand, if the design assumption is essentially correct, then I expect the behavior of the new Stats Engine will look something like this.

The line is mostly flat meaning there is equal probability of the test ending at any sample size, including sample sizes that are multiples of the fixed-sample requirement. This is the “price to pay” for doing sequential testing, i.e. multiple peeking. At the lower end of sample sizes, I expect a slight positive curve, because the Bayesian prior (assuming it is a skeptical prior) will prevent tests from being stopped “too early”.

[Thanks to Optimizely’s statistics team for entertaining my inquiries about this intuition.]

***

How important is saving time for your testing program? This depends on your readiness to move on. My experience is that unexpected time saving, say calling a winner one week before the test was supposed to end, frequently gets eaten up by the organization’s inability to move schedules around. Your IT or web developers may have other projects on their plates.

Further, if you tend to look at data by segments post-hoc, I don't think the current implementation supports that. If you know what segments you care about beforehand, then you can build those into the design.

Most importantly, please don’t fall into the trap of thinking that design and upfront planning become unimportant because of sequential testing and FDR. The design phase is very important in establishing expectations and facilitating communications within the organization.

I also recommend reading this post by Andrew Gelman on data-dependent stopping rules.

In my HBR article about A/B testing (link), I described one of the key managerial problems related to A/B testing--the surplus of “positive” results that don’t quite seem to add up. In particular, I mentioned this issue:

When managers are reading hour-by-hour results, they will sometimes find large gaps between Groups A and B, and demand prompt reaction. Almost all such fluctuations result from temporary imbalance between the two groups, which gets corrected as new samples arrive.

Over the holidays, I paid a visit to the Optimizely team, and learned that they have been developing a solution to this problem. (Optimizely is one of the leading platforms for online A/B testing. They just made an announcement this week about a new feature they are calling “the New Stats Engine”.)

Optimizely also recognizes that their clients face a credibility crisis when the A/B testing tool returns too many “significant” results. Their new tool promises to reduce this false-positive problem. They tackle specifically two sources of the problem:

a) Many clients monitor A/B tests like horse races, and run tests to significance. This is sometimes known as “sampling to a foregone conclusion”.

b) Many clients run many (dozens to hundreds, I imagine) tests simultaneously; here, a test is any pairwise comparison of variations, comparison of variations within segments, or any comparison using multiple goals. This is the “multiple comparisons” problem.

***

Let me first explain why those are bad practices.

The classical hypothesis test is designed to work with fixed sample sizes, which should be determined prior to the start of the test. The testing protocol then allows up to a 5-percent probability of falsely concluding that there is an effect (That’s the same value as the significance level. This is not the same saying 5 percent of the positive results are false, but that’s a different article). However, if the analyst is peeking at the result multiple times during a test, then the analyst incurs a 5-percent false-positive chance, not once, but for every such peek. Thus, at the end of the test (when significance is reached), the probability of a false positive is much, much higher than five percent. It can be shown that every A/A test will reach significance eventually in this setting.

In a “multivariate” test, the analyst makes many pairwise comparisons, and each comparison is analogous to a peek of the data. Each comparison incurs a 5-percent false-positive chance so that across all of the comparisons within one test, the chance of seeing at least one false positive result is exponentially larger. There are many, many different ways to suffer a false positive (an error in comparison 1 only, in comparison 2 only, etc., in comparisons 1 and 2, in comparisons 1 and 3, etc.).

Now, if the multivariate test is being run to significance, you have a hydra of a head.

***

The Optimizely solution uses two key results from statistics:

a) A sequential testing framework is adopted, in which the analyst is presumed to be peeking at the results. The Bayesian analysis in most cases will not result in significance even if the sampling does not end--because of the skeptical prior. This line of research started in the 70s 40s with Wald.

b) All solutions to the multiple comparisons problem involves tightening the threshold of significance at the individual test level. Optimizely adopts the Benjamini-Hochberg approach to controlling the “false discovery rate,” (FDR) defined as the proportion of significant results that are in fact false. This line of research is from the 90s, and still very active. One advantage is that the FDR is an intuitive concept.

***

What this means for Optimizely clients is that your winning percentages (i.e., the proportion of tests returning significant results) will plunge! And before you despair, this is actually a great thing. Here’s why: In many testing programs, as I pointed out in the HBR article (link), there are too many “positive” findings, which means there are too many false positives. This is fine until the management starts asking you why those positive findings don’t show up in the corporate metrics.

If you currently rely on standard Optimizely reports to read test results, and run tests to significance, then the Stats Engine is surely a no-brainer.

In the next post, I have further thoughts for those customers who have more advanced protocols in place.

PS. This is Optimizely's official explanation of their changes on YouTube.

During my vacation, I had a chance to visit Trifacta, the data-wrangling startup I blogged about last year (link). Wei Zheng, Tye Rattenbury, and Will Davis hosted me, and showed some of the new stuff they are working on. Trifacta is tackling a major Big Data problem, and I remain excited about the direction they are heading.

From the beginning, I am attracted by Trifacta’s user interface. The user in effect assembles the data-cleaning code through visual exploration, and suggestions based on past behavior.

Here are some improvements they have made since I last wrote about the tool:

Handling numeric data - Trifacta now generates some advanced statistics, e.g. percentiles, about the columns in the Visual Profiler whereas in the past, every column is summarized as a histogram. I believe there is also some binning functionality.

Moving beyond Top N - I ranted about Top N thinking in the past (link), and I wasn’t happy that the Trifacta demo seemed to encourage this bad practice. I’m happy that the team heard the complaint and now offer a Random N selection. Eventually, I think Random N should be the default; I don’t know why anyone would want to see Top N.

Interactive workflow - Random N is a big step forward but in the world of data cleaning, it’s not sufficient. The reason is that many data quality problems are rare cases that don’t show up in a random sample. To deal with this, Trifacta has created an interactive workflow. Through the visual exploration paradigm, the software prepares a set of code; when the user applies the code to the entire data, the tool automatically check for further anomalies, and reports those to the user. For instance, there may be a handful of email addresses with unusual structures not found in the random sample, and thus fall outside of any of the data-wrangling rules. These are flagged for further treatment.

Column metadata - Another exciting development is the expanded use of metadata associated with columns. Such metadata is a major difference between an Excel spreadsheet and any sophisticated data table. For instance, the user can now associate labels with values within a column.

New file formats - Trifacta handles many new data formats like JSON. It can, for example, accept a JSON file and parse the nested structure into columns. Very nice addition!

***

I think Trifacta can gain ground by pushing the envelope on two fronts: more and better visual cues to help users diagnose data-quality problems; and more sophisticated recipes for how to handle such problems, informed by a knowledge base of past user behavior.