We are a commune of inquiring, skeptical, politically centrist, capitalist, anglophile, traditionalist New England Yankee humans, humanoids, and animals with many interests beyond and above politics. Each of us has had a high-school education (or GED), but all had ADD so didn't pay attention very well, especially the dogs. Each one of us does "try my best to be just like I am," and none of us enjoys working for others, including for Maggie, from whom we receive neither a nickel nor a dime. Freedom from nags, cranks, government, do-gooders, control-freaks and idiots is all that we ask for.

Blog Administration

RSS Readers

Friday, April 10. 2015

Our department was given a briefing on yet another huge company-wide initiative to aggregate and coalesce all data, allowing us to develop relationships across whole departments and sectors of the businesses we run. It's a tremendous opportunity, and one which is needed if you consider what Facebook and Google are doing with data (among many other firms that have well-developed data management groups).

I had several questions about the project. For one, was there a revenue impact which was expected to offset the cost, and if so how was it calculated? What was the timeline for introduction at departmental and company-wide levels? What were the expectations of the use of the data? Was it better to implement in a piecemeal fashion, department by department - continuing the current path we are on - or was their top-down approach more efficient and likely to yield better results? Each question received an answer, sometimes dismissive, which led to more questions.

I was viewed negatively for my inquisitiveness. I explained I wasn't opposed to the project, but that I'd seen projects like this many times. None have worked as expected and most never paid off. These were not reasons to avoid doing it, but it is good to ask questions and be sure. I was told to 'trust' the data scientists, none of whom I know, and don't stand in the way. I acquiesced, and ceased my questions. Groupthink is a powerful thing. Data was here to save our business, I was assured.

On the train ride home, I ran into a colleague from another department who is much closer to this project and he told me even more details about the project. For one, it was the third attempt by this team to implement the 'vision' (so much for trust!). For another, they were abandoning all the work done in the previous 2 operations and starting from scratch, meaning work which had been done on all the old systems had to be reassessed and either tossed or transferred to newer platforms. Finally, they'd spent exorbitant sums of money already, to the point that break-even was probably 10 years off, assuming they met their 4 year timeline. He listened to my questions and nodded, saying they were all the right questions and there was good reason to question the nature and scope of this project.

Google, Facebook and all the other firms with huge data systems have the benefit of being young and starting from scratch while new technologies were being introduced. This is how business works, it's part of the process of creative destruction. The newer companies benefit from untried, but potentially beneficial products, living or dying by their ability to manage and incorporate these ideas and technology. Older companies have to try and keep up, and many are incapable of doing so. However, these older firms need to be careful about the implementation. Data is as much about art as it is about what the data tells us, sometimes less is more. Sometimes your gut tells you as much as $10mm worth of information does. I have seen people collect information on months-long projects only to confirm suggestions which were made at the outset. The delays cost money. There are rare, very rare, occasions when the data tells us something different. Sometimes the reason it tells us something different is due to the time delay in collecting the data. Perhaps this is a form of Heisenberg's Cat played out in the realm of business.

I am a huge believer in collecting and managing data. My job relies on it. But as I tell my boss, data and technology are like Stradivarius violins. You can give me a Stradivarius and I will make awful noise with it. Give it to a concert violinist, and beautiful music is made. The same is true of data. Many data scientists today, I've found, make very basic mistakes in their assumptions about what data tells them. The most common is the confusion over causation and correlation. I have had arguments with PhDs over this very issue when they present correlative data without proving the linkage to causation.

Baseball is a great example of this point. Sabermetrics have revived and increased my interest in the game. Yet Sabermetrics have limits. A cute, sappy movie Trouble With The Curveillustrates where data intersects with knowledge and experience. Data can provide support, but it takes experience to know what that data is telling you.

Dr. Joy Bliss recently posted about this issue, as the problem has infected even the realm of medicine and health.

Michael Crichton famously warned us of the problem of politicized science and data. Sadly, many intelligent people remain ignorant of misplaced trust in data, demonizing critics without explaining fully why the critics' logic is flawed.

A company, like the one which employs me, is just as likely to politicize positions. We call it groupthink. In my briefing, I was not part of the groupthink. I enjoy being on the outside. I may be wrong at times, but when I am, I'm happy to know that I have played the role of Captain Obvious, asking difficult questions in a fashion to open up the thought process further - if it can be opened up further. Sadly, as I watch what happens in the office, I begin to understand why Progressives remain so prevalent in our society. They are incapable of moving past groupthink. If everyone else is doing it, it must be good - right?

That groupthink is so prevalent came be explained by our history as hunter-gatherers. We spent ~80,000 generations in small groups where consensus was second nature. Groups that weren't thinking together didn't survive.
We've been only about 500 generations in agricultural-commercial society where individuals could start to think and act for themselves. Many in the group will still be 'concerned' with those who are independent (especially if they are successful) but new ideas that benefit society will be seen by most as good.

I think that is backwards. It is pretty clear that agriculture and industry made coercion more effective because there was no escape. A disaffected hunter/gatherer could always split off. As Greg Cochrane says we are the descendants of those who down on their knees.

By the way, 20 generations is enough for major evolutionary change: see the Russia fox experiment. You can get a new species with 500 generations.

The most common is the confusion over causation and correlation.
Lots of people believe that you can prove causality with statistics. They miss the obvious. Causality is a deterministic process. The outcome is predictable. Statistics, by definition, deals with random processes where the outcome is unpredictable. Simple examples of random processes are flipping a coin and rolling a pair of dice.

"Data can do many things. But the last thing it should be used for is policy-making, because data is typically utilized under the 'pretense of knowledge' and applied in a fashion that has unintended consequences. They may also have politics, which don't benefit you, built in."

Figures don't lie, but liars figure. "Beat on those numbers enough, and they'll tell you what you want to hear."

I am a data person myself, and as I was explaining to a friend the other night whose university is launching an ultra-big Big Data project with no clear goals or purpose:

Big data is often Bad data. If you're drawing information from multiple different datasets into one huge Borg-like cube, a lot of garbage can get into your database and you will need to spend a lot of time and money cleaning and adjusting it - mind you, in order to do that you need to know where the problems are! Sometimes I watch these PowerPoint presentations based on integrated data of dubious quality and I think to myself, "This is like a Potemkin Village!"

Depending on what you're planning on doing, a dirty dataset is probably not going to be that big a deal, but you still have some unknown unknowns and you can't know how big an effect these things might have.

Also: data costs money. Just because Big Data has gotten less expensive in recent years doesn't mean it's cheap.

I work in the database world as an architect and developer. Been doing so for about fifteen years.

I've seen "big data" projects for going-concern companies that worked really, really well. One example allowed a marketing company to cut time-to-deliver on marketing campaigns by 40%. This gave that company a huge competitive edge over everyone in its particular focus. Pushed sales and profits out the roof for years.

I've also seen them soak resources for years without any identifiable payoff.

The difference has really been whether the top execs (or owners) of the company let the data people do the job and present undoctored data, which results in success, vs when every manager in the shop had to pee in the data (figuratively, of course), which resulted in "yep, it says exactly what I've been saying for years!"

The second option does have the benefit of a lot of ego inflation. I haven't found a way yet to monetize that, except in politics and self-help seminars.

I agree and it the frustration of ego driven projects that leads to what I experienced yesterday.

The presenters used lots of really good, but vague, language.

"We will incorporate clean data." Well, sure. That should go without saying. But what, exactly does that mean and how do you intend to get it or make it clean?

That's the tricky part, and that's the part that raises my skepticism meter. When I asked that I received a dismissive response.

Which my boss, a rather uninformed but enthusiastic neophyte, decided pointed me out as 'negative' and she spent time asking me to get aboard. I told her I AM on board - this is needed. But I am skeptical of vague assurances and misleading information. Sure, it all sounds good - you can make anything sound good when you're trying to win approval. But to me, winning approval means providing detail that you know what you're doing.

The fact they've failed twice already, which I found out after the meeting, says to me they are just looking to keep their jobs going for as long as they can.

And how much do the data scientists actually know about the company's products, its customers, its sales process, its support process?...not just as a primitive quantitative framework, but in terms of deep understanding? Not much, I'd guess.

"Fifty years or more ago the Uncle Henry's and the Charlie Kellsadts dominated; then it was necessary for Son Irvin to emphasize systems, principles, and abstractions. There was need to balance the overly perceptual with a little conceptual discipline....But now we again need the Uncle Henrys and Chralie Kellstadts. We have gone much too far toward dependence on untested quantification, toward symmetrical and purely formal models, toward argument from postulates rather than from experience, and toward moving from abstraction to abstraction without once touching the solid ground of concreteness"

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.Enter the string from the spam-prevention image above: