Posts categorized "Books"

My review of Jim Albert's book Visualizing Baseball is on the sister blog. As I mentioned in the review, I thought a more appropriate title should be "Fast Intro to Baseball Analytics". Thus, the book may be of interest to some of the readers here.

***

Over the past year, I've been experimenting with the video medium. Here are some clips on various topics:

I was lucky to earn my Operations Research degree in a department filled with professors who have advanced our thinking about stochastics, learning from the likes of Erhan Cinlar, John Mulvey, Warren Powell, and Rob Vanderbei. In operations research, much of the early seminal work deals with "deterministic" problems in which the data are assumed to be fixed and available, but most real life problems involve data that are highly variable or missing. Replacing the distribution of values by their averages allows those deterministic solutions to be computed although it is widely known that such solutions are not optimal. In other words, we must embrace variation.

These thoughts came to mind as I started reading the Algorithms to Live By book by Christian and Griffiths. In the first 20 pages, they kept returning to the altar of deterministic, precise engineering solutions.

The book begins with the classic optimal stopping problem, in which one is faced with a sequential search problem at each step, one can either make the decision and stop searching, or delay the decision and keep searching. House buying is an example of such a problem. The authors pronounce that there is a provably optimal solution of spending first 37 percent of the house search period delaying the decision, after which one waits for the next house that is better than the previously seen.

They eventually do admit that this 37-percent solution is optimal only under a set of highly unrealistic assumptions, such as that one can never bid on a house one passed over before, and that the seller will always accept one's offer.

It's not the over-selling of the math that annoys me. Assumptions can be relaxed and complexity added to the base scenario. It's the constant harping on 37 percent. It's not 35 percent, it's not 39 percent. It is precisely 37 percent, and you better believe it!

No one should believe it because another huge assumption in these "provably optimal" solutions is that the data supplied to the problem are available, and precise. That's far from the truth! In a variant of the optimal stopping problem which the authors also describe, the decision-maker is supposed to have "full information" about the value of the house, that is to say, one knows the value of each house being viewed, expressed as the precise percentile in the distribution of all house values!

This leads to the hair-raising moment in which the authors declare on page 21, "if the cost of getting another offer is only a dollar, we’ll maximize our earnings by waiting for someone willing to offer us $499,552.79 and not a dime less." (By now, the problem setting has flipped to selling rather than buying a home.)

As formulated, the problem yields a "closed-form" formula, which spits out an answer to two decimal places. If one accepts uncertainty and embraces variation, one would not care about dimes.

***

I still plan on finishing the book, and I will write a detailed review when I finish. This post is not a refutation of the entire book. It's quite common to take formulas as god-given objects. Between unrealistic assumptions and uncertain data, one can only hope for general guidance and approximate solutions to these problems.

Seth Stephens-Davidowitz has written a fascinating book calling for social scientists to use data collected by Google or Facebook in their research. This is a controversial issue, and if it weren’t so, it wouldn’t warrant writing a full-length book about it. Google does not release publicly its search data, but provides some pre-processed and aggregated statistics through services such as Google Trends and Adwords. Researchers who use this data do not have control over its collection or processing. (Seth previously worked at Google, and has written some columns for New York Times.) Facebook does not publish its data either, although social media users have a lower expectation of privacy.

Such big datasets come with a set of knotty problems, which I have previously summarized as OCCAM. In addition to having little control over its origin, the researcher’s purpose typically diverges from that of the data collector. The data is found or observed and not usually experimental. It is often treated as “complete” or essentially complete by the researcher, which is an assumption, not a fact. In stages of merging in other data, the researcher introduces inevitable errors. Here is my previous post about OCCAM datasets (link). It is not surprising that classically trained scientists have reservations about such datasets, especially if they are interested in causal mechanisms. Nevertheless, I agree with Seth that we can make progress on solving these problems if we start taking them seriously.

In writing the book, Seth carried out a number of mini-studies using mostly Google Trends data. Here are 8 things I learned from reading Everybody Lies:

Some people use search engines as confessionals. They type complete sentences like “I am sad.” or open-ended questions like “Is my daughter ugly?”

People assume machines (like the Google search engine) will keep their secrets. For sensitive topics, Google may generate more honest data than surveys. There are many questions asked to Google that I’m sure people won’t pose to a librarian.

Google searches for “Obama” is frequently paired with “kkk” and the “n” word. The prevalence of racist searches does not exhibit a North-South divide – it’s East-West.

As President of Harvard, Larry Summers spent quite a bit of time brainstorming with Economics PhD students on how to beat the stock market using new data. (And they came up empty-handed, or so they say.)

Anthony Weiner got rejected from Stuyvesant High School (famous NYC public school), missing the cutoff score in the admissions test by one point

Some economists found that going to Stuyvesant conferred no meaningful benefit to one’s career – at least, this is the case for those who attain a score close to the cutoff in the admissions test.

There are 6,000 searches on Google a year for “how to kill your girlfriend” while there are 400 murders of girlfriends.

“Big data” does not provide any insights that surveys can’t at the aggregate level so people slice and dice the data to examine “micro” segments, which means they are analyzing a huge collection of small data sets

The research mentioned in Seth’s book come out of the Economics discipline primarily, and can be considered in the tradition of Freakonomics (2005). There are examples of natural experiments, as popularized by Steven Leavitt. Seth brings the coverage up to the current trends, describing regression discontinuity, field experiments, and other techniques favored by econometricians at the moment. For those interested in what happens after Steven Leavitt, this is a good place to start.

In his book, Everything is Obvious (Once You Know the Answer): Why Common Sense Fails, Duncan Watts, a professor of sociology at Columbia, imparts urgent lessons that are as relevant to his students as to self-proclaimed data scientists. It takes only nominal effort to generate narrative structures that retrace the past, Watts contends, but developing lasting theory that produces valid predictions requires much more effort than common sense.

Watts’s is a perfect foil to the pop sociology books such as The Tipping Point and Outliers. Many a time while reading Everything is Obvious, I was reminded of Malcolm Gladwell by a passage here or there but the lessons drawn by Watts are decidedly more complex and nuanced.

Several chapters in the book begin with stories that could easily have come from Gladwell’s pen. In Chapter 9, Watts recalled the night an NYPD veteran, James Gray, eventually killed a family while driving drunk.

What happened next isn’t completely clear, but the record indicates that as Officer Gray drove north on Third Avenue, under the Gowanus Expressway overpass, he ran a red light. Definitely not good, but also perhaps not a big deal. On any other Saturday evening, he might have sailed right on through and gotten safely to Staten Island…

Using this case study, Watts shows how society cares more about outcomes than processes. For to replay the scenario in one’s mind’s eye is to realize that the unfortunate victims would have been safe if any number of events would have intervened causing Officer Gray to be “late” to the scene by just one second.

I imagine Gladwell would have embraced the James Gray story and folded it into one of his crowd-pleasing narratives wrapped around some simple truth about humans. Gladwell’s universe of truths such as “stars are made, not born.” One anecdote, a couple of research papers, a trifle of interviews - Gladwell’s ammo has dazzled and swayed the public, who crave broad-brush descriptors of our world.

Thankfully for this reader, Watts does not go there. The milieu of Everything is Obvious is far more complex. Watts resists reducing sociological phenomenon to simple terms. His stories tend to leave readers hanging; we learn that researchers have not quite figured out the goings-on. The James Gray story, Watts explains, is about how bad deeds can have harmless outcomes, good deeds can have bad outcomes, bad outcomes can arise from harmless deeds, and so on. Watts challenges us to think more broadly, behind the outcomes unto the processes. On this very issue, I wish sportscasters would read Everything is Obvious, and then they might cease calling for the coach’s head when he ordered a pass play that failed on the last play of a football game, with his team inches from the goal line. The coach made a bad call because the play failed but it would have been celebrated as the clinching moment on the coach’s resume had a touchdown been the outcome.

Watts is not afraid to admit when sociologists have no answers. Such indeterminacy may frustrate some readers, or startle them. Popularizers of science, following the tradition of Gladwell, have invented a world high on consensus and low on uncertainty. In Everything is Obvious, Watts, who converted to sociology from physics, repeatedly describes failed theories and false starts.

The book also works well as an overview of the field of sociology, its key research areas and avenues. Many of the topics covered are thoroughly modern, such as the rise of crowd-sourced data, the running of large-scale experiments, the complexity of systems, and the futility of chasing after a “grand unifying theory” of sociology (and by extension, economics, and other social sciences).

Watts and his associates made valuable contributions to some of these subjects. He holds a research position at Yahoo! Research Labs at the time the book was written, and thus has access to industrial-grade systems and datasets. Several chapters of the book are concerned with the nature of making predictions. I particularly enjoy the materials about choosing one’s battles - the idea that some things are just not meant to be predicted. This important lesson has been laid to waste in the Big Data age. There are too many data scientists who issue pointless predictions just because they have datasets that can be fed to algorithms.

I share Watts’s obsession about measuring the accuracy of predictions. One of his research papers investigated “predictive markets,” concluding that they were not meaningfully better than experts, contradicting the unwarranted hype about predictive markets (and its crowd-sourcing ilk). The only surprise here is the miserly mention by the media, who otherwise eagerly publish any hearsay promotion about wisdom of crowds.

The title of the book is worth another moment’s thought. Everything is obvious, Watts says, when viewed in hindsight. The hyperventilation after each “terrorist” attack is instructive. Inevitably, a reporter files a note, disclosing that the accused killer has left a warning on his Facebook account. Such hindsight is taken as evidence of law enforcement’s failure to secure the nation. If right now, we were to flag every Facebook message that contains a threat of violence, we might have to investigate thousands, if not hundreds of thousands, of people, nearly all of whom will never become real killers. The menacing Facebook note becomes known only after its writer has committed a heinous act.

By its nature, Watts’s book does not read as quickly as your average pop sociology book. It contains quite a bit of philosophising, and urges readers not to take things at face value. Common sense is helpful, but limited.

I've learned one thing about book readers. There is a lag between buying a book and reading it. In fact, I imagine an author has two battles to win: one is at the bookstore (or Amazon) getting you to purchase the book; the next is to get you to pull out the book from your shelves and start reading it. I mean, what am I to say? I own shelves of unread books.

Anyway, reviews of Numbersense are trickling in. The following two long reviews make me happy, as these readers get the points I'm trying to get across.

Here's Thinkanator's review (link). He apparently is tired of the heavily-groomed media coverage of Big Data, and wanted the world to know about "Real Big Data," the less glamorous side of the job. He noted: "In the epilogue Kaiser shares one of his own Real Data Science stories and I found myself nodding my head and saying, “Yup, that’s how I spent several days in the last couple of weeks!”

Thinkanator probably wrote a better summary of the book than I have: "Numbersense is a wonderful and accessible book that consists of a series of stories about data that illustrate how to think about the kinds of statistics you read about on a daily basis. The emphasis isn’t mathematical, it’s more about when you should think, “Hmmm… that doesn’t sound right”, when you hear some statistics thrown around."

In Post 1, she highlights the sections of the book that she enjoys most. In Post 2, she applies the Groupon analysis to her own business. In Post 3, she draws lessons from the chapter on unemployment statistics.

Just a little while ago, I showed an example of imprecise algorithms and how it causes incorrect historical facts to be promulgated. The point is not that algorithms are scary things but that we should not confuse efficiency with accuracy (or truth).

So this past week, I have another encounter with imprecise machines, and this time, it's personal.

***

If you go to Amazon right now, and search for my name in quotes "Kaiser Fung", you will get several versions of my 2010 book Numbers Rule Your World, including the recently published Chinese translation but you will not see my 2013 book Numbersense at all.

If you instead search for Kaiser Fung without quotes, the first match is Numbersense, followed by the older book.

To me, this is a clear mistake.

However, Amazon doesn't think so. This is what the customer service rep wrote me:

I understand your recent book "Numbersense" is not appearing in the search results when you search with your name "kaiser fung", including the quotes.

When you use our search engine to look for books, our system attempts to find the products you're most likely to be looking for based on the words you entered. Our search methods go beyond simple keyword matching and may also be using information not visible on the search results page, including attributes provided by the publisher.

So apparently, people who search for my name in 2013-4 are looking for my 2010 book instead of my 2013 book. In addition, my publisher has given them attributes to suppress my recent book from the search results.

Search results for books may be based on the text of each book, not just its title. That's why you may sometimes see results you weren't expecting.

I don't even understand what this sentence could mean, in the context of the name of the author.

I regret that we haven't been able to address your concerns to your satisfaction.

We won't be able to provide further insight or assistance for your request.

Thank you for contacting us.

This is rather shocking coming from the gatekeeper of most book sales.

***

Needless to say, this type of error costs authors as some people won't find the book. Yet, Amazon is unwilling/unable to fix the issue. Are there any readers here who have insight into why this is happening and how I might be able to correct this error?

Arati Mejdal has an interesting list of 20 books, which should interest my readers: link

Albert Cairo selects it as one of his favorites of 2013: link. Pay attention to his recommendations for visual design books.

Rakesh Arya writes a review of the book, concluding: "The examples are quite apt with the topics and easily understood as they are explained in simple language without using the statistical jargons. It’s a good-read for those who are interested in behind-the-scene-theories and like to crunch numbers to find out the truth. But obviously it’s not a casual read." link.

***

In related news, my first book, Numbers Rule Your World, appeared in China (Mainland) this fall. I'm proud to be published by the press of Renmin University, one of the top schools in China.

There is a really nice article about the book. I'll translate it for you when I have time. If you can read Chinese, it's here. The title of the article is: "In the Big Data era, how should you analyze data?"

The book was chosen as one of the top Finance/Business books in the Chinese press. (link)

***

To gift Numbersense or Numbers Rule Your World to your data-loving friends for the holidays, you can click any of these links:

I left a comment on one of Andrew Gelman's recent posts about Malcolm Gladwell (link). This post discusses briefly a review of Gladwell's recent book.

A commenter ("Haile") made the following defense of Gladwell:

My point is that many of these criticisms are based on Gladwell’s failure to present rigorous statistical evidence of arguments that are not statistical in nature in the first place.

when (if) reading Gladwell, it’s time to put down the statistical hammer, because virtually nothing that he talks about has anything to do with phenomena that are statistical in nature.

The argument is simply that if the author is merely describing one person, a sample size of one, and not attempting a generalization, then statistics has no standing in this conversation.

In my response, I will gloss over the important issue of whether Gladwell is making general comments about human psychology or sociology in his books. I will consider what it means to study an individual sample by itself.

It turns out that reducing the scope of the data makes the analysis harder, not easier. This may feel counterintuitive.

In my response to the comment, I used one of the examples from the new book. The greatness of David Boies, the famous lawyer, is argued to be a result of his dyslexia, which is billed as a "desirable difficulty". Here is what I said:

Let’s examine the theory that Boies is a great lawyer because he is dyslexic. How would you prove/disprove such a statement? All we observed are two separate things: Boies is a great lawyer; and Boies has dyslexia. Well, maybe a third thing, which is that Boies said he became a great lawyer because he is dyslexic. A correlation does not a causation make. First-person speculation is not science either.

If Gladwell is attempting a generalization, he could expand his data set to multiple Boies, and hope to find stronger evidence for the theory. If he is forced to use N=1 samples, his task is more formidable, not less.

Note that saying Boies benefited from his dyslexia is very different from saying Boies is X feet tall. The latter is purely descriptive and can be verified easily. The former is not descriptive, and a hard problem.

Andrew's other post on the Gladwell genre is really good. It captures many of my own sentiments.

Yesterday, Larry Cahoon, a 29-year veteran at the Census Bureau, answered some questions. The rest of the interview is printed below.

***

KF: How can a data analyst improve his or her skills?

LC:

I have to say my best training has been many, many hours of
just playing with statistics, playing with graphics, and reading the analysis
others have done. The more data I see, the more analysis I do, the more
graphics I look at and produce, the more I learn about how to look at data and
how to see the pitfalls in the data analysis. This is getting down in the mud
and dealing with real data with all of its warts. My wife tells people I’m a
statistician through and through as every time she looks at my
computer, there is a graphic of some type on the screen.

Finally, I have always been an avid reader of Science Fiction. No one
can read Science Fiction without being forced to consider any problem from
different perspectives and take into considerations differing assumptions. This
in turn has helped me develop the ability to question the
assumptions being made in any data analysis.

***

KF: What are your pet peeves with published data interpretations?

LC:

I seem to return again and again to the same issues with the
analyses I see in the newspapers, online, and just about anywhere. The most
basic problem is one of incomplete analysis.

We see so many papers and news
reports where a data difference is observed and then based on no data
whatsoever, the author goes off with an entire line of speculation without any
data to justify that speculation. This line of thinking then frequently ends up with the claim that these
two things are correlated, and therefore we have cause and effect.

The media
then fans these reports by writing a story without asking basic questions,
such as is the data itself any good or have they any evidence for the claims
that are being made. The media
acts as if the claims have been proven – especially in how they headline the
story.

My second pet peeve is what I call an emphasis on a one dimensional
world. This is usually reflected in simple statements like: A causes B. The
world is much more complex than that. Those who investigate airline accidents
have been telling us for some time that there is seldom just one cause for each
accident. Rather there are a number of causes. We need to carry that knowledge
over to our statistical analysis and reporting.

***

KF: Which source(s) do you turn to for
reliable data analysis?

LC:

I can’t say that I have any favorite source for data analysis.
If forced to name one, I would say that I tend to like the work of the Pew
Research Center (link) Their surveys seem to be well designed, the questions they ask
well thought out, and the analysis something I can trust.

I like the data that
is available from the Federal Government. But the government agencies rightly
avoid most detailed data analysis in an effort to remain nonpartisan.

***

KF: Thank you so much for your time. We're lucky that you continue to blog in your retirement.