Posts categorized "Sports"

That was the question I asked when first hearing about the mobile app that ages you 30 years. The FaceApp went viral, and became even better known when some politicians proclaimed it unsafe because it was run by Russians.

It turns out FaceApp also allows users to make them look younger. It seems that the aged photos are more likely to be shared than the make-young photos. Why is that?

The issue raised in the clip is important for any machine-learning developer. Depending on your application, the errors made by the algorithm may be more or less visible to the users.

I discussed this aspect of decision-making in Chapter 4 of Numbers Rule Your World (link), the chapter about steroids testing. When the lab makes a false positive error, the falsely accused athlete will complain very loudly. When the lab commits a false negative, the cheating athlete will not make a sound. This creates a situation in which the lab's false-positive errors are highly visible while almost all false-negative errors are hidden from view.

It was/is generally believed that the athletes and the testing labs are playing a cat-and-mouse game. Given the above, is that really true?

For the sports stuff, you can read my book. For how this applies to FaceApp, play my video:

If you like our videos, support us by subscribing to the channel and sharing.

My new article, joint work with Pravin and Sriram, has appeared in FiveThirtyEight this week. It's motivated by this chart, showing the lengths of baseball games:

Since 2000, the average Major League baseball game lasts over 3 hours; one out of 10 games takes over 3.5 hours. The added time is almost all non-action time. In fact, there was an article in the Wall Street Journal in which they logged a bunch of games and discovered that less than 10 percent of the game time can be classified as "action." For the uninitiated, the "non-action" time includes pitchers staring at the batter, batters taking practice swings, managers slow-walking to the mound to chit-chat with the pitcher, etc.

To their credit, the League office has started to pay attention to this problem although it seems like an agreement with the players' union is hard to come by. Some fans are taking matters into their own hands - why fight with all others to get out of the stadium when you can leave in the middle of the game, with one team "comfortably" ahead?

In our article, we model this decision as a stay-or-leave decision, and use Wald's Sequential Probability Ratio Test to scientifically determine the optimal time to leave. Here's the bottom line:

A back-of-the-envelope estimate pegs the total amount of time saved at 1,612 hours for the season. The price to pay was the 61 mistakes — dramatic comebacks that were heard on the radio from the parking lot by the fans who left early.

Yet another Tour de France winner is entangled in a doping scandal - what a surprise. This time, it is the British rider Bradley Wiggins, who is the subject of a damning report.

His team is accused of supplying Wiggins and his support riders with a banned substance under the guise of medical need. The reality is they have not violated anti-doping rules because the authorities allow so-called Therapeutic Use Exemptions (TUEs).

I addressed this and various issues surrounding anti-doping tests in my book, Numbers Rule Your World,years ago before the Lance Armstrong scandal broke the silence.

TUEs are widely abused in the sports world. For example, the data suggest that either swimming causes asthma, or that people with asthma are much more likely than the general population to engage in world-class competitive swimming.

TUE is a major source of false negatives in the world of anti-doping tests - here, I define false negative as failing to catch a doper, which is broader than dopers passing the anti-doping tests. In the book chapter, I explain why there is very low risk of dopers getting caught by the anti-doping measures.

ABC News reported that Ricky Williams, former NFL star, proclaimed himself as holding "the world record for most times drug tested". (link) He said he was tested 500 times.

During this 11-year career, Williams failed the test four times. So there is one thing we know - the drug testing regime is not much of a deterrent.

Since the athlete knows when he is juicing or not, he is privy to an estimate of the false negative / false positive rates of the testing regime. If someone keeps rolling the die, he or she probably knows the tests are not that effective. In my book, I showed using some simple math that almost all juicers would pass these tests.

The number 500 itself is useless. It's all about the protocols. Is the testing really random? When are athletes informed about the test? How is the sample collected? (For example, his wife disclosed that the tester left the sample standing for 45 minutes to "go get stickers" to identify the source of the sample. Sure.) What checks are in place to prevent tricks like using other people's urine, diluting, etc.? Are there off-season testing?

It doesn't matter how many times he is tested. What we should care about are the protocols used in these tests.

Really enjoying Propublica pieces lately. There are several articles about topics of great interest to me, and those who read my books will be familiar with these themes.

My favorite is an article that speaks a truth about data projects -- much as we sweat about data collection, data integrity and statistical models, the true challenge is in persuading the rest of the world to adopt our endproducts. The title of this piece says it all "The FBI built a database that can catch rapists--and almost nobody uses it. " (link).

The data project in question is an early effort to link data from multiple sources to leverage correlations to solve the problem of identifying serial offenders. However, less than 10% of local police departments contribute data to the system, rendering it toothless. In my experience, it is common to find data projects stuck in first gear, and failing to make any real-world impact.

Kudos to the authors for asking the dirty question of the return on investment of such a system. It is believed that in 12 years, the system may have helped solve 33 crimes. It costs $800,000 per year to maintain (most likely, contractor expenses). You do the math!

For managers, the key is to diagnose properly the reasons for inaction. Lack of adoption is frequently blamed on technology but the reality is much more complicated.

David Epstein reports on raids on steriods labs. (link) Law enforcement is the most effective way to catch cheaters in sports. In Numbers Rule Your World, I explained why anti-doping tests are ineffective, in the sense that false negatives are rampant, letting lots of dopers off the hook. This conclusion comes from a simple statistical calculation. In the chapter on a lottery cheat, I described how statistics can be used to prove that "someone has beyond reasonable doubt cheated" but physical evidence is required to nail the perpetrator.

Epstein then expanded the conversation: "World-class athletes are merely the fine layer of frost atop the iceberg’s tip when it comes to the steroid economy." The headline of the piece is "Everyone's Juicing".

I find it interesting that Epstein said "In years of reporting on performance-enhancing drugs, I’ve frequently been asked why athletes in smaller sports or facing lower stakes would dope, given that there’s little money in it for them." This feels odd because when I was researching my book five or six years ago, I heard the opposite claim, that elite athletes couldn't possibly be doping because they don't need steroids. (Think Barry Bonds back in the days.) This tells me that (a) public opinion has shifted due to the Armstrong revelations and (b) the human mind will rationalize any story even if the story flips.

Epstein has another article in August about false negatives, which should be familiar territory for my readers (link).

***

Joaquin Sapien reports on the case of one Ruddy Quezada, who was released after spending 24 years in prison for murder. This case reminds me of the Innocence Project, whose amazing work I featured in Numbers Rule Your World. In the current scenario, though, we don't know if Quezada was innocent, only that the prosecution lied about how they coerced the witness to testify. The witness testimony was the only piece of evidence in that case, which means that the prosection is left with no avenue to re-try the case.

The case I used in my book concerns false confessions so both cases deal with coerced evidence.

The assumptions seem very dubious to me, and I would love to see a critique of their methods.

Andrew forwarded the email to me. I wrote about steroids testing and Tour de France in my first book Numbers Rule Your World. The gist was that those tests fail to capture most dopers, which of course is not a controversial comment today. But pre the Armstrong confession, it used to be people believed in lots of things.

I clicked on the link, and read through the nicely-written summary of how they estimated the probability of doping, and wrote back to Gabe:

It looks like a reasonable way of proceeding to me. When you say "The assumptions seem very dubious to me, and I would love to see a critique of their methods.", what do you mean exactly? Are you unhappy with the method or are you unhappy with the assumptions (such as the prevalence of doping, the effect of doping, etc.)?

He replied:

The part of the analysis that seemed most dubious to me was the estimate of the prevalence of doping in the peloton today, estimated at 25%. The authors seemed to think that was a conservative estimate, but it sounds very high to me. If it was 10 or 15 years ago, sure, I think the number would be even higher than that. But given the monitoring that has been put into place by UCI since then, including the biological passport and year-round monitoring, if the number of riders doping was nearly 25% then I would expect that the number of positive doping tests would be much higher than it is. But of course it's hard to know, and perhaps the doping is just very sophisticated micro-dosing.

Also, for the distribution of W/kg, the parameters of the Gaussian distribution for clean riders (shown in the first plot) suggest that there is a very narrow range of W/kg values for clean riders in the peloton. I don't know enough about sports physiology to know whether this is true, which is part of why I was looking for critical analysis from others. I would have expected the standard deviation to be larger.

Finally, I wonder if the Gaussian distribution is even appropriate here when modelling the distribution of W/kg values for the most elite athletes in the world.

That was a nice deconstruction of the model used for the prediction. Two normal distributions are postulated, one for clean riders and one for dopers. Then you mix those distributions.

Now my response to Gabe's critique.

Generally speaking, this type of conversation falls into the same vein as the "deflategate" in NFL. Statistical modeling allows us to put a probability estimate on the chance that there is foul play. Note that it is an estimate, based on a model of the world. Further, note that the cheaters are adversaries, and they will hide their tracks, and most analyses just plain fail to take that into account.

All models can be challenged on the grounds that you don't like the structure of the model. I personally don't like that kind of critique unless someone can offer an alternative model that both sides agree to be better.

On the specifics about the prevalence of doping, I wrote:

This is the issue of the "prior" in Bayesian statistics. In theory, you can run sensitivity analysis to see how the 25% assumption changes things.

In reality, I don't think it matters. Their model is that the bell curve shifts upwards for dopers (which I think is reasonable). This means that if you focus on the extreme tail of the combined distribution, it is always going to be the case that the majority of those extreme people would come from the doping distribution.

Later, Gabe confirmed what I just said, as he played with the spreadsheet from VeloClinic:

Based on your comments, I should clarify that I am not opposed to attempts to estimate the probability of doping given power, and I accept that all models are approximations. I do think it's appropriate to question whether the parameters of a model are sensible, and I gave some specific examples that I thought could be improved. Also, I am not at all opposed to them using W/kg as a proxy measure.

Playing around with their provided spreadsheet ... does confirm that changing the prior does not have a big impact. However, changing the standard deviation of the clean rider distribution does.

To which I responded:

I think the question that gets to the core of this is: what should a reasonable model predict to be the chance of doping given extreme levels of performance? When you get into the extremes, it seems to me any reasonable model would say the same thing this model is saying - that the more extreme the performance, the more likely the athlete has doped. I don't think statistical modeling can give you the kind of answers you are looking for.

I'd add that given the revelations around Armstrong, we should be modeling: the more naturally gifted a performer, the more likely he/she dopes. People say the opposite before but they were wrong. Because of this, it is even more difficult to separate out doping from talent!

In the email exchange, I didn't comment on the standard deviation bit that Gabe raised. By increasing the standard deviation of the clean riders, you are explicitly allowing more extreme performers who are clean so what he said sounded right. In this scenario, I'd have shifted the dopers proportional to standard deviations.

I say there is great progress because reporters no longer believe the story that if you pass dozens of tests, you must be clean!

[I should add that Gabe later pointed out that it is no longer clear whether Froome was an outlier or not on that proxy metric for performance.]

I like this passage very much, which really nails home the point that good analytics requires intuition:

The hours of waiting [during draft meetings] were often filled with watching film of prospects. It helped me refine my analysis, as I soaked up details from scouts that I never would have seen on my own. ("Rewind that. ... Did you see his foot placement there, getting ready for the rebound? That's NBA ready.") During one of these sessions, we were watching film of Syracuse point guard Jonny Flynn. I mentioned that, based on the rate at which he collected steals, he was likely a good defender. But one of the scouts explained that Flynn's steal total was likely higher than other point guards' because Syracuse played mostly zone defense, which allowed guards to attack the ball more. I checked that insight against the data and it seemed true, so I adjusted my defensive statistics to account for the dominant style of defense used by a player's team.

I'm glad to hear that the style of play is included in the models. I cringe every time I hear a (usually English) football (i.e. soccer) commentator claiming that a team "deserves" to be in the lead because it is dominating the time of possession when in fact, the other team is using a counter-attack strategy. When the other team ekes out a 1-0 victory on a sneak attack, the commentator loses his wit.

Alamar sees the next big challenge in NBA analytics as deriving value from the SportsVU data. What is SportsVU? Alamar tells us they installed cameras everywhere that "capture the coordinates of 10 players plus the ball 25 times every second." This is the typical "Big Data" scenario--data is collected without any design or any research question in mind. It raises a few intriguing questions:

The granularity of such data (here it is 25 times a second, that is, to say, four-hundredth of a second apart) can be arbitrarily small. When have we reached the point of picking up just background noise?

The very act of relating such data as "predictors" to outcomes such as scoring statistics presupposes the model in which the precise movements of the players or balls are correlated with those outcomes. Whether we like it or not, any resulting analysis will take on a causal interpretation--this is what separates trivia from an actionable insight. Is this type of predictor the most relevant to explaining outcomes? If not careful, we may just believe this story because that's the one we start with.

Last week, I pointed out the futility of using data as proof or disproof in Deflate-gate. Emphatically, a case of "N=All" does not make things better. I later edited the post for HBR (link).

In this post, I want to address a couple of more subtle technical issues related to the Sharp analysis, which can be summarized as follows:

1. New England is an outlier in the plays per fumbles lost metric, performing far better than any other team (1.8x above league average).

2. Different ways of visualizing and re-stating the metric yield the same conclusion that New England is the outlier.

3. There is a dome effect of about 10 plays per total fumbles, meaning that teams who play indoors ("dome") typically suffer 10 fewer fumbles than teams who play outdoors ("non-dome"). New England is an outdoor team that performs better than most dome teams on the plays per total fumble metric. If dome teams are removed from the analysis, New England is an outlier.

4. Assuming that the distribution of the metric by team is a bell curve, the chance that New England could have achieved such an extraordinary level of play per fumbles lost is extremely remote.

5. Therefore, it is "nearly impossible" for any team to have the New England type ability to prevent fumbles... unless the team is cheating.

***

Focus on Point 4 for the moment. This is a standard technique used by statisticians, and the basis of any analysis of "statistical significance". In statistical significance testing, we appeal to the normal distribution (bell curve) to estimate how close the observed sample is to the "average sample". The big question being addressed is: IS THIS AVERAGE?

Let's say we want to measure the effect of genetic modification on the size of fish. If the Fracken-fish sample is far from the average of natural fish samples, we conclude that Fracken-fish is statistically different (larger) from natural fish. A crucial requirement of this analysis is that the samples are randomly drawn.

But for Deflate-gate, the big question is: IS THIS EXTREME? The statistical significance tool is not designed to answer this question. The analysis tells us that the Patriots do not look like the average random sample from the NFL. Saying that something is not average is far from saying that it is an outlier! Indeed, statistical significance testing is frequently (and controversially) used to detect "small effects".

If the Patriot sample were randomly drawn out of the NFL, then Point 4 would have provided evidence of an extreme value but there is no random selection here. This takes us back to the point of my first post: the Patriots could belong to a group of elite NFL teams that have more "skill" in preventing fumbles, or there could be many other possibilities.

***

The other point of interest is that Points 1-4 say essentially the same thing: that the Patriots are far different from the rest of the NFL on the play per fumbles metric. Point 2 is the visual equivalent of the mathematics of Point 4.

Point 3 sounds different but it really isn't. Points 2 and 4 say the Patriots don't fumble much. But dome teams fumble less because they play indoors; thus, their presence in the analysis makes the Patriots advantage (a non-dome team) less pronounced. Thus, in constructing Point 3, Sharp removed dome teams. It's the same data, viewed from a different lens.

Repetitively stating the same statistic does not make an argument. I'm not saying Sharp should not have performed these steps. I'd have done many of these analyses myself. But they play the role of quality control. The reiterations don't strengthen the argument, and they sound a bit like Sunday morning talk shows.

I was asked to adapt my earlier post for the HBR audience, and the new version is now up on HBR. Here is the link.

I'm happy that they picked up this post because most business problems concern reverse causation. A small subset of problems can be solved using A/B testing, but only those in which causes are known in advance and subject to manipulation. Even then, Facebook got into trouble for running such an experiment (not in my eyes though).

A number of readers sent me Warren Sharp's piece about the ongoing New England Patriots' deflate-gate scandal (link to Slate's version of this) so I suppose I should say something about it. For those readers who are not into American football, the Superbowl is soon upon us. New England, one of the two finalists, has been accused of using footballs that are below the weight requirements on the rulebook, hence "deflate-gate".

The Slate piece is a good example of the brand of data journalism that is possible in today's world where anyone can get a hold of a lot of data. The quality of the analysis is above average as far as these pieces go. I like the use of different visualizations to understand the variability of the plays per fumble across teams. It's also clear to me that the histogram is easily the best of the bunch.

The chart is just missing a label: the one team standing far right is the New England Patriots.

***

When reading these pieces, pay attention to the structure of the statistical argument. Here is how I would summarize the argument:

1. New England is an outlier in the plays per fumbles lost metric, performing far better than any other team (1.8x above league average).

2. Different ways of visualizing and re-stating the metric yield the same conclusion that New England is the outlier.

3. There is a dome effect of about 10 plays per total fumbles, meaning that teams who play indoors ("dome") typically suffer 10 fewer fumbles than teams who play outdoors ("non-dome"). New England is an outdoor team that performs better than most dome teams on the plays per total fumble metric. If dome teams are removed from the analysis, New England is an outlier.

4. Assuming that the distribution of the metric by team is a bell curve, the chance that New England could have achieved such an extraordinary level of play per fumbles lost is extremely remote.

5. Therefore, it is "nearly impossible" for any team to have the New England type ability to prevent fumbles... unless the team is cheating.

Actually, Sharp to his credit did not argue Point 5. If you jump to the end of the article, all he said was the extreme value of New England's plays per fumble performance is not random fluctuation. And that there are many possible explanations, including both legitimate and illegitimate ones. So not a lot of bang for the buck of running the analysis.

However, any reader of the Slate article, especially an anti-Patriots fan, will be tempted to cite these statistics and actually conclude that the Patriots cheated. All he/she requires is some story about how deflating footballs reduce the chance of fumbling the football. Notice I say story because I don't think we have any scientific theories connecting the two, yet.

This is a good example of the limitation of data analysis, including Big Data analysis. However suggestive, the data could not prove guilt. In this case, it is hard to find data to deny that the Patriots couldn't achieve the result in a legal way.

In fact, the "dome" analysis for me weakens rather than strengthens the argument. Sharp switched from plays per fumbles lost to plays per total fumbles, the difference being fumbles that were recovered by the fumbling team. Since I can't understand how deflating the football could help the Patriots recover more fumbles after the ball hits the ground, I perfer plays per total fumbles as the measure. Given this measure, the Patriots is not an outlier at all, and is second to the Falcons--only when Sharp removed all dome teams (the Falcons being one) could he make the argument that the Patriots is the outlier. But this tells me there are legitimate ways to perform equally or slightly better than the Patriots did-just look at the Falcons.

The only true proof of cheating is if New England is caught red-handed. Show me the record of the weights of the footballs and you have a believer. Show me someone who saw the footballs being deflated. Similarly, drug testing (statistics) did not nail Lance Armstrong (I wrote about it a lot); what finally wrongfooted him were eye-witness evidence and law enforcement investigations.

Also, this is an example of what Andrew Gelman has been calling "reverse causation" problems (link). We learn that New England did spectacularly on a metric, and we want to know what caused it. This is the opposite structure from an A/B test where we vary some causes, and observe how the variations affect an outcome. The reverse causation problem is one of the big issues of the Big Data era that isn't getting enough attention.

***

As this post is dragging, I will leave other comments to Part 2. An issue I have with the outlined statistical argument is that Points 1, 2, 3 and 4 are essentially re-stating the same thing four times.