Sabermetric Research

Sunday, April 29, 2012

Puzzle

Here's a puzzle that occurred to me a few days ago. I don't know what to do with it, so I might as well post it here.

------

In a certain country, the "daily numbers" lottery works like this: you buy a $1 ticket with your choice of 3-digit number. The winning 3-digit number is announced. If you match exactly, you win $1000.

However, the draw isn't random. Instead, the winning number is always the *least popular* number chosen by ticket-buyers -- that is, the number that is on the fewest tickets.

There's one catch: if there is a "tie" for least popular number, every number in the tie counts as a winning number. For instance, if there are 1,359 tickets with number "000", 1,359 tickets with number "844", and every other number is on more than 1,359 tickets, that's a two-way tie for least popular, so there are two winning numbers. Holders of either number win the full $1000.

That means, in theory, that *every* ticket could win. For instance, if all 1,000 numbers are bought exactly 3,453 times, it's a 1000-way tie, and every ticket buyer wins $1000. (Presumably, the lottery authority has a large cash reserve to cover this possibility.)

Assume that every ticket buyer chooses his or her number randomly. Is the chance of winning less than 1 in 1000, more than 1 in 1000, or exactly 1 in 1000?

------

I think there's an answer that requires almost no math, just logic and a very basic common-sense understanding of how probability works. (There might be other answers, and it could be that I missed an answer so obvious that the question isn't very interesting.)

Friday, April 27, 2012

Why 10 runs equals 1 win

It's a rule of thumb that in baseball, every additional 10 runs you score turns one loss into a win. When I first heard that, it seemed like 10 was a lot ... but I managed to convince myself that it made sense. Here's how I explained it to myself.

------

Imagine a reasonably large number of baseball games -- a team-season, or decade, or whatever. Pick 10 games at random, and then pick one of the teams randomly in each of those 10 games. Add 1 run to those ten teams' score.

You've now added 10 runs. How does that change things?

Well, for many of those games, it won't change things at all. If the game didn't go into extra innings, and was won by 2 runs or more, than adding one extra run can't change the outcome.

In the 1990s, 68.4 percent of games were decided by more than one run. That means that 6.84 of those extra 10 runs are "wasted", and don't do anything.

Now, consider the 9-inning games decided by exactly one run. That was 22.5 percent of all games. Half of the time, the extra run will go to the winning team -- so that run doesn't do anything.

That leaves 11.3 percent of games where the run goes to the team who lost by one run. That 11.3 percent of the time, the game will now go into extra innings. The team that gets the run will win half of those. That means that 5.6 percent of those extra 10 runs turn a loss into a win. That's 0.56 wins.

That leaves only games that went into extra innings. In the 1990s, that was 9 percent of all games.

If we add a run to one of those teams, that team now wins the game outright. It would have won half of them anyway, so half of those runs don't do anything. But, the other half, the run turns a loss into a win. That's 4.5 percent of all games, or 0.45 wins.

Add 0.56 wins to 0.45 wins, and you get ... 1.01 wins.

That's how every 10 runs leads to one win.

------

Another way of getting the same answer (with all numbers rounded):

If you assign 10 extra runs randomly, 5 will be assigned to the team that won the game anyway, so those are wasted. Another 3 will be assigned to teams that lost by two or more runs, so those are wasted too. That leaves 2 runs.

One of those runs will turn a nine-inning tie -- half a win -- into a full win. So that's 0.5 wins.

The other one will turn a one-run deficit into an extra inning game -- which turns a loss into a half win. So that's the other 0.5 wins.

------

I guess you can generalize this to other sports: the number of wins per "point" is half the percentage of games that are tied in regulation, plus half the percentage of games that are lost in regulation by exactly one point.

For basketball, you'd have to also adjust for deliberate fouls at the end of the game. For hockey, you have to adjust for empty-net goals. And for football ... well, I don't know how you'd do it for football, since points usually come in large bunches of 3 or 7. Maybe you could add a field goal to get wins per extra 3 points.

------

Disclaimer: this analysis makes a few simplifications. It ignores bottom-of-the-ninth issues. It assumes all teams are .500. It assumes that none of the extra runs were allocated to an extra inning. And it assumes you never choose the same game twice, randomly (which is related to assuming all teams are .500).

But, if you fix all those things, you'll still get a number close to 10 runs.

Friday, April 20, 2012

Don't use regression to calculate Linear Weights, Part II

I've written a coupleof times now about how why it's better to calculate linear weights via the "traditional" play-by-play method than by regression. That was all theory .... this time I figure I might as well go into more detail, including doing the work and showing the real numbers.

I ran the play-by-play method using all major league games from 1990 to 1998. You probably know this already, but here's how the method works:

1. Look through every baseball game for the period you're interested in (probably by computer, through Retrosheet). For every plate appearance, note the number of outs, bases occupied, and the number of runs that scored in the remainder of that inning.

2. For each of the 24 game states (number of outs, bases occupied), group all the observations and average them, giving you the average number of runs scored in the remainder of that inning.

For instance: from 1990 to 1998, there were 27,580 instances of a runner on second and nobody out. A total of 31,749 runs scored in the remainder of those innings, for an average of 1.151 runs.

There were 9,090 instances of runners on first and third with no outs. The average there was 1.807 runs.

3. Once you have those 24 averages, go through every baseball game again. Find every single. Look at the game state before the single, and after the single. Calculate the difference in run expectation between the two states. The difference, plus the number of runs scored on the play, is what that single was worth.

For instance: with nobody out and a runner on second, the batter singles, runner to third. The value of that single is 1.807 minus 1.151, which is 0.656. So a single in that situation is worth 0.656 runs, on average.

Repeat for every single, and average out all the values. That's your answer, your linear weight for the single.

4. Repeat step 3 for every other event: doubles, home runs, walks ... whatever you want to calculate the linear weight for.

The logic of this method, I think, is very solid. Because we looked at all the data for those years, it is a *fact* -- not an estimate -- that on average, 1.151 runs were scored with a runner on second with nobody out. It is a *fact* that an average 1.807 runs were scored with runners on first and third with nobody out. Therefore, there is a very strong presumption that a single that moves the runner to third is worth .656 runs.

Technically, the value of the single is only a very close estimate. The reason is that we used three different samples. The 1.201 figure is from a group of 27,580 teams that put a runner on second with nobody out. The 1.879 figure is from a *different* group of 9,090 that put a runner on first and third with nobody out. And, the singles we're investigating were hit by a *third* group of teams -- 139,477 of them. Teams in the third group, by definition, form the weighted average of all teams that hit singles. But the other two groups are probably very slightly better than average, which means the calculation isn't quite exact.

But that's a very picky objection. Any bias caused by these circumstances is going to be very small.

-----

Here are the values the play-by-play method gives for the "big five" events:

These are reasonably in line with the "traditional" Pete Palmer linear weights, circa 1984. That's not surprising -- Pete used the play-by-play method himself. (Actually, since Retrosheet didn't exist at the time, Pete ran a simulation to get random play-by-play data, then ran this method on that.)

-----

Then, there's the regression method.

In terms of the amount of work you have to do, this method is much, much simpler than the traditional method. All you do is take team batting lines, and ask your regression software to give you the best predictor of runs scored based on the other factors.

I ran this regression for the same seasons as the play-by-play method: 1990-1998. That comprised 248 team batting lines total. I used only 1B, 2B, 3B, HR, BB, K, and outs. Here are the results (with standard errors in parentheses):

You'll notice that there are some significant differences between the two methods, especially singles and walks.

So, what's going on?

Well, first, I should point out that the differences aren't statistically significant. All the regression coefficients are within two standard errors of the traditional ones. So, you'd be within your rights to insist that the two sets of results aren't that much different.

And, of course, it's possible that it's just a programming error on my part.

However, I don't think either of those things is what's happening. I think that the differences are real.

I think the play-by-play method gives the correct result -- or close to it -- and the regression method is inherently biased.

Let me show you why I think that, in a bit of a roundabout way. Let me know if I'm wrong.

I ran another version of the same regression, but this time, instead of using team-seasons in the regression, I used individual games (both teams combined into one batting line). So, I had 81 times as much data in the regression. Actually, because of the 1994 lockout, it was actually only about 77 times.

This time, the value of the walk is huge, much bigger than it should be -- and much bigger than the value we got when we used 1/77 the data. Why?

Start with the observation that the value of the walk (or, indeed, any other event) depends on context. On a high-offense team, the base on balls will be worth a lot, because the baserunner has a higher chance of being driven in. For low offense teams, the value of the walk will be lower, as it's more likely he'll be stranded.

So far, so good. The problem, though, is that the more walks a team gets, the stronger its offense -- and so, the more valuable each individual walk becomes.

Suppose a team has N walks. At the margin, the "N+1"th walk a team gets is worth so many runs. But the "N+2th" walk has to be worth more -- because now it's in the context of a stronger team, a team that has N+1 walks, instead of a team that has only N walks.

It doesn't seem like it should be a big deal -- and it isn't that big, for a season. The range of actual major-league team offenses, in general, is pretty small. In 1998, the Yankees led the major leagues with 678 walks. The Pirates were last, at 393. That means the Yankees walked 73 percent more than the Pirates.

But for games, the difference can be much larger. Some games have 7, 8, or more walks. Some games have only 1 or 2. Now, the difference between most and least is in the hundreds of percents.

The difference between the 8th walk and the 2nd walk, over one game, is much bigger than the difference between the 393rd walk and the 678th walk, over a season.

And so, the problem of non-linearity -- of increasing returns on walks -- is bigger when you go game-by-game.

Now, what happens in the regression when you have increasing returns? Well, first, you probably shouldn't be using linear regression, which, by name and by definition, assumes things are linear. If you do it anyway, your coefficient gets inflated. In this case, the coefficient for the walk got inflated all the way to .433.

If you want a simpler example: Suppose I offer to pay you $1 for one hit, $4 for two hits, and $9 for three hits. That means hits have increasing returns -- the first hit is worth $1, the second is worth $3, and the third is worth $5.

In six games, you get one hit four times, two hits once, and three hits once. You made $17 on 9 hits, so the average hit is worth $1.89.

But, if you run a regression on the six games, you get each hit worth $2.29. (Try it if you don't believe me.)

The more increasing returns you have, the worse the bias. For games, the bias was high, pushing the walk to .433. For seasons, the bias is lower -- but not zero. So the regression's value for the walk will still be higher than it should.

The regression software I use has a test for non-linearity. For games, it comes back that walks (and singles) are definitely non-linear. For seasons, it finds non-linearity, but not enough to be statistically significant. (That insignificance, I think, is why academic studies that use regression don't notice there's a problem.)

------

Here are the results for two other sets of seasons. 2000-2009 NL:

PBP 0.454, reg 0.542, reg games 0.481 (single)

PBP 0.765, reg 0.753, reg games 0.794 (double)

PBP 1.063, reg 1.159, reg games 1.010 (triple)

PBP 1.395, reg 1.576, reg games 1.576 (home run)

PBP 0.303, reg 0.283, reg games 0.416 (walk)

And 2000-2009 NL:

PBP 0.478, reg 0.527, reg games 0.495 (single)

PBP 0.780, reg 0.776, reg games 0.783 (double)

PBP 1.051, reg 1.529, reg games 1.000 (triple)

PBP 1.397, reg 1.396, reg games 1.415 (home run)

PBP 0.334, reg 0.349, reg games 0.430 (walk)

And, for completeness, I'll re-run the original numbers for 1990-1998:

PBP 0.468, reg 0.494, reg games 0.490 (single)

PBP 0.768, reg 0.730, reg games 0.779 (double)

PBP 1.076, reg 1.343, reg games 1.097 (triple)

PBP 1.403, reg 1.465, reg games 1.433 (home run)

PBP 0.314, reg 0.342, reg games 0.433 (walk)

In every case, the regression on individual games ("reg games") seriously inflates the value of walks and singles. The regression on batting lines ("reg") inflates the value of walks and singles five out of six times.

-----

Why does the inflation only seem to apply to walks and singles? Here's my hypothesis. It might be wrong.

Singles and walks tend to be more concentrated in games than other events are. Some pitchers give up a lot of walks, some give up only a few. The difference between pitchers is not as big for hits. Yes, pitchers do vary a lot in strikeouts, which leads to differences in hits, but each strikeout difference is only 3/10 of a hit (since batting average on balls in play is fairly constant at .300).

Doubles, triples, and home runs have less non-linearity: two triples in the same inning aren't that much more valuable than two triples in separate innings. Also, home runs may have *diminishing* returns: two HRs in the same inning are probably worth less than two HRs in separate innings.

Also: doubles, triples, and home runs are rare enough that multiples don't happen that often. The number of repeats, I think, should be proportional to the square of the frequency. So if there are twice as many walks as doubles, there are four times as many consecutive walks as consecutive doubles.

That's why I think we see most of the bias in the value of the walk and single.

------

So, that's one reason I think it's preferable to use the traditional method over the regression method -- the regression method is biased too high.

If that's not enough, here's another reason: the traditional method has a lower random error.

Intuitively, that makes sense. The traditional method uses all the play-by-play data available, at a very granular level. The regression method uses only season statistics, the aggregation of maybe 6,000 plate appearances. It seems obvious that the method that uses six thousand times as much data should be more accurate.

But, that's easy to show empirically.

Here's what I did. I used the same years of data, and the same play-by-play method -- but I divided the data into 13 parts. I then calculated the linear weight independently, for each of the 13 parts.

I took those 13 linear weights, and calculated their standard deviation.

The best estimate of the true value, of course, is the average of the 13 estimates. And the standard error of that average is simply the SD of the 13 estimates, divided by the square root of 13.

So, here are the results of the play-by-play method again, this time including the standard errors I wound up with:

0.468 (.0022) single0.768 (.0027) double1.076 (.0072) triple1.403 (.0028) home run0.314 (.0023) unintentional walk(Technical note: because the way I constructed the 13 groups was random-ish, the standard errors are random too. I tried a different randomization and the values changed somewhat, but were the same order of magnitude. If you really, really wanted a precise estimate of the standard error, you could just do the randomization a couple of hundred times and average them all out.)

If you compare those to the standard errors from the regression, you'll see that they're smaller by at least a factor of 10. Which makes sense, considering they use so much more data, at so granular a level.

-------

So, this is a case where the more technical, mathematicky method is actually less accurate, less precise -- and less rigorous -- than the grade-school level method. Other than ease of computation, I don't see any reason to prefer the regression.

Tuesday, April 10, 2012

Academic rigor

At the SABR Analytics Conference last month, a group of academics, led by Patrick Kilgo and Hillary Superak, presented some comments on the differences between academic sabermetric studies, and "amateur" studies. The abstract and audio of their presentation is here (scroll down to "Friday"). Also, they have kindly allowed me to post their slides, which are in .pdf format here.

I'm not going to comment on the presentation much right now ... I'm just going to go off on one of the differences they spoke about, from page 11 of their slides:

-- Classical sabermetrics often uses all of the data -- a census.

-- [Academic sabermetrics] is built for drawing inferences on populations, based on the assumption of a random sample.

That difference hadn't occurred to me before. But, yeah, they're right. You don't often see an academic paper that doesn't include some kind of formal statistical test.

That's true even when there are times when there are better methods available. I've written about this before, about how academics like to derive linear weights by regression, when, as it turns out, you can get much more accurate results from a method that uses only logic and simple arithmetic.

So, why do they do this? The reason, I think, is that academics are operating under the wrong incentives.

If you're an academic, you need to get published in a recognized academic journal. Usually, that's the way to keep your job, and get promoted, and eventually get tenure. With few exceptions, nobody cares how brilliant your blog is, or how much you know about baseball in your head. It's your list of publications that's important.

So, you need to do your study in such a way that it can get published.

In a perfect world, if your paper is correct, whether you get published would depend only the value of what you discover. But, ha! That's not going to happen. For one thing, when you write about baseball, nobody in academia knows the value of what you've discovered. Sabermetrics is not an academic discipline. No college has a sabermetrics department, or a sabermetrics professor, or even a minor in sabermetrics. Academia, really, has no idea of the state of the science.

So, what do they judge your paper on? Well, there are unwritten criteria. But one thing that I'm pretty sure about, is that your methodology must use college-level math and statistics. The more advanced, the better. Regression is OK. Logit regression is even better. Corrections for heteroskedasticity are good, as are methods to make standard errors more robust.

This is sometimes defended under the rubric of "rigor". But, often, the simpler methods are just as "rigorous" -- in the normal English sense of being thorough -- as the more complicated methods. Indeed, I'd argue that computing linear weights by regression is *less* rigorous than doing it by arithmetic. The regression is much less granular. It uses innings or games as its unit of data, instead of PA. Deliberately choosing to ignore at least 3/4 of the available information hardly qualifies as "rigor", no matter how advanced the math.

Academics say they want "rigor," but what they really mean is "advanced methodology".

A few months ago, I attended a sabermetrics presentation by an academic author. He had a fairly straightforward method, and joked that he had to call it model "parsimonious," because if he used the word "simple," they'd be reluctant to publish it. We all laughed, but later on he told me he was serious. (And I believe him.)

If you want to know how many cars are in the parking lot today, April 10, you can do a census -- just count them. You'll get the right answer, exactly. But you can't get published. That's not Ph.D. level scholarship. Any eight-year old can count cars and get the right answer.

So you have to do something more complicated. You start by counting the number of parking spots. Then, you take a random sample, and see if there's a car parked in it. That gives you a sample mean, and you can calculate the variance binomially, and get a confidence interval.

But again, that's just too simple, a t-test based on binomial. You still won't get published. So, maybe you do this: you hang out in the parking lot for a few weeks, and take a detailed survey of parking patterns. (Actually, you get one of your grad students to do it.) Then, you run regressions based on all kinds of factors. What kind of sales were the stores having? What was the time of day? What was the price of gas? What day of the week was it? How close was it to a major holiday? How long did it take to find a parking spot?

So, now you're talking! You do a big regression on all this stuff, and you come up with a bunch of coefficients. That also gives you a chance to do those extra fancy regressiony tests. Then, finally, you then plug in all the dependent variables for today, April 10, and, voila! You have an estimate and a standard error.

Plus, this gives you a chance to discuss all the coefficients in your model. You may notice that the coefficient for "hour 6", which is 12pm to 1pm, is positive and significant at p=.002. You hypothesize that's because people like to shop at lunch time. You cite government statistics, and other sociological studies, that have also found support for the "meridiem emptor" hypothesis. See, that's evidence that your model is good!

And, everyone's happy. Sure, you did a lot more work than you had to, just to get a less precise estimate of the answer. But, at least, what you did was scholarly, and therefore publishable!

It seems to me that in academia, it isn't that important to get the right answer, at least in a field of knowledge that's not studied academically, like baseball. All journals seem to care about is that your methodology isn't too elementary, that you followed all the rules, and that your tone is suitably scholarly.

"Real" fields, like chemistry, are different. There, you have to get the right answer, and make the right assumptions, or your fellow Ph.D. chemists will correct you in a hurry, and you'll lose face. But, in sabermetrics, academics seem to care very little if their conclusions or assumptions about baseball are right or wrong. They care only that the regression appears to find something interesting. If they did, and their method is correct, they're happy. They did their job.

Sure, it could turn out that their conclusion is just an artifact of something about baseball that they didn't realize. But so what? They got published. Also, who can say they're wrong? Just low-status sabermetricians working out of their parents' basement. But the numbers in an academic paper, on the other hand ... those are rigorous!

And if the paper shows something that's absurd, so much the better. Because, nobody can credibly claim to know it's absurd -- it's what the numbers show, and it's been peer reviewed! Even better if the claim is not so implausible that it can't be rationalized. In that case, the author can claim to have scientifically overturned the amateurs' conventional wisdom!

The academic definition of "rigor" is very selective. You have to be rigorous about using a precise methodology, but you don't have to be rigorous about whether your assumptions lead to the right answer.

-----

Just a few days ago, after I finished my first draft of this post, I picked up an article from an academic journal that deals with baseball player salaries. It's full of regressions, and attention to methodological detail. At one point, the authors say, "... because [a certain] variable is potentially endogenous in the salary equation, we conduct the Hausman (1978) specification test ..."

I looked up the Hausman specification test. It seems like a perfectly fine test, and it's great that they used it. When you're looking for a small effect, every little improvement helps. Using that test definitely contributed to the paper's rigor, and I'm sure the journal editors were pleased.

But, after all that effort, how did their study choose to measure player productivity? By slugging percentage.

Sometimes, academia seems like a doctor so obsessed with perfecting his surgical techniques that he doesn't even care that he's removing the wrong organ.

Sunday, April 08, 2012

How stable is a baseball player's talent? Part III

(Note: this is a continuation of a technical post, probably not of general interest.)

Following the pattern of what I did last post for the odd/even split, here are the calculations for confidence intervals for the real season-to-season cases. Skip the math and head right to the bold parts, if you like.

-----

For all 38 pairs of seasons from 1973/4 to 2010/11, I calculated the SD of the binomial Z-scores for all pairs of players with at least 400 PA both seasons.

The average of all 38 SDs was 1.118. The SD of the 38 SDs was .0755. To get the standard error of the average, we divide .0755 by the square root of 38, which gives .0122.

I'm not sure if the SD is normally distributed, but, even if it's not, 2 SE in both directions is a reasonable confidence interval. So, adding/subtracting twice .0122 from the average of 1.118 gives us an interval of (1.094, 1.142).

(Last post, I said this method was imprecise. I think I was wrong when I said that. The method is imprecise only when the effect you're looking for is very close to zero (like odd/even). For larger effects, the method works much better.)

Those are the total SDs. They comprise the variance caused by binomial randomness, and also the variance caused by talent (and circumstance) changes.

If the actual SD is 1.094, that means the SD attributable to talent (actually, anything other than binomal variance) is the square root of (1.094 squared minus 1 squared), which is 0.444. If the actual SD is 1.142, the SD attributable to talent is .552.

So the confidence interval of the SD of talent change is (.444, .552), in terms of binomial Z-scores.

We now need to convert that to OBP. The average player in the study had 574 PA the first season, and 570 the second season. That gives a binomial single-season SD of the square root of (.333 times .667 divided by 570), which works out to .0197. For the difference between two seasons, multiply that by the square root of 2, giving .0279 points of OBP.

Now, we just multiply the Z-score confidence interval by .0279 to convert to an OBP confidence interval. And the result:

The confidence interval for SD of talent (and circumstance) change between seasons, for a single player, runs from .012 to .015.

I did the same thing for strikeout rate, walk rate, and extra-base hit rate (all per PA). Results:

That's a pretty significant split -- bigger than his home/road or his day/night. Line 1 is not that great, but Line 2 is MVP material.

What is it? The first line is how he hit on even days of the month. The second is how he hit on odd days of the month.

There's actually nothing special about day of the month, or about David Justice in particular. I ran splits for every player-season from 1974 to 2009, and this was one of the biggest, so I just thought I'd show it.

------

Anyway, the reason I did this was because of a suggestion from Tango. In the previous post, I showed a method where we could estimate how much a player's talent in batting average fluctuates from year to year. (The estimate we got was that the SD of talent changes is around .010.)

Tango asked me to repeat the analysis for OBP, K rate, and BB rate. For OBP, it was .0136. For K, it was .0161 (that's strikeouts per plate appearance). And, for BB, it was .0161. I posted those results at his blog.

Those were larger that Tango thought, and so he wondered if it could just be random, and suggested the odd/even analysis. If the method was correct, then, for the odd/even analysis, the same method should give us a talent change of close to zero.

So I did it.

For every player-season from 1974 to 2009, where the player had at least 200 AB on both odd and even days of the month, I calculated the binomial Z-score of the difference between his odd performance in OBP, and his even performance in OBP.

Then, I took the mean and SD of those Z-scores. If everything was just random, we would get a mean of zero, and an SD of 1.00.

It was close. The mean was -.009, and the SD was 1.02.

By the same method used in the previous post, we can figure that the SD of "talent" changes between odd days and even days is .2 SD (the square root of 1.02 squared, minus 1 squared). With an average of 282 PA in each group, one SD of the binomial difference for a single player is .040. So, the talent change, in terms of OBP, is the product of the two, which is .008.

Is .008 in the right range? We wouldn't expect it to be zero, because, even though the days are random, the circumstances aren't. It could be that, just by chance, a certain player had more home games in one group than the other, or faced better pitching, or had more day games, or more games that the wind was blowing out, and so on.

But, my gut says that .008 is too large to be just the result of random circumstances. But is that really true? .008 is only two plate appearances in 250. Is it reasonable to expect that, typically, if you divide a season into two parts of 250 PA each, the random combination of home/day/baserunners/opposition pitching would lead to an expectation of one fewer out in one group, and one extra out in the next group? When I put it that way, it seems more reasonable ... but my gut still says it's too high.

However: there's a lot of randomness involved. The numbers varied quite a bit from year to year. Recall that the overall SD of the Z-scores, for all 36 seasons combined, was 1.02. Here are the first few single season numbers:

There are large differences between seasons. Indeed, for some seasons, the SD is less than 1, which really shouldn't happen for any reason except random chance. That is, there's no reason to expect that players' differences between odd and even should be less than if you just assigned plate appearances randomly.

If I continued the series all the way to 2009, we'd find that the SD of all those numbers is 0.051. Since there were 36 seasons in the study, which happens to be the square of exactly 6, the SD of the overall average is 1/6 of 0.051, which is around 0.0085.

The average of the 36 numbers was 1.0156, which is .0156 from 1.000. Since the SD is .0085, what we observed is 1.8 SD from 1.000.

But, as I said, we shouldn't be expecting exactly exactly exactly 1.000, because, the two groups are not actually identical in circumstances of those plate appearances. I don't know how much greater than 1.000 we'd expect, though. It might be only a tiny bit.

(By the way, I apologize if all these different SDs are getting confusing. There's the SD of the difference in OBA for a single season, which we talked about in the previous post. There's the SD of the Z-scores for a single season's worth of players, which is the list of numbers above. And, now, there's the SD of all those SDs! Sorry about that.)

-----

The bottom line is: we get our estimate of the SD of "OBP talent change" (which actually includes circumstance change) between odd and even to be around .008. But, the standard error of that is so large, that it could be anywhere between -.008 and .025 -- or, actually, .000 and .025, because negative SDs don't exist.

So, what we learned from this exercise is that this method isn't all that precise, even with 36 seasons worth of data to work with. That means that our previous estimate, that between-season talent of OBP varies with an SD of .0136, is similarly imprecise.

Maybe for next post, I'll repeat this "SD of the SD" analysis for the real season-to-season data, instead of this odd/even data, and see what the confidence intervals look like.