There are a lot of other forecasting systems in the wild; I chose to look at Marcel and CHONE in comparison because they’ve fared well historically and they have sound underpinnings.

Let’s look at how well each system forecasted the overall offensive levels of the league as a whole. We’ll use OPS, since it’s a “good enough” offensive estimate for the sort of study we’re doing, and it’s typically calculated by every forecaster, so it’s a very transparent way to compare systems. Looking only at players in common between the four projection sets, the weighted average of OBP, SLG, and OPS for each system:

This was a down year for offense on the whole; the most recent PECOTAs and CHONE forecasts were a shade closer on projecting the league offensive environment, but even then there were .023 points of OPS between them and the observed results.

So I adjusted each set of forecasts to line up with the lower offensive environment, and looked at the root mean square error of each forecast from the observed result, weighted by the number of plate appearances that player had. (Root mean square error represents the standard error—in other words, 68 percent of the time, you should expect outcomes to occur within that distance from the forecast.) I now present the most boring chart I have ever had the honor of showing you:

We see a bit more separation here, but not much. PECOTA and CHONE were tops at predicting batting average, the Marcels were best at predicting home runs, and PECOTA was tops at predicting runs scored, RBI, and stolen bases. (PECOTA looks better at these projections because we use our Depth Charts to model a player’s specific role and playing time—that has little to no impact on his OPS, but has a big effect on his counting stats and thus his fantasy value.)

Now, let’s look at pitchers, using ERA as our measurement. First, the predicted versus observed ERAs, as a group:

Again, the run environment this year was lower than any of these forecasts expected it to be. After adjusting the forecasts to account for the difference between expected and actual run environment, a look at how each forecasting system did:

We see a little more separation in the pitcher projections that we did in the hitter projections, but not a lot. Looking at the other roto categories for pitchers (CHONE doesn’t project saves, so it received no score in that category):

Now, of course, PECOTA has always done more than simply project a player’s basic stat line—we have a lot of other things going on, like the 10-year forecasts, the percentiles and the upside/downside ratings. It’s one of PECOTA’s main attractions, but it shouldn’t be its major downside as well.

One of the drawbacks of PECOTA’s additional complexity is simply how long it takes to produce forecasts. But that’s also a consequence of using Excel to generate them. We’ve cut that dependency a while ago, and are continuing to work to integrate PECOTA more with our other statistical offerings. That’s important to you because it means you get your forecasts sooner—because the word “fore” is of course a major component in forecasting.

But it’s also important for the accuracy of any individual forecast. I can take one hitter’s forecast and substitute any number of outlandish findings for him, and that on its own won’t move the needle on those RMSE figures I showed you—it takes a systemic problem affecting a lot of forecasts to show up in that sort of test.

And PECOTA is a computer program—essentially, a list of instructions. It will follow those instructions unerringly, regardless of whether those instructions are correct. It takes a human to write instructions for the computer to follow, and as we all know, humans make mistakes now and then.

Some of you may remember the PECOTA forecast for one Matt Wieters’ debut season. It struck a lot of people as being outlandish—I was certainly one of them. In this case, PECOTA was a victim of its own complexity—by taking so long to produce forecasts, there wasn’t enough time to properly proof the PECOTAs.

Minor league players are heavily dependent on methods of putting their stats in terms of expected major league performance; in this case, it was the Davenport Translations. The foundation of this process is a set of league difficulty ratings that establish how a league compares to the majors.

What seems to have happened is that, when spitting out translations for the two leagues that Wieters played in (and only those two leagues, mind you), those league difficulty factors were significantly inflated from what they should have been—the Eastern League was not only rated higher than the other two Double-A leagues, but above both Triple-A leagues as well. And the High-A Carolina League placed above both of the other Double-A leagues as well.

For most players, that wasn’t going to make a noticeable impact—very few players who are expected to be anywhere close to the majors have only one year of stats split between the Eastern and Carolina Leagues. But one is enough to produce the Wieters forecast.

For last year’s book, we had a fairly involved proofing process of the PECOTAs. (Notably we neglected to do that for the first run of the Depth Charts. There’s a lesson to be learned there, and we’ve learned it—proof everything before publishing.) That’s good, but we want to do better than that. So in addition to having humans proof the PECOTAs, we’re building a set of unit tests to run alongside the PECOTAs, testing each element to make sure it’s functioning properly. What this means is that the output of the PECOTAs are going to be tested at several steps along the process, to ensure that everything is functioning correctly.

We’re also utilizing these tests to make sure that when changes are made to the PECOTAs, that they’re actually improving the underlying accuracy of the product. And we will be notifying subscribers when changes are made to the methods between PECOTA updates.

Of course, PECOTA has had some infamously mistaken forecasts that wouldn’t have been caught regardless of the amount of proofing. Most of them have been of Ichiro Suzuki. Tomorrow, we’ll go ahead and address how PECOTA missed the boat on Ichiro, and what we’ve learned from those mistakes.

Not exactly the proper venue, but with regards to the PFM, why even with a straight draft do you still have to input a total budget with an allocation towards hitting and pitching?

I'd rather put in my leagues parameters, and have you tell me how valuable your projections think that player will be in that league (perhaps VORP?). Not something based on an (I assume) arbitrary value where the total allocation defaults to 260, 180 of which is assigned to hitting.

Thanks Colin for the much-appreciated transparency. While I don't think any projection system is "deadly accurate," this restores my lost belief that PECOTA is in the same group at the top as the other major players in projections.

Is it fair to normalize every projection to the league offensive environment? My instinct says no; the projections are the projections. However, I'd like to hear some arguments from both sides on this one as I'm not too sure what is most fair.

It probably depends on what you want to do with the projections. If you are playing fantasy baseball, it is probably fair... I believe that relative performance between players is what matters most, not the absolute level at which they collectively perform.

I don't know if it's "fair" or not, but I think it models the way most people use projections - I don't necessarily care what a guy's OPS is going to be, I care how good he's going to be relative to the other players in the league. We use OPS as a proxy for that, but projecting the league average OPS is less useful than being able to identify who the exceptional and subpar hitters are.

That said - it's not like doing it that way helped PECOTA in those tests. It was the leader in identifying the average OPS (tied with CHONE) and ERA.

I would like to respond to the "Deadly Accurate" thing once and for all, since I'm the SOB that coined it. As I recall, we were asked by our then-publisher for some ways to describe what we do, and I jokingly offered a number of things, one of which was "deadly accurate." There was much chuckling at the time, because who but Annie Oakley would say "deadly accurate" about anything? Six months later, there it was on the cover. It was never intended to be more than an obviously hyperbolic boast, something out of the Stan Lee school of breathless cover blurbs. I see how some might read it as written, but even so, it's a line on a book cover, not a slap across the face, and I've never understood why some folks seem so exercised about a harmless bit of self-evident braggadocio.

Steven, thanks for sharing the backstory on that one. Some folks may have gotten too exercised about it, and I was one of them, but at the time it appeared that Baseball Prospectus thought they were better than the rest of us and their excrement didn't stink. Not simply from that line on the book cover but also from other things that were happening. By now, it's clear that BP is serious about addressing and fixing errors and playing on the same field with everyone else, and that's very good. It wasn't so clear back then that it was made in humor and not a claim to be better than the rest of the sabermetric community.

I think it's understandable that it tweaked a few people and good to know that it wasn't meant that way.

I'm one of the people who is irked. It's when people believe this press release that it's the problem. People start acting like PECOTA is the leader, when tests after tests shows that it's possibly above-average, and possibly below Marcel. It's not something to be boastful about.

If the intent was limited to a quasi joke, why do you have it here:
http://www.baseballprospectus.com/subscriptions/

"Complete depth charts and forecasts for AL and NL pitchers and hitters using Baseball Prospectus' deadly-accurate PECOTA projection system--the same one used in MLB front offices."

Well, first off, let me say that I'm sorry for the snarky comment. I just can't resist an opportunity to quote Kenny Bania.

I guess that I am just bemused by the amount of invective that the "deadly accurate" marketing slogan has generated. The purpose of a marketing slogan is to generate interest in your product. The cover of the annual isn't designed to attract buyers from the sabermetric community, it's designed to generate interest from the public at large.

My sense is that others doing great analysis have landed on the "deadly accurate" thing as a slight on their work. If that is the case, I have to disagree. The slogan exists to promote BP's product (and it isn't like they were selling snake oil). I don't see any obligation to compare their product to completing products, or to even acknowledge that there are competing products.

I'm all for accountability, and I salute Colin's efforts in this regard. But .

Thanks. The apology was because I think my first comment came across as unappreciative of your work (which I think is brilliant, BTW).

The point I was trying to make is that I don't want to discourage marketing of advanced analysis. I get tired of explaining to every Tiger fan I know that Austin Jackson is more likely to hit .260 next year than hit .300 again. And have them look at me like I'm nuts.

So I get frustrated when I see any discord among the sabermetric community (real or imagined by me). Because I'd like to see everyone tugging on the same rope and getting the word out. My introduction to sabermetrics went a little something like Rob Neyer-->Baseball Prospectus-->Hardball Times-->Bill James (yes, he came 4th to me)-->Tom Tango-->Fangraphs. I had to get started somewhere, and for me it was ESPN.com in 1998.

If it takes a person reading the BP2010 annual for the first time another three years to realize that Chone, Pecota, and Marcel all tell a pretty similar story, that's OK with me - at least they are reading a part of the story and hopefully their curiosity is peaked to read further.

I guess I'm saying that even if the marketing is imperfect, at least the marketing is happening. I think in the long run it benefits the entire sabermetric community, not just BP. I view this as a good thing.

An excellent point, and very well said. It would act as a great mission statement.

As long as we can get away from "mine is better than yours", and into "mine works best here, and yours works best there", that would go a long way. Unless of course in those situations where something is deficient and should be supplanted.

So, Chone, Marcel, ZiPS, PECOTA can all live happily together, each having its own strengths, with none deserving to be discarded.

I always get a bemused chuckle from the "deadly accurate" claim, because it is a big joke, and a marketing ploy, and yet I also get the feeling that underneath it all, there's an undeniable element of pride and morale-building, too.

Back in the day when I played competitive sports, we used to give ourselves nicknames for all the same reasons. Well, except for the marketing angle.

I appreciate the candor of this look into PECOTA and the process. Certainly, the set of unit tests is a gigantic step forward in improving value of your products.

My major concerns are in the use of standard error comparison to demonstrate anything useful or important! It says to me that other major projections are about the same over the set of all players. I am more interested in the projections on some groupings e.g.: "NL", "AL", "Stars", "Bums", "Everyday", ....

My reason for using Prospectus is to help brighten my teams future performance by improving my drafting. It is the distribution of projection error within groups that is most important, I think.

While most "grading the projections" articles have to do with the mean projection, has there ever been a retrospective look at the accuracy of PECOTA's percentile forecasts (i.e. the distribution of errors)?

Obviously, this would be a bit more labor intensive, and you couldn't cross-compare to other systems (which don't provide percentile bands), but it would be VERY interesting to know if, for example, do approximately 10% of players actually meet their 90th percentile forecast? Does PECOTA accurately assess the overall number of "breakouts" and "collapses"?

I've often wondered the same thing. What percentage of the time does PECOTA correctly predict individual players stats? The definition of what is 'correct' can be flexible (within 5%, 20%, etc). Does it predict some stats better then others? Does it do a better job with certain positions or player types? That info would be a lot more interesting then looking at league wide averages.

Or even very simply, I would like to see a PECOTA weighted means spreadsheet that shows the variance between what was predicted before the season and the final actual totals. I'm talking about the variance for every category on the PECOTA spreadsheet for every player.

From there, we could sort by stat, position, league, etc..to determine if there are specific trends or categories that we might want to pay attention to with the next iteration.

To make it up to us -- how about going back and doing a new 2009 PECOTA forecast for Wieters, with the correct Davenport Translations? It would be interesting to see whether his performance over the past two years is in line with that.

There is a reason I advocated his 10th percentile forecast as the one you should be paying attention to on draft day, and that even that one was possibly high. PECOTA and I have to work together to make sure we both do our best work :-)

Well, as I recall his 10th percentile forecast was still HOF material. That's exactly why I'd like to see the forecast re-run -- what would his 10th or 20th percentile forecast have been with correct translations?

I don't think anyone but Nate Silver could accommodate you here, exactly. I could provide you with a "retrojection" based upon the current PECOTA setup, but I can't guarantee that all the differences between the Wieters forecast then and the one I give you are due to the changed DTs - in fact, I can guarantee you they aren't.

Have you ever done "retrojections" with PECOTA? That is, use the latest version of PECOTA but only using data available at the time (so we could have a forecast for 2009 using only data available through March 2009, etc). That would give us much larger samples to at least judge PECOTA against itself (this could also be done with Marcel since it's open-source). Also if the algorithm for PECOTA is tweaked we could examine the new retrojections to see how much they improved (although much care must be taken to avoid over-fitting).

I recognize that this would be a huge undertaking, you might not have access to all the right data required, and you are surely busy with other stuff, but it would be cool to have.

I have a database table for every player who played in the majors from '50 to '09, with what their PECOTA baseline projection (minus aging) would have been for the next season. I have another database table for that player's Marcel forecast. Then a similar forecast comparison to the one presented here is used. Once we finish work on the revised age adjustments, those will be going into the test suite as well.

(Obviously with some things, like minor league stats, we're not going to be able to do 60 years worth of tests, but the concept is similar.)

For those interested, tomorrow we're going to be presenting the baseline forecasts (again, without age adjustments) for Ichiro using the current PECOTA methodology.

B) Was an uber prospect based largely on scouting and college performance - not his 2009 PECOTAs.

Still a pretty good chance he emerges as one of the best catchers in baseball. Let's not write his obit. just yet. Though, obviously if you watch him recently he doesn't look like an impact bat... the tools are still there and no one should be shocked in the slightest if he turns it around.

"Wieters was regarded as the best position player in last year's draft and possibly the best college catcher in recent memory, combining good defense, a high average, and good power. The Orioles got him with the fifth overall pick because the top four teams were scared away by his bonus demands and his agent, Scott Boras." - BP 2008

most places would gloss over issues like you had to deal with at the beginning of the year. you guys have overcome it. I'm still a bit burned over the extremely low BABIP listed for Oakland pitchers in their Depth charts.

I never could get a straight answer as to what ended up causing it....

+1 I would guess that a large amount of subscribers want PECOTA for fantasy purposes, and base things off of the projections the PFM spits out. Accurate runs and RBI projections aren't so meaningful for "real baseball," but them's the stuff fantasy championships are built upon.