HHP: The Comprehensive Report on Statistical Prospecting

IntroductionFor a long time, I've been working on a project to use the statistics generated in the various minor leagues to predict how well a given prospect will do. I've posted in the past some early attempts at this. I have finally got a product that I consider worthy of publishing, at least here on the OH. It isn't finished or complete, yet, but this particular section is.

I've studied every single hitting prospect who got significant playing time between 1991 and 2005 in four leagues: the Gulf Coast League, the New York-Penn League, the South Atlantic League, and the Carolina League. These are the leagues that the Orioles' minor league affiliates play in at their respective levels. From this set of data, I've made observations and drawn some preliminary conclusions. I've also developed a quick-and-dirty prediction method that identifies prospects from non-prospects.

This has been a pretty big project, and it isn't over. Next, I'm going to try to refine the prediction methods to be more accurate and more precise (not the same thing). After that, I'm going to try the same methods to predict pitchers. No estimate for when those will be done.

MethodsI used Baseball Reference to create a dataset that included every position player who played in the GCL, NYPL, SAL, or CARL between 1991 and 2005. I removed all players who had less than 150 plate appearances. I found those players who made the majors and their respective totals of MLB plate appearances and rWAR. I then categorized all the players with the following codes.-1: This prospect never played in the major leagues.0: This player has fewer than 600 MLB PAs (cup of coffee).1: This player has 600-1600 MLB PAs or this player has <1 rWAR.2: This player has at least 1600 MLB PAs and 1-14 rWAR.3: This player has at least 1600 MLB PAs and at least 14 rWAR.

I consider any player categorized as a "2" or a "3" to be a successful prospect. Even with this fairly low bar, only 5.65% of the 7630 prospects in the dataset qualify as successful.

Separate seasons by one player were treated as completely separate datapoints. If a player split his season between two leagues in one year, and got at least 150 PA in each, that was treated as two independent datapoints. If a player played at the same league in multiple years, that was also treated independently. This is not an ideal solution, but I think acceptable for now.

During dataset construction, I removed any prospect who made the majors as a pitcher. It seems impossible to me to be able to predict the chance that a position-player-to-pitcher conversion will work from a prospect's hitting stats. But, I did not want to simply list them as failed hitters. So I simply excluded them from the dataset. This is probably not the best solution, but there were fewer than 10 of these players in total, anyway.

When making predictions or developing systems using this data, it is important to note that the 150 PA cutoff is a completely hard line. Since players with fewer than 150 PA were excluded, I have quite literally NO idea how well they do. THIS DATASET CANNOT MAKE ANY PREDICTION ABOUT A PROSPECT WITH FEWER THAN 150 PA! No excuses, no exceptions.

My basic method of analysis is to divide the prospects into groups based on known criteria, such as age, OPS, or K%. Attempting to perform direct regressions between, say, OPS and rWAR leads to poor results, since the massive number of minor leaguers who go nowhere overwhelms the actual prospects. Instead, I look to examine how well various categories of players do, then work from there. In this post, I'm not going very deep with prediction, and it's important to be careful.

For instance, suppose a 21 year old player in the New York-Penn League has an .834 OPS, a K% of 16.2%, and an ISO of .121 (these are, in fact, Trey Mancini's stats). We can look at the NYPL categorizations and find that his chances of success (cat. 2 or 3) given these values are 7.2% (age), 9.4% (OPS), 4.1% (K%), and 6.1% (ISO). BUT IN THE ABSENCE OF TESTING, THESE NUMBERS CANNOT BE COMBINED! Because there is a correlation between the predictive factors (eg, high OPS usually also means high ISO), the predictions are not independent, and any method of combining them must be tested for accuracy, via, for instance, a Brier score. I've started trying some ways, and the best-performing so far is a weighted geometric mean taking only age and OPS into account - but this is a topic for a future post.

Until I (or someone else) has this sort of evaluation thoroughly tested, IT IS WRONG TO COMBINE PREDICTIONS! Either report them separately, or rely only on age and OPS, which usually have the highest accuracy.

In any event, the main point here is to present the base statistics, the success rates broken down by components, and my attempt at a first pass to determine who should qualify as a prospect, period.

Basic ResultsThere were 7,630 player-seasons analyzed in this dataset. Of these, 6,140 (80.5%) failed to make the major leagues at all. Another 781 (10.2%) played less than 600 PA in the majors, and are in category 0 - the "cup of coffee" category. 278 players (3.6%) were active less than 3 seasons (1600 PA) or failed to produce significantly above replacement level (<1 rWAR); these players are in category 1, loosely defined as AAAA players. Examples include Nolan Reimold (who has 2.2 rWAR but only 1081 PA), Wily Mo Pena (has 1845 PA but -1.8 rWAR), Geronimo Gil (887 PA, 1.1 rWAR), and Jeff Keppinger (3048 PA, 0.5 rWAR). Decent starters, who have at least 1600 MLB PAs and 1.0 rWAR, but less than 14.0 rWAR, are in category 2, which includes players like Cesar Izturis, Russell Branyan, Hank Blalock, Rajai Davis, Mike Morse, and Nate McLouth. Stars are defined as any player with over 14 rWAR; this includes Magglio Ordonez, Adrian Beltre, and Nick Markakis, and, at the low end, Franklin Gutierrez, Joe Crede, and Edwin Encarnacion.

Some players are still active and thus able to increase or decrease their rWAR. In the vast majority of cases, these players are already category 2 or 3. The relatively small number of borderline players who are active (Jeff Keppinger is a good example) is an issue for this analysis, but I think a fairly small one.

The category divisions are somewhat arbitrary, especially the line between 2 and 3. However, for the most part I am only concerned with determining WHETHER a prospect will "succeed" or not, which I define as belonging to either category 2 or category 3, rather than trying to determine just how good he'll be if he does. I am also somewhat concerned with predicting who will make the majors at all. Luckily, the three chances tend to be rather parallel: a higher chance of stardom is indicative of a higher chance of making the majors, and vice versa.

This suggests to me the first important conclusion: the difference between "ceiling" and "floor" is probably overemphasized by prospect analyzers. You'll often read that a particular prospect has "limited upside" but a "high floor." But it's pretty unlikely to find a prospect that has any kind of floor at all. Prospects fail all the time, and unless you're talking about someone ranked in the national top 25-50 or so (see my post from about 1.5 years ago on the BA top 100 ranking), they're more likely to fail than to succeed. It's possible that these guys are trying to distinguish the future perennial all-star Robinson Canos of the world from the Nick Markakises, but that's a cut that goes pretty fine. Any prospect below AA can fail and is in fact reasonably likely to do so. In most cases, you can predict a prospect's chance of success, then divide it evenly between "future starter" and "future star." One caveat applies: if the prospect is in the absolute YOUNGEST age category for his league (17 and under in the GCL, 18 and under in the SAL or NYPL, 19 and younger or 20 in the CARL), their chance of stardom is a significantly greater portion of their chance of success.

Below is the first of many graphs, showing prospect success rates by league. This is one of the best ways I've got to show just how few guys actually make it. The blue bar represents the percent that never see the majors; successful players are the purple and teal bands at the top. The other major lesson here is that the NYPL is pretty weak: it has the lowest rates of players both making the majors and having success, presumably due to the presence of non-toolsy college players. We also see the relevance of what is usually called "major-league readiness:" the Carolina League sends the most players and has the highest success rate. (Also, my preliminary Eastern League data, which isn't published here, suggests that the success rate there is even higher.)

The most important factor in determining prospect success is age relative to league. The next graph shows just the success rate (chance of being in category 2 or 3) for each age/league combination. First, follow the bars of each color, as they decay to the right: this shows that for any league, an older player is less likely to be successful. Then, look at each individual age, and note how much being in a higher league improves a prospect's chance of success.

One note on this graph: There are nowhere near enough 17-year-old SAL or NYPL players, or enough 17 or 18-year-old CARL players, to plot. They are lumped in with the youngest age category for their respective league. Likewise, the GCL tops out at age 22, the NYPL at age 24, and the SAL and CARL at age 25, with any older players lumped into those values. Only the Carolina League had ANY successful players come out of its highest age category.

Also note that the highest success rate is only about 30%, for the youngest players in both the SAL and CARL. Without taking performance into account, this is as good as statistical prediction can get. Of course, the goal is to take performance into account.

As an extension to the previous graph, I'm going to present 4 more, one for each league, that shows the prospect outcome breakdown by age. The bottom two bars of each column of these graphs add to form the columns in the above graph. Note that the chance a prospect of the given age/league did NOT make the majors is given by one minus the total height of the column - so a 20-year-old in the GCL has about a 90% chance of missing the majors.

Looking at these graphs, we can develop an answer to a very important question: how old is too old for each league? At what age does good performance no longer really matter?

Your answer to this question will depend on what level of risk you want. At the most extreme, you'd probably pick an age with a literal 0% chance of success. These ages are:GCL: 22NYPL: 24SAL: 25CARL: 25

But we could also go with ages that just severely restrict a prospect's chance. Feel free to use the graphs and decide for yourself! My own preference is for GCL 21, NYPL 22, SAL 23, CARL 24.

In addition to investigating the relevance of age, I also investigated the relevance of performance, looking at 5 key statistics for each prospect-season: OPS, K%, BB%, ISO, and PA. Plotting these numbers against, say, rWAR is an approach that rarely gives good results, because the immense amount of zeroes overwhelms the signal. Instead, I group these statistics into categories, then find the rate of each result for each category.

I used four categorization methods. First, I grouped by same-distance changes in the predictive statistic. For example, the OPS groupings were <.600, .600-.699, .700-.799, .800-.899, and >.899. Second, I grouped by three different same-size methods: quintiles, sextiles, and octiles. I investigated the resulting rates by hand. I also plotted the values and attempted to generate a line of best fit, using each statistic to predict the odds of MLB success.

A few conclusions could be drawn here:1) Higher OPS is a very good sign for any prospect in any league. Generally, being in the top OPS category (over .899) gave a prospect the same odds of success as being in the second-youngest age category for that league. Also, very, very few prospects succeed after even one bad year. Of the 1138 prospects that had a <.600 OPS, only 14 (1.2%) succeeded. A low OPS is most forgivable in the GCL: poor-performing prospects there had a 1.8% chance of success; in the other three leagues, it's 0.8%.

2) Strikeouts matter, and high strikeouts are worst at the lowest levels. No GCL or NYPL prospect succeeded after striking out over 28% of the time. Generally, the fewer strikeouts, the better; however, being in the lowest strikeout category does not boost odds of success as high as being in the highest OPS category.

3) Walks don't matter. There is near-zero correlation between a prospect's walk rate and his chances of future success. In one or two cases, it appears that walk rates at both extremes (below 5% and above 15%) reduce chances of success, but I suspect this is a sample size artifact, and even if it is real, the effect is small. I believe that a prospect's walk rate can safely be ignored.

4) ISO is indicative, but probably unhelpful. Higher ISOs generally tend to correlate with improved chances of success. However, the effect isn't enormous, and small ISOs do not tend to reduce chances nearly as much as small OPS or high K%. I suspect that the ISO effect is simply a statistical echo of the OPS effect, since players with high OPS usually often have a high ISO. It's possible that ISO could be useful as a predictor (I can't rule it out), but I don't think it's especially good.

5) Incredibly, the number of PAs a prospect gets is quite a good predictor of their odds of success. The usual pattern is that PA values below the median or so (180 PA in the GCL, 220 PA in the NYPL, 400 PA in the SAL or CARL) are unpredictive (best fit is a flat line), then the chance of success rapidly increases from that PA value going up. I don't have a good explanation for this: perhaps it's indicative of athletic guys who don't get injured, perhaps teams give their best players the most PA, and perhaps it's a sign that prospects spending an entire year at one level is a superior development system to that of midseason promotions. But the effect is definitely there and, at the high end, quite strong.

The best-fit graphs for OPS and K% are presented here. The fact that poor performance is more forgiveable at the lower levels is clear, as is the fact that high strikeouts are less forgiveable at the lower levels. The NYPL fits are very weird, which I think is due to the fact that the base rate of success is lowest in the NYPL. Still, it's a little surprising that it would be so different.

And here are the equations used to make those lines. Note that they are least reliable at the edges, where there are fewer points to constrain them. I especially wouldn't trust the NYPL predictions on the extremes. For the K% equations, input the K% as a decimal (ie .02 for 20%).

What About Using These In Combination?The obvious next step is to try to figure out how these predictors overlap. If being young for the league is good, and having a high OPS is good, how good is having both at the same time? Unfortunately, this is tough to do. At the extremes, the sample sizes tend to be small, and when you have two intersecting extremes, they get so small as to preclude this group-and-calculate-percentages method from being effective. For instance, there are only 63 players in the Sally League age 18 and younger, and of those, only 2 had an OPS above .900: one was Adrian Beltre (cat 3) and one was Delmon Young (cat 2). I'm not comfortable taking a 2-for-2 result and predicting 100% chance of success for future over .900, under 19 SAL players. Likewise, seven players in this group had an OPS between .800 and .900. Five succeeded; two failed. Can we then reliably predict a 71% success rate? I don't think so.

This sample size problem is frequently compounded by low rates of success. Even in a 150 person sample, if there are only 2 successes, you retain a high rate of variance.

There are possible solutions; one is to add uncertainties based on the size of the age/league/OPS/K% category a given prospect falls in; another is to avoid grouping so severely and use the best-fit predictors instead. Ideally, we'll test some of these against each other to generate full sets of predictions that can be compared via Brier score. This is the subject of my current research.

There are also issues of independence. As discussed above for ISO, many of the performance indicators correlate with each other: high K% usually goes hand-in-hand with low OPS, for example. When generating predictions, this reinforcement effect must be accounted for and removed. Testing via Brier score should reveal when overlapping indicators are reinforcing each other to the point that they reduce the effectiveness of the prediction.

Additional IssuesOne of the biggest problems with this method of prospect evaluation is that it ignores defensive and positional value. Many of the prospects who succeed with worse minor league performance are shortstops or catchers (Jhonny Peralta struck out a ton one year). Taking this into account is very difficult.

As always, there is no end-all, be-all method of predicting prospects. Scouting is important, and the predicted chance of success will always be a guideline that can be adjusted by perceived defensive ability and/or toolsiness. I would be wary of making BIG adjustments, however. Shifts of 1% are actually quite large, and shifts of 8% are enough to move someone from the fringes to be a top prospect.

Here's a Recap of the Most Important Conclusions:1) Very few prospects make it - for prospects below AA, there is no "floor" of solid-regular, just about ever, unless the prospect is very young.2) Age/league is most important.3) There is such a thing as too old.4) Performance also matters: OPS and K% are important, BB% is not.5) The New York-Penn League is weaker and harder to predict than the others.6) PAs have an effect, but the cause is unclear.

Quick-and-Dumb Prediction SystemIdeally, I would like to test a variety of prediction methods by generating full prediction sets and comparing Brier scores for each. Hopefully I'll find the time to do this eventually. Until then, I've got a first step ready.

The idea is this: we can identify at the very least those players who are very unlikely to succeed and remove them from the pool. Then we can recalculate the success rate for the remaining prospects, and simply use that as the prediction. In essence, it's a simple 2-bin categorization. As a plus, I should be able to use it as I generate a prediction system, as I can ignore those who have already been identified as non-prospects.

My goal was to have literally zero ultimately successful players in the non-prospect pool; in this, I did not succeed. However, I did manage to keep the non-prospect success rate below 1% in all cases, and there were so few ultimately successful non-prospects that I'm actually able to list all of them.

So here's the set of criteria that make a given player a non-prospect. Note that meeting ANY of these criteria is enough to rule a player out. The successful players identified as non-prospects by the system are listed by the criterion that ruled them out.

Code:

GCL:
Any player age 22 or older
Any player age 21 with <.800 OPS
Any player age 20 with <.750 OPS (causes miss Melvin Mora)
Any player age 19 with <.650 OPS (causes miss Laynce Nix)
Any player age 18 or 17 with <.540 OPS
Any player age 21 who strikes out over 12% of the time
Any player age 20 who strikes out over 13% of the time (causes miss Travis Hafner)
Any player age 19 who strikes out over 15% of the time
Any player who age 18 or 17 who strikes out over 24% of the time (causes miss Garrett Jones)
No restrictions for players under 17
NYPL:
Any player age 23 or older (causes misses Ben Zobrist and Jose Macias)
Any player age 22 with <.650 OPS
Any player age 21 with <.640 OPS (causes misses Matt Diaz and Toby Hall)
Any player age 20 or 19 or 18 with <.575 OPS
No OPS restrictions for players under 18
Any player age 22 who strikes out over 25% of the time
Any player age 21 or less who strikes out over 28% of the time
SAL:
Any player age 24 or older (causes misses Ben Zobrist, Luke Scott, and Jose Macias)
Any player age 23 with <.850 OPS (causes miss Nyjer Morgan)
Any player age 22 with <.675 OPS (causes miss Tony Womack)
Any player age 21 or 20 with <.650 OPS (causes misses Kevin Stocker, Endy Chavez, and Chris Woodward)
Any player age 23 who strikes out over 18% of the time
Any player age 22 who strikes out over 28% of the time*
Any player age 21 who strikes out over 21% of the time (causes miss Travis Hafner)
Any player age 20 who strikes out over 31% of the time
No restrictions for players under 20
*If you set it to 21%, you miss Ryan Howard, Brad Hawpe, Fred Lewis, and Travis Hafner (and eliminate 90 misses). Notably, all four were drafted out of college, and, except for Hafner, in their second pro year.
CARL:
Any player 24 or older (causes miss Ben Zobrist, Luke Scott 2002, Luke Scott 2003, Luke Scott 2004, Nyjer Morgan)
Any player age 23 with <.700 OPS
Any player age 22 with <.620 OPS
Any player age 21 or 20 with <.580 OPS
Any player age 23 who strikes out over 24% of the time
Any player age 22 who strikes out over 25% of the time
Any player age 21 or 20 who strikes out over 27% of the time
No restrictions for players under 20

In most cases, this all makes a lot of sense. The older the player, the better the performance must be to retain prospect status. The only oddity applies to players who were 22 in the Sally League and struck out a lot - but I'm hopeful future work that breaks out recently drafted college players will help there.

Once you've applied these criteria, you get the following breakdown, given as League: success rate for non-prospects, success rate for prospects, percent of ultimately successful players identified as non-prospects

I think that's pretty good! We miss about 5% of the ultimately successful players, but we rule out huge numbers of those who are unsuccessful. And it's worth noting that many of the same names keep showing up, and often have unusual circumstances associated with them: of the 22 misses, Ben Zobrist is 3 of them (older college player, debuted at 23); Luke Scott is 4 of them (older college player who then needed time to recover from TJ surgery); Nyjer Morgan is 2 of them; and Travis Hafner is 2 of them. Of course, people often look for excuses, and much of the strength of this method comes from ignoring narratives that are often built by people who are rooting for a particular prospect. So we have to accept the 5% miss rate, at least for now. Hopefully I can do better when I try to work out a comprehensive, prospect-by-prospect prediction system.

Though it isn't designed to, and there is certainly a different and better set of criteria out there, we can also look at how this prediction method does at telling us who will make the majors. I find these data less useful, since who cares if someone got 40 PA over two seasons with -0.1 rWAR? But anyway, here are the results:

So that's the quick and dumb prediction method. Use the above criteria to decide whether a player falls into the "prospect" or "non-prospect" category, then predict the corresponding chance of success. Remember, the player MUST have at least 150 PAs within a single league.

List of Prospects in 2006 Carolina League/list of non-prospects who made majorsAs a means of testing this prediction system, I applied it to the Carolina League in 2006. Below are two lists, one of all the players identified as prospects, the other of all those players identified as non-prospects who have made the majors.

So of the 105 players who had at least 150 PA in the Carolina League that year, I identify 39 prospects and 66 non-prospects. Of the non-prospects, only 5 made the majors (8%) and only 1 was a successful player (1.5%). Of the prospects, 20 made the majors (51%) and 2 were successful (5.1%). The success rate for the prospects is much lower than I would like, but this was a bit of a weak year for the Carolina League, and there is still the chance that some of these players could perform enough in the future to move them to the successful category (Reimold, for instance, needs about 550 more PA to cross the threshold for success).

2012 Orioles prospects, 2013 Orioles prospects so farUsing the prediction method, I also went and looked at the Orioles' minor league hitters statistics from last season (2012) and also how they've done so far this season, and gave them the appropriate prediction after sorting. Remember that this is a quick-and-dumb prediction method! Don't try to read too much into the numbers. It's just a sort. Also, it's only those hitters with at least 150 PAs, which limits the numbers for the 2013 GCL and Aberdeen. And of course the 2013 numbers will change between now and the end of the season.

The early results from my Brier testing suggest that a decent result can be achieved by taking the age/league chance of success, giving it a weight of .6, and the OPS best-fit chance of success, giving it a weight of .4, and finding the mean. I'm presenting that chance of success as well, just for illustration purposes.

Please note that a lot of the 2013 draftees (especially in the GCL) haven't reached 150 PAs, and therefore can't be judged by this system yet.

I'm encouraged by these lists. For the most part, it confirms what we already know - which is a good sign. This fairly simple and fairly unsophisticated statistical tool can rule out most of the guys who won't make it. It can't reliably rank the guys who might make it, but I'm working on that. And the early version of it is really promising, because that's more or less the order people would be ranking Orioles hitters, with the exception of Sawyer and Davis above Mancini and Yastrzemski.

What to take away here: if you are ranking prospects and you include one of the non-prospects on your list (unless it's a very long one), you may want to reconsider.

What next?This is a big project, and it's taken a long time, both in terms of the duration between start date and this post, and the actual hours invested. Next, I'm going to try to duplicate what I just did in this post for pitching prospects, and get a handle on the base expected success rates by age, league, and performance. Unfortunately, there is no single stat for pitchers that is both (1) as informative and (2) as easy to calculate/look up as OPS is for hitters. Of course I intend to look at ERA, K/9, BB/9, K/BB, and WHIP. I expect my IP cutoff to be around 40; aside from just sounding good, it should be about the amount of game time as 150 PA (usually just over 4 PA/IP). The biggest problem will be with evaluation: just using straight WAR probably won't be enough, as I'd like to be able to tell future starters from relievers. But those are bridges to cross when I reach them.

Additionally, I'd like to develop and refine a prediction method to improve those simple 13% or 14% numbers. The plan is to just calculate Brier scores for a whole bunch of different methods and pick the best. I'm going to try a variety of weighted means, predictions via categorization or fits, etc. etc. I'll let you know when I've got something I like.

I'd like to develop a prediction system for pitchers. That is quite far in the future, right now.

I'd also like to add data from the Eastern League, but it takes quite a while to collate, as there are far more major leaguers to look up and the AAAA guys who aren't prospects have to be removed.

As always, feedback is very welcome. I'm happy to explain my reasoning behind any decision made during the research process or to defend my conclusions. Please remember that prospect prediction is a very inexact science even when you try to put some numbers on it, and I'm not trying to claim otherwise. I consider this a simple first attempt to quantify the definition of a "prospect" and to begin to understand just how likely a given player is to succeed or fail in baseball.