Introduction

Often when I’m watching Major League Baseball games a player will come
up to bat or pitch and I’ll comment “former Oakland Athletic” and the
player’s name. It seems like there’s always one or two players on the
roster of every team that used to be an Athletic.

Let’s find out. We’ll use the Retrosheet database again, this time using
the roster lists from 1990 through 2014 and comparing it against the
40-man rosters of current teams. That data will have to be scraped off
the web, since Retrosheet data doesn’t exist for the current season and
rosters change frequently during the season.

Methods

As usual, I’ll use R for the analysis, and rmarkdown to produce this
post.

We’re using plyr and dplyr for most of the data manipulation and
rvest to grab the 40-man rosters for each team from the MLB website.
Setting stringsAsFactors to false prevents various base R packages
from converting everything to factors. We're not doing any statistics
with this data, so factors aren't necessary and make comparisons and
joins between data frames more difficult.

The Retrosheet database lives in PostgreSQL on my computer, but one of
the advantages of using dplyr for retrieval is it would be easy to
change the source statement to connect to another sort of database
(SQLite, MySQL, etc.) and the rest of the commands would be the same.

We only grab data since 1990 and we combine the first and last names
into a single field because that’s how player names are listed on the
40-man roster pages on the web.

Now we filter the list down to Oakland Athletic players, combine the
rows for each Oakland player, summarizing the years they played for the
A’s into a single column.

Current 40-man rosters

Major League Baseball has the 40-man rosters for each team on their
site. In order to extract them, we create a list of the team identifiers
(oak, sf, etc.), then loop over this list, grabbing the team
name and all the player names. We also set up lists for the team names
(“Athletics”, “Giants”, etc.) so we can replace the short identifiers
with real names later.

Combine the data

To find out how many players on each Major League team used to play for
the A’s we combine the former A’s players with the current rosters using
player name. This may not be perfect due to differences in spelling
(accented characters being the most likely issue), but the results look
pretty good.

Pretty cool. I do notice one problem: there are actually two Chris
Young’s playing in baseball today. Chris Young the outfielder played for
the A’s in 2013 and now plays for the Yankees. There’s also a pitcher
named Chris Young who shows up on our list as a former A’s player who
now plays for the Royals. This Chris Young never actually played for the
A’s. The Retrosheet roster data includes which hand (left and/or right)
a player bats and throws with, and it’s possible this could be used with
the MLB 40-man roster data to eliminate incorrect joins like this, but
even with that enhancement, we still have the problem that we’re joining
on things that aren’t guaranteed to uniquely identify a player. That’s
the nature of attempting to combine data from different sources.

One other interesting thing. I kept the A’s in the list because the
number of former A’s currently playing for the A’s is a measure of how
much turnover there is within an organization. Of the 40 players on the
current A’s roster, only 22 of them have ever played for the A’s. That
means that 18 came from other teams or are promotions from the minors
that haven’t played for any Major League teams yet.

All teams

Teams with players on other teams

Now that we’ve looked at how many A’s players have played for other
teams, let’s see how the number of players playing for other teams is
related to team. My gut feeling is that the A’s will be at the top of
this list as a small market, low budget team who is forced to turn
players over regularly in order to try and stay competitive.

We already have the data for this, but need to manipulate it in a
different way to get the result.

This is a pretty complicated set of operations. The main trick (and
possible flaw in the analysis) is to get a list similar to the one we
got for the A’s earlier, and eliminate the first row (the number of
players on a team who played for that same team in the past) before
counting the total players who have played for other teams. It would
probably be better to eliminate that case using team name, but the team
codes vary between Retrosheet and the MLB roster data.

The A’s are indeed on the top of the list, but surprisingly, the Padres
are also at the top. I had no idea the Padres had so much turnover. At
the bottom of the list are teams like the Giants and Phillies that have
been on recent winning streaks and aren’t trading their players to other
teams.

Current players on the same team

We can look at the reverse situation: how many players on the current
roster played for that same team in past years. Instead of removing the
current × former team combination with the highest number, we include
only that combination, which is almost certainly the combination where
the former and current team is the same.

The A’s are near the bottom of this list, along with other teams that
have been retooling because of a lack of recent success such as the
Yankees and Dodgers. You would think there would be an inverse relationship
between this table and the previous one (if a lot of your former players are
currently playing on other teams they’re not playing on your team), but this
isn’t always the case. The White Sox, for example, only have 20 players on
their roster that were Sox in the past, and there aren’t very many of them
playing on other teams either. Their current roster must have been developed
from their own farm system or international signings, rather than by exchanging
players with other teams.

Yesterday I saw something I’ve never seen in a baseball game before: a runner
getting hit by a batted ball, which according to Rule 7.08(f) means the runner
is out and the ball is dead. It turns out that this isn’t as unusual an event
as I’d thought (see below), but what was unusal is that this ended the game
between the Angels and Giants. Even stranger, this is also how the game
between the Diamondbacks and Dodgers ended.

Let’s use Retrosheet data to see how often this
happens. Retrosheet data is organized into game data, roster data and event
data. Event files contain a record of every event in a game and include the
code BR for when a runner is hit by a batted ball. Here’s a SQL query to
find all the matching events, who got hit and whether it was the last out in the
game.

Here's what the query does. The first sub-query sub finds all the events
with the BR code, determines which team was batting and finds the id for the
player who was running. This is joined with the roster table so we can assign a
name to the runner. Finally, it’s joined with a subquery, max_events, which
finds the last event in each game. Once we’ve got all that, the SELECT
statement at the very top retrieves the columns of interest, and records whether
the event was the last out of the game.

Retrosheet has event data going back to 1922, but the event files don’t include
every game played in a season until the mid-50s. Starting in 1955 a runner
being hit by a batted ball has a game twelve times, most recently in 2010.
On average, runners get hit (and are called out) about fourteen times a season.

Here are the twelve times a runner got hit to end the game, since 1955. Until
yesterday, when it happened twice in one day:

Yesterday, the Baltimore Orioles and Chicago White Sox played a game at Camden
Yards in downtown Baltimore. The game was “closed to fans” due to the riots
that broke out in the city after the funeral for a man who died in police
custody. It’s the first time a Major League Baseball game has been played
without any fans in the stands, but unfortunately it’s not the first time
there have been riots in Baltimore.

After Martin Luther King, Jr. was murdered in April 1968, Baltimore rioted for
six days, with local police, and more than eleven thousand National Guard, Army
troops, and Marines brought in to restore order. According to wikipedia six people died, more than
700 were injured, 1,000 businesses were damaged and close to six thousand people
were arrested.

At that time, the Orioles played in Memorial Stadium, about 4 miles
north-northwest of where they play now. I don’t know much about that area of
Baltimore, but I was curious to know whether the Orioles played any baseball
games during those riots.

Retrosheet has one game, on April 10, 1968, with
a reported attendance of 22,050. The Orioles defeated the Oakland Athletics by
a score of 3–1. Thomas Phoebus got the win over future Hall of Famer Catfish
Hunter. Other popular players in the game included Reggie Jackson, Sal Bando,
Rick Mondy and Bert Campaneris for the A’s and Brooks Robinson, Frank Robinson,
Davey Johnson, and Boog Powell for the Orioles.

Introduction

One of the best sources of weather data in the United States comes from
the National Weather Service's Cooperative Observer Network (COOP),
which is available from
NCDC.
It's daily data, collected by volunteers at more than 10,000 locations.
We participate in this program at our house (station id DW1454 /
GHCND:USC00503368), collecting daily minimum and maximum temperature,
liquid precipitation, snowfall and snow depth. We also collect river
heights for Goldstream Creek as part of the Alaska Pacific River
Forecast Center (station GSCA2). Traditionally, daily temperature
measurements were collecting using a minimum maximum thermometer, which
meant that the only way to calculate average daily temperature was by
averaging the minimum and maximum temperature. Even though COOP
observers typically have an electronic instrument that could calculate
average daily temperature from continuous observations, the daily
minimum and maximum data is still what is reported.

In an earlier post we looked
at methods used to calculate average daily temperature, and if there are
any biases present in the way the National Weather Service calculates
this using the average of the minimum and maximum daily temperature. We
looked at five years of data collected at my house every five minutes,
comparing the average of these temperatures against the average of the
daily minimum and maximum. Here, we will be repeating this analysis
using data from the Climate Reference
Network stations in the United
States.

The US Climate Reference Network is a collection of 132 weather stations
that are properly sited, maintained, and include multiple redundant
measures of temperature and precipitation. Data is available from
http://www1.ncdc.noaa.gov/pub/data/uscrn/products/ and includes monthly,
daily, and hourly statistics, and sub-hourly (5-minute) observations.
We’ll be focusing on the sub-hourly data, since it closely matches the
data collected at my weather station.

The code to insert all of this data into a database can be found
here.
Once inserted, I have a table named crn_stations that has the
station data, and one named crn_subhourly with the five minute
observation data.

Methods

Once again, we’ll use R to read the data, process it, and produce plots.

Remove observations without temperature data, group by station and date,
calculate average daily temperature using the two methods, remove any
daily data without a full set of data, and collect the results into an R
data frame. This looks very similar to the code used to analyze the data
from my weather station.

Results

The average anomaly across all stations and all dates is 0.44 degrees
Celsius (0.79 degrees Farenheit). That’s a pretty significant error.
Half the data is between −0.1 and 1.0°C (−0.23 and +1.8°F) and the full
range is −11.9 to +10.8°C (−21.4 to +19.4°F).

Plots

Let’s look at some plots.

Raw data by latitude

To start, we’ll look at all the anomalies by station latitude. The plot
only shows one percent of the actual anomalies because plotting 512,460
points would take a long time and the general pattern is clear from the
reduced data set.

The clouds of points show the differences between the min/max daily
average and the actual daily average temperature, where numbers above
zero represent cases where the min/max calculation overestimates daily
average temperature. The blue line is the fit of a linear model relating
latitude with temperature anomaly. We can see that the anomaly is always
positive, averaging around half a degree at lower latitudes and drops
somewhat as we proceed northward. You also get a sense from the actual
data of how variable the anomaly is, and at what latitudes most of the
stations are found.

The overall model and coefficients are highly significant, and show a
slight decrease in the positive anomaly as we move farther north.
Perhaps this is part of the reason why the analysis of my station (at a
latitude of 64.89) showed an average anomaly close to zero (−0.07°C /
−0.13°F).

Anomalies by month and latitude

One of the results of our earlier analysis was a seasonal pattern in the
anomalies at our station. Since we also know there is a latitudinal
pattern, in the data, let’s combine the two, plotting anomaly by month,
and faceting by latitude.

Station latitude are binned into groups for plotting, and the plots
themselves show the range that cover half of all anomalies for that
latitude category × month. Including the full range of anomalies in each
group tends to obscure the overall pattern, and the plot of the raw data
didn’t show an obvious skew to the rarer anomalies.

And the plot itself. At the end, we’re using a function called
facet_adjust, which adds x-axis tick labels to the facet on the
right that wouldn't ordinarily have them. The code comes from this
stack overflow
post.

Each plot shows the range of anomalies from the first to the third
quartile (50% of the observed anomalies) by month, with the dot near the
middle of the line at the mean anomaly. The orange horizontal line shows
the overall mean anomaly for that latitude category, and the count at
the bottom of the plot indicates the number of “station years” for that
latitude category.

It’s clear that there are seasonal patterns in the differences between
the mean daily temperature and the min/max estimate. But each plot looks
so different from the next that it’s not clear if the patterns we are
seeing in each latitude category are real or artificial. It is also
problematic that three of our latitude categories have very little data
compared with the other two. It may be worth performing this analysis in
a few years when the lower and higher latitude stations have a bit more
data.

Conclusion

This analysis shows that there is a clear bias in using the average of
minimum and maximum daily temperature to estimate average daily
temperature. Across all of the CRN stations, the min/max estimator
overestimates daily average temperature by almost a half a degree
Celsius (0.8°F).

We also found that this error is larger at lower latitudes, and that
there are seasonal patterns to the anomalies, although the seasonal
patterns don’t seem to have clear transitions moving from lower to
higher latitudes.

The current length of the CRN record is quite short, especially for the
sub-hourly data used here, so the patterns may not be representative of
the true situation.

Abstract

The following is a document-style version of a presentation I gave at work a
couple weeks ago. It's a little less useful for a general audience because you
don't have access to the same database I have, but I figured it might be useful
for someone who is looking at using dplyr or in manipulating the GHCND data
from NCDC.

Introduction

Today we’re going to briefly take a look at the GHCND climate database
and a couple new R packages (dplyr and tidyr) that make data
import and manipulation a lot easier than using the standard library.

For further reading, consult the vignettes for dplyr and tidyr,
and download the cheat sheet:

GHCND database

The GHCND
database
contains daily observation data from locations around the world. The
README linked above describes the data set and the way the data is
formatted. I have written scripts that process the station data and the
yearly download files and insert it into a PostgreSQL database
(noaa).

“Tidy” data

Without going into too much detail on the subject (read Hadley
Wickham’s paper) for
more information, but the basic idea is that it is much easier to
analyze data when it is in a particular, “tidy”, form. A Tidy dataset
has a single table for each type of real world object or type of data,
and each table has one column per variable measured and one row per
observation.

For example, here’s a tidy table representing daily weather observations
with station × date as rows and the various variables as columns.

dplyr and tidyr are the data import and manipulation libraries
we will use, knitr is used to produce tabular data in report-quality
forms, ggplot2 and scales are plotting libraries, and
lubridate is a library that makes date and time manipulation easier.

Also note the warnings about how several R functions have been “masked”
when we imported dplyr. This just means we'll be getting the
dplyr versions instead of those we might be used to. In cases where
we need both, you can preface the function with it's package:
base::filter would us the normal filter function instead of the
one from dplyr.

Each row in the observation table rows contain the station_id, date,
a variable code, the raw value for that variable, and a series of flags
indicating data quality, source, and special measurements such as the
“trace” value used for precipitation under the minimum measurable value.

Each row in the variables table contains a variable code, description
and the multiplier used to convert the raw value from the observation
table into an actual value.

This is an example of completely “normalized” data, and it’s stored this
way because not all weather stations record all possible variables, and
rather than having a single row for each station × date with a whole
bunch of empty columns for those variables not measured, each row
contains the station × data × variable data.

We are also missing information about the stations, so let’s load that
data:

The first part is the same as before, loading the ghcnd_stations
table, but we are filtering that data down to just the Fairbanks area
stations with long term records. To do this, we use the pipe operator
%>% which takes the data from the left side and passes it to the
function on the right side, the filter function in this case.

filter requires one or more conditional statements with variable
names on the left side and the condition on the right. Multiple
conditions can be separated by commas if you want all the conditions to
be required (AND) or separated by a logic operator (& for AND, |
for OR). For example: filter(latitude > 70, longitude < -140).

When used on database tables, filter can also use conditionals that
are built into the database which are passed directly as part of a
WHERE clause. In our code above, we’re using the %in% operator
here to select the stations from a list.

Now we have the station_ids we need to get just the data we want
from the observation table and combine it with the other tables.

Join the observation table with the filtered station data, using
station_id as the variable to combine against. Because this is an
“inner” join, we only get results where station_id matches in
both the observation and the filtered station data. At this point we
only have observation data from our long-term Fairbanks stations.

Join the variable table with the Fairbanks area observation data,
using variable to link the tables.

Add a new variable called value which is calculated by
multiplying raw_value (coming from the observation table) by
raw_multiplier (coming from the variable table).

Remove rows where the quality flag is not an empty space.

Select only the station name, date, variable and actual value columns
from the data. Before we did this, each row would contain every
column from all three tables, and most of that information is not
necessary.

Finally, we “collect” the results. dplyr doesn’t actually perform
the full SQL until it absolutely has to. Instead it’s retrieving a
small subset so that we can test our operations quickly. When we are
happy with the results, we use collect() to grab the full data.

De-normalize it

The data is still in a format that makes it difficult to analyze, with
each row in the result containing a single station × date × variable
observation. A tidy version of this data requires each variable be a
column in the table, each row being a single date at each station.

To “pivot” the data, we use the spread function, and we'll also
calculate a new variable and reduce the number of columns in the result.

spread takes two parameters, the variable we want to spread across
the columns, and the variable we want to use as the data value for each
row × column intersection.

Examples

Now that we've got the data in a format we can work with, let's look at a few
examples.

Find the coldest temperatures by winter year

First, let’s find the coldest winter temperatures from each station, by winter
year. “Winter year” is just a way of grouping winters into a single value.
Instead of the 2014–2015 winter, it’s the 2014 winter year. We get this by
subtracting 92 days (the days in January, February, March) from the date, then
pulling off the year.

See if you can follow the code above. The pipe operator makes is easy to
see each operation performed along the way.

There are a couple new functions here, group_by and summarize.
group_by indicates at what level we want to group the data, and
summarize uses those groupings to perform summary calculations using
aggregate functions. We group by station and winter year, then we use
the minimum and n functions to get the minimum temperature and number of
days in each year where temperature data was available. You can see we
are using n to remove winter years where more than two weeks of data
are missing.

Also notice that we’re using spread again in order to make a single
column for each station containing the minimum temperature data.

Here’s how we can write out the table data as a restructuredText
document, which can be converted into many document formats (PDF, ODF,
HTML, etc.):

To plot the data, we need the data in a slightly different format with
each row containing winter year, station name and the minimum
temperature. We’re plotting minimum temperature against winter year,
coloring the points and trendlines using the station name. That means
all three of those variables need to be on the same row.

To do that we use gather. The first parameter is the name of
variable the columns will be moved into (the station names, which are
currently columns, will become values in a row named station_name).
The second is the name of the column that stores the observations
(tmin) and the parameters after that are the list of columns to
gather together. In our case, rather than specifying the names of the
columns, we're specifying the inverse: all the columns exceptwinter_year.