Sean LahmanSean Lahman, award winning database journalist and author. Home of the Lahman Baseball Database, a free collection of statistics for Major League Baseball teams, players, and seasons from 1871 to present.

Baseball in the Age of Big Data:
Why the revolution will be televised

What is big data? Everybody’s heard the buzz word, but what does it mean? Today I’m going to talk about what we mean we say “big data,” how it’s transforming the field of information technology, and how I think it’s going to impact baseball research in the next few years.

“Big data” does not simply mean a lot of data. It really means collecting all available data… every scrap and morsel of information that exists. When we talk about big data, we’re talking about a quantity of information so vast that it might as well be infinite.

It can be a tough concept to wrap your head around. A recent example would be the NSA’s Prism project, where they collected information on every call and every text from every cell phone in the United States.

But the retail world is where big data has really made the most visible impact. The “Big Box” stores understand consumer behavior in ways that were never before possible, by collecting data on every transaction, on every item, in every store.

They have 25 million users for their streaming service. They deliver 30 million video views a day. And they don’t just keep track of what you watch. They know when you pause or rewind. They know when you give up on a movie after five minutes, and they know when you watch 12 episodes of “The Office” in one sitting.

Big data gives them an incredible competitive advantage compared to broadcast networks, who rely on Nielsen ratings. The Nielsen ratings rely on surveys of very small samples. Participants keep a diary of what they watch, and who knows whether they’re telling the truth, or whether they are an accurate representative of everyone else’s viewing habits.

It’s fascinating, for example, for Netflix to observe the difference between what people say they want to watch and what they actually watch. ­­ Folks put Citizen Kane and Casablanca in their instant queue but they watch “Breaking Bad” and the Hangover movies

Big data has revolutionized the business world. Retailers are not simply guessing, not sampling what’s happening, but collecting detailed data in real time, which can be sliced in a infinite number of ways

Traditionally when we have talked about working with data, we think of very well ­defined information, rows and columns of numbers in spreadsheets, or structured tables in a relational database.

That model is becoming outdated, because with big data we’re often talking about collections of data that aren’t well structured, but are more like amorphous blobs. Rather than investing the work into organizing the data on the front end ­­- what we call normalizing the data ­­ – increasingly moving towards systems that use artificial intelligence to extract answers.

Watson used a combination of machine learning and natural language processing, combined with vast stores of information — 4 terabytes of data, including all of Wikipedia. IBM researchers were able to build a computer that could answer questions ­­ and not just return an answer, but calculate the likelihood that the answer it came up with was right. If the confidence level was too low, it wouldn’t buzz in. And the best part was that when it got an answer wrong, it learned, so it could improve the next time.

We are teaching computers to think, not just to process a list of canned commands in order but to analyze information in an abstract way. And that’s what’s behind the push for vast, limitless collections of data to feed them.

So that’s a quick overview of big data; Nearly infinite amounts of data, with a focus on collecting every possible detail that can be recorded, and using powerful computers to analyze and deliver answers.

—

To tie that all into baseball, let’s step back and take a look at the history of baseball data, to get a sense for where we are.

I want to show you a few data points that will help illustrate the scope of what I’m talking about.

This is 1951, the year Turkin and Thompson published the first Barnes baseball encyclopedia. We had roughly 1,800 data points per season.

In 1969, when the first edition of the Macmillan baseball encyclopedia came out — Big Mac — we had about 12,000 data points per season

If you go back and look at those early baseball abstracts, much of his early analysis was not even really analysis, it was a call for improved data gathering. James introduced things like pitcher run support, umpiring statistics, and stolen bases stats for catchers. He would pour through box scores and compile data that wasn’t being compiled by anyone else. None of these things involved the creation of new formulas. He was simply counting things that weren’t being counted, building data sets and pulling out interesting bits.

As far as I’m concerned, that was the real genius of Bill James. He clearly understood that to make advances in our understanding of the game, we needed to make a quantum leap in terms of the amount information we had available. ­­­­­­­­­­­

Bill helped launch Project Scoresheet, as many of you know, which started collecting and sharing play­-by-­play data. Shortly thereafter, pitch-­by-­pitch data started to become available. Increasingly larger data sets.

And what happened?

Once we had the play-by-play data, a whole new world opened up. We started looking at player splits and situational stats: Lefty vs. righty matchups and batting with runners in scoring position. Those differences began to become apparent because new data sets were available.

The availability of pitch-­by-­pitch data opened our eyes to pitch counts, and the fact that maybe it’s not a great idea for your 21-­year-old phenom to throw 135 pitches every fifth day.

And here’s where we are with Pitch f/x data, which Major League Baseball has collected for every game since 2007.

This was made possible because of technological advances in our ability to gather, store, and, share this volume of data.

We have created more data about games played in the last five years than in the 140 odd years before that combined.

Every time there has been a surge In the amount of data available there has been a corresponding surge in the quality of analysis and thus our understanding of the game.

And I would argue that we are in a golden era of baseball analysis.

But we are just beginning to scratch the surface. Technology is advancing so fast. In 3 or 4 years, we’ll look back at the Pitch f/x data and scoff at how primitive it was.

Here’s why: video.

Video has been a boon for fans. Access through MLB.TV or MLB At Bat has given more people more access to more games than ever before. That’s a great thing for fans but an opportunity that we as a research community have not really begun to exploit

Field f/x records high resolution shots 15 times a second, identifying every human on the field. Each image is time stamped, and the computer recognizes and records events that occur on the field: when the pitcher releases the ball, the batter hits the ball, the fielder gains possession of a ball, and the fielder throws the ball.

That comes out to something like 2.4 billion data points per season

It would be incredibly labor intensive for a human being to go through the video of a single game and make all of those measurements. And there would be human error and variability. But the Field f/x system isn’t constrained by those human limits. All of those measurements are made by the computer.

And you can imagine the insights such a data set might yield: True measures of fielding range, reaction time of fielders. Runners speed from first to third.

Some of you may have seen the presentation Thursday by Mike Eckstein of KinaTrax. His company is in the early stages of deploying a motion capture system used to generates biometric analysis of pitchers’ throwing motions. He described how it uses high speed cameras to capture 10-12 gigabytes of video for each pitch. That’s 1.5 to 1.8 terabytes per pitcher per game.

This is the future of data analysis. It’s not increasingly larger spreadsheets. It’s raw video of games and smart computer systems that can analyze them.

How many people here today have an iPad?

That technology did not even exist 5 years ago. Today 90 million people have them, one third of Internet users in the US.

Baseball teams have been using video for some time, particularly as a scouting tool. And if you’ve been around a clubhouse, either in the majors or the minors, you know that iPads are everywhere.

It used to be that a player had to go into the video room and watch cassette tapes of opposing pitchers. The theory remains the same, but the delivery systems for video have vastly improved.

When I was covering the NFL ten years ago, teams were cutting video after every game and passing out DVDs to each player. Now, pro and college teams have moved their playbooks to tablets. A coach can insert a new play from his desk and have it show up instantly in his team’s hands. And because the systems are interactive, he knows which players have seen it.

In my day job I write about technology for Gannett. I get to talk to some of the top research labs in the world, both at universities and large companies. I’ve been surprised at how many of them are working on video analysis

They’re working on advances in computer vision – teaching computers to look at images and understand what they see.

You’re probably familiar with things like facial recognition software, but here are a few of the other cutting edge technologies I’ve encountered in my reporting.

License plate readers, mounted on police cars, can read the number from a passing car and check to see if the vehicle is stolen or has an expired registration. These systems can read four vehicles a second while moving at full speed.

Researchers at Xerox developed software that recognizes human gestures. They can tell, for example, that a patient who had hip surgery is trying to get out of bed.

At MIT, they’re using motion amplification and color amplification to detect heartbeat and respiration from a video image. It’s not infrared or some other sort of special video. The technology can can be applied to existing videos — I saw a demo using footage from the latest Batman film.

Microsoft is working on visual tracking, teaching computers to identify people or objects and follow them. In the UK, where they have 1.85 million CCTV cameras, they’re teaching computers to recognize when a human passenger separates from his backpack. At UCSD, they have computers that learn how to drive by observing how people do it.

But for the civilian research community, we are not there. We don’t have access to data from systems like Field f/x — either the raw video footage or the advanced metrics that come out of it. We don’t have access to cutting edge technologies like high speed cameras or to supercomputers.

What we do have is broadcast footage of major league games back to 2010, and while its a cumbersome process, you can use the Pitch f/x data or play-by-play data to identify plays and then go look up the video. And the results can be pretty impressive, even though we are in the very earliest days of the digital video as a tool for research.

Ben’s also done published some interesting video studies on pitch framing… a topic I don’t think anyone was even talking about a few years ago. And rather than just pontificating, he uses the Pitch F/X data and video to show how some catchers get a pitch called a strike that others don’t