I joke sometimes that if I were ever marooned on a desert island and could bring one thing, I might take Baseball-Reference.com with me. The world’s greatest baseball website has enough content to keep a hardened fan or researcher occupied for months if not years. I’ve certainly killed weekends on it.
I was lucky enough recently to talk with Sean Forman, the founder of Baseball-Reference.com and its overarching group of businesses, Sports Reference LLC. Excerpts from our phone conversation are as follows:
Forman: Oh yeah. Just box scores, we have 200,000 box scores.
And then, we have probably 200,000-300,000 pages in the minor league players, total, who have pages. Then you’ve got another 40,000 teams. You’ve got 2,000 teams in the major leagues, plus then we’ve got like 10 different pages for each season. I’m guessing we’re into the millions, just to count splits, game logs, the whole nine yards. It’s probably well over a million. I’m sure there are over a million. There could be over two million distinct pages on the site.
When you’re dealing with as large of an entity as Sports-Reference, what keeps you on-track and keeps you focused? I would think it would be so easy to get off-track and go off in a bunch of different directions.
It is. It’s… yeah, I’m not sure we are on-track. [laughs] We try to do some planning. We obviously have to fix any mistakes or bugs that people find. But yeah, it’s always challenging to make sure we’re making big strides rather than small steps that aren’t getting us where we want to go. It’s a challenge. That’s probably true of anybody trying to stay focused on what they want to be doing.
* * *
As far as the scope of minor league data on the site goes, do you have any ideas in mind to expand the amount of minor league data that’s up there.
The Japanese leagues are obviously not minor leagues, per se, but I think at some point this summer, we’re going to put up pretty complete Japanese league stats back to like the ’30s or ’40s. That’s kind of the big thing. Continuing to make progress on the Negro League stats, which we got from the Hall of Fame and from Outsider Baseball. It’s things like that and just continuing to get more league coverage and more complete leagues… in the minor leagues. It’s just ongoing. It’s one of those things where you work on it on a daily or weekly basis or there are other people who are working on it on a daily or weekly basis. You look up in two years, and you’ve gotten pretty far into the project. It seems daunting but if you try to make progress everyday, you can move pretty far in not too long of a time.
I was talking to (Major League Baseball historian and author) John Thorn a few years ago, and he was saying one of the potential pitfalls with Negro League stat research, he said there’s some researchers who’ll go so far as to hypothesize box scores. Have you heard of that kind of thing?
It’s very hard. The leagues were not well-defined. The barnstorming was obviously endemic and very important to the game, so how do you count those? It’s a messy situation… I think we have like 140 home runs for Josh Gibson, or something like that, but you could probably defend any number between 140 and 500 and make it sound reasonable. We’ll never know. We’re just never gonna know what those numbers are. And I mean, it’s unknowable, because different people are gonna have different views as what should and shouldn’t be counted. Even if we knew what all the game results were, different people would count them differently…
There’s a famous mathematician, Paul Erdos who would joke that he was excited to go to Heaven, because he figured God had all the proofs for all the theorems that we didn’t yet know how to solve. So he called it ‘The Book.’ He wanted to go to Heaven and see ‘The Book’ so he could learn what all these beautiful proofs that God had worked were, all these mathematical theorems. I figure God also has the Baseball Encyclopedia so when we go to Heaven, we’ll actually know what Ty Cobb’s hit total was and how many home runs Josh Gibson hit in his career. It’s unknowable. We’re doing the best we can but it’s not possible to really get those numbers. Even Ty Cobb’s hit totals, we don’t know exactly what that was.
* * *
I’m guessing you’re kind of limited on time and there’s probably certain things that you’d like to be able to do that you simply don’t have time to do. What’s one thing that you would expand on for Baseball-Reference if you had more time?
It would be some of the more modern stuff, like the PITCHf/x. I would love to go in and create some data presentations for that material but I just have not been able to set aside a three-month period to work diligently on that. I’m not sure how much of a big payoff that would be, either. It’s something that I’d love to do, but I just haven’t had time.
The PITCHf/x stuff, that’s become a big thing in the last few years, right?
Right. It’s a remarkable data set. It really turns the analysis of pitching on its head. You’re able to look at things at a granularity. And even catching, it’s revolutionized defensive [analysis for] catching. People are getting 30, 40, 50 run estimates for what Bengie Molina adds or Jose Molina adds in framing pitches. It’s fairly compelling stuff, so it’s interesting to see that. More data just creates better science and more interesting results.
That’s interesting, I didn’t realize PITCHf/x also lent itself to pitch framing. I’ve thought of it more as a pitcher’s stat but that totally makes sense. It’d be one of those stats that kind of goes both ways.
Right because you’re able to see the location of the pitch and whether it was called a ball or a strike, so you can say this catcher, for whatever reason, he gets more strike calls on these pitches than the typical catcher does. There’s some really interesting articles on Baseball Prospectus and Fangraphs on it.
* * *
With Sports-Reference, do you get the feeling ever that you’re preserving history?
I’d say we’re putting a friendly face on it so people can find it more easily. I think our goal is to answer user questions and a big part of that is obviously the question of what happened… and who was this person and what did they accomplish and things like that. So yeah, definitely, we’re working to preserve history.