An interview with Sean Forman

I joke sometimes that if I were ever marooned on a desert island and could bring one thing, I might take with me. The world’s greatest baseball website has enough content to keep a hardened fan or researcher occupied for months if not years. I’ve certainly killed weekends on it.

I was lucky enough recently to talk with Sean Forman, the founder of and its overarching group of businesses, Sports Reference LLC. Excerpts from our phone conversation are as follows:

BPP: How many pages do you have overall on Baseball-Reference? I’ll say just Baseball-Reference right now. Do you have over a million pages on the site?

Forman: Oh yeah. Just box scores, we have 200,000 box scores.

Oh whoa.

And then, we have probably 200,000-300,000 pages in the minor league players, total, who have pages. Then you’ve got another 40,000 teams. You’ve got 2,000 teams in the major leagues, plus then we’ve got like 10 different pages for each season. I’m guessing we’re into the millions, just to count splits, game logs, the whole nine yards. It’s probably well over a million. I’m sure there are over a million. There could be over two million distinct pages on the site.

When you’re dealing with as large of an entity as Sports-Reference, what keeps you on-track and keeps you focused? I would think it would be so easy to get off-track and go off in a bunch of different directions.

It is. It’s… yeah, I’m not sure we are on-track. [laughs] We try to do some planning. We obviously have to fix any mistakes or bugs that people find. But yeah, it’s always challenging to make sure we’re making big strides rather than small steps that aren’t getting us where we want to go. It’s a challenge. That’s probably true of anybody trying to stay focused on what they want to be doing.

* * *

As far as the scope of minor league data on the site goes, do you have any ideas in mind to expand the amount of minor league data that’s up there.

The Japanese leagues are obviously not minor leagues, per se, but I think at some point this summer, we’re going to put up pretty complete Japanese league stats back to like the ’30s or ’40s. That’s kind of the big thing. Continuing to make progress on the Negro League stats, which we got from the Hall of Fame and from Outsider Baseball. It’s things like that and just continuing to get more league coverage and more complete leagues… in the minor leagues. It’s just ongoing. It’s one of those things where you work on it on a daily or weekly basis or there are other people who are working on it on a daily or weekly basis. You look up in two years, and you’ve gotten pretty far into the project. It seems daunting but if you try to make progress everyday, you can move pretty far in not too long of a time.

I was talking to (Major League Baseball historian and author) John Thorn a few years ago, and he was saying one of the potential pitfalls with Negro League stat research, he said there’s some researchers who’ll go so far as to hypothesize box scores. Have you heard of that kind of thing?

It’s very hard. The leagues were not well-defined. The barnstorming was obviously endemic and very important to the game, so how do you count those? It’s a messy situation… I think we have like 140 home runs for Josh Gibson, or something like that, but you could probably defend any number between 140 and 500 and make it sound reasonable. We’ll never know. We’re just never gonna know what those numbers are. And I mean, it’s unknowable, because different people are gonna have different views as what should and shouldn’t be counted. Even if we knew what all the game results were, different people would count them differently…

There’s a famous mathematician, Paul Erdos who would joke that he was excited to go to Heaven, because he figured God had all the proofs for all the theorems that we didn’t yet know how to solve. So he called it ‘The Book.’ He wanted to go to Heaven and see ‘The Book’ so he could learn what all these beautiful proofs that God had worked were, all these mathematical theorems. I figure God also has the Baseball Encyclopedia so when we go to Heaven, we’ll actually know what Ty Cobb’s hit total was and how many home runs Josh Gibson hit in his career. It’s unknowable. We’re doing the best we can but it’s not possible to really get those numbers. Even Ty Cobb’s hit totals, we don’t know exactly what that was.

* * *

I’m guessing you’re kind of limited on time and there’s probably certain things that you’d like to be able to do that you simply don’t have time to do. What’s one thing that you would expand on for Baseball-Reference if you had more time?

It would be some of the more modern stuff, like the PITCHf/x. I would love to go in and create some data presentations for that material but I just have not been able to set aside a three-month period to work diligently on that. I’m not sure how much of a big payoff that would be, either. It’s something that I’d love to do, but I just haven’t had time.

The PITCHf/x stuff, that’s become a big thing in the last few years, right?

Right. It’s a remarkable data set. It really turns the analysis of pitching on its head. You’re able to look at things at a granularity. And even catching, it’s revolutionized defensive [analysis for] catching. People are getting 30, 40, 50 run estimates for what Bengie Molina adds or Jose Molina adds in framing pitches. It’s fairly compelling stuff, so it’s interesting to see that. More data just creates better science and more interesting results.

That’s interesting, I didn’t realize PITCHf/x also lent itself to pitch framing. I’ve thought of it more as a pitcher’s stat but that totally makes sense. It’d be one of those stats that kind of goes both ways.

Right because you’re able to see the location of the pitch and whether it was called a ball or a strike, so you can say this catcher, for whatever reason, he gets more strike calls on these pitches than the typical catcher does. There’s some really interesting articles on Baseball Prospectus and Fangraphs on it.

* * *

With Sports-Reference, do you get the feeling ever that you’re preserving history?

I’d say we’re putting a friendly face on it so people can find it more easily. I think our goal is to answer user questions and a big part of that is obviously the question of what happened… and who was this person and what did they accomplish and things like that. So yeah, definitely, we’re working to preserve history.

Does he belong in the Hall of Fame? Sean Forman

Claim to fame: I’ll preface this by saying I was planning to write a column on Sean Forman before he bailed me out of a jam this morning. I signed up about a month ago for a free 30-day trial of the Play Index, a nifty tool on Forman’s website that allows for the kind of searches that used to take me hours. Want to know how many players in baseball history have at least 500 home runs and an OPS+ of 140? A quick Play Index search shows there to be 19.

My free trial expired on Sunday, and I put up $36 that evening for a year-long subscription. By some glitch in the system, though, perhaps a quirk of PayPal, my order was delayed for a few days during which time I couldn’t see the results of my P-I searches. I already don’t want to fathom writing regularly about baseball history without the index, so I sent an email to this morning, and they fixed the glitch within an hour or so.

Such is the power of the most important baseball website ever. I’ll go a step further and say that I think Forman’s the most influential person in baseball research today. He’s a modern version of Henry Chadwick, a 19th century statistician who invented the box score, batting average, and earned run average among other things. If Chadwick can have a place in the Hall of Fame, I’d augur for an eventual spot for Forman as well.

Current Hall of Fame eligibility: Chadwick has had a plaque in the Executives & Pioneers section of Cooperstown since 1938. At quick glance, he might be the only statistician enshrined, even if modern godfather of statistics Bill James is sorely overdue. That’s a story for another time, though James’ case and Forman’s as well could reasonably come before the Veterans Committee in the next decade or so.

Does he belong in the Hall of Fame? Some may sooner call James the most important baseball researcher today. But James has slowed in recent years, and while I respect his scholarship, he remains a highly polarizing figure. Some people zealously defend his work. Others have little use for it. Forman, meanwhile, continues to refine a website that appeals to analysts and traditionalists alike and draws several hundred thousand people a month. Just past his 40th birthday, Forman’s hopefully just getting started.

Consider how far baseball research online has come since Forman launched in 2000. A former college mathematics professor, he created his site after being unable to find stats for the likes of Ty Cobb on the Internet. By 2007, B-R was up to pages for all 17,000 players in MLB history, as well as 40,000 pages of Wikipedia-style content and 98,000 pages of box scores. Forman told that year:

I haven’t necessarily found all the data. The people at Retrosheet and the Society for American Baseball Research (SABR), they just do incredible work. I often say that I’m just putting a friendly face on the things that they’re doing. I certainly can’t take credit for getting the data in the raw format. But one of the things that I think the site does well is make this data easy to find. That’s always been a goal of mine, is to make things as quick and easy as possible.

I love that attitude, and at a time where people who’ve devised metrics like Wins Above Replacement are taking heat for a lack of transparency, I respect what Forman’s doing. More than that, I try to follow his example here.

End of day, I can only speak for myself, a blogger with no idea how much worse my work would be without Forman’s influence. Giving his organization $36 was the least I could do, and truth is, Forman’s done more for me than I’ve ever done for him. $36? Heck, I joke that I spend so much time on I may as well be paying the site rent.

