Baseball – God of the Machine
Jan 042007

Begin with a data set, preferably one in which many people are interested. Let’s say, World Series results from 1903 to the present.

Now ask a question about the data, one that should be easy to answer with a highly simplified model. Our question will be: have World Series teams, historically, been evenly matched?

Our model will ignore home-field advantage. In baseball the home team wins 53% or 54% of the time; nonetheless, we will assume that each team has a probability of 0.5 of winning each game. This gives the following expected probabilities for a best-of-seven series running four, five, six, or seven games:

P(4) = 0.125
P(5) = 0.250
P(6) = 0.3125
P(7) = 0.3125

Remember that if the model is too simple to fit the data, you can clean the data. Since 1903, the World Series has been played every year but two. There were a few best-of-nine series and a few more that included ties, which are too complicated to deal with. Throw them out. This leaves 95 series. Draw up a little chart comparing actual and expected probabilities, like so:

Possible outcomes P(Expected) P(Actual)
4-0 0.125 0.179
4-1 0.250 0.221
4-2 0.3125 0.242
4-3 0.3125 0.358

Now answer your own question. If the teams were evenly matched, the results would hew reasonably closely to the expected probabilities from the model. In fact there are anomalies. There are always anomalies. The World Series has been swept 17 times, five more than the model would predict. Plug this into the BINOMDIST function in Excel. (Understanding how this function works is optional and may in some cases be a disadvantage.) You find that, if the probabilities in the model were correct, there would be 17 or more sweeps in 95 occurrences only 8% of the time. A rotten break: you’re three lousy percent under statistical significance. But that aside, eleven of those were won by the team with the better regular-season record, several by teams considered among the all-time greats, including the 1927, 1939 and 1998 Yankees. That probably means something. On the other hand, the team that held the American League record for wins before 1998, the 1954 Indians, was swept by the Giants. Conclude judiciously that, on the whole, the data imply an occasional mismatch.

Look for any bonus anomalies. It doesn’t matter if they have nothing to do with your original question. Our data set turns up a nice one; the series went to seven games 34 out of 95 times — five too many, according to the model. This would occur randomly, assuming correct probabilities, only 20% of the time.

Damn, we’ve missed out on statistical significance again. Instead of looking at how often the series went seven, we can look at how often the team behind 3-2 won the sixth game. 34 out of 57, a somewhat more unusual result. Plug it back into BINOMDIST: we’re down to 9%, which is close but not close enough.

It has become inconvenient to look at the entire data set; let’s take just a chunk of it, say, 1945 to 2002. In those 58 years the World Series lasted seven games 27 times, which would happen by chance a mere 1% of the time. Furthermore, the team behind 3-2 won the sixth game 27 of 39 times; again, a 1% chance. Statistical significance at last!

Next, concoct plausible explanations for your new, statistically significant anomaly. Maybe the team that is behind plays harder, with their backs against the wall. Maybe they use all of their best pitchers, holding nothing in reserve for the seventh game. Maybe the team that is ahead chokes and cannot close it out.

Under no circumstances should you test these explanations. In the World Series the team that won Game Six also won Game Seven 18 times out of 34 — not likely if they had squandered their resources to win Game Six. In basketball, in the NBA Finals, the team that led 3-2 won Game Six 26 times out of 45. This is the opposite of what we found in baseball, in a sport that rewards hard play more and is far more conducive to choking, as anyone knows who has tried to shoot a free throw in a big game. In other words, your explanations, though plausible, are false. The result is probably due to random variation. This should not discourage you from completing your article. Write up your doubts in a separate note several months later.

Finally, check the literature to make sure your idea is original. If it isn’t, which is likely, mention your predecessor prominently in your acknowledgements, and include a footnote in which you pick a few nits.

Submit to suitable journals. Repeat unto death, or tenure, whichever comes first.

Update: Actual professional statisticians comment. Evolgen, who may or may not be a professional statistician, comments.

Feb 212004

Five years ago, after the 1999 season, a fellow fantasy league baseball owner and I fell into an argument about Roger Clemens. Clemens was 37 years old. In 1998 he had a brilliant season with Toronto, winning the pitching triple crown — ERA, wins, and strikeouts — and his fifth Cy Young Award. In 1999, his first year with the Yankees, he slipped considerably, finishing 14-10 with an ERA higher than league average for the only time since his rookie season. His walks and hits were up, his strikeouts were down, and my friend was sure he was washed. He argued that Clemens had thrown a tremendous number of innings, that old pitchers rarely rebound from a bad season, and that loss of control, in particular, is a sign of decline. I argued that Clemens is a classic power pitcher, a type that tends to hold up very well, that his strikeout ratio was still very high, that his walks weren’t up all that much, and that his diminished effectiveness was largely traceable to giving up more hits, which is mostly luck.

Of course Clemens rebounded vigorously in 2000 and won yet another Cy Young in 2001. He turned out not be finished by a long shot, and still isn’t. Does this mean I won the argument? It does not. Had Clemens hurt his arm in 2000 and retired, would my friend have won the argument? He would not.

Chamberlain wasn’t wrong about “peace in our time” in 1938 because the history books tell us Hitler overran Europe anyway. He was wrong because his judgment of Hitler’s character, based on the available information in 1938, was foolish; because, to put it in probabilistic terms, he assigned a high probability to an event — Hitler settling for Czechloslovakia — that was in reality close to an engineering zero. He would still have been wrong if Hitler had decided to postpone the war for several years or not to fight it at all.

“Time will tell who’s right” is a staple of the barroom pedant. Of course it will do no such thing: time is deaf, blind, and especially, mute. Yet it is given voice on blogs all the time; here’s Richard Bennett in Radley Balko’s comments section: “Regarding the Iraq War, your position was what it was and history will be the judge.” It’s not an especially egregious instance, just one I happened to notice.

Now you can take this too far. If your best-laid predictions consistently fail to materialize, perhaps your analyses are not so shrewd as you think they are. You might just be missing something. Or not. But this should be an opportunity for reflection, not for keeping score.

We fumble in the twilight, arguing about an uncertain future with incomplete knowledge. Arguments over the future are simply differences over what Bayesian probability to assign the event. There is a respectable opposing school, frequentism, which holds that Bayesian probability does not exist, and that it makes no sense to speak of probabilities of unique events; but it has lost ground steadily for the last fifty years, and if it is right then most of us spend a great deal of time talking about nothing at all. Like Lord Keynes, one of the earliest of the Bayesian theorists, we are all Bayesians now.

This, for argument, is good news and bad news. The good news is that history won’t prove your opponent out. The bad news is that it won’t prove you out either. You thrash your differences out now or not at all. Then how do you know who won the argument? You don’t. Argument scores like gymnastics or diving, not football. It will never, for this reason, be a very popular American indoor sport.

Jun 202003

Niceness counts, your mother used to tell you, and so it does, for you and me. When you are one of the best in the world at what you do, niceness stops counting. I am reminded of this by the sportswriters’ treatment of Barry Bonds.

Barry Bonds is one of the greatest hitters who ever lived, and his unearthly bat speed, unerring plate discipline and perfect balance make him a joy to watch. The pleasure he has given anyone who enjoys baseball, including some sportswriters, can never be repaid. He is also rather surly with the media and disinclined to give interviews. Tough. Nobody cares about how Barry Bonds’ relations with the press except the press, and if they had any respect for greatness they would keep quiet about it.

Babe Ruth, in another era, was celebrated for promising to hit home runs for sick children, although by the authoritative account he was a lout. But really, does anything matter about him except the way he played baseball?

I have quoted Yvor Winters before on the relations between distinguished poets and scholars, but his words serve equally well to describe the relations between great athletes and sportswriters:

To the scholar in question, the poet is wrong-headed and eccentric, and the scholar will usually tell him so. This is bad manners on the part of the scholar, but the scholar considers it good manners. If the poet, after some years of such experiences, loses his temper occasionally, he is immediately convicted of bad manners. The scholar often hates him (I am not exaggerating), or comes close to hating him, but if the poet returns hatred with hatred (and surely this is understandable), he is labeled as a vicious character, for, after all, he is a member of a very small minority group.

David Halberstam, he’s talking to you.

Jacques Barzun, in The House of Intellect, has an anecdote about a distinguished jurist, a member of the Supreme Court, who was profiled in a newspaper article the largest point of which was that the jurist rose early every morning and cooked breakfast for his family. In the forty-odd years since Barzun’s book was published his anecdote has been reprised countless times, almost exactly in the case of Justice Rehnquist, about whom ten people could tell you that he put stripes on his gown and sings Christmas carols for every one who could tell you a thing about his jurisprudence. This is supposed to “humanize” great men. By “humanizing” is meant “making seem more like you and me,” although what is interesting about the great is precisely what makes them unlike the rest of us. These “human” qualities are attractive or unattractive, according to the disposition of the writer: they are always irrelevant. I don’t want to see great men humanized. I want to see them praised, or even damned, for the qualities that make them great. Everything else is pornography.

(Update: Howard Owens comments.)

Jun 022003

I just finished Michael Lewis’s terrific book about Billy Beane, the Oakland A’s general manager who consistently fields a great team with one of the lowest payrolls in the major leagues. The A’s are baseball commissioner Bud Selig’s particular albatross. Selig harps on the need for more baseball socialism (“revenue-sharing”) because of the alleged “inability of small market teams to compete,” when in fact it is only incompetently managed small market teams who can’t, Selig’s own Milwaukee Brewers prominent among them. Beane must drive him to drink. Now to anyone who has played fantasy baseball and read Bill James, which seems to be half of the male portion of the blogosphere, how to put together a winning baseball team with little money is no secret. You exploit inefficiencies, which is to say, you take advantage of the fact that many baseball executives are stupid. Certain traits are overvalued by other teams, like sculpted physiques or blazing speed or cannon arms. These don’t translate very well into on-field success anyway, and you ignore them. Other, more useful traits, like a deceptive pitching motion or the ability to draw walks, are undervalued, and these are what you look for.

The golden rule is that past performance indicates future performance, and ugly doesn’t count. Essentially you work from the spreadsheet instead of the scouting report. Scouts hate that. So do fans, stat geeks like me excepted, because it slights any knowledge of the game that comes from actually watching it. When I played in a fantasy league I would regularly tell other owners that they watched too much baseball, and that they needed to stop believing their own eyes. I was delighted to note that Beane often tells his scouts the same thing.

Beane himself is a former major-league player and hot prospect of exactly the type that he has trained himself, and his staff, to ignore. He was a high-school “tools” player, the type who looks better playing than he actually plays, and so highly regarded that many scouts and executives wanted to draft him first in his class, ahead of such future luminaries as Darryl Strawberry. But Beane’s tools never translated into major-league success. By his own account, his temper destroyed him as a player: he couldn’t cope with failure, and one bad at-bat would wreck his game, or his week.

In other words, Beane, instead of hiring in his own image, has become a brilliant success by doing the opposite. If there are other executives who have done this, I don’t know who they are.

(Dr. Manhattan reviews the book at greater length.)

(Update: Floyd McWilliams comments.)

(Update: Robert Birnbaum has an interesting interview with Lewis.)

Apr 302003

It requires a certain type of mind to excite itself over “fragments of fragments,” but the normally sober baseball analyst Rob Neyer exults giddily over them in his column the other day.

The question at issue is how lucky the 2002 Detroit Tigers were. On the one hand, they lost 106 games. On the other, if you apply Pythagorean analysis to their run margin, they “should” have lost 112 games. So they were lucky. But on the third hand, as one of Neyer’s correspondents points out, they scored fewer runs than one would expect from their offensive components, and allowed more than would expect from the offensive components of their opponents, and they really should have lost 98 games. So they were unlucky.

But why stop there?

All hits, for example, are not created equal. If two players hit 120 singles, we consider those accomplishments the same. But what if one of the players hit 80 line drives and 40 ground balls with eyes, and the other hit 120 line drives? Would we expect them to match performances the next season?

No, we wouldn’t. We’d expect the guy with 120 line drives to outperform the guy who got lucky with the grounders.

That is just one tiny example, of hundreds we could come up with. And for the people who care about such things, finding the fragments of the fragments of the fragments is the next great frontier.

Ah, fragments of fragments of fragments. Perennial employment for baseball analysts! More work for Rob Neyer!

Neyer analogizes this process to pricing financial derivatives, which I happen to know something about, having worked as a programmer for several years for a software company that did exactly that. On slow afternoons the analytics boys would quarrel over whether to construct the yield curve using a two- or three-factor Heath-Jarrow-Morton model. Sure, with a two-factor model you might be able to price the bond to four decimal points, but with a three-factor model you can price it to seven! Eventually someone, usually me, would have to rain on their parade by pointing out that bonds are priced in sixteenths (of a dollar), and that the bid/offer spread dwarfs anything beyond the first decimal point.

In baseball granularity is not measured in sixteenths, but in wins. Since it takes about eight to ten additional runs for each additional win, any variance below five runs or so is a big, fat engineering zero. And I can assure Rob Neyer without even firing up a spreadsheet that a team’s line drive/ground ball ratio when hitting singles won’t get you anywhere near five runs. It’s barely conceivable that it could help you draft a fantasy team. Knock yourself out.

Hitting has been well understood since John Thorn and Pete Palmer published The Hidden Game of Baseball twenty years ago. All work since has been on the margins. The new frontiers in baseball analysis lie elsewhere. Pitching is still imperfectly understood, because its results are mixed with fielding, which, until Bill James’s new book on Win Shares, was not understood at all. Voros McCracken (where do you sign up for a name like that?) recently demonstrated that a pitcher’s hits allowed, relative to balls in play, is almost entirely random. That’s serious work. Fragments of fragments is masturbation.

The lesson here, which applies more broadly to the social sciences, is not to seek more precision than is proper to your subject. Fortunately Professors Mises and Hayek have already given this lecture, and I don’t have to.

(Update: Craig Henry comments.)

Apr 222003

It’s been a while since I’ve thrown a sop to my baseball-oriented readers and the season is under way, so I’m gonna make it up to you with a new statistic, because the one thing baseball suffers from is not enough statistics.

I was trying to explain the game to an Icelandic friend of mine the other day. What’s with guys charging the mound? he wanted to know. (This from a hockey fan.) Well, they get upset when pitchers throw at them, I said. So why do the pitchers throw at them? he asked. To instill fear, I said. It’s a lot harder to hit when you’re worrying that the next pitch might come at your head. Don’t pitchers get thrown out for doing that? he asked. Yes and no, I explained. It’s complicated. He asks, can’t they at least keep track of the pitchers who do it all the time and punish them later? Why yes, I mused. Yes they can. And then and there I conceived the VI, or Viciousness Index.

VI relies on the premise that a pitcher’s true wildness can be roughly judged by the number of walks he allows. The fewer he allows, the better idea he has of where the ball is going most of the time. So if he allows very few walks and still hits a lot of batters, the way Pedro Martinez does, one can assume that it’s not entirely or even mostly by accident. Therefore VI = HBP/BB. I submit this will prove an excellent index to pitcher viciousness.

I’d like to oblige you with some actual numbers, but HBP pitcher data turns out to be scarce. It’s not in the Lahman database, Baseball Reference doesn’t have it, and that means I don’t have it either. In lieu of numbers, I offer two hypotheses. First, pitchers with headhunting reputations, like Bob Gibson and Don Drysdale, will have high VIs. Second, the VI leaders, seasonally and career, will be a better set of pitchers than the VI trailers. (This is of course largely because the trailers walk more hitters. A stronger version is that if you match pitchers with similar walk/inning ratios, the ones with the higher VIs will tend to be better.) If somebody out there has HBP data for pitchers and wants to share it with me so I can confirm or deny, I pledge that I will not only publish the lifetime and 2002 leaders for the Viciousness Index, but I will add the data to my pitching search engine. Now is that a deal or what?

(Update: I’ve mentioned before how impressed I am with my commenters — it’s a regular little salon around here — but Greg Padgett has outdone himself. He actually grabbed the HB numbers from the Yahoo MLB database and posted the VI leaders and trailers, both raw and adjusted, for 2002 in the comments. He also extracted a few VI comparisons for pitchers with similar walk rates. I’ll have more to say about this later, but on casual inspection the results are inconclusive. There are some excellent pitchers at the top, like Pedro Martinez, Brad Radke, Derek Lowe, and Mark Mulder, but there are some pretty good pitchers at the bottom too, like Bartolo Colon and Jason Schmidt. Each end of the list has its share of washouts too. I suspect career results will be more conclusive, since we’re dealing with a relatively rare occurrence. Adjusted VI doesn’t range much beyond plus or minus 5 in a single season. But go read Greg’s comment.)

Jul 162002

Read Part 1.

Five things I learned about fielding in baseball from reading Bill James’s new book Win Shares:

1. Defensive efficiency, the percentage of balls put into play that is turned into outs, defined, if we ignore the small peripherals, as total outs minus strikeouts divided by the total number of balls in play, accurately measures how well a team performs defensively. You might think that pitchers would influence this statistic. They don’t.

2. You can measure range for first basemen, but this requires not just assists, but unassisted putouts, which are usually made when he runs to the bag himself instead of flipping to the pitcher. You can approximate this second component by subtracting all of the other infielders’ assists from the first baseman’s total putouts.

3. Catcher fielding percentages are a lot more meaningful when you remove strikeouts, which absurdly bloat the catcher’s total chances and never should have been there in the first place.

4. Because fielding, unlike pitching or hitting, is a cooperative effort, it must be evaluated top down — first on the team level, and only then by assigning contributions to individuals. (James also argues that this is the best way to evaluate everything, but I’m sure he would agree that you can get a lot further with pitching and hitting by working from the bottom up.)

5. Fielding statistics, like many things, make a lot more sense in context. If Bill Mazeroski, who has the best all-time double play statistics of any second baseman, turned a lot of double plays, we need to figure out how many he had a chance to turn, and we can. If Richie Ashburn, who has the best all-time fielding statistics of any outfielder, caught a lot of fly balls, we need to figure out how many he had a chance to catch, and we can again. In fact Ashburn achieved them partly because he was in fact a superb defensive player, but mostly because the Phillies’ pitching staff in the 1950s gave up more fly balls than any other pitching staff ever, by far. For the same reason that team’s shortstop, Granny Hamner, has lousy fielding statistics, even though his defensive reputation was excellent. Mazeroski, on the other hand, had more or less the normal number of opportunities to turn double plays. He really was that good.

Maybe these things are obvious. But I didn’t think of them, and neither did you.

Jun 212002

Read Part One. Go on, it’s short.

Watching baseball actually impedes understanding. When I was fourteen my father took me to a Yankee game. The Yankees lost and Bobby Bonds struck out four times, twice on changeups in the dirt. After the game my father said, “I never realized Bonds was such a bum.” Now, of course, Bonds wasn’t a bum; Bonds was a borderline Hall of Fame player in the middle of one of the best seasons of his career. But that’s what happens when you string up a hammock at some local minimum or maximum and proceed to draw conclusions about the shape of the graph.

When blowhards like Joe Morgan and Tim McCarver exult over “the little things that don’t show up in the box scores” this should be regarded as a paid commercial announcement — as if you have to listen to them to know what’s going on. Just about everything shows up in the box scores, and if it doesn’t, then we just need better box scores. Box scores used to show next to nothing, not even walks. And then they showed hit-by-pitches, and intentional walks, and pitch counts, and ball-strike ratios, and stolen-base attempts, and caught-stealings. Soon they will show runners advanced, and groundball/flyball ratios, and out charts, and the margin of baseball events that don’t show up in the box scores (what Bill James used to call “the swamp”) will dwindle, inexorably, to zero, just as science gradually asserts its dominion over all kinds of problems that used to belong to philosophy.

In the meantime, at least turn the sound down.

Jun 182002

Many people actually watch baseball (though not as many as there used to be), which amazes me. Have you ever tried to watch a baseball game with someone who knows nothing about sports, like your girlfriend? Mine can appreciate, at least for five minutes, the balletic grace of basketball or soccer, the raw violence of football, even the ebb and flow of hockey, but when baseball comes on the channel is changed. Immediately.

But there is one beautiful thing about baseball, and it isn’t the Cartesian symmetry of the diamond. Baseball playing fields aren’t even symmetrical, actually, except in the ugliest parks. Perhaps the fact that it’s played in the summer?

It breaks your heart. It is designed to break your heart. The game begins in the spring, when everything else begins again, and it blossoms in the summer, filling the afternoons and evenings, and then as soon as the chill rains come, it stops and leaves you to face the fall alone.

Giamatti doth protest too much, methinks.

No, the beautiful thing about baseball is that it’s transparent to statistical analysis. This is fortunate, because it means you don’t have to watch baseball to understand it. All you have to do is read the box scores.

Jun 162002

Several premises underlie Bill James’s new book, some of them radical, and it seems best to examine them individually.

1. It would be valuable to have a single number to represent the value of a player’s season. Well, sure. Wouldn’t it? People spend a lot of time rating things on a scale “from 1 to 10.” James proposes to do the same thing with baseball players, except it’s on a scale of 0 to about 50 or so (there are only about a dozen seasons in baseball history that rate higher than 50, including Barry Bonds’ 2001, which clocks in at 54). Each integer represents a third of a win, a “Win Share”. This would come in handy to resolve salary disputes, trade questions and bar arguments, for sure.

2. It is impossible to evaluate players by taking the average as a baseline. This too is true, and it has already been acknowledged by many other analysts who have introduced the concept of replacement value (the Baseball Prospectus boys with VORP, among others). James justly says that the value of a player is not in how far above the average he is but in the fact that he can play at this level at all. Poor Pete Palmer and his Linear Weights system take it on the chin for using average performance as a baseline, implying that a slightly-below average major league player has a value of less than zero.

3. The best way to analyze performance, particularly fielding, is to look at the team’s performance first, and then allocate it among the individual players. True again. For hitters it’s easy to separate individual hitting performance by ignoring situation-dependent statistics like RBI; for pitchers it’s a little more difficult; but fielders cannot be evaluated properly apart from the team.

4. The Win Shares of the players must add up to the wins of the team. Here we start to get into trouble. A team’s record can be predicted quite accurately by a Pythagorean formula, and in Win Shares James introduces a new formula, based on “marginal runs”, that is nearly as accurate. (In fact calculating Win Shares is just a matter of figuring a team’s marginal runs and allocating them among its members.) Some teams, however, significantly overperform or underperform, relative to the number of runs they score and allow. The 1984 New York Mets, who finished 90-72 despite being outscored by 24 runs, are a notorious example. It follows from James’s formulae that each marginal run a 1984 Met contributed was more valuable than each marginal run from an underperforming team, say, the 1984 Pittsburgh Pirates, who finished 75-87 despite outscoring their opponents by 48 runs. The adjustments here can range to 20% or more, dwarfing things like park factors. And it follows further that Hubie Brooks and Tony Pena both get 21 Win Shares, despite the fact that Pena outhit Brooks slightly and played a far more difficult defensive position far better. In other words, Brooks, according to James, was a better player in 1984 because his team was lucky. (All analysts agree–James himself may have been the first to say so–that when teams win a lot of close games, it’s mostly luck.) James defends such conclusions obliquely in an essay called The Snider/Mays Dilemma, as follows:

But in that case [of another overachieving team, the 1969 Mets], it seems OK, because, after all, we know what this team accomplished. We all understand that this isn’t the usual case. But in the Win Shares system, we follow the logic that whatever is accomplished by the team is credited to the players, wherever that leads us.

Now maybe we all know what the 1969 Mets accomplished, and won’t be thrown by Cleon Jones’s 30 Win Shares, but do we all know what the 1984 Pirates failed to accomplish, or will we take Tony Pena’s and Hubie Brooks’ 21 Win Shares apiece at face value? I am not certain that we should evaluate players relative to their teams’ expected instead of actual won-loss records, but I am certain that James can give a better defense of his method than he does here.

Smaller points: James says in The New Historical Baseball Abstract and elsewhere that closers are overrated and managers ought to ignore save situations and pitch their best relief pitchers when they are most valuable, i.e., in tied and one-run games. Couldn’t agree more. But in Win Shares he appears to ignore that argument, giving relievers arbitrary extra credit for saves to make his numbers work out. James also asserts that stopping the running game constitutes 50% of a catcher’s defense, although he admits that such an estimate is desperate work. This should be easy to add up, however, because we know, offensively, what stolen bases and caught stealings are worth. When I am feeling less lazy I will add up the alleged defensive contributions of some cannon-armed catcher like, say, Ivan Rodriguez, and see if they match up with what the offensive result of his SB/CS record would be worth.

(Update: Read Part 2.