Wednesday, April 06, 2011

SABRMetrics 101: Predicting Wins

First, let’s show the magic trick. People love magic tricks.

The Twins scored 781 runs last year and gave up 671 runs. How many wins did they have?

OK, you probably know that they had 94 wins. Bad example. Let’s use Cleveland, instead. (Everyone else does.) The Indians scored 646 runs and gave up 752 runs. What was their record?

What Bill James showed in the 80s is that if you have a calculator, I’ve given you enough information for you to predict how many games a team won. He called his little trick the Pythagorean Formula, which is an incredibly unfortunate name, because Pythagoras has already coined that, but it stuck. We’ll walk through it using Cleveland’s numbers above.

1. Square the runs scored. (646 * 646 = 417,316)
2. Square the runs against (752 * 752 = 565,504)
3. Add those two numbers together. (417,316 + 565,504 = 982,820)
4. Divide the 1st number by the 3rd number (417,316/982,820 = .4246)
5. That is the team’s winning percentage. So just multiply that number by 162, or however many game the team played (.4246 * 162 = 68.7)

So the formula says the Tribe won 69 games, which is exactly as many as they won. If you do the same thing with the Twins number, you’ll see it predicts they won 93 games, one less than they won. And if you do it for all major league teams, you’ll see that it predicted sixteen teams records within two games. All but three teams are within 5 games of its prediction. It also accurately predicted all eight of the teams that made the playoffs.

Ta-DA! (Deep bow)

The basis of this formula is simple enough for anyone to understand: the more runs you score, and the less runs you give up, the more games you’re going to win. Nobody argues with that idea. But what was revolutionary was how precise it seemed. And how FUN is was. With a calculator (remember, this was the 80s) and an imagination, you could come up with all kinds of insights.

For instance, Nick Blackburn gave up 101 runs last year in 161 innings. What if we had a more average pitcher, who gave up just 75 runs? Just subtract those 25 runs from the Twins runs against, rerun the numbers and see how many more games the Twins might have won.

(I’ll let you go ahead and crunch that one yourself. It’s good practice. Have fun.)

It became a favorite plaything of anyone doing analysis on their favorite team. It became a powerful tool for insight. It became widely misunderstood. But most importantly…(hold it, this requires caps.)


Runs, it turns out, are a lot easier to study with baseball stats than wins. And that was especially true when James dropped his next bombshell. We’ll get to that next time.

If you’re going to any of the games this weekend, I’d highly recommend plunking down $1 for the Twins Official Scorecard. TwinsCentric writers and other independent bloggers will be providing the content for the Dugout Splinters, which is a preview of both teams within the Scorecard. For the A’s series, I’m writing the Twins side while Kyle Eliason (who has been a key contributor for years) looks at what the A’s are up to.

It’s easily the best bargain at Target Field, and you can buy it at any souvenir or program stand. You’ll love it.

Over at Seth Speaks, Seth reviews the prospect hounds’ choices for the minor league pitchers most likely to break out this upcoming season.


TT said...

"what was revolutionary was how precise it seemed. "

"All but three teams are within 5 games of its prediction"

With that information you can predict the standings, give or take 10 games. Two teams with the same pythagorean record could be that far apart. The emphasis is that it "seems" precise, its actually about as accurate as Shooter Hunt's pitches.

This discussion says more about our ability to intuitively evaluate differences than it does how accurately you can predict wins from run differential.

There is a carnival game like this at the state fair, or at least there used to be, where the operator guesses your age within 5 years. Like all carnival games, it is a bit of a con, much easier than it "seems". People fall for it.

John said...

TT, I actually agree with you somewhat. I'll be talking about that (I hope) in the next post on this topic. It is one of the weaknessses of a lot of sabrmetric study that not a lot of people talk about.

TT said...

BTW - the simple arithmetic of run differential predicted the playoff teams in 2010. Run differential predicted 7 of 8 in 2009, with only Colorado edging Atlanta for the wild card spot. Not surprisingly, there is almost a perfect correlation between simple run differential and the pythagorean "prediction" with only Houston out of order.

In 2009 - pythagorean missed the Yankees record by 7 games, the Dodgers by 4, the Angels by 5, and the Braves by 6 games. Seattle was off by 9 games. So even a 5 game differential is not really an outlier.

On the other hand, if you look at the formula what is interesting is the calculations done on run differential to create a teams record. The squaring of both runs scored and runs given up indicates that the actual value of a run in determining a teams record is not static.

As an example, lets use very simple numbers. Team A scores 6 runs and gives up 4. Team B scores 12 runs and gives up 10 in the same number of games. Team A's pythagorean number would be .692, Team B's would be .590. Lets change Team B's numbers so they score 12 runs and give up 8, proportionally the same as Team A's. Their pythagorean number is then .686, still lower than Team A's, although not by as much.

James created pythagorean, using a computer, to translate run differential into a winning percentage. To make it mimic actual winning percentage as close as possible, he had to weight runs differently by squaring the results.

What that showed was that the less runs scored, the more important each run becomes. Not surprisingly, the fewer runs scored the more important each run becomes. This is not only true of individual games, which I think most people intuitively believed. But it is true for teams over the course of the season as well.

Anonymous said...

"It also accurately predicted all eight of the teams that made the playoffs."

It didn't "predict" anything. How could it predict something that had already happened? Predictions occur before the event, not after.

In order for the P Theorem to "predict" something, it would have to do so before the season started, not after, when you (and "it") know which teams had the biggest run differential.

If the P Theorem could indeed predict which teams would have the biggest run differentials in an UPCOMING season, you'd have something useful.

As it is, you have a toy that is sorta neat, but ultimately as useful as a weatherman who predicts yesterday's weather.