Of all Bill James’ sabermetric innovations, my favorite has always been the log5 formula for determining matchup probabilities. It provides a method for taking the strengths of two teams and estimating the probability of one team beating the other. It can also be applied to individual player matchups, such as a batter facing a pitcher.
Here is a common way to express the formula for the context of team matchups:
Unfortunately this formulation doesn’t shed much light on why James called this log5, or where the formula came from in the first place.
James’ Original Formulation
James introduced the formula in the 1981 Baseball Abstract, which is excerpted here. In his initial presentation, James first converted each team’s winning percentage (or , their probability of success) into what he called their .
Solving for :
After this conversion, the formula is simple:
Logarithms, Odds, and Odds Ratios
So where does the “log” in log5 come from? I’m not sure exactly where James got it, but there is a connection to the logit function:
That term was present in James’ formulation. It is what is known as the odds. It’s common term in gambling — if some event has a .75 probability, the odds are , typically expressed as 3:1 or 3 to 1.
Framing things in terms of odds rather than probabilities can be helpful.
We can replicate the log5 formula by simply taking the odds ratio, which is just the odds for team A divided by the odds for team B.
To convert this back to a probability we need one final step:
Combining these steps, we have a simple formulation of log5:
This matches James’ original formulation, but here we see that one can use simple odds rather than James’ log5 term (which contains an unnecessary .500 multiplier).
Tying this back into the logit function, we can reformulate log5 to say that the matchup probability is equal to the inverse-logit of the log of the odds ratio (of course, it’d be simpler to just say that the matchup odds are equal to the odds ratio, but then we’d be leaving the “log” out of “log5″).
The “5″ in “log5″ and a More General Formulation
The “5″ part of “log5″ was in reference to the fact that teams were being compared to .500, the average winning percentage. But when we’re dealing with individual matchups, the league average isn’t always .500 (for a batter/pitcher matchup to estimate the probability of a hit, we would need to use the league-wide batting average). To deal with this we need to add another term to the formula representing the league average probability (or odds).
In the odds ratio formulation, this is easy. We just divide by the league average odds (). When the league average probability is .500, the odds are , so the term can be omitted without consequence.
Converting this to a probability, we have what I find to be the clearest formulation of the generalized log5 formula:
So was Bill James the first to discover the log5 formula? Not exactly. It turns out that log5 is a variation of the Bradley-Terry model for pairwise comparison, which was first published in 1952 (and which itself was a variation on a 1929 work of German mathematician Ernst Zermelo). The formula given on the Wikipedia page is equivalent to the inverse-logit formulation I discussed above, if the logs of each team’s odds are used for the scale locations (their formula uses the difference of the logs of the odds, which is equal to the log of the ratio of the odds that I used). Jim Albert and Jay Bennett discussed the Bradley-Terry model in Chapter 12 of their excellent book, “Curve Ball.” The Bradley-Terry model has been used for rating systems in many sports, including hockey and chess (I highly recommend Mark Glickman’s article “A Comprehensive Guide to Chess Ratings” for more background on paired comparisons and the connection between log5/Bradley-Terry and the logistic distribution).
For more on log5, here’s a good early piece by Dean Oliver, which includes a shortcut formula that mirrors one discussed by Joe Arthur in a great thread from Tango’s blog. Mike Tamada has also written some lucid intros to log5 here and here. Hal Stern’s work on paired comparisons is worth hunting down – he explicitly makes the link between the logit function and log5 in Chapter 9 of “Statistical Thinking in Sports” (for more references to his work see this comprehensive bibliography on sports ranking systems, which points to a lot of other relevant articles). Padres analyst Chris Long also made the connection between log5 and Bradley-Terry in a presentation he gave last year. And finally, Steven Miller has written a nice short paper that provides a justification of log5 using the geometric series formula.