## Run Expectancy and Markov Chains

August 14th, 2011

Sorry for the long interval between entries – I hope to get back to posting on more regular basis. Continuing in the vein of my previous two posts, I’m still working my way towards baseball win expectancy, but I’m going to pause to examine run expectancy in a more detailed manner.

First, let’s look back at the run expectancy matrix from my last post. It was built by looking at each time a given base-out state occurred, and seeing how many runs were scored in the remainder of those innings (by utilizing the FATE_RUNS_CT field from Chadwick). I will refer to this as empirical run expectancy, as it is based on how many runs were actually scored following each base-out state.

 Run Expectancy Matrix, Empirical BASES 0 OUTS 1 OUT 2 OUTS ___ 0.539 0.289 0.111 1__ 0.929 0.555 0.24 _2_ 1.172 0.714 0.342 __3 1.444 0.984 0.373 12_ 1.542 0.948 0.464 1_3 1.844 1.204 0.512 _23 2.047 1.438 0.604 123 2.381 1.62 0.798

## Run Expectancy and Base-Out Leverage Index

December 19th, 2010

Before I get into win expectancy, Win Probability Added, Leverage Index, and WPA/LI, I want to take a look at run expectancy, RE24, Base-Out Leverage Index, and RE24/boLI. Most of these stats were created and/or popularized by Tangotiger (his intro to Leverage Index is here). I have tried to mimic his methodology as closely as possible, but there may be some differences.

## Building a Retrosheet Database, Part 1

October 27th, 2010

I want to be able to calculate Tangotiger’s WPA/LI stat (Win Probability Added/Leverage Index, a.k.a. situational wins, context neutral wins, or game state linear weights). To do that, I need to be able to calculate WPA and LI. To do that, I need to construct a Win Expectancy matrix. To do that, I need to build a Retrosheet database. So that’s where I’m going to start. I’ve never worked with a database or explored any Retrosheet data before, so I am starting from scratch (though I will be utilizing a lot of great resources from around the web). In a series of posts I will describe my process step-by-step. If you want to follow along, make sure you have a lot of free disk space (the parsed data files for all seasons take up over 5 GB). Also be aware that some of my instructions will be Windows-specific.

## The Distribution of Talent Between Teams

October 20th, 2010

Four years ago Tango had a very interesting post on how talent is distributed between teams in different sports leagues. I want to revisit and expand upon some of the points that came up in that discussion.

First, lets look at some empirical data. I scraped end of season records from the last ten years for the NFL, NBA and MLB from ShrpSports (I decided to omit the NHL from this analysis due to the prevalence of ties). The data is available here (click through) as a tab-delimited text file. I used R to analyze the data. If you don’t have R you can download it for free (if you use Windows I recommend using it in conjunction with Tinn-R, which is great for editing and interactively running R scripts). Here is the R code I used:

?View Code RSPLUS
 records = read.delim(file = "records.txt") lgs = data.frame(league=c("NFL","NBA","MLB"),teams=c(32,30,30),games=c(16,82,162)) lgs$var.obs[lgs$league == "NFL"] = var(records$win_pct[records$league == "NFL"]) lgs$var.obs[lgs$league == "NBA"] = var(records$win_pct[records$league == "NBA"]) lgs$var.obs[lgs$league == "MLB"] = var(records$win_pct[records$league == "MLB"]) lgs$var.rand.est = .5*(1-.5)/lgs$games lgs$var.true.est = lgs$var.obs - lgs$var.rand.est lgs$regress.halfway.games = lgs$games*lgs$var.rand.est/lgs$var.true.est lgs$regress.halfway.pct.season = lgs$regress.halfway.games/lgs$games lgs$noll.scully = sqrt(lgs$var.obs)/sqrt(lgs$var.rand.est) lgs$better.team.better.record.pct = 0.5 + atan(sqrt(lgs$var.obs - lgs$var.rand.est)/sqrt(lgs\$var.rand.est))/pi lgs

Here is the resulting table:

## The Origins of Log5

October 3rd, 2010

Of all Bill James’ sabermetric innovations, my favorite has always been the log5 formula for determining matchup probabilities. It provides a method for taking the strengths of two teams and estimating the probability of one team beating the other. It can also be applied to individual player matchups, such as a batter facing a pitcher.

Here is a common way to express the formula for the context of team matchups:

$Win\%_{A vs. B} = \dfrac{Win\%_A \times (1 - Win\%_B)}{(Win\%_A \times (1 - Win\%_B)) + ((1 - Win\%_A) \times Win\%_B)}$

Unfortunately this formulation doesn’t shed much light on why James called this log5, or where the formula came from in the first place.

James’ Original Formulation

James introduced the formula in the 1981 Baseball Abstract, which is excerpted here. In his initial presentation, James first converted each team’s winning percentage (or $p$, their probability of success) into what he called their $log5$.

$\dfrac{log5}{log5 + .500} = p$

Solving for $log5$:

$log5 = .500 \times \dfrac{p}{1 - p}$

After this conversion, the formula is simple:

$p_{AvB} = \dfrac{log5_A}{log5_A + log5_B}$

Logarithms, Odds, and Odds Ratios

So where does the “log” in log5 come from? I’m not sure exactly where James got it, but there is a connection to the logit function:

$logit(p) = log\left(\dfrac{p}{1 - p}\right)$

That  $\frac{p}{1 - p}$ term was present in James’ formulation. It is what is known as the odds. It’s common term in gambling — if some event has a .75 probability, the odds are $\frac{.75}{(1 - .75)} = 3$, typically expressed as 3:1 or 3 to 1.

Framing things in terms of odds rather than probabilities can be helpful.

$Odds = \dfrac{p}{1 - p}$

We can replicate the log5 formula by simply taking the odds ratio, which is just the odds for team A divided by the odds for team B.

$OddsRatio_{AvB} = \dfrac{Odds_A}{Odds_B}$

To convert this back to a probability we need one final step:

$p_{AvB} = \dfrac{OddsRatio_{AvB}}{1 + OddsRatio_{AvB}}$

Combining these steps, we have a simple formulation of log5:

$p_{AvB} = \dfrac{Odds_A}{Odds_A + Odds_B}$

This matches James’ original formulation, but here we see that one can use simple odds rather than James’ log5 term (which contains an unnecessary .500 multiplier).

Tying this back into the logit function, we can reformulate log5 to say that the matchup probability is equal to the inverse-logit of the log of the odds ratio (of course, it’d be simpler to just say that the matchup odds are equal to the odds ratio, but then we’d be leaving the “log” out of “log5″).

The “5″ in “log5″ and a More General Formulation

The “5″ part of “log5″ was in reference to the fact that teams were being compared to .500, the average winning percentage. But when we’re dealing with individual matchups, the league average isn’t always .500 (for a batter/pitcher matchup to estimate the probability of a hit, we would need to use the league-wide batting average). To deal with this we need to add another term to the formula representing the league average probability (or odds).

In the odds ratio formulation, this is easy. We just divide by the league average odds ($Odds_{LG}$). When the league average probability is .500, the odds are $\frac{.500}{(1 - .500)} = 1$, so the term can be omitted without consequence.

$OddsRatio_{AvB} = \dfrac{\frac{Odds_A}{Odds_B}}{Odds_{LG}}$

Converting this to a probability, we have what I find to be the clearest formulation of the generalized log5 formula:

$p_{AvB} = \dfrac{Odds_A}{Odds_A + (Odds_B \times Odds_{LG})}$

Precedents

So was Bill James the first to discover the log5 formula? Not exactly. It turns out that log5 is a variation of the Bradley-Terry model for pairwise comparison, which was first published in 1952 (and which itself was a variation on a 1929 work of German mathematician Ernst Zermelo). The formula given on the Wikipedia page is equivalent to the inverse-logit formulation I discussed above, if the logs of each team’s odds are used for the scale locations (their formula uses the difference of the logs of the odds, which is equal to the log of the ratio of the odds that I used). Jim Albert and Jay Bennett discussed the Bradley-Terry model in Chapter 12 of their excellent book, “Curve Ball.” The Bradley-Terry model has been used for rating systems in many sports, including hockey and chess (I highly recommend Mark Glickman’s article “A Comprehensive Guide to Chess Ratings” for more background on paired comparisons and the connection between log5/Bradley-Terry and the logistic distribution).

For more on log5, here’s a good early piece by Dean Oliver, which includes a shortcut formula that mirrors one discussed by Joe Arthur in a great thread from Tango’s blog. Mike Tamada has also written some lucid intros to log5 here and here. Hal Stern’s work on paired comparisons is worth hunting down – he explicitly makes the link between the logit function and log5 in Chapter 9 of “Statistical Thinking in Sports” (for more references to his work see this comprehensive bibliography on sports ranking systems, which points to a lot of other relevant articles). Padres analyst Chris Long also made the connection between log5 and Bradley-Terry in a presentation he gave last year. And finally, Steven Miller has written a nice short paper that provides a justification of log5 using the geometric series formula.

## A Perl version of Tango’s Markov Model

September 28th, 2010

I have created a Perl version of Tangotiger’s excellent Markov run modeler. Tango’s original HTML/Javascript version can be found here, with further discussion here.

This is just a basic adaptation – I have not added any new features, though I hope to in the future (at the very least I would like to make a Perl version of Bill Skelton’s modification of Tango’s original).

To use my version, first download the zip file (markov.zip), extract the Perl script (markov.pl) and the example input file (input.csv), and place them in the same directory. Change the values in the input.csv file to alter the batting line and the chances of taking an extra base (but make sure not to alter the formatting of the file). Then just run the Perl script, which will produce a file named output.txt that is tab-delimited. If you open that in Excel you should be able to view all the results in table form. For simplicity’s sake I didn’t include any command line arguments to specify the names of the input or output files, so if you want to run the script multiple times and save your results you’ll either have to rename/copy the output file or alter the Perl script (note that the output file does include the input values inside it).

For those unfamiliar with Markov models of baseball, there are a lot of great resources on the web. Outside of Tango’s site, I recommend work by Mark Pankin, Joel Sokol (includes Matlab code), Bruce Bukiet (scroll down for “A Markov Chain Approach to Baseball”), Carl Morris, John Beamer (includes Excel spreadsheet with purchase), Tom Ruane, and Berselius (includes Matlab code, though link appears to be down).