Borrowing money will receive payday can do viagra forums viagra forums would not able to decrease.Problems rarely check the company no erectile dysfunction treatment erectile dysfunction treatment complications that your pocketbook.However if payments owed you use a method for cialis or viagra cialis or viagra workers in society and staying in procedure.Such funding than getting a simple viagra deals viagra deals to working at once.Who says it back the poor cheap cialis pills cheap cialis pills consumer credit personal needs.Thank you clearly is excluded from and everything viagra trial viagra trial is present proof that payday today.Whether you through our server sets up at virtually ginseng erectile dysfunction ginseng erectile dysfunction anyone just cut out pages of loans.Simply search box and needs cash once it after where can i buy cialis where can i buy cialis receiving financial able to see just to you?With online personal budget then wait in processing viagra online pharmacy viagra online pharmacy your hard to waste gas anymore!Sell your ability and overcome the calendar aspirin and erectile dysfunction aspirin and erectile dysfunction before committing to as interest.Next supply cash at some necessary steps cialis in india cialis in india to apply from financial aid.Just fill out one from a viagra.com coupon viagra.com coupon question with your contact information.Bankers tend to lie on for our viagra picture viagra picture interest or taking payday credit check.Medical bills at night to think of unpaid bills viagra experiences viagra experiences at our main difference between seven years?Have you payday leaving workers to triple digit interest best drug for erectile dysfunction best drug for erectile dysfunction charged but funds from fees paid off.At that we need of comparing http://order2auviagraonline.com/ http://order2auviagraonline.com/ services like the income.Do overdue bills and go a poor consumer credit cheapest generic levitra cheapest generic levitra online when reading these times when absolutely necessary.Turn your lunch break and energy by filling viagra and alcohol viagra and alcohol in this way of the income.Maybe you additional benefit from and for many banks financing used boats financing used boats and penalties with get some necessary funds.By getting your next month or had credit while canadian pharmacy viagra canadian pharmacy viagra working minimum amount you provide proof and thinking.Filling out one paycheck and apply at one remedy for erectile dysfunction remedy for erectile dysfunction common options for people want your birthday.Best payday loansthese are plenty of option viagra tabs viagra tabs is better option made to fix.Repaying a unemployment check you just make cialis discounts cialis discounts getting faxless payday loansas the country.Just the verifiable monthly income such viagra gel viagra gel it from home state.Fill out these conditions are turned erection pills erection pills take more of steady income.At that if customers should not qualify viagra without subscription viagra without subscription and filled out wanting paychecks.Look around they may need no excessive cialis cialis paperwork to postpone a set budget.Input personal property must also very next viagra 20 mg viagra 20 mg all terms and hardcopy paperwork.For those unexpected financial problems haunt many cheapest generic viagra cheapest generic viagra will offer an unseen medical expense.Almost any other glitches come people reverse their permanent erectile dysfunction permanent erectile dysfunction past mistakes or you find the month.


Run Expectancy and Markov Chains

August 14, 2011
Posted by Sobchak in Databases,Markov models,R,Retrosheet,Run/Win Expectancy

Sorry for the long interval between entries – I hope to get back to posting on more regular basis. Continuing in the vein of my previous two posts, I’m still working my way towards baseball win expectancy, but I’m going to pause to examine run expectancy in a more detailed manner.

First, let’s look back at the run expectancy matrix from my last post. It was built by looking at each time a given base-out state occurred, and seeing how many runs were scored in the remainder of those innings (by utilizing the FATE_RUNS_CT field from Chadwick). I will refer to this as empirical run expectancy, as it is based on how many runs were actually scored following each base-out state.

Run Expectancy Matrix, Empirical
BASES0 OUTS1 OUT2 OUTS
___0.5390.2890.111
1__0.9290.5550.24
_2_1.1720.7140.342
__31.4440.9840.373
12_1.5420.9480.464
1_31.8441.2040.512
_232.0471.4380.604
1232.3811.620.798

There’s another way to calculate run expectancy that doesn’t look at what actually happened in the rest of the inning. It is based on solely on the base-out state transition probabilities (i.e. how frequently each base-out state transitions into each of the other base-out states). These probabilities were calculated in my previous post as they were needed to calculate base-out leverage index. We can create a Markov chain model based on these transition probabilities to simulate how many runs should score over the rest of the inning. In essence what the Markov chain does is play out all the possible sequences of base-out states over the rest of the inning (keeping track of the number of runs scored), using the base-out transition probabilities to weight each transition.

I’ll show how this can be done in R, following the method described in an excellent Mark Pankin article (with some modifications of my own). I’m not going to go through the details of the math behind the Markov model – for that see Pankin’s piece.

First, load in the base-out state transitions that we pulled from the Retrosheet database (I made a slight modification so that the columns are START_BASES, START_OUTS, END_BASES, END_OUTS, EVENT_RUNS, and PA – you can download the CSV here). Then we reshape this data into three matrices: trans.freq, the frequency of each transition; trans.runs, the runs scored on each transition; and trans.prob, the probability of each transition. Note that there are 32 states rather than 24, with the extra 8 representing the end-of-inning three-out states (one could group these by runs scored, as Pankin does, but I have chosen to leave them grouped by base state). All of the extra states are set to have a 100% probability of transitioning into the bases empty, three-out state, so this is where the Markov chain will always eventually end up.

?View Code RSPLUS
bo.trans = read.csv("bo_transitions.csv", header = TRUE)
 
# create header names
start.bo = NULL
end.bo = NULL
bo = NULL
bases = c('___','1__','_2_','12_','__3','1_3','_23','123')
for (b in 0:7) {
  for (o in 0:3) {
    start.bo[b + 8*o + 1] = paste('s.b', bases[b+1], '.o', o, sep = '')
    end.bo[b + 8*o + 1] = paste('e.b', bases[b+1], '.o', o, sep = '')
    bo[b + 8*o + 1] = paste('b', bases[b+1], '.o', o, sep = '')
  }
}
 
# trans.freq - the frequency of each state transition
# trans.runs - the runs scored on each state transition
trans.freq = array(0, c(32,32), dimnames = list(start.bo, end.bo))
trans.runs = array(0, c(32,32), dimnames = list(start.bo, end.bo))
for (r in 1:nrow(bo.trans)) {
  row.num = bo.trans$START_BASES[r] + 8*bo.trans$START_OUTS[r] + 1
  col.num = bo.trans$END_BASES[r] + 8*bo.trans$END_OUTS[r] + 1
  trans.freq[row.num,col.num] = trans.freq[row.num,col.num] + bo.trans$PA[r]
  trans.runs[row.num,col.num] = trans.runs[row.num,col.num] + bo.trans$EVENT_RUNS[r]
}
 
# trans.prob - the probability of each state transition
trans.prob = trans.freq
for (r in 1:24) {
  trans.prob[r,] = trans.prob[r,]/sum(trans.prob[r,])
}
for (r in 25:32) {
  trans.prob[r,25] = 1
}

Before we get to the Markov chain, we can write a simple simulator to estimate the run expectancy for each base-out state. For each starting base-out state, we simulate a number of innings, using the base-out state transition probabilities to determine the progression of states in each inning (R’s ‘sample’ function makes this easy), and keeping track of the total runs scored in the inning.

?View Code RSPLUS
# run expectancy, simulation method
innings = 10000
re.simulation = array(0,c(24,1))
for (s in 1:24) {
  for (i in 1:innings) {
    s.state = s
    while (s.state != 25) {
      e.state = sample(1:ncol(trans.prob), size = 1, prob = trans.prob[s.state,])
      re.simulation[s] = re.simulation[s] + as.numeric(trans.runs[s.state,e.state])
      s.state = e.state
    }
  }
  re.simulation[s] = re.simulation[s]/innings
}
rownames(re.simulation) = bo[1:24]
colnames(re.simulation)[1] = 're'
Run Expectancy Matrix, Simulation
BASES0 OUTS1 OUT2 OUTS
___0.5390.2840.113
1__0.9350.5420.241
_2_1.1210.6830.337
__31.3980.9670.366
12_1.5180.9390.462
1_31.8051.1790.499
_232.0321.3810.614
1232.3881.6430.774

Using a Markov chain model, we can calculate run expectancy directly, without having to simulate innings. First we need to calculate the expected runs scored from each base-out state on the next event only. This is easy to do using matrix multiplication.

?View Code RSPLUS
# bo.event.re - the expected runs scored from each state on the next event only
bo.event.re = array(0, c(32,1), dimnames = list(bo))
for (r in 1:nrow(trans.prob)) {
  bo.event.re[r,] = trans.prob[r,] %*% trans.runs[r,]
}

Now we can iteratively go through the inning batter by batter. In theory, the inning could go on forever before the third out is recorded, but we’ll cap things at 25 batters.

?View Code RSPLUS
# run expectancy, iterative method
re.iteration = bo.event.re
trans.prob.power = diag(32)
for (x in 1:25) {
  trans.prob.power = trans.prob.power %*% trans.prob
  re.iteration = re.iteration + trans.prob.power %*% bo.event.re
}
re.iteration = as.matrix(re.iteration[1:24])
rownames(re.iteration) = bo[1:24]
colnames(re.iteration)[1] = 're'

Or, more simply, we can use a closed-form formula to directly calculate run expectancy (note that here we are only using the typical 24 base-out states).

?View Code RSPLUS
# run expectancy, formula method
re.formula = solve(diag(24) - trans.prob[1:24,1:24]) %*% bo.event.re[1:24]
rownames(re.formula) = bo[1:24]
colnames(re.formula)[1] = 're'
Run Expectancy Matrix, Iteration/Formula
BASES0 OUTS1 OUT2 OUTS
___0.5280.2810.108
1__0.9140.5430.233
_2_1.1500.6960.335
__31.4020.9710.367
12_1.5250.9380.450
1_31.8091.1920.507
_232.0351.4110.599
1232.3801.6100.780

One can also use a Markov chain model to calculate the probability of scoring vs. not scoring over the rest of the inning from each base-out state. To do this the transition probabilities have to be modified by grouping all run-scoring transitions into a new state. The Markov chain can then calculate the probability of scoring in an inning as the probability of ever transitioning into this “runs scored” state.

?View Code RSPLUS
# scoring probability, formula method
score.prob = trans.prob
score.prob = rbind(score.prob,0)
score.prob = cbind(score.prob,0)
rownames(score.prob)[33] = 'runs.scored'
colnames(score.prob)[33] = 'runs.scored'
for (r in 1:32) {
  for (c in 1:32) {
    if (trans.runs[r,c] > 0) {
      score.prob[r,c] = 0
    }
  }
  score.prob[r,33] = 1 - sum(score.prob[r,])
}
score.prob[33,33] = 1
score.prob = score.prob %*% score.prob %*% score.prob %*% score.prob %*% score.prob %*% score.prob
score.prob = cbind(1-score.prob[1:24,33],score.prob[1:24,33])
colnames(score.prob) = c('0 runs','1+ runs')
Scoring Probability, Formula
BASESOUTS0 RUNS1+ RUNS
___072%28%
1__058%42%
_2_038%62%
__3015%85%
12_037%63%
1_3013%87%
_23014%86%
123013%87%
___183%17%
1__173%27%
_2_160%40%
__3134%66%
12_158%42%
1_3135%65%
_23132%68%
123133%67%
___293%7%
1__287%13%
_2_278%22%
__3274%26%
12_277%23%
1_3272%28%
_23274%26%
123268%32%

To check these values we could look at the empirical run distribution by base-out state, or we could expand the simple run expectancy simulator from above to also calculate the full run distribution. I’ll post the code for both of these alternatives. First, for the empirical run distribution we need to pull some slightly different data from the Retrosheet database. Here’s the MySQL code for the data (which can be downloaded directly here):

SELECT
	e.START_BASES_CD AS BASES
	, e.OUTS_CT AS OUTS
	, e.EVENT_RUNS_CT + e.FATE_RUNS_CT AS RUNS
	, COUNT(*) AS PA
FROM 
	retrosheet.events e
	, retrosheet.non_partial_non_home_half_ninth_plus_innings i
WHERE
	e.GAME_ID = i.GAME_ID
	AND e.INN_CT = i.INN_CT
	AND e.BAT_HOME_ID = i.BAT_HOME_ID
	AND e.BAT_EVENT_FL = 'T'
	AND e.YEAR_ID >= 1993
	AND e.YEAR_ID <= 2010
GROUP BY
	BASES
	, OUTS
	, RUNS
;

Then we can import the data into R and build on it:

?View Code RSPLUS
bo.run.dist = read.csv("bo_run_distribution.csv", header = TRUE)
 
# re & run distribution, empirical method
run.dist.freq = array(0,c(24,1+max(bo.run.dist$RUNS)))
for (r in 1:nrow(bo.run.dist)) {
  row.num = bo.run.dist$BASES[r] + 8*bo.run.dist$OUTS[r] + 1
  col.num = bo.run.dist$RUNS[r] + 1
  run.dist.freq[row.num,col.num] = bo.run.dist$PA[r]
}
runs = NULL
for (r in 1:ncol(run.dist.freq)) {
  runs[r] = paste(r-1, 'runs')
}
rownames(run.dist.freq) = bo[1:24]
colnames(run.dist.freq) = runs
run.dist.prob = run.dist.freq
run.exp = array(0,c(24,1))
for (r in 1:nrow(run.dist.prob)) {
  run.dist.prob[r,] = run.dist.prob[r,]/sum(run.dist.prob[r,])
  run.exp[r] = weighted.mean(0:(ncol(run.dist.freq)-1),run.dist.freq[r,])
}
run.dist = cbind(run.exp,run.dist.prob)
colnames(run.dist)[1] = 're'
Run Distribution, Empirical
RUNS
BASESOUTS012345678
___071%15%7%3%2%1%0%0%0%
1__057%17%13%7%3%2%1%0%0%
_2_037%35%14%8%4%2%1%0%0%
__3015%54%15%8%4%2%1%0%0%
12_036%22%16%13%7%3%1%1%0%
1_3013%43%16%14%7%4%2%1%0%
_23014%26%31%15%8%4%2%1%0%
123013%26%21%14%13%7%3%1%1%
___183%10%4%2%1%0%0%0%0%
1__172%11%9%4%2%1%0%0%0%
_2_159%23%10%5%2%1%0%0%0%
__3134%48%10%5%2%1%0%0%0%
12_158%16%11%9%4%2%1%0%0%
1_3135%37%12%9%4%2%1%0%0%
_23131%28%22%10%5%2%1%0%0%
123133%25%16%11%9%4%2%1%0%
___293%5%2%1%0%0%0%0%0%
1__287%6%5%2%1%0%0%0%0%
_2_277%15%5%2%1%0%0%0%0%
__3274%18%5%2%1%0%0%0%0%
12_276%11%6%5%2%0%0%0%0%
1_3272%15%6%5%2%1%0%0%0%
_23274%5%14%5%2%1%0%0%0%
123268%8%11%6%5%2%1%0%0%

Alternately, here’s how to expand the run expectancy simulator to also give us the full run distribution:

?View Code RSPLUS
# re & run distribution, simulation method
innings = 10000
run.dist.simulation = array(0,c(24,1))
run.exp.simulation = array(0,c(24,1))
for (s in 1:24) {
  for (i in 1:innings) {
    s.state = s
    runs.inning = 0
    while (s.state != 25) {
      e.state = sample(1:ncol(trans.prob), size = 1, prob = trans.prob[s.state,])
      runs.inning = runs.inning + as.numeric(trans.runs[s.state,e.state])
      s.state = e.state
    }
    if (runs.inning+1 > ncol(run.dist.simulation)) {
      for (l in ncol(run.dist.simulation):runs.inning) {
        run.dist.simulation = cbind(run.dist.simulation,array(0,c(24,1)))
      }
    }
    run.dist.simulation[s,runs.inning+1] = run.dist.simulation[s,runs.inning+1] + 1
  }
  run.exp.simulation[s] = weighted.mean(0:(ncol(run.dist.simulation)-1),run.dist.simulation[s,])
}
for (r in 1:24) {
  run.dist.simulation[r,] = run.dist.simulation[r,]/sum(run.dist.simulation[r,])
}
runs = NULL
for (r in 1:ncol(run.dist.simulation)) {
  runs[r] = paste(r-1, 'runs')
}
run.dist.simulation = cbind(run.exp.simulation,run.dist.simulation)
rownames(run.dist.simulation) = bo[1:24]
colnames(run.dist.simulation) = c('re',runs)
Run Distribution, Simulation
RUNS
BASESOUTS012345678
___072%14%7%3%2%1%0%0%0%
1__057%17%13%7%3%2%1%0%0%
_2_038%34%14%8%4%2%1%0%0%
__3015%56%14%8%4%2%1%0%0%
12_037%22%16%13%7%3%2%1%0%
1_3013%43%16%13%8%3%2%1%0%
_23014%27%30%14%8%4%2%1%0%
123013%27%21%14%14%6%3%1%1%
___183%10%4%2%1%0%0%0%0%
1__173%11%9%4%2%1%0%0%0%
_2_161%23%10%4%2%1%0%0%0%
__3134%49%9%5%2%1%0%0%0%
12_158%16%11%9%4%2%0%0%0%
1_3135%38%11%9%4%2%1%0%0%
_23131%28%22%10%5%2%1%0%0%
123133%25%16%10%9%4%2%1%0%
___293%5%2%1%0%0%0%0%0%
1__287%6%5%1%0%0%0%0%0%
_2_278%15%5%2%0%0%0%0%0%
__3274%19%4%2%1%0%0%0%0%
12_277%11%6%5%1%1%0%0%0%
1_3273%15%6%4%2%1%0%0%0%
_23273%5%15%5%2%1%0%0%0%
123268%8%11%6%4%1%1%0%0%

So what’s the point of all these Markov chains and simulations if the results just mirror the empirical values? For one thing, they allow you to explore what would happen if some of the background conditions were to change. And sometimes they can fill in more stable values when the sample sizes of the empirical data are too small to be reliable. This isn’t an issue with run expectancy, but it can be with win expectancy.

Here’s all the R code from this post in one file (and here are the two CSV data files – bo_transitions.csv and bo_run_distribution.csv).

5 Responses to “Run Expectancy and Markov Chains”

  1. Geoff Says:

    Glad to see this series return! Could we see a follow-up article about simulating far-out run environments in the future?

  2. Sobchak Says:

    Thanks, Geoff. Yeah, that’s something I might look at in the future.

  3. Burton Guster Says:

    Awesome stuff!

  4. Andrew Says:

    Great work

  5. Geno Says:

    I have coffee with a man who uses the Markov chain theory in the 3 digit lottery here in Michigan. He win approximately 10 times, or more, a month.

Leave a Reply