Building a Retrosheet Database, Part 1October 27, 2010
Posted by Sobchak in Databases,Retrosheet,Run/Win Expectancy
I want to be able to calculate Tangotiger’s WPA/LI stat (Win Probability Added/Leverage Index, a.k.a. situational wins, context neutral wins, or game state linear weights). To do that, I need to be able to calculate WPA and LI. To do that, I need to construct a Win Expectancy matrix. To do that, I need to build a Retrosheet database. So that’s where I’m going to start. I’ve never worked with a database or explored any Retrosheet data before, so I am starting from scratch (though I will be utilizing a lot of great resources from around the web). In a series of posts I will describe my process step-by-step. If you want to follow along, make sure you have a lot of free disk space (the parsed data files for all seasons take up over 5 GB). Also be aware that some of my instructions will be Windows-specific.
Getting the Retrosheet data
First I downloaded Retrosheet’s Regular Season Event Files by decade from here. I saved them into C:\retrosheet\data\zipped\ (I will be following the directory structure suggested by Tango, Colin Wyers, and others – if you use a different structure you may have to modify some of my instructions). These are six zip files (1950seve.zip, etc.) totaling 90 MB. I unzipped all the files into C:\retrosheet\data\unzipped\. This creates 2944 text files totaling 482 MB. There are four types of files:
- TEAM[SEASON] files (x60): lists the teams from each season (fields are Team ID, League, City, Nickname)
- [TEAM][SEASON].ROS files (x1442): lists each team’s roster (fields are Player ID, Last Name, First Name, Batting Hand, Throwing Hand, Team ID, Position)
- [SEASON][TEAM].EVA files (x726): play-by-play data for home games of American League teams (fields described here and here)
- [SEASON][TEAM].EVN files (x716): play-by-play data for home games of National League teams
These files are not in a format that can be easily imported into a database. They first have to be pre-processed using some free programs. There are two versions of these programs – those provided by Retrosheet, available here (BEVENT, BOX & BGAME); and those developed by Ted Turocy called Chadwick, available here (cwevent, cwgame, cwbox, cwsub & cwcomment). I will be using Chadwick, which provides some extra functionality. Download Chadwick 0.5.2 here (thanks to Colin Wyers for compiling this version for Windows) and extract the files to C:\retrosheet\common\programs\.
We need to run cwevent, cwgame, and cwsub on all of the 1442 EVA and EVN files to convert the data into yearly CSV files containing fields that can be imported into a database. Here are three batch files that automate this process, adopted from some from Colin Wyers. Unzip the batch files into C:\retrosheet\data\unzipped\ and run each of the three files. They will generate 180 CSV files (WARNING: totaling 5.11 GB!) in C:\retrosheet\data\parsed\ (note that these CSV files do not contain a header row with the field names). The cwevent batch file will generate 60 “all[season].csv” files containing the 97 basic fields (those that would be generated by BEVENT) for each play/event in each game, plus 61 extended fields. Descriptions of these fields can be found here. The cwgame batch file will generate 60 “games[season].csv” files containing 84 basic fields summarizing each game as listed here, and the cwsub batch file will generate 60 “subs[season].csv” files containing 10 fields describing in-game player substitutions as listed here.
Importing the data into MySQL
At this point we have the Retrosheet data in a form that can be imported into a database. I will be using MySQL, because it’s free and there is a lot of documentation for it around the web. I downloaded the ”mysql-essential-5.1.51-win32.msi” file (39 MB) from here. I selected the “Typical” installation, choosing the following settings as they came up (and the defaults everywhere else): “Detailed Configuration,” “Developer Machine;” “Non-Transactional Database Only;” “Decision Support (DSS)/OLAP;” “Include Bin Directory in Windows PATH;” and I set a root password. For a GUI to interact with the database, I chose SQLyog by downloading and installing “SQLyog-8.6.3-0Community.exe” (12MB) from here (some alternative GUIs are MySQL Workbench and HeidiSQL).
Here is a zip file containing six .SQL files that need to be run to set the initial structure of our Retrosheet database, and load in the data. Again, these are based on work done by Colin Wyers, with some minor adjustments. Unzip these into C:\retrosheet\loaders, and run them in order (the files are numbered). To run one of them, in SQLyog select “Restore from SQL dump” from the “Tools” menu, locate the .SQL file, and click “Execute.”
The first file, “01 create structure.sql,” builds the table structure for our Retrosheet database. For the three main tables we will create (“events” – from the “all[season].csv” files, “games” – from the “games[season].csv” files, and “subs” – from the “subs[season].csv” files) we have to name each field and specify what type of data it will contain (e.g. a number, a text string, etc.). This is a real pain (it’s one of the main things that makes working with a database more imposing that working with data in Excel), but it has to be done. The file contains comments to give an idea of what each command is doing.
The next three .SQL files load the .CSV files we created with Chadwick into the database. “05 partition.sql” partitions the data by year, which will improve the speed of queries (this may take a while to run). “06 lookup codes.sql” creates tables that make it easy to look up descriptions of various Retrosheet codes (as described here and here).
Creating a Run Expectancy Matrix
Now that we have the data loaded, we can begin to query it for information. To start, we can create a basic run expectancy matrix with the following code (which I found here):
SELECT e.OUTS_CT , e.START_BASES_CD , AVG(e.EVENT_RUNS_CT + e.FATE_RUNS_CT) AS RUNS FROM retrosheet.events e GROUP BY OUTS_CT, START_BASES_CD;
In the FROM line we say what table we want data from, and give it an abbreviated name “e,” which makes it easier to refer to its columns by just having to prefix them with “e.” The SELECT statement lists the columns we want to display – OUTS_CT, the number of outs; START_BASES_CD, one of the extended fields from Chadwick that specifies what bases are occupied at the start of the play; and RUNS, which we calculate by summing EVENT_RUNS_CT (the runs scored on the play) and FATE_RUNS_CT (the runs scored in the rest of the half inning after the play). The AVG takes the average runs for the base-out state. The GROUP BY specifies that we want to group the data by both the number of outs and the base state. To understand the data that is displayed you have to know what the base state codes represent. You can see this by glancing at the “lkup_cd_bases” table, which shows that 0 = bases empty, 1 = man on 1st, 2 = man on 2nd, 3 = men on 1st & 2nd, 4 = man on 3rd, 5 = men on 1st & 3rd, 6 = men on 2nd & 3rd, and 7 = bases loaded.
There’s a problem with this simple run expectancy matrix. As Tango mentions in The Book, not all innings should be included in the data set. All partial innings (where the inning ends before there are three outs) should be excluded, because we don’t know how many more runs would have scored before the third out was recorded. So we have to exclude the home halves of the 9th inning and later where the home team scores enough runs for a walk-off win. However, if we only exclude those innings, we introduce bias into the data by excluding home halves of the 9th or later where the home team scored at least one run, but including home halves of the 9th or later where the home team didn’t score (or didn’t score much). So to correct for this bias it is best to exclude all home half innings of the 9th or later. To do this it is helpful to first create a new table that lists only the innings we want to include.
CREATE TABLE non_partial_non_home_half_ninth_plus_innings AS SELECT YEAR_ID, GAME_ID, INN_CT, BAT_HOME_ID FROM retrosheet.events e WHERE INN_END_FL = "T" AND EVENT_OUTS_CT + OUTS_CT = 3 AND IF(INN_CT >= 9 AND BAT_HOME_ID = 1,1,0)=0;
Then we can use this slightly modified version (adopted from here) of our original run expectancy code:
SELECT e.OUTS_CT , e.START_BASES_CD , AVG(e.EVENT_RUNS_CT + e.FATE_RUNS_CT) AS RUNS FROM retrosheet.events e, retrosheet.non_partial_non_home_half_ninth_plus_innings i WHERE e.GAME_ID = i.GAME_ID AND e.INN_CT = i.INN_CT AND e.BAT_HOME_ID = i.BAT_HOME_ID AND e.YEAR_ID >= 1999 AND e.YEAR_ID <= 2002 GROUP BY OUTS_CT, START_BASES_CD;
I also added in some year constraints so that we are only looking at 1999-2002 data, which allows us to compare our results to those posted by Tango here. Almost a perfect match. Here’s what the query returned:
That’s all for now. Win Expectancy will have to wait till next post. In the meantime a useful resource to check out is the RetroSQL group.