I wrote up a crude Markovian lineup simulator yesterday. It definitely needs more tweaking, but I want to explain the basics of how this works for those who are unfamiliar with it.
The inputs
Obviously, the lineup order itself! For now, because this is the simplest possible one I can make, the inputs needed for each players are:
Lineup position
Projected PAs
Projected singles
Projected doubles
Projected triples
Projected home runs
Projected walks
Under the hood
Once we have this, we divide each of the outcomes (singles, double, etc) by the PAs to get the rates that each of these occur. You use this to develop a range of probabilities for each player. Basically, it looks like
Chance of single | Chance of double | Chance of triple | Chance of HR | chance of walk | chance of out
Obviously these chances have to add up to 1.
For example, for Ryan Theriot's PECOTA projection we have
Counting stats:
PAs | 1b | 2b | 3b | HR | BB |
647 | 132 | 29 | 3 | 5 | 63 |
Rate stats:
1b | 2b | 3b | HR | BB |
0.2040 | 0.0448 | 0.0046 | 0.0077 | 0.0974 |
Once we have these rates, we use these to define bins to determine what happens in each PA. Thus, continuing with this example, we would have
1b | 2b | 3b | HR | BB |
0.2040 | 0.2488 | 0.2535 | 0.2612 | 0.3586 |
In each PA, we generate a random number between 0 and 1. If it is less than the first number, it is a single. If not, we check to see if it is less than the next number, which gives a double. If it's bigger than the last one the player generates an out.
Now, we simply do this moving through the lineup, keeping track of outs and the number of innings. As this is a first-cut lineup simulator, I'm making the unrealistic (though perhaps not for the Cubs...) assumption that all baserunning is done station-to-station. Thus if someone hits a single, everyone moves up only one base. There are no productive outs. Obviously the next step is to add in a few baserunning possibilities since many players score from 2b on singles.
This is the key feature that defines what a Markov process is. Basically, in brief in a Markov simulator each event is entirely independent from the previous one. Thus each batter in this simulator bats the same whether the generic pitcher is throwing a no-hitter or has just given up 5 straight HRs. Another more glaring consequence of this is that there is no situational hitting or baserunning (i.e. sacrifices, hit and runs, stolen bases, sacrifice flies, etc). We should keep this in mind when we see the results of the simulator.
Some results
In a million simulatons, the Cubs lineup of
- Theriot
- Fukudome
- Lee
- Ramirez
- Byrd
- Soriano
- Soto
- Baker
- Zambrano (using career numbers, since he doesn't have a hitting projections)
scores an average of 3.2862 runs per game.
The Cardinals lineup of
- Schumacher
- Rasmus
- God
- Holliday
- Ludwick
- Molina
- Freese
- Wainwright
- Ryan
scores an average of 3.3504 runs per game over a million sims.
I'm not sure that even a million runs are enough to really get a good number. The Central Limit Theorem is nice in theory but can get a bit frustrating in practice. Running 100,000 games takes roughly a minute.
Up next:
- Obviously, these runs/game numbers are pretty low. Changing the baserunning should make a big difference, especially for the scoring from second stuff mentioned above. I'll try to find some average advancement numbers to factor this into the code.
- It would be nice to add a lineup optimizer option, that takes 9 players and finds the optimum distribution of players. It would be pretty expensive to brute force though, since there are 9! = 362880 possible configurations of 9 players, so there would either have to be some sort of mixed-integer type discrete optimization to do (ugly and expensive) or we'll pretty much just have to plug in some guesses
- Figure out how much noise there is in the runs per game depending on the number of simluations. I should be able to find a standard deviation, at least (empirically or not).
- Set it up to draw players from some sort of database or other external file. Having to enter the numbers in by hand is annoying.
- Write this in some sort of cross-platform language. I think perl is probably the best best, though I don't know a ton about running perl scripts on windows machines.
If you want to download and play with my 0.1 code (and can run matlab), you can find it at this link.
(EDIT: you should be able to run this code in OCTAVE, which is an open-sourced version of matlab and can be found at http://www.gnu.org/software/octave/index.html )