February 28, 2010

Building a Markov lineup simulator (version 0.1)

I wrote up a crude Markovian lineup simulator yesterday. It definitely needs more tweaking, but I want to explain the basics of how this works for those who are unfamiliar with it.

The inputs

Obviously, the lineup order itself! For now, because this is the simplest possible one I can make, the inputs needed for each players are:

Lineup position
Projected PAs
Projected singles
Projected doubles
Projected triples
Projected home runs
Projected walks

Under the hood
Once we have this, we divide each of the outcomes (singles, double, etc) by the PAs to get the rates that each of these occur. You use this to develop a range of probabilities for each player. Basically, it looks like

Chance of single | Chance of double | Chance of triple | Chance of HR | chance of walk | chance of out

Obviously these chances have to add up to 1.

For example, for Ryan Theriot's PECOTA projection we have
Counting stats:

PAs1b2b3bHRBB
647132293563

Rate stats:
1b2b3bHRBB
0.2040 0.0448 0.0046 0.0077 0.0974


Once we have these rates, we use these to define bins to determine what happens in each PA. Thus, continuing with this example, we would have
1b2b3bHRBB
0.2040 0.2488 0.2535 0.2612 0.3586

In each PA, we generate a random number between 0 and 1. If it is less than the first number, it is a single. If not, we check to see if it is less than the next number, which gives a double. If it's bigger than the last one the player generates an out.

Now, we simply do this moving through the lineup, keeping track of outs and the number of innings. As this is a first-cut lineup simulator, I'm making the unrealistic (though perhaps not for the Cubs...) assumption that all baserunning is done station-to-station. Thus if someone hits a single, everyone moves up only one base. There are no productive outs. Obviously the next step is to add in a few baserunning possibilities since many players score from 2b on singles.

This is the key feature that defines what a Markov process is. Basically, in brief in a Markov simulator each event is entirely independent from the previous one. Thus each batter in this simulator bats the same whether the generic pitcher is throwing a no-hitter or has just given up 5 straight HRs. Another more glaring consequence of this is that there is no situational hitting or baserunning (i.e. sacrifices, hit and runs, stolen bases, sacrifice flies, etc). We should keep this in mind when we see the results of the simulator.

Some results
In a million simulatons, the Cubs lineup of
  • Theriot
  • Fukudome
  • Lee
  • Ramirez
  • Byrd
  • Soriano
  • Soto
  • Baker
  • Zambrano (using career numbers, since he doesn't have a hitting projections)

scores an average of 3.2862 runs per game.

The Cardinals lineup of
  • Schumacher
  • Rasmus
  • God
  • Holliday
  • Ludwick
  • Molina
  • Freese
  • Wainwright
  • Ryan

scores an average of 3.3504 runs per game over a million sims.

I'm not sure that even a million runs are enough to really get a good number. The Central Limit Theorem is nice in theory but can get a bit frustrating in practice. Running 100,000 games takes roughly a minute.

Up next:
  • Obviously, these runs/game numbers are pretty low. Changing the baserunning should make a big difference, especially for the scoring from second stuff mentioned above. I'll try to find some average advancement numbers to factor this into the code.
  • It would be nice to add a lineup optimizer option, that takes 9 players and finds the optimum distribution of players. It would be pretty expensive to brute force though, since there are 9! = 362880 possible configurations of 9 players, so there would either have to be some sort of mixed-integer type discrete optimization to do (ugly and expensive) or we'll pretty much just have to plug in some guesses
  • Figure out how much noise there is in the runs per game depending on the number of simluations. I should be able to find a standard deviation, at least (empirically or not).
  • Set it up to draw players from some sort of database or other external file. Having to enter the numbers in by hand is annoying.
  • Write this in some sort of cross-platform language. I think perl is probably the best best, though I don't know a ton about running perl scripts on windows machines.


If you want to download and play with my 0.1 code (and can run matlab), you can find it at this link.

(EDIT: you should be able to run this code in OCTAVE, which is an open-sourced version of matlab and can be found at http://www.gnu.org/software/octave/index.html )

8 comments:

shawndgoldman said...

Awesome.

Unknown said...

Great work Berselius! Keep the good stuff coming!

Anonymous said...

> It would be nice to add a lineup optimizer option

This sounds like a good place to use a genetic optimizer. They tend to provider close to optimum results fairly quickly.

http://dces.essex.ac.uk/staff/rpoli/gp-field-guide/
http://fog.neopages.org/helloworldgeneticalgorithms.php
http://en.wikipedia.org/wiki/Genetic_algorithm

J. Cross said...

Berselius,

First off, this is awesome. I'm playing with it in an attempt to teach myself matlab.

The runner advancements do seem to make a difference.

With all runners going base to base I have the cubs scoring 3.30 runs/game.

If runners from 2nd score half the time on a single we're up to 3.59 runs/game.

If they all score it's 3.97 runs/game. (100,000 simulations)

I'm surprised it makes this big of a difference and this is without sac flies or going 1st-to-3rd.

Berselius said...

hey, glad you're interested! I've got some newer verseions of the code that allow for more exotic baserunning (and strikeouts), but given all the conditionals in there it runs incredibly slowly. Tango had a few suggestions which could be worth building on, though while they're clever I'm not convinced that they cost less overall (at least, not to the level of detail I'm looking to code...)

I'm not surprised that the baserunning adds that much - I'm just talking out of my ass here but a nontrivial number of runs are scored off singles with runners on 2b.

Anonymous said...

I found out about your site. Your code is pretty cool. I did notice one mistake, some base1 were written basel (with the letter L). So basically, players with walkers were not getting on base, hence the low scoring rates.

Berselius said...

Holy crap, nice catch. That makes a lot more sense.

Anonymous said...

Also, the randomizer doesn't seem to be reliable (at least on my computer; warning: I'm no analyst, just a baseball fan that happen to have MATLAB on my computer).

In any case, when I plugged in N=162, to simulate a season, I had very weird seasonal stats, 80HR 200 RBIs would occasionally appear. I modified the randomizer, and the stats now totally make sense on a seasonal basis.

But my emphasis again: thanks for the hard work and for sharing.