February 28, 2010

Building a Markov lineup simulator (version 0.1)

I wrote up a crude Markovian lineup simulator yesterday. It definitely needs more tweaking, but I want to explain the basics of how this works for those who are unfamiliar with it.

The inputs

Obviously, the lineup order itself! For now, because this is the simplest possible one I can make, the inputs needed for each players are:

Lineup position
Projected PAs
Projected singles
Projected doubles
Projected triples
Projected home runs
Projected walks

Under the hood
Once we have this, we divide each of the outcomes (singles, double, etc) by the PAs to get the rates that each of these occur. You use this to develop a range of probabilities for each player. Basically, it looks like

Chance of single | Chance of double | Chance of triple | Chance of HR | chance of walk | chance of out

Obviously these chances have to add up to 1.

For example, for Ryan Theriot's PECOTA projection we have
Counting stats:

PAs1b2b3bHRBB
647132293563

Rate stats:
1b2b3bHRBB
0.2040 0.0448 0.0046 0.0077 0.0974


Once we have these rates, we use these to define bins to determine what happens in each PA. Thus, continuing with this example, we would have
1b2b3bHRBB
0.2040 0.2488 0.2535 0.2612 0.3586

In each PA, we generate a random number between 0 and 1. If it is less than the first number, it is a single. If not, we check to see if it is less than the next number, which gives a double. If it's bigger than the last one the player generates an out.

Now, we simply do this moving through the lineup, keeping track of outs and the number of innings. As this is a first-cut lineup simulator, I'm making the unrealistic (though perhaps not for the Cubs...) assumption that all baserunning is done station-to-station. Thus if someone hits a single, everyone moves up only one base. There are no productive outs. Obviously the next step is to add in a few baserunning possibilities since many players score from 2b on singles.

This is the key feature that defines what a Markov process is. Basically, in brief in a Markov simulator each event is entirely independent from the previous one. Thus each batter in this simulator bats the same whether the generic pitcher is throwing a no-hitter or has just given up 5 straight HRs. Another more glaring consequence of this is that there is no situational hitting or baserunning (i.e. sacrifices, hit and runs, stolen bases, sacrifice flies, etc). We should keep this in mind when we see the results of the simulator.

Some results
In a million simulatons, the Cubs lineup of
  • Theriot
  • Fukudome
  • Lee
  • Ramirez
  • Byrd
  • Soriano
  • Soto
  • Baker
  • Zambrano (using career numbers, since he doesn't have a hitting projections)

scores an average of 3.2862 runs per game.

The Cardinals lineup of
  • Schumacher
  • Rasmus
  • God
  • Holliday
  • Ludwick
  • Molina
  • Freese
  • Wainwright
  • Ryan

scores an average of 3.3504 runs per game over a million sims.

I'm not sure that even a million runs are enough to really get a good number. The Central Limit Theorem is nice in theory but can get a bit frustrating in practice. Running 100,000 games takes roughly a minute.

Up next:
  • Obviously, these runs/game numbers are pretty low. Changing the baserunning should make a big difference, especially for the scoring from second stuff mentioned above. I'll try to find some average advancement numbers to factor this into the code.
  • It would be nice to add a lineup optimizer option, that takes 9 players and finds the optimum distribution of players. It would be pretty expensive to brute force though, since there are 9! = 362880 possible configurations of 9 players, so there would either have to be some sort of mixed-integer type discrete optimization to do (ugly and expensive) or we'll pretty much just have to plug in some guesses
  • Figure out how much noise there is in the runs per game depending on the number of simluations. I should be able to find a standard deviation, at least (empirically or not).
  • Set it up to draw players from some sort of database or other external file. Having to enter the numbers in by hand is annoying.
  • Write this in some sort of cross-platform language. I think perl is probably the best best, though I don't know a ton about running perl scripts on windows machines.


If you want to download and play with my 0.1 code (and can run matlab), you can find it at this link.

(EDIT: you should be able to run this code in OCTAVE, which is an open-sourced version of matlab and can be found at http://www.gnu.org/software/octave/index.html )

February 22, 2010

Midwinter Grilling

While I was jaunting around southern Wisconsin last weekend, visiting a dog we might adopt, I came across a local meat shop near Darien and popped in to see what they have. I picked up some decent sausage (which became meatballs at dinner that night), but the main thing that caught my eye were their ribs. It was 'warm' that weekend so I was hoping to have an excuse to grill.



I cobbled together a couple of rub recipes and used salt, pepper, paprika, brown sugar, and garlic powder as my rub. The recipe I had called for more or less steaming the ribs in a covered pan with water in the oven for 2 hours before the grill, which I was kind of skeptical of, but the result was just fine. One of the nice things about grilling in winter here is that the snow catches the ambient light so you can see what you're doing a little better when grilling after dark.



Final result:


I also made a loaf of fellow grad student Jason Murcko's excellent Challah bread





I love making braided breads - they look so cool when you take them out of the oven.

The ribs were pure win - this was my first time cooking anything like them. Maybe in the summer sometime (if I'm here...) I'll experiment with full-on smoking and the like in the grill. I'm looking forward to doing these again.

February 12, 2010

Add it to the pile

I haven't made a new post about books in about 3 months, when I reviewed the most recent Wheel of Tedium book. I surprised myself with how much I enjoyed it, and it really turned my opinion about one of the characters around that I had previously lumped in with some of the shrill harpies that populate this series. Anyway, it was enough to get me to reread them, which meant I had to go out and re-buy them as I had sold my original copies off after the author passed away a few years ago (the series has been passed on to another author, with Jordan's blessing). Our B+N had the first 6 books, so I bought them and I'm most of the way through book 5.

While out on our quasi-Valentine's date today, we stopped by the B+N on the other side of town which had the rest of them and I grabbed the rest of them...but after about 5 minutes I changed my mind and put them back. I've got a big backlog of books that I've been meaning to read, and I did resolve to make a little more effort to read new stuff rather than re-reading things that I've read many times before (but very much enjoy). In the past 3 months I've read those 5 Jordan books and re-read the excellent Tawny Man trilogy (the last book of which is my sentimental favorite for best fantasy novel). But, here's what I have piled up to read:

  • The rest of Bryson's A Short History of Nearly Everything
  • The rest of The Book: Playing the Percentages in Baseball
  • The rest of Simmons's The Book of Basketball
  • Bryson's I'm a Stranger Here Myself
  • Bryson's Notes From a Small Island
  • Bryson's The Life and Times of The Thunderbolt Kid (You can tell I'm a big Bryson fan)
  • #17 on the KLAW 101 Jonathan Strange and Mr. Norrell
  • Klaw 101 #15, The Wind-Up Bird Chronicle
  • Klaw 101 #42 Brideshead Revisited
  • Marvin Miller biopic A whole different ballgame
  • Joe Posnanski's excellent The Soul Of Baseball: A Road Trip Through Buck O'Neil's America
  • Posnanski's The Machine
  • The Neyer/James Guide to Pitchers
  • The new BP Annual
  • The Hardball Times Annual
  • The 2010 Cubs 'Anal', with 2 articles by Shawn

I picked up Wind-Up Bird Chronicle after putting the WoT books back. A skim of the back cover makes it seem like it's pretty cool. Not that I agree with Klaw on his rankings in general - I was not a fan at all of his #1, The Master and Margarita, and my wife (who I view as a good authority on this) thinks that Emma, his #3, isn't nearly as good as the other two Austen books (P+P and Persuasion) on the list and isn't as good as one (Sense and Sensibility) that didn't make the cut. A curious selection. He also has Jane Eyre and Wuthering Heights in there, both of which I despised in high school. He does score huge points for including Lonesome Dove though (at 98),

Anyway, I have my work cut out for me. I'm going to finish reading through WoT #6 and then get back to work on these things...

February 08, 2010

Congrats to the Saints

Not only did they win their first Super Bowl in franchise history, but they did so by beating Hall of Fame QBs (Warner, Favre, Manning) in three straight playoff games. After the Saints scored the TD and 2-pt conversion to put them up by 7, I told my wife 'if I had money to gamble with, I would bet it all on Manning leading a drive for a TD here'. The lesson is, as always, NEVER GAMBLE (dying laughing).

Commercials were pretty lousy again this year - the only ones I really remember were the Betty White/Abe Vigoda one and the Letterman one with Leno and Oprah. My expectations have lowered so much that it wasn't a big deal though.

February 05, 2010

Rumblings about FIP

Today we had a discussion over at ACB about what should be defined as replacement level FIP. MB had just done Jeff Gray's projection and it came out to 4.32 FIP, which he claimed to be right around replacement level. This set off all sorts of alarm bells for me. I still wonder if the replacement level FIP for relievers is set too low. Let's take a look at the actual numbers and arguments and see if they disagree with me.

Here's the main argument. Starting is harder than relieving. This I agree with. The replacement level FIP for a NL starter is 5.35, which seems right to me (since FIP is more or less a proxy of ERA anyway). A quick rule of thumb for converting between the two is that a pitcher's FIP should be 1.25 x what it would be as a reliever. This is okay with me too. But given that, then a replacement level reliever's FIP should be 4.28! That seems way too low to me - in my head that's what an average reliever, not a replacement level one should do. How do the numbers bear this out?

Running the numbers, the average FIP in the National league was 4.1, which is indeed lower than 4.28. This gap doesn't seem as big as, say, the gap between an average batter wOBA and a replacement level bat, but I digress.

Where did that replacement level number come from anyway? In a series of posts on Fangraphs, Dave Cameron broke down how replacement level is defined. Basically, it was estimated that with a replacement level starter and average everything else, a team wins 38% of the time, and with a replacement level bullpen and average everything else, a team wins 47% of the time. Next, we look at the number of runs allowed in our league in a year (Dave looks at 2008, since he wrote it after the 2008 season). For example, in the 2008 American league the average FIP based on this should be 4.40, which is what a average (.500) pitcher should generate. Now, what about replacement level? Cameron cryptically says 'running the numbers through the formula gives us a 4.68 FIP'. Here's my dumb hick guess as to what he did

I'm assuming a linear approxmation here, with win% as the variable x. I'm assuming that a pitcher that has a FIP of 0 has 100% win percentage, and a pitcher that gives up 4.4 runs has a 50% win percentage. Since we have 2 data points, we have the linear approximation:

FIP = 4.4(x-1)/-0.5 = 8.8(1-x)

This a 47% reliever and a 38% starter will have

FIP_relief = 8.8(.53) = 4.66
FIP_start = 8.8(0.72) = 5.45

Which are different than what Cameron found (4.68 and 5.63, respectively). Maybe I'm doing things wrong. For one, a 0 FIP pitcher isn't going to have 100% winning percentage since he still gives up hits and walks and stuff (though he does strike a ton of guys out). But since our league average FIP is being scaled to run scoring, we'll have to do it. Maybe there's some other data point that he didn't mention that he's using for the other component of the linear approximation (or he's not using a linear model at all).

At least I learned a thing or two

February 02, 2010

AL west preview: Seattle Mariners - UPDATED

(See below for update. Basically I changed how I estimated playing time)

Quite possibly the most intriguing division this year will be the AL West. The Angels have long dominated the division, but the Mariners made a lot of upgrades in the offseason, managing to acquire Cliff Lee and Milton Bradley for almost nothing, the Texas Rangers have a surge of young talent reaching the majors, and it's tough to count out the A's.

I'm going to take a look at these teams a little more closely, busting out BtB's WAR calculator from last year to examine things.

First, here's a rundown of what I'm doing: for the players wOBAs and FIPs, I'm taking the average of 5 projection systems that largely have data on all of the teams: CHONE, Bill James, PECOTA, Marcel, and the Fans projections at fangraphs. For playing time, I'm just using the playing time estimates over at BP's depth charts. I should probably change this to the fan projections playing times, but we don't have full data on enough players to do that. The defensive numbers come from Jeff Zimmerman's UZR projections on Beyond the Boxscore, and I ignored baserunning, because I'm lazy and it doesn't have a huge impact anyway.

So without further ado, here are the Mariners. I apologize for the strange formatting - bloggers crappy rich text editor is converting all the newlines in the table html into line breaks. Super annoying.

Hitter Pos PA wOBA Hit Pos Fld Rep WAR FA $ WAR
Josh Bard CA 324 .299 -2.16 1.250 2.50 1.59 $3.7 0.7
Rob Johnson CA 312 .292 -2.62 1.250 2.50 1.13 $2.7 0.5
Casey Kotchman 1B 496 .337 0.12-1.25 0.50 2.50 1.87 $6.4 1.3
Ryan Garko 1B 211 .343 0.23-1.25 -0.30 2.50 1.48 $2.3 0.4
Jose Lopez 2B 641 .330 -0.33 0.25 -0.10 2.50 2.32 $10.0 2.1
Jack Hannahan 2B 72 .309 -1.570.25 0.20 2.50 1.38 $1.0 0.1
Jack Wilson SS 513 .303 -1.940.75 0.60 2.50 1.91 $6.7 1.4
Jack Hannahan SS 136 .309 -1.57 0.75 0.10 2.50 1.78 $2.0 0.3
Chone Figgins 3B 667 .341 0.380.25 0.60 2.50 3.73 $16.4 3.6
Jack Hannahan 3B 80 .309 -1.570.25 0.90 2.50 2.08 $1.5 0.2
Michael Saunders LF 423 .318 -1.02-0.75 0.50 2.50 1.23 $3.7 0.7
Milton Bradley LF 125 .369 2.09 -0.75 -0.40 2.50 3.44 $3.2 0.6
Junior LF 68 .326 -0.57 -0.75 -0.30 2.50 0.88 $0.8 0.1
Franklin Guitierrez CF 660 .337 0.10 0.25 1.60 2.50 4.45 $19.3 4.2
Langerhans CF 70 .319 -0.95 0.25 -0.20 2.50 1.60 $1.1 0.2
Suzuki Ichiro RF 710 .349 0.85-0.75 0.70 2.50 3.30 $15.4 3.3
Langerhans RF 78 .319 -0.95 0.75 0.20 2.50 1.00 $0.9 0.1
Milton Bradley DH 422 .369 2.09 -2.000 2.50 2.59 $7.4 1.6
Junior DH 148 .326 -0.57 -2.000 2.50 -0.07 $0.3 0.0
Garko DH 88 .343 0.49 -2.000 2.50 0.99 $1.0 0.1
Team 6398 .331 -0.25 0.00 -0.24 0.43 2.44 $100.7 22.3


Pitcher S/R IP FIP LEV FA $ WAR
Felix Hernandez S 220 3.45 1.0 $25.6 5.6
Cliff Lee S 219 3.55 1.0 $24.2 5.3
Ian Snell S 163 4.54 1.0 $9.0 1.9
Ryan Rowland-Smith S 164 4.33 1.0 $10.8 2.3
Erik Bedard S 84 3.76 1.0 $8.4 1.8
Doug Fister S 93 4.6 1.0 $5.0 1.0
Aardsma R 60 3.83 1.8 $5.4 1.1
Lowe R 60 4.05 1.3 $3.0 0.6
White R 60 4.61 1.0 $0.6 0.1
League R 60 4.30 0.9 $1.5 0.2
Kelley R 55 4.18 0.8 $1.6 0.3
Vargas R 60 4.78 0.7 $0.2 0.0
Olson R 57 5.08 0.6 -$0.2 -0.1
Starters 802 4.01 $66.6 14.7
Relievers 412 4.40 $9.8 2.1
Total 1214 4.14 $76.0 16.8


Group WAA WAR
Hit -2.3
BR 0.0
Field 3.7
Hitters 21.7
Pitchers 20.0
Total WAR 41.7
Total FA $ $188.6
Win Talent 85.2


So in summary - the Mariners are an incredibly good fielding team (led by Franklin Guitierrez, of course), but they're not very good hitters, and their rotation behind twin aces Hernandez and Lee are not very good.


UPDATE: I decided to change a few things with regards to playing time, especially the pitching. Garko and Bedard have signed with the team since I wrote this so I included them too. I used the fan's playing time estimates from fangraphs instead. Tango has shown that the fans estimates tend to be too high, so you should temper these a little bit. But for the most part they were pretty close to what PECOTA had.I only did this for the regulars position-player wise, and divvied up bench ABs based on pecota and my own adjustments. For the 5th pitcher slot, I'm assuming that Bedard will share it with someone else. Since Fister seems to be the leading candidate to get these starts over all the others on the Mariners depth chart (Vargas, Olson, French) I went with him. Anything they get from Bedard should be pretty valuable.

This new WAR estimate is 2 wins and change higher than the old one, and the main difference is pitching. Rowland-Smith was not predicted to throw as many innings in PECOTA, and replacing Yusmeiro Petit (who no one but pecota has as the 5th starter) with Bedard/Fister was a big upgrade too.