BBR Mailbag: More Statistical +/- Tidbits
Posted by Neil Paine on April 22, 2009
(Before reading this, read this.)
OK, this post is in response to a few questions by our readers over the past day or so...
A Simpler Version
There were some inquiries into a simpler equation than the monstrous one I posted the other day, and it's actually true that not all of the variables I included were significant at the 0.10 level (I'm all but sure this was also the case for Dr. Rosenbaum's model back in 2004 as well). By throwing out some of the less important variables, you can actually use the following simplified equation without a significant loss in accuracy:
SPM = -10 + (0.55*P40) - (1.38*TSA40) + (0.02*TSA40^2) + (0.44*3A40) + (0.44*FTA40) - (1.72*TO40) + (2.25*ST40) + BL40 + (0.44*PF40) + (1.54*VI) + (0.1*MPG)
It appears that although breaking down assists and offensive & defensive rebounds into their own categories is nice, their presence in the versatility index picks up the majority of their value. I'm still going to use the full, more complex version, though, which brings us to...
Age-Squared Model
A poster named "Schtevie" over at APBRmetrics wanted to know if I could include a squared term for player age in the regression, which of course would add a more realistic curve shape to the aging effect (rather than the linear effect that was assumed earlier). And I most certainly can -- in fact, I would have originally, save for the fact that Excel's somewhat lousy -- but easy-to-use -- regression package limits you to 16 variables (and Age^2 would have been the 17th). But because you asked for it specifically, I opened up R and ran the regression this morning, which will now be considered the "official" SPM formula:
Coefficients | |
---|---|
(Intercept) | -14.39411 |
Ag | 0.029944 |
Ag^2 | -0.0002405 |
Ht | 0.0412704 |
P40 | 0.5634116 |
TSA40 | -1.316389 |
TSA40^2 | 0.0184027 |
3PA40 | 0.4829075 |
FTA40 | 0.4293357 |
A40 | 0.1820349 |
OR40 | 0.2917254 |
DR40 | 0.0039279 |
TO40 | -1.717041 |
ST40 | 2.328424 |
BL40 | 0.9022185 |
PF40 | 0.3900931 |
VI | 1.307191 |
MPG | 0.0960708 |
If you want a spreadsheet of these results for every player-season in NBA & ABA history (or since 1952, at least), you can get it here.
Charges
Another commenter inquired into the possibility of including charges drawn as another variable in the regression, and I replied that I'd like to if I could find the data. 82games.com has tracked charges in the past, but with only one year of full results to draw from, I don't think it would make for a very meaningful regression. So for now, I think we should wait until we see a few more years' worth of charges tracked -- all the while crossing our fingers that the league itself will officially keep tabs on the stat someday.
As always, feel free to ask more questions or tell us your thoughts in the comments below...
April 22nd, 2009 at 2:25 pm
This is great stuff. Any chance at getting standard errors? Also, I wonder if height has more impact on defense. If I recall correctly, Dan Rosenbaum found that being a rookie had an impact independent of age - you might want to insert than in too.
April 22nd, 2009 at 2:50 pm
Is it just me, or does it seem odd that steals are, more or less, an order of magnitude more important than anything else.
The other factors that have an absolute weighting above 1 are: TO40, TSA40, and VI which all seem slightly different than steals.
TO40 is the closest to being the inverse of steals in both significance and weighting. But I suspect that TO40 has more of a necessary correlation with positive stats (scoring and assists) than steals. The opportunity for TOs (having the ball) is necessarily connected to the opportunity for scoring or assisting.
TSA40 is also necessarily correlated with PT40, 3PA40, and FTA40. To some extent it can be seen as not an independent term, but a way to distinguish players who score mostly on 2-point attempts from players who score a high percentage of their points on 3-pointers or FT.
Finally, VI is, explicitly not a specific stat, but a term introduced to describe tendencies (towards specialization or versatility).
Steals seem closer to rebounding and PF stats as not being necessarily correlated with specific other stats (though steal attempts may represent a loss of DR opportunities, and probably correlate negatively with PF. So the high value may represent diminished opportunities for those positive contributions).
So all of that makes the high coefficient for steals stand out even more.
April 22nd, 2009 at 3:05 pm
If you want offensive fouls drawn instead of simply "charges", my CSV data has the opponent for these fouls. So that's now 3 years of data to work with. :)
April 22nd, 2009 at 3:12 pm
Following up on my earlier comments, I wonder if the coefficient for steals is so high because the variance for steals is so much lower.
I may be misinterpreting the regression, but assuming that I'm not, I'd be curious to see what the coefficients would be for each element if you normalized all of the variances by using z-score as the input rather than raw numbers.
It wouldn't change the content of the regression, of course, it would just put the relative weight of the coefficients in perspective. In fact you might not even have to re-do the regression, you could just divide each coefficient by the standard deviation for that stat, and display that value in a separate column.
April 22nd, 2009 at 3:18 pm
I believe steals show up as significant for offense as well as defense. It could be simply that they lead to high % fast break opportunities, but it might be that it's proxy for an unusual level of quickness, basketball IQ, and hand eye coordination that is helping in all kinds of plays not tracked by traditional stats.
April 22nd, 2009 at 3:22 pm
NickS wrote:
The size of the regression coefficient does not really tell you anything meaningful about the relative importance of each predictor.
April 22nd, 2009 at 3:29 pm
it might be that it’s proxy for an unusual level of quickness . . .
Sure. If steals are "overvalued" that would mean that they were showing up as a proxy for something else. The question then becomes whether there is some other factor that could be used along with steals to capture that "other quality" more precisely.
Imagine, for example, that you added in a term for the number of times per 40min that a player scored an assist, a basket, or a shooting foul within 5 seconds of an opponent's TO or the team's defensive rebound and used that as a different proxy for speed and opportunism.
If that worked then you would expect to see the coefficient for steals go down, as the new facts captures some of that value.
The question then would be how much accuracy that added -- or whether it turned out that steals alone were a perfectly good measure of those other qualities.
I am just trying to think this through out loud. I think that's all correct.
April 22nd, 2009 at 3:33 pm
The size of the regression coefficient does not really tell you anything meaningful about the relative importance of each predictor.
"Importance" was probably the wrong word there, but isn't it fair to say that the high regression coefficient for steals implies that SPM places a high "value" on steals? (with "value" being deliberately vague there).
I freely acknowledge that could be completely misreading the significance of that coefficient, and please correct me if I am.
April 22nd, 2009 at 3:40 pm
NickS wrote:
No, because the regression coefficients are dependent upon the measurement scale of each predictor. For example, steals occur much less often than assists, so it's not that surprising that the coefficient for steals is larger than the coefficient for assists.
April 22nd, 2009 at 3:44 pm
Salary would be a fun variable too.
April 22nd, 2009 at 3:46 pm
No, because the regression coefficients are dependent upon the measurement scale of each predictor.
So I was correct in my second comment?
I know I've been dominating the thread somewhat, but would it be interesting to show a column that had each coefficient divided by the standard deviation for that factor?
April 22nd, 2009 at 6:07 pm
Looks like adding the age^2 term doesn't make much difference unless you're Dikembe Mutombo. The quadratic for age peaks at 62, so the model still seems to say that as you get older, your value over and above your stats gets higher and higher (if only a bit).
April 22nd, 2009 at 6:33 pm
You're right Biggles. It's hard to see what the age^2 term is adding to the model.
April 23rd, 2009 at 4:03 pm
Steals aren't the only way that a defensive player can force a turnover. Players can also force turnovers through offensive fouls or by tipping the ball out of bounds (as long as it touches an offensive player before it ends up out of bounds). Even when there is a steal, it is often credited to the player who ends up with the ball rather than the player who causes the turnover. That is why NBA teams often keep track of "deflections" instead of just steals and blocks. I would bet that the number of deflections a player gets correlates strongly with the number of steals they get, so the steals is probably acting as a proxy for deflections and for other defensive contributions that don't show up in the box score.