Too many zeros in an independent variable

Andrew Steele

Join Date: Feb 2017

Posts: 6
#1

Too many zeros in an independent variable

21 Apr 2018, 23:45

Hello everyone,

Here is the background of my project:

I ran an OLS robust regression on National Basketball Association player data. The dependent variable is a statistic called Offensive RPM. The independent variables are various shooting zone field goals made and attempted totals as well as minutes total.

The regression as an entirety is statistically significant and each independent variable is statistically significant with the results, in a broad interpretation, being what's already known: making more shots on fewer attempts increases a player's offensive RPM and taking more shots with fewer makes decreases a player's offensive RPM -- i.e. efficient basketball is good.

However, the coefficient for corner three-point shot attempts is almost the same as midrange shot attempts. The coefficient figures I'm about to list are scaled up 100 to make typing and reading easier: -1.25 (midrange attempts) and -1.17 (corner three attempts).

This goes against the accepted norms and beliefs of the NBA statistics community because midrange shots have less value than three-point shots, especially in the corner. Three points is greater than two points, and the corner three is the shortest three point shot on the court. There are two conclusions to this:

1. Attempting corner three-point shots are not as beneficial as the norm suggests.

2. The fact that so many NBA players do not even attempt corner three-point shots, generating a lot of 0s in the sample, is forcing/pulling the two variables outward and creating a much larger risk-reward ratio than it really is.

When it comes to adding weights or trying to control for these potential issues, then I get a bit over my head. The zero values aren't an issue of "missing data," but rather mainly players simply don't take that many shots.

In all, what I'm asking is whether or not I should do something to address this potential issues and if so, what do I do in Stata to adjust my robust OLS regression?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

22 Apr 2018, 01:56

I suspect that one facet here is misunderstanding of robust in this context. If by OLS robust regression you mean using regress, robust then that is not robust regression in the sense that you gain robustness to outliers or problematic tails in the data. The regression coefficients (parameter estimates) are exactly the same as without the option.

Otherwise I do not know enough about basketball even to understand the difficulty.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

22 Apr 2018, 08:27

Let me just throw this one up. Perhaps the reason the value of 3-point corner attempts is the same as midrange 2-point attempts is because the yield on each 3-point corner attempt, despite being the shortest 3-point shot, is less that that of each 2-point midrange attempt.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

22 Apr 2018, 08:28

Well, I do know something about basketball, but had to look up RPM. It turns out it's a measure of how a player affected overall team performance (plus/minus team points versus opponent) while he was playing. My guess here would be that you would need an interaction term involving corner shots to reflect that for players in certain positions (a 4 or a 5 player), the effect of corner shots is likely negligible, since they're almost never in a position to take them, whereas corner shots are likely meaningful for say, a 1 or a 2 player. Failing to recognize this would be like using analyzing the effect of a goalkeeper's play in (non-U.S.) football without recognizing that (e.g.) his number shots on goal does not predict his impact on the team. I'm sure Nick could give a better-described analogy here.<grin>
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

22 Apr 2018, 08:34

I yield to Statalist friends on just about any question involving sport.
Comment
Andrew Steele

Join Date: Feb 2017

Posts: 6
#6

22 Apr 2018, 10:47

Mike Lacy that's a good idea. Create a dummy variable for the positions (point guard 1, shooting guard 2, etc.) as a control in this regression to help address that many centers and forwards just don't take the shot.

I have a follow-up question that I would like to ask, if you don't mind. Would a zero-inflated model help with this? This type of model is out of my comfort zone as I don't really know it well, but from some research it does sound like an option.

I did try a Poisson regression on the data this morning and the corner three attempts variable became statistically insignificant. I'm not sure if I did the model right given that I primary know OLS, random, and fixed effects. For my basketball analyses, I don't really need to go outside them... until now.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

23 Apr 2018, 07:11

Zero-inflated models apply for situations in which the *response* variable is a count, and also has a large number of zeros. Such models aren't at all relevant when the *explanatory variable* has a large number of zeros. I also can't see the relevance of a Poisson regression for your situation because your response variable, RPM, is not a "count" variable.
1 like
Comment

Announcement

Too many zeros in an independent variable

Comment

Comment

Comment

Comment

Comment

Comment