Predicting values setting specific coefficients to zero

Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#1

Predicting values setting specific coefficients to zero

28 Apr 2021, 06:15

Hello,

I have a complex linear cost model on health care utilisation, with around 59 million observations and around 1000 variables. I need to predict the costs based on the regression model, however, I need to set some coefficients to zero. This is because some of the coefficients for ethnicity dummies are negative, meaning negative cost, and expert advice is that these relate to unmet need, rather than being a true reflection of cost - certain ethnic groups not accessing health care. I am unsure how to predict, setting some of the coefficients to zero. Has anybody come across this before. I know this may not be the ideal response to the situation - but it is something that I have been asked to explore.
Tags: None
Felix Bittmann

Join Date: Aug 2018

Posts: 693
#2

28 Apr 2021, 06:33

I am not sure if I understood that correctly and maybe it might be better to adjust your regression model or have separate models for subgroups? However, when you run a regression you can use margins for specific predictions using the at option. For example

Code:

reg ... margins, at(age=30 female=0 migrant=1 income=2500)

You can create a prediction for a specific "case" you have in mind and set the values as desired.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#3

28 Apr 2021, 06:40

Thank you, but I don't need to set the variables to a particular value, I just need to predict but as if the coefficients from the model for some of the ethnicity dummies are 0, so I am not reducing costs for these observations.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 693
#4

28 Apr 2021, 06:58

I then assume that you want something like predict that gives you an individual prediction for each case in the dataset, right? If so you might have to generate this manually. For example, run the regression and then:

Code:

gen pred = _b[var1] * var1 + _b[var2] * var2 + ... + _b[_cons]

Then you can manually adjust the desired coefficient instead of using _b[var], insert the adjusted value (or simply drop the term if it is 0).

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#5

28 Apr 2021, 07:04

Originally posted by Daniel Sutcliffe View Post

This is because some of the coefficients for ethnicity dummies are negative, meaning negative cost, and expert advice is that these relate to unmet need, rather than being a true reflection of cost - certain ethnic groups not accessing health care. I am unsure how to predict, setting some of the coefficients to zero.

So, let's say white group has the highest healthcare cost, if I set it to be the reference group then I would magically consider all racial/ethnic groups have "unmet" healthcare since their indicators would be all negative? Then how about this: if I pick the lowest mean, say American Indian & Alaska Native, as the reference group then I once more can magically turn the healthcare need as "excessive" in all the racial/ethnic groups as they should be now all positive? The +/- sign of the binary indicator is relative to the reference group, not to optimal healthcare budget. I am really not sure how to interpret that expert advice.

Last edited by Ken Chui; 28 Apr 2021, 07:09.
1 like
Comment
Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#6

28 Apr 2021, 07:16

Thank you, that might work. Can I check, is there a way to find out how Stata has stored the name of the betas for the variables entered in to the model? I have over a 1000 variables so might take time to do and there is a risk of human error so would be a good QA to check how Stata stores the betas
Comment
Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#7

28 Apr 2021, 07:21

Thanks Ken, the base category was chosen to allow us to discern this expected result based on the evidence base supporting the modelling
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#8

28 Apr 2021, 07:29

Originally posted by Daniel Sutcliffe View Post

Thanks Ken, the base category was chosen to allow us to discern this expected result based on the evidence base supporting the modelling

Ah, thanks for the clarification.

Can I check, is there a way to find out how Stata has stored the name of the betas for the variables entered in to the model?

Yes, set up the regression model as usual, and add this option:

Code:

reg y x1 x2 x3 x4 x1000, coeflegend
Comment
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#9

28 Apr 2021, 16:40

Daniel, here is what I would do:
create extra variables to hold the original values first
gen Origx1 =x1
then estimate the regression with original values (x1)
reg depvar x1
after the reg estimation command but before the predict command replace x1 values with 0s
replace x1 =0
then predict
predict predicteddepvar
if you need to produce more estimates then you can restore the x1 back to its original
replace x1 = Origx1

this would give you the predictions as if x1 values were 0s.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

28 Apr 2021, 18:03

Setting those coefficients to zero strikes me as completely invalid statistically. It is trying to constrain the model to exclude those variables--but then you have to actually rerun the model excluding those variables, because some of the other coefficients have been estimated from a model that includes those variables. You can't just mix and match coefficients from different models. All coefficients in a model are conditional on the complete list of variables that have been included in the model. If I were reviewing a paper that did just set coefficients to zero like that, I wouldn't even bother reading the rest of the paper: I'd send it back to the editor with a recommendation to reject, in upper case bold face letters.

If you want to constrain those coefficients to zero, then do that: rerun the model excluding those variables and use the new coefficients you get from that. You'll have to justify that choice of model, but at least it's a statistically valid approach to the problem. Alternatively, go Bayesian and set the prior on those coefficients to some distribution whose support is entirely non-negative. Again, you'll have to justify that choice, but at least the method has statistical integrity.

Last edited by Clyde Schechter; 28 Apr 2021, 18:06.
2 likes
Comment
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#11

28 Apr 2021, 19:33

I fully agree with Clyde Schechter as to the concerns he raises . My response is intended to give a solution to the original question, but in no way should be construed as anything that suggests doing it is a valid approach. Although, using such an approach might have some value in coming up with predictions for hypothetical cases that are not in the estimation sample.
Comment

Announcement

Predicting values setting specific coefficients to zero

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment