Need Help for Regression with Both Categorical and Numeric Variables

Sarah Gayle

Join Date: Apr 2016

Posts: 16
#16

06 Jan 2017, 09:05

Month of year should also be categorical- right now its numbers 1-12 but in truth the increase is not incremental, and the "highest" month may be month 7/July.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#17

06 Jan 2017, 09:10

Sarah,

Good luck with the remainder of your pregnancy, labor and delivery.

To have Stata treat MC as a categorical variable, you specify it as i.MC in your regression. (See -help fvvarlist- for more on factor variable notation.) If you do that, the regression coefficient table will include a separate entry for each value of MC in the sample (except one which serves as the base category.) Similar advice for month: i.month.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#18

06 Jan 2017, 10:21

I echo Clyde's best wishes but want to point out that the combined age of the males answering is probably well over 200, so we are sometimes a little grumpy.

I am a strong fan of treating time of year sinusoidally when possible although that's not popular, it seems, with business or economics groups.

You seem to be spending degrees of freedom lavishly and perhaps a large dataset will allow that.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#19

06 Jan 2017, 11:29

Sarah:
I do share Clyde and Nick's wishes for what lies ahead.
It would be interesting to know which equation Stata gave you back after -regression-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Sarah Gayle

Join Date: Apr 2016

Posts: 16
#20

09 Jan 2017, 11:50

Hi all,
Hope you had a good weekend.
I am wondering, also based on what Nick said , if it would be preferable to make multiple formulas. Perhaps a formula for each model (26 models) or each metro. I do not think there is enough data to do both. I've got 28,000 observations.
My largest metros (NYC, LA, CHI, etc.) have a lot more data than the smaller metros (Tampa, Charlotte, Seattle, etc.) so I wonder if this would be much easier if I do top 5 metros and top 10 models... and then reassess when I have more data. The project I am working on generates about 100 new observations a day; so a few months from now it should be greatly improved as far as quality of data. I just want to get a working formula before baby.

Thoughts? thanks!!!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#21

09 Jan 2017, 12:36

Well, this is primarily a content issue, although there are also some statistical considerations.

If you use a single model, with indicators for the 26 metro areas, you are stipulating that all of the other predictors in your model have the same associations with the PaidAmount outcome in every metro area--although the different metro areas may be operating at different overall levels of PaidAmount. You need to figure out whether that is a reasonable way to look at the process you are modeling based on your knowledge of this field (or in consultation with others in your field). This is the primary issue, and it is not a statistical question.

If you decide that it is not reasonable to presume that all of the predictors work the same way in all the MC's, then there is the question of the best way to handle that heterogeneity in your modeling. There are a few options, and statistical considerations would play a role in deciding among them. Separate models for each MC, or perhaps for certain groups of MCs that you expect will be heterogeneous, is one approach. A single model with MC X everything else interaction terms is another. So is a random slopes model.

But first decide whether a single model with just MC indicators, and its assumption that other effects are the same across MC's, is reasonable. If you decide it isn't, then we can go down the other path and explore the different approaches for suitability here.
Comment
Sarah Gayle

Join Date: Apr 2016

Posts: 16
#22

09 Jan 2017, 14:33

Thank you so much Clyde.
The metro areas definitely differ vastly. Put simply, we are talking about auto markets that include exports, so there can be a 50% difference between one metro and another...definitely not something I can neglect.

I am going to look tomorrow morning at how to best group them. I am not sure how to group car models, but I will see if I can do that as well (at the very least models with very similar measures of central tendency)
I wish I had more data so this didn't have to be done.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#23

09 Jan 2017, 15:25

Put simply, we are talking about auto markets that include exports, so there can be a 50% difference between one metro and another...definitely not something I can neglect.

That's not exactly the issue. If the only difference among the MCs is that they operate at different overall price points, a single model with an i.MC term will be sufficient. The question is whether the relationships between your other predictors and the prices differ among MCs. To put it concretely, suppose one of your predictor variables is foreign vs domestic. Even if prices are higher in NYC than they are in DesMoines overall, if the difference in price between foreign and domestic cars is the same in both markets, then a single model which includes iMC suffices (at least with respect to the foreign vs domestic aspect). But if the difference in price between foreign and domestic cars is different in different MC's, then a more complicated modeling approach is needed. So think about this with regard to all of your planned regression variables. If any of them has a different effect on price depending on which MC you're looking at, then you need a more complicated approach.

So again, it's not a question of whether the overall (say, average) price point is different among markets; it's whether the effects of the various predictors in your model differ across MC's that matters.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment