A Tricky Regression

Muhammod Towhid Ahmed

Join Date: Mar 2023
Posts: 6

A Tricky Regression

30 Mar 2023, 10:10

Hi Statalist,

Hope everyone is well!

I'm running a regression and essentially have ran into a few problems that I honestly cannot seem to resolve that I was hoping I could get some insight into!

Context

I'm running a MLS with a continuous dependent variable (values between 0 and 2).
All my independent variables are categorical, and save two (Sex and Mortgage_Number), they all have three or more categories.
I'm also running an interaction between Race and Sex.
Previously, I did use a multinomial logistic regression, but this turned out to be unsuitable for my task.

Problems

My base category is White Male, however, Stata doesn't give me the interaction coefficient and standard errors for Male#Black, Male#Asian, Male#Other, and Female#White. I need to work these out, but can't find a way to either: (a) calculate them in Stata or (b) calculate them manually.
Also, since my independent variables are categorical, I don't think I can test whether my regression meets the assumptions of a multiple linear regression in the conventional ways.
Furthermore, I'm not too sure of ways to test for robustness and reliability of my results.
Please could someone help me!

Thank you so much!

Code:

gen agegroup = x74r
replace agegroup = 1 if x74r >= 18 & x74r <= 21
replace agegroup = 2 if x74r >= 22 & x74r <= 29
replace agegroup = 3 if x74r >= 30 & x74r <= 39
replace agegroup = 4 if x74r >= 40 & x74r <= 49
replace agegroup = 5 if x74r >= 50 & x74r <= 59
replace agegroup = 6 if x74r >= 60 & x74r <= 69
replace agegroup = 7 if x74r >= 70 & x74r <= 79
replace agegroup = 8 if x74r >= 80 & x74r <= 99
label define Age_Range 1 "Eighteen to Twenty-One" 2 "Twenty-Two to Twenty-Nine" 3 "Thirty to Thirty-Nine" 4 "Forty to Forty-Nine" 5 "Fifty to Fifty-Nine" 6 "Sixty to Sixty-Nine" 7 "Seventy to Seventy-Nine" 8 "Eighty to Ninety-Nine"
label values agegroup Age_Range
rename agegroup Age
label define sex 1 "Male" 2 "Female"
label values x75r sex
rename x75r Sex
label define education 1 "Some Schooling" 2 "High School" 3 "Technical School" 4 "College" 5 "College Graduate" 6 "Postgraduate Studies"
label values x76r education
rename x76r Education
label define race 1 "White" 2 "Black" 3 "Asian" 4 "Other"
label values x78r race
rename x78r Race
label define Household_Income 1 "Less Than $35,000" 2 "$35,000 to $49,999" 3 "$50,000 to $74,999" 4 "$75,000 to $99,999" 5 "$100,000 to $174,999" 6 "More Than $175,000"
label values x83 Household_Income
rename x83 Household_Income
label define Risk_Attitudes 1 "High" 2 "Above Average" 3 "Average" 4 "Averse"
label values x87 Risk_Attitudes
rename x87 Risk_Attitudes
label define Mortgage_Number 1 "First Mortgage" 2 "Not First Mortgage"
label values first_mort_r Mortgage_Number
rename first_mort_r Mortgage_Number
mvdecode Mortgage_Number, mv(-2)
rename ltv LTV
rename score_orig_r Credit_Score
mvdecode Credit_Score, mv(-2)
drop if Credit_Score < 300
drop if Credit_Score > 850
recode x56a (3=0) (2=1) (1=2) 
label define mortgagelitone 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56a mortgagelitone
rename x56a Mortgage_Literacy_One
recode x56b (3=0) (2=1) (1=2) 
label define mortgagelittwo 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56b mortgagelittwo
rename x56b Mortgage_Literacy_Two
recode x56c (3=0) (2=1) (1=2) 
label define mortgagelitthree 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56c mortgagelitthree
rename x56c Mortgage_Literacy_Three
recode x56d (3=0) (2=1) (1=2) 
label define mortgagelitfour 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56d mortgagelitfour
rename x56d Mortgage_Literacy_Four
recode x56e (3=0) (2=1) (1=2) 
label define mortgagelitfive 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56e mortgagelitfive
rename x56e Mortgage_Literacy_Five
recode x56f (3=0) (2=1) (1=2) 
label define mortgagelitsix 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56f mortgagelitsix
rename x56f Mortgage_Literacy_Six
recode x56g (3=0) (2=1) (1=2) 
label define mortgagelitseven 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56g mortgagelitseven
rename x56g Mortgage_Literacy_Seven
mvdecode Mortgage_Literacy_Seven, mv(-3)
recode x56h (3=0) (2=1) (1=2) 
label define mortgageliteight 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56h mortgageliteight
rename x56h Mortgage_Literacy_Eight
mvdecode Mortgage_Literacy_Eight, mv(-3)
recode x56i (3=0) (2=1) (1=2) 
label define mortgagelitnine 0 "Not At All" 1 "Somewhat" 2 "Very"
label values x56i mortgagelitnine
rename x56i Mortgage_Literacy_Nine
mvdecode Mortgage_Literacy_Nine, mv(-3)
egen Mortgage_Literacy_Ten = rowmean(Mortgage_Literacy_One Mortgage_Literacy_Two Mortgage_Literacy_Three Mortgage_Literacy_Four Mortgage_Literacy_Five Mortgage_Literacy_Six Mortgage_Literacy_Seven Mortgage_Literacy_Eight Mortgage_Literacy_Nine)
reg Mortgage_Literacy_Ten i.Sex##i.Race i.Education i.Household_Income i.Risk_Attitudes i.Age i.Mortgage_Number, allbaselevels

Tags: None

Muhammod Towhid Ahmed

Join Date: Mar 2023

Posts: 6
#2

30 Mar 2023, 10:13

Also, here is a picture of the regression output!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#3

30 Mar 2023, 10:30

My base category is White Male, however, Stata doesn't give me the interaction coefficient and standard errors for Male#Black, Male#Asian, Male#Other, and Female#White.

Actually, it does. You're just looking for them in the wrong place.

The first thing to remember is that in an interaction model, the "main effects" do not mean what they mean in a non-interaction model. Rather, they reflect the effects of those variables conditional on the other variable's being zero. This is another way of saying that these "main effects" coefficients are actually disguised interaction coefficients for terms where the other variable is in its base level.

So, for example, the Male#Asian effect will be found as the coefficient of Asian in the Race variable. And the Female#White effect will be found as the coefficient of Female in the Sex variable. This is because Male is the base category for Sex in your model, and White the base category for Race.
Comment
Muhammod Towhid Ahmed

Join Date: Mar 2023

Posts: 6
#4

30 Mar 2023, 10:57

Hi Clyde,

Amazing, thank you so much!

How would I interpret the coefficients then? My understanding is that since my base category is White Male, I use "summarize Mortgage_Literacy_Ten if Sex==1 & Race==1" to get the mean value of Mortgage_Literacy_Ten for White Male, e.g. 1.3, and the coefficients represent the change from that mean value. For example, going from First Mortgage to Not First Mortgage would represent a 0.12 increase in Mortgage_Literacy_Ten from the mean value, so a 9% increase. Is this correct?

Best wishes,
Towhid
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#5

30 Mar 2023, 13:29

Well, actually, that approach doesn't work well anyway. The problem is that it is highly likely that the other variable besides race and sex differ among the values of race and sex, so when you then take means conditioned on the values of race and sex, you are partially undoing the efforts of the regression to adjust the analysis for those differences in the other variables. What you would be calculating with that approach is the expected values of the outcome variable in each race#sex category while holding all of the other variables at zero (i.e., for the categorical predictors, holding them at their base level).

To get fully adjusted estimates of the expected values of the outcome in each race sex category, you should instead use the -margins- command following the regression. For this regression, the follow-up with -margins- is

Code:

margins Sex#Race

If you are looking for expected values that suppress variation in the variables other than Race and Sex (which is what you would have gotten with your approach, appropriately modified based on what I explained in #3) it is probably more meaningful to constrain the other variables to their means rather than their base levels. You can get that with

Code:

margins Sex#Race, atmeans

Or, if you want expected values that suppress variation in the variables other than Race and Sex but don't like using their mean values, you can choose the specific values you want with:

Code:

margins Sex#Race, at(Education = chosen_value_of_education Risk_Attitudes = chosen_value_of_risk_attitude ETC.)

Replace the italicized parts by values and variables appropriate to your model.
Comment
Muhammod Towhid Ahmed

Join Date: Mar 2023

Posts: 6
#6

30 Mar 2023, 14:08

Ah, I see. Thank you! Last question My new methodology is:
Run regression without the interaction effects.

Use margins to calculate the mortgage literacy value of my base category.

Run the regression with the interaction effect - but only to use the interaction effects.

Essentially, I use the first regression to analyse the individual variables and their significance - they follow the same trend, and then using the second regression only for the interaction effects.

Does this sound like it would produce reliable results?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#7

30 Mar 2023, 14:28

No. If I understand what you are saying, you would like to use both the model with and the model without iteraction and draw conclusions from one of them for some variables and from the other for other variables. Definitely a bad idea. You cannot "mix and match" the terms from the two models. (There are certain very stringent conditions under which this would be OK, but they seldom apply in real-world observational data. Moreover, in those conditions, the coefficients of the non-interaction-involved variables would turn out to be the same, or very nearly so, in both models, so it really would make no difference at all which model you looked at for those.)

Perhaps I have misunderstood what you have in mind. In that case, please try to explain more clearly what you are thinking of doing here.
Comment
Muhammod Towhid Ahmed

Join Date: Mar 2023

Posts: 6
#8

30 Mar 2023, 14:37

Yes, you're right in understanding! I will definitely not do that then.

Based on what you've said, I will do the following:
Run "reg Mortgage_Literacy_Ten i.Sex##i.Race i.Education i.Household_Income i.Risk_Attitudes i.Age i.Mortgage_Number, allbaselevels".

Use margins to get the mortgage literacy value of the base category.

Use this value to work out percentage changes in literacy levels based on the different variables, e.g. if the coefficient value for "Not First Mortgage" is +0.2, and the margin for Male#White is 1.3, then the percentage increase would be roughly 15%.

Does this sound right? If so, do you have any thoughts on ways of dealing with testing whether my model meets the assumptions of a multiple linear regression? I have categorical variables that makes things more difficult, i.e. in practice, my dependent variable is also a categorical variable - the values are either 0, 1, or 2.
Comment
Muhammod Towhid Ahmed

Join Date: Mar 2023

Posts: 6
#9

30 Mar 2023, 14:40

If this doesn't sound right, would it be possible to trouble you for a call please? )
Comment

Announcement

A Tricky Regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment