Two-stage regression model with categorical dependent varible and categorical endogenous variable

Jennifer Castaneda

Join Date: Mar 2018

Posts: 5
#1

Two-stage regression model with categorical dependent varible and categorical endogenous variable

20 Mar 2018, 09:46

Dear Statalist,

I am running a model where my dependent variable (y1) has 4 categories. My problem is that one of my regressors (y2, a categorical variable with 4 categories) is also influenced by the other explanatory variables (household characteristics, $x1). I have already identified an instrumental variable (included in $x2), my base outcomes for the both variables of interest (y1 and y2) are the largest categories and I have already checked for IIA running two separate multinomial logistic models.

I have tried different approaches to account for endogeneity, without success:

I tried [cmp][/CODE]: [cmp (y1 =$x1) (y2= $x2), ind($cmp_mprobit $cmp_mprobit)][/CODE], but I had problems to make it converge:
cannot compute an improvement -- discontinuous region encountered

[convergence not achieved
convergence not achieved
r(430);]

I also tried [gsem][/CODE] using latent variables to account for the correlation with the error terms: [gsem (i.y1<- $x1 L, mlogit) (i.y2<- $x2 L, mlogit), vce(cluster community) var(L@1)][/CODE]

And I got the following error message:

[initial values not feasible
r(1400);]

The [gsem] command runs properly if I exclude the latent variables: [gsem (i.y1y<- $x1 L, mlogit) (i.y2<- $x2 L, mlogit), vce(cluster community)]
But I would be ignoring the endogeneity problem.

My database has 320 observations and I am using Stata 14.

I would appreciate any advice you could provide to move forward.
Tags: None
Jennifer Castaneda

Join Date: Mar 2018

Posts: 5
#2

21 Mar 2018, 06:26

To complement the previous post and present the code properly:

I have tried using the initial values both from the equation without the latent variables and using the noestimation option, but without success. I keep obtaining the following error message:

Code:
Refining starting values:

Grid node 0: log likelihood = .
Grid node 1: log likelihood = .
Grid node 2: log likelihood = .
Grid node 3: log likelihood = .
(note: Grid search failed to find values that will yield a log likelihood value.)

Fitting full model:

initial values not feasible
r(1400);

Code:
gsem (i.y1<- $x1, mlogit) (i.y2<- $x2, mlogit), vce(cluster community)
matrix b = e(b)
gsem (i.y1<- $x1 L, mlogit) (i.y2<- $x2 L, mlogit), vce(cluster community) var(L@1) from(b)

...
Fitting fixed-effects model:

Iteration 0: log likelihood = -990.45505
Iteration 1: log likelihood = -957.79754 (backed up)
Iteration 2: log likelihood = -792.03858
Iteration 3: log likelihood = -554.69154
Iteration 4: log likelihood = -490.22375
Iteration 5: log likelihood = -489.69856 (backed up)
Iteration 6: log likelihood = -474.27916 (backed up)
Iteration 7: log likelihood = -461.6616
Iteration 8: log likelihood = -456.60642
Iteration 9: log likelihood = -451.35826
Iteration 10: log likelihood = -450.59582
Iteration 11: log likelihood = -450.57966
Iteration 12: log likelihood = -450.57965

Refining starting values:

Grid node 0: log likelihood = .
Grid node 1: log likelihood = .
Grid node 2: log likelihood = .
Grid node 3: log likelihood = .
(note: Grid search failed to find values that will yield a log likelihood value.)

Fitting full model:

initial values not feasible
r(1400);

gsem (i.p1_foodsecurity<- $x1 L, mlogit) (i.gardentypes_recoded<- $x2 L, mlogit), vce(cluster community) var(L@1) noestimate
matrix b = e(b)
gsem (i.p1_foodsecurity<- $x1 L, mlogit) (i.gardentypes_recoded<- $x2 L, mlogit), vce(cluster community) var(L@1) from(b)

...
Refining starting values:

Grid node 0: log likelihood = .
Grid node 1: log likelihood = .
Grid node 2: log likelihood = .
Grid node 3: log likelihood = .
(note: Grid search failed to find values that will yield a log likelihood value.)

Fitting full model:

initial values not feasible
r(1400);
Comment
David Roodman

Join Date: Jul 2014

Posts: 472
#3

21 Mar 2018, 09:09

That is a very demanding model to fit, especially with so few observations, because each alternative other than the base alternatives gets its own equation in the model. So that's really 2*3=6 equations, each with its own error term, and at least in the cmp model, these are allowed to be correlated, so that's 21 cross-correlations to estimate among the 6.

Oh, actually there's a more basic problem. Unless you specify IIA (which gets rid of the problem above by locking down the correlations to zero), you'll need for different alternatives to have distinctive sets of regressors. That's why Stata's mprobit command imposes IIA while asmprobit (alternative-specific mprobit) does not.

So try:

Code:

cmp (y1 =$x1, iia) (y2= $x2, iia), ind($cmp_mprobit $cmp_mprobit) nolr
1 like
Comment
Jennifer Castaneda

Join Date: Mar 2018

Posts: 5
#4

22 Mar 2018, 03:54

Thank you for your advice I will work on simplying the model.
Comment
Jennifer Castaneda

Join Date: Mar 2018

Posts: 5
#5

22 Mar 2018, 12:36

In order to simplify the model I am trying to verify how big is the 'endogeneity problem'. I have been looking at different posts on endogeneity, but I have not found any clear answer on how to test for endogeneity in a logistic or multinomial probit with a categorical (4 categories) presumbly endogenous variable. I am not sure if the 'traditional OLS technique' of computing the residuals from the reduced equation and including them in the original equation works for model with categorical dependent variables.
Comment

Announcement

Two-stage regression model with categorical dependent varible and categorical endogenous variable

Comment

Comment

Comment

Comment