ivregress for binary variables

Laura Myles

Join Date: Jun 2018

Posts: 153
#1

ivregress for binary variables

16 Apr 2021, 07:17

Hi All.

I am analyzing data from a trial looking at the impact of treatment on prognosis (i.e. dead or alive). Participants were either receiving usual care or a new treatment; however, some contamination happened and after running an intention to treat analysis, I would like to adjust for the contamination. I know per protocol analyses are not perfect and have come across a paper suggesting using Complier Average Casual Effect analysis where the instrumental variable is the treatment actually received.

In Stata, there is a command called ivregress that can conduct instrumental variable regressions but this seems to require continuous outcomes (and possibly predictors). Is there an equivalent command (or workaround) that can be used when the outcome, predictor and instrumental variables are binary?

Thanks for your help!
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

16 Apr 2021, 07:40

Linear IV regression is appropriate for your case where the "outcome, predictor and instrumental variables are binary".

If you want to do something exotic, but not really necessary, you can try out -biprobit-.
Comment
Laura Myles

Join Date: Jun 2018

Posts: 153
#3

16 Apr 2021, 08:35

Thanks Joro Kolev - is there a way to obtain odds ratios or relative risks using ivregress?
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#4

16 Apr 2021, 09:56

Automatically, I do not think so. -ivregress- can report coefficients or exponentiated coefficients.

Manually everything can be done, if you show the formula of what you want.

Originally posted by Laura Myles View Post

Thanks Joro Kolev - is there a way to obtain odds ratios or relative risks using ivregress?
Comment
Laura Myles

Join Date: Jun 2018

Posts: 153
#5

14 Jun 2021, 10:34

Dear Joro Kolev I am back on this issue and wanted to follow-up on your last comment: manually everything can be done? if I were to run the -ivregress- command with a binary outcome, I would obtain a coefficient but not an RR which is what I am after.

What options do I have to manually estimate it? It'd be great if you could point me towards some resources

Last edited by Laura Myles; 14 Jun 2021, 10:36.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

14 Jun 2021, 16:08

I do not know what are odds ratios or RR in the context of estimated linear IV regression, Laura. So what I was asking you is the same that you end up asking me, and therefore I cannot point you to sources.

If you are asking for "odds ratios or RR in the context of estimated linear IV regression," you must have seen this thing somewhere, you must have seen the formula that is giving you odds ratios or RR from estimated coefficients. So if you show this formula we can get the calculation in Stata.

Originally posted by Laura Myles View Post

Dear Joro Kolev I am back on this issue and wanted to follow-up on your last comment: manually everything can be done? if I were to run the -ivregress- command with a binary outcome, I would obtain a coefficient but not an RR which is what I am after.

What options do I have to manually estimate it? It'd be great if you could point me towards some resources
Comment
Matthew Williams

Join Date: Feb 2021

Posts: 195
#7

15 Jun 2021, 03:25

Originally posted by Joro Kolev View Post

I do not know what are odds ratios or RR in the context of estimated linear IV regression, Laura. So what I was asking you is the same that you end up asking me, and therefore I cannot point you to sources.

If you are asking for "odds ratios or RR in the context of estimated linear IV regression," you must have seen this thing somewhere, you must have seen the formula that is giving you odds ratios or RR from estimated coefficients. So if you show this formula we can get the calculation in Stata.

Hi Joro,

I have a question regarding comparison between OLS estimates and those of IV. I have read some papers and they said that OLS estimates tend to be smaller than IV estimates but did not provide explicit explanations. So for example, a paper estimating the effect of maternal education on child mortality using IV where maternal education is measured as mothers' years of schooling (denoted Edu) and child mortality (let's call it Y) is a binary (1=child dead and 0 otherwise). That paper uses the timing of an education reform (let's call it Z) as an exogenous variation to instrument for maternal years of schooling. In the results, they find that IV estimates are greater than OLS estimates so I am wondering how to explain or compare estimates from the two regressions?

Thank you.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

15 Jun 2021, 03:59

Matthew, very generally in measurement error problems OLS suffers from something called "attenuation bias," that is OLS estimates are biased towards 0. Therefore in measurement error problems you are pretty much assured that your IV estimates will be larger than the OLS estimates. E.g., in the literature of the effects of financial literacy on various financial decisions, financial literacy is quite clearly measured with error. You basically ask your subject some 4-5 simple questions regarding inflation, stocks, interest rates, etc., and you call some simple function of their answers, say the sum of correct answers, "financial literacy" -- very obviously this is not financial literacy but some very rough mismeasured proxy. So in this literature you find all over the place that IV estimates of the effects of financial literacy on financial decisions are a lot larger than the OLS estimates, presumably due to attenuation bias, see e.g.,
Cupák, Andrej, Gueorgui I. Kolev, and Zuzana Brokešová. "Financial literacy and voluntary savings for retirement: novel causal evidence." The European Journal of Finance 25, no. 16 (2019): 1606-1625.

I do not think that there are general results when with instrumentation you are solving reverse causality or omitted variable problems. In these cases depending on the underlying mechanism IV estimates can be larger or smaller than OLS estimates.

Mothers education might be mismeasured or not, I do not know what the authors of the paper you have in mind are doing or saying.

To your final question:

1. You compare IV and OLS estimates by a) eyeballing them and seeing what is larger, and whether magnitudes make sense. b) doing a Hausman test to see whether they are statistically different.

2. I believe that only in measurement error problems IV estimates are expected to be larger than OLS estimates without further qualifications. In all other cases you need to reason through the mechanism that is supposedly causing the endogeneity.

Originally posted by Matthew Williams View Post

Hi Joro,

I have a question regarding comparison between OLS estimates and those of IV. I have read some papers and they said that OLS estimates tend to be smaller than IV estimates but did not provide explicit explanations. So for example, a paper estimating the effect of maternal education on child mortality using IV where maternal education is measured as mothers' years of schooling (denoted Edu) and child mortality (let's call it Y) is a binary (1=child dead and 0 otherwise). That paper uses the timing of an education reform (let's call it Z) as an exogenous variation to instrument for maternal years of schooling. In the results, they find that IV estimates are greater than OLS estimates so I am wondering how to explain or compare estimates from the two regressions?

Thank you.
1 like
Comment
Laura Myles

Join Date: Jun 2018

Posts: 153
#9

15 Jun 2021, 10:37

Thanks Joro Kolev . No, I did not have anything in mind but I have done some reading on this question and I found recommendations on using a two-stage residual inclusion estimator approach when the outcome is binary. I looked into the ivpoisson command - the description reads that it handles continuous covariates but based on previous posts on the Stata forum it does not seem to be the case so I am wondering if that could be the straightforward solution.

The linked paper on the use of 2sri for trial data suggests the residuals can be obtained from the first-stage least squares linear regression of arm allocation on the instrument (https://www.sciencedirect.com/scienc...5435617305644#!) so I have modified the code for the Bootstrap example in the Terza's practitioners guide - it seems to work OK and produces an output close to the one obtained using -ivpoisson-. However, I am unsure about my approach for Stage 1 and would welcome your input/views.

program drop _all
program twosri, rclass

* 1.st Stage
regress arm i.contamination
predict resid, residual

* 2.nd Stage
glm alive i.arm resid, robust family(poisson) link(log) eform
return scalar barm = _b[1.arm]
end

bootstrap exp(r(barm)), reps(1000) seed(19345): twosri
Comment
Matthew Williams

Join Date: Feb 2021

Posts: 195
#10

16 Jun 2021, 09:04

Originally posted by Joro Kolev View Post

Matthew, very generally in measurement error problems OLS suffers from something called "attenuation bias," that is OLS estimates are biased towards 0. Therefore in measurement error problems you are pretty much assured that your IV estimates will be larger than the OLS estimates. E.g., in the literature of the effects of financial literacy on various financial decisions, financial literacy is quite clearly measured with error. You basically ask your subject some 4-5 simple questions regarding inflation, stocks, interest rates, etc., and you call some simple function of their answers, say the sum of correct answers, "financial literacy" -- very obviously this is not financial literacy but some very rough mismeasured proxy. So in this literature you find all over the place that IV estimates of the effects of financial literacy on financial decisions are a lot larger than the OLS estimates, presumably due to attenuation bias, see e.g.,
Cupák, Andrej, Gueorgui I. Kolev, and Zuzana Brokešová. "Financial literacy and voluntary savings for retirement: novel causal evidence." The European Journal of Finance 25, no. 16 (2019): 1606-1625.

I do not think that there are general results when with instrumentation you are solving reverse causality or omitted variable problems. In these cases depending on the underlying mechanism IV estimates can be larger or smaller than OLS estimates.

Mothers education might be mismeasured or not, I do not know what the authors of the paper you have in mind are doing or saying.

To your final question:

1. You compare IV and OLS estimates by a) eyeballing them and seeing what is larger, and whether magnitudes make sense. b) doing a Hausman test to see whether they are statistically different.

2. I believe that only in measurement error problems IV estimates are expected to be larger than OLS estimates without further qualifications. In all other cases you need to reason through the mechanism that is supposedly causing the endogeneity.

Thank you for your detailed and helpful explanations. I appreciate that.

I just made a new thread which can be found here: https://www.statalist.org/forums/for...a-given-period since I think my new questions deserve a new post. Would you like to take a look at that post? Please note that the content of the post is combination of my idea and the results of the paper mentioned in #7.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#11

17 Jun 2021, 02:12

Laura, I am not familiar with the work of Terza, so I do not know. If she/he shows that what you are doing works, fine.

What you are doing in econometrics is known as "control function approach," an authoritative reference on the topic being Wooldridge, Jeffrey M. "Control function methods in applied econometrics." Journal of Human Resources 50, no. 2 (2015): 420-445.

The control function requires strong distributional assumptions, which make the control function approach inappropriate when you have binary endogenous regressor. To use the control function approach, your dependent variable can be of any nature, and your instrument can be of any nature, but your endogenous regressor has to be roughly continuous.

Stata herself is adding to the confusion, because what is called -ivprobit- is not IV, but is control function probit, and what is called -ivpoisson- is not IV, but is control function. (There is -ivpoisson gmm- which might be IV, I have not read the manual carefully enough to figure out what this do.)

In short, I already told you toward the beginning of the thread my opinion regarding your problem:

1) Use linear IV regression, -ivregress-. In your case the linear IV regression has LATE (local average treatment effect) interpretation, e.g., see these papers, or at least one of them, they say pretty much the same thing and are by the same set of authors:
Angrist, Joshua D., and Guido W. Imbens. "Two-stage least squares estimation of average causal effects in models with variable treatment intensity." Journal of the American statistical Association 90, no. 430 (1995): 431-442.
Angrist, Joshua D., Guido W. Imbens, and Donald B. Rubin. "Identification of causal effects using instrumental variables." Journal of the American statistical Association 91, no. 434 (1996): 444-455.
Angrist, Joshua, and Guido Imbens. "Identification and estimation of local average treatment effects." (1995). NBER working paper.

2) If you want something exotic that "appropriately takes care of the nature of your variables" use -biprobit-: -biprobit- is appropriate for binary outcome and binary endogenous regressor.

Originally posted by Laura Myles View Post

Thanks Joro Kolev . No, I did not have anything in mind but I have done some reading on this question and I found recommendations on using a two-stage residual inclusion estimator approach when the outcome is binary. I looked into the ivpoisson command - the description reads that it handles continuous covariates but based on previous posts on the Stata forum it does not seem to be the case so I am wondering if that could be the straightforward solution.

The linked paper on the use of 2sri for trial data suggests the residuals can be obtained from the first-stage least squares linear regression of arm allocation on the instrument (https://www.sciencedirect.com/scienc...5435617305644#!) so I have modified the code for the Bootstrap example in the Terza's practitioners guide - it seems to work OK and produces an output close to the one obtained using -ivpoisson-. However, I am unsure about my approach for Stage 1 and would welcome your input/views.

program drop _all
program twosri, rclass

* 1.st Stage
regress arm i.contamination
predict resid, residual

* 2.nd Stage
glm alive i.arm resid, robust family(poisson) link(log) eform
return scalar barm = _b[1.arm]
end

bootstrap exp(r(barm)), reps(1000) seed(19345): twosri
Comment
Laura Myles

Join Date: Jun 2018

Posts: 153
#12

22 Jun 2021, 08:07

Thank you Joro Kolev - I will read the papers you recommended!
Comment

Announcement

ivregress for binary variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment