Modeling Proportions, correcting for Heteroskedasticity

Stefan Sliwa

Join Date: Jun 2019

Posts: 19
#1

Modeling Proportions, correcting for Heteroskedasticity

07 Jun 2019, 06:47

Dear Stata Forum,

I have a question which is not directly related to Stata, but since the Stata Journal has been a valuable source of information, I hope I can find answers here.

I am running a regression with the vote shares of a party in different districts as my dependent variable.

The excellent report by Baum (2008) (https://journals.sagepub.com/doi/pdf...867X0800800212) answered most of my methodological questions. I cannot make use of OLS, as this requires the dependent variable to lie on the real number lie, i.e. it needs to be unbounded (and not a proportion/fraction/share).
However, one question remains. In the article, it is said that for response variables that are strictly between the interval of [0,1], a logit transformation suffices in order to make OLS a valid estimation technique again. Stata’s grouped logistic regression (glogit) is recommended in the event that one wants to correct for heteroskedasticity in the error term.

My question is now: Given that I can safely use OLS, could I not simply correct for heteroskedasticity with robust standard errors instead of using the weighted least squares method glog?

Thanks for your help and best regards
Stefan

SAGE Journals: Your gateway to world-class journal research

https://journals.sagepub.com

Subscription and open access journals from SAGE Publishing, the world's leading independent academic publisher.
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3439
#2

07 Jun 2019, 07:32

That article was written before Stata included commands explicitly intended for dealing with proportions: fracreg and betareg. So if you want to study proportons, those are the two commands to look at. I recently wrote an encyclopedia entry on various methods of analyzing proportions: http://www.maartenbuis.nl/publications/prop.html

However, I am slightly worried about your dependent variable. I hope you are familiar with the ecological fallacy, and that you won't try to use your results to make statements individuals' propensity to vote for certain parties: https://en.wikipedia.org/wiki/Ecological_fallacy

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment
Stefan Sliwa

Join Date: Jun 2019

Posts: 19
#3

07 Jun 2019, 14:40

Dear Marteen,

thanks for your reply!
Indeed, I did not know about these two commands and will look into them.

However, I would still be interested in why OLS (or linear regression in general for that matter) is not a valid option,

I understand that heteroskedasticity is a problem, but it is inherent to almost every cross-sectional study and robust standard errors are traditionally the first option to at least alleviate this challenge. A constructed scatter plot between the residuals and the estimated response variable does also not perform too bad; The first scatter plot depicts regression results of OLS with the share of votes simply used as response variable. there is definitely heteroskedasticity going on, but there is no downward- or upward sloping as there are no observations who actually touch (or come too close) to the boundaries. The second scatter illustrates the correlation between the residuals and the predictions on an OLS regression on the logit transformed response variable.

With respect to OLS not enforcing the lower and/or upper bound: if one logit transforms the response variable, this should not be of concern as it allows y to be on a real number line.
This last point is of course not possible in the event of having fringe observations. However, in my example the response variable lies strictly between 0 and 1 (and also does not come too close to these points) In theory, OLS should be unbiased, but I understand that it might be not the most efficient solution (I'd appreciate it a lot if you clarify this in the event that I am wrong).

I just ask, because my program is not focusing on econometrics and I will likely have to justify, why I did not apply the solutions familiar to our curriculum.

Considering potential ecological fallacy: great point, but I remain with my analysis on the aggregate level and do not intend to infer any statement about individual level behavior.

Thank you for taking your time, I really appreciate it.

Stefan
Attached Files

Last edited by Stefan Sliwa; 07 Jun 2019, 15:03.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3439
#4

12 Jun 2019, 03:37

The biggest problem with using linear regression this way is that you are no longer modeling the average proportion but the average transformed proportion. Remember that the logit transformation a nonlinear transformation is, so you can't easily recover the effects on the proportions from the effects on the logit(proportions). As a consequence it will be hard to interpret your model.

Code:

. // create a random datasat with proportions . clear . set obs 100 number of observations (_N) was 0, now 100 . set seed 123456 . gen prop = runiform() . . // compute the mean . sum prop Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- prop | 100 .5140416 .2884996 .0165345 .9990426 . . //transform the mean . gen tr_prop = logit(prop) . . // show that we cannot recover the mean proportion by back-transforming . // the mean transformed proportion: . sum tr_prop Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- tr_prop | 100 .1374499 1.833884 -4.085631 6.950363 . di invlogit(r(mean)) .53430847

fracreg and betareg solve this by modeling the transformed mean proportion rather than the mean transformed proportion. So from them you can recover the effects on the mean proportion.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Stefan Sliwa

Join Date: Jun 2019

Posts: 19
#5

14 Jun 2019, 09:51

Thanks again, Maarten.

If I understood it correctly, the default of betareg is a link(logit) for the location parameter and a slink(log) for the scale parameter.

I obtain the intuitive marginal effects after performing betareg with the margins, dydx(*) command, is this correct?
The only variable I am missing is the constant. Why does the margins command omit it? My constant is the reference value of a categorical variable. More specifically, I have 4 regions with k=1 as the constant. This is why I need value of it in terms of the intuitive marginal effects for a meaningful interpretation.

I did not find anything in the help files.

Thanks once again.
Stefan

Last edited by Stefan Sliwa; 14 Jun 2019, 10:33.
Comment

Announcement

Modeling Proportions, correcting for Heteroskedasticity

Comment

Comment

Comment

Comment