Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Modeling Proportions, correcting for Heteroskedasticity

    Dear Stata Forum,

    I have a question which is not directly related to Stata, but since the Stata Journal has been a valuable source of information, I hope I can find answers here.

    I am running a regression with the vote shares of a party in different districts as my dependent variable.

    The excellent report by Baum (2008) (https://journals.sagepub.com/doi/pdf...867X0800800212) answered most of my methodological questions. I cannot make use of OLS, as this requires the dependent variable to lie on the real number lie, i.e. it needs to be unbounded (and not a proportion/fraction/share).
    However, one question remains. In the article, it is said that for response variables that are strictly between the interval of [0,1], a logit transformation suffices in order to make OLS a valid estimation technique again. Stata’s grouped logistic regression (glogit) is recommended in the event that one wants to correct for heteroskedasticity in the error term.

    My question is now: Given that I can safely use OLS, could I not simply correct for heteroskedasticity with robust standard errors instead of using the weighted least squares method glog?

    Thanks for your help and best regards
    Stefan
    Subscription and open access journals from SAGE Publishing, the world's leading independent academic publisher.

  • #2
    That article was written before Stata included commands explicitly intended for dealing with proportions: fracreg and betareg. So if you want to study proportons, those are the two commands to look at. I recently wrote an encyclopedia entry on various methods of analyzing proportions: http://www.maartenbuis.nl/publications/prop.html

    However, I am slightly worried about your dependent variable. I hope you are familiar with the ecological fallacy, and that you won't try to use your results to make statements individuals' propensity to vote for certain parties: https://en.wikipedia.org/wiki/Ecological_fallacy
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Dear Marteen,

      thanks for your reply!
      Indeed, I did not know about these two commands and will look into them.

      However, I would still be interested in why OLS (or linear regression in general for that matter) is not a valid option,

      I understand that heteroskedasticity is a problem, but it is inherent to almost every cross-sectional study and robust standard errors are traditionally the first option to at least alleviate this challenge. A constructed scatter plot between the residuals and the estimated response variable does also not perform too bad; The first scatter plot depicts regression results of OLS with the share of votes simply used as response variable. there is definitely heteroskedasticity going on, but there is no downward- or upward sloping as there are no observations who actually touch (or come too close) to the boundaries. The second scatter illustrates the correlation between the residuals and the predictions on an OLS regression on the logit transformed response variable.

      Click image for larger version

Name:	statalist_example.png
Views:	2
Size:	52.4 KB
ID:	1502157Click image for larger version

Name:	statalist_example_logit.png
Views:	1
Size:	60.9 KB
ID:	1502158

      With respect to OLS not enforcing the lower and/or upper bound: if one logit transforms the response variable, this should not be of concern as it allows y to be on a real number line.
      This last point is of course not possible in the event of having fringe observations. However, in my example the response variable lies strictly between 0 and 1 (and also does not come too close to these points) In theory, OLS should be unbiased, but I understand that it might be not the most efficient solution (I'd appreciate it a lot if you clarify this in the event that I am wrong).

      I just ask, because my program is not focusing on econometrics and I will likely have to justify, why I did not apply the solutions familiar to our curriculum.

      Considering potential ecological fallacy: great point, but I remain with my analysis on the aggregate level and do not intend to infer any statement about individual level behavior.

      Thank you for taking your time, I really appreciate it.

      Stefan
      Attached Files
      Last edited by Stefan Sliwa; 07 Jun 2019, 15:03.

      Comment


      • #4
        The biggest problem with using linear regression this way is that you are no longer modeling the average proportion but the average transformed proportion. Remember that the logit transformation a nonlinear transformation is, so you can't easily recover the effects on the proportions from the effects on the logit(proportions). As a consequence it will be hard to interpret your model.

        Code:
        . // create a random datasat with proportions
        . clear
        
        . set obs 100
        number of observations (_N) was 0, now 100
        
        . set seed 123456
        
        . gen prop = runiform()
        
        .
        . // compute the mean
        . sum prop
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                prop |        100    .5140416    .2884996   .0165345   .9990426
        
        .
        . //transform the mean
        . gen tr_prop = logit(prop)
        
        .
        . // show that we cannot recover the mean proportion by back-transforming
        . // the mean transformed proportion:
        . sum tr_prop
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
             tr_prop |        100    .1374499    1.833884  -4.085631   6.950363
        
        . di invlogit(r(mean))
        .53430847
        fracreg and betareg solve this by modeling the transformed mean proportion rather than the mean transformed proportion. So from them you can recover the effects on the mean proportion.
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Thanks again, Maarten.


          If I understood it correctly, the default of betareg is a link(logit) for the location parameter and a slink(log) for the scale parameter.

          I obtain the intuitive marginal effects after performing betareg with the margins, dydx(*) command, is this correct?
          The only variable I am missing is the constant. Why does the margins command omit it? My constant is the reference value of a categorical variable. More specifically, I have 4 regions with k=1 as the constant. This is why I need value of it in terms of the intuitive marginal effects for a meaningful interpretation.

          I did not find anything in the help files.

          Thanks once again.
          Stefan
          Last edited by Stefan Sliwa; 14 Jun 2019, 10:33.

          Comment

          Working...
          X