Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • OLS on discrete dependent variable

    I want to analyze the factors behind the educational attainment of a person. I have a survey dataset. My pendent variable is Years of Education (Preschool, 1 , 2........,18) while independent variables are gender, income etc. Can I use OLS to estimated this model? Or I should go for any other econometric technique.

  • #2
    Muhammad.
    I would also consider -ologit.-
    Anyway, things are a bit trickier in your case, as your data come from a survey (please, see -help svy-).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3

      Hello, Muhammad,

      Welcome to the Stata Forum.

      Carlo gave two excellent suggestions, one concerning the - ologit - model and the other concerning the - svy - prefix.

      By the way, if you choose a survey analysis, you'll need to think about (at least) three important aspects: strata, weights and the PSU.

      Independent whether you do "svyset" or not, I wish to add the following comments:

      a) Since it seems the variable "years" ranges at least from 1 to 18 (no zero?), you could categorize it into, say, 4 groups, and turn your - ologit - model more "understandable" to the audience, so to speak.

      b) On the other hand, shall you wish to consider the years exactly within its full range, and now assuming you are dealing with count data, maybe you should think about a Poisson or Negative Binomial sort of model as well. If, theoretically, the model doen't allow zero values, you may wish to test whether a zero-truncated (Poisson or Negative Binomial) fits better.

      c) It was not clear to me whether the count of years of education will be measured only once, for each individual, or you wish to measure "educational attainment" repeatedly. If you chose the second option, you may wish to select a panel-data design accordingly as well.

      Best,

      Marcos
      Best regards,

      Marcos

      Comment


      • #4
        Thanks Carlo and Marcos!

        I have already tried -ologit- dividing my dependent variable into 4 categories. Though the results of coefficients (after -ologit-) are significant but Average Marginal Effects (computed by-margins-) are not good for all outcomes. Mostly average marginal effects are with 3 zeros (like 0.0004) which are not good and i think not interpretable (means nearly zero probability on average). The results area attached below.


        ologit edu i.gender i.Me i.Hhe i.Hhn i.wlthind5 i.area

        Iteration 0: log likelihood = -4672.7716
        Iteration 1: log likelihood = -4632.8068
        Iteration 2: log likelihood = -4569.0401
        Iteration 3: log likelihood = -4566.6892
        Iteration 4: log likelihood = -4566.6827
        Iteration 5: log likelihood = -4566.6827

        Ordered logistic regression Number of obs = 120,489
        LR chi2(15) = 212.18
        Prob > chi2 = 0.0000
        Log likelihood = -4566.6827 Pseudo R2 = 0.0227

        ------------------------------------------------------------------------------
        edu | Coef. Std. Err. z P>|z| [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        gender |
        female | -.1387143 .0754821 -1.84 0.066 -.2866565 .009228
        |
        Me |
        2 | -.2520384 .105095 -2.40 0.016 -.4580208 -.046056
        3 | -.0094196 .2016714 -0.05 0.963 -.4046883 .3858491
        4 | 2.314247 .429267 5.39 0.000 1.4729 3.155595
        |
        Hhe |
        2 | -.2354058 .0839715 -2.80 0.005 -.3999869 -.0708246
        3 | -.3057352 .1619106 -1.89 0.059 -.6230741 .0116037
        4 | 1.963383 .4046434 4.85 0.000 1.170297 2.75647
        |
        Hhn |
        2 | .0549861 .096692 0.57 0.570 -.1345268 .2444989
        3 | -.1152156 .1010498 -1.14 0.254 -.3132695 .0828383
        4 | -.0301888 .1200921 -0.25 0.802 -.2655651 .2051875
        |
        wlthind5 |
        second | -.6246391 .1081967 -5.77 0.000 -.8367008 -.4125774
        middle | -.7723642 .1149702 -6.72 0.000 -.9977017 -.5470268
        fourth | -1.141412 .1386848 -8.23 0.000 -1.413229 -.8695946
        highest | -.7714078 .1599307 -4.82 0.000 -1.084866 -.4579495
        |
        area |
        all urban | .2577991 .1021139 2.52 0.012 .0576595 .4579387
        -------------+----------------------------------------------------------------
        /cut1 | 4.267976 .1004746 4.071049 4.464903
        /cut2 | 4.328564 .1008853 4.130833 4.526296
        /cut3 | 4.330017 .1008955 4.132266 4.527769
        ------------------------------------------------------------------------------

        . margins,dydx(*) predict(outcome(1))


        Average marginal effects Number of obs = 120,489
        Model VCE : OIM

        Expression : Pr(edu==1), predict(outcome(1))
        dy/dx w.r.t. : 2.gender 2.Me 3.Me 4.Me 2.Hhe 3.Hhe 4.Hhe 2.Hhn 3.Hhn 4.Hhn 2.wlthind5 3.wlthind5
        4.wlthind5 5.wlthind5 1.area

        ------------------------------------------------------------------------------
        | Delta-method
        | dy/dx Std. Err. z P>|z| [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        gender |
        female | .0008344 .0004504 1.85 0.064 -.0000484 .0017173
        |
        Me |
        2 | .0014245 .0005639 2.53 0.012 .0003192 .0025298
        3 | .0000598 .0012765 0.05 0.963 -.002442 .0025617
        4 | -.0543638 .0240623 -2.26 0.024 -.1015251 -.0072025
        |
        Hhe |
        2 | .0014315 .0005116 2.80 0.005 .0004289 .0024342
        3 | .0017986 .0008667 2.08 0.038 .0001 .0034973
        4 | -.0397149 .0176346 -2.25 0.024 -.0742781 -.0051517
        |
        Hhn |
        2 | -.0003494 .0006144 -0.57 0.570 -.0015536 .0008548
        3 | .0006738 .0005905 1.14 0.254 -.0004835 .0018311
        4 | .000184 .0007286 0.25 0.801 -.0012441 .0016121
        |
        wlthind5 |
        second | .0053468 .0010092 5.30 0.000 .0033688 .0073248
        middle | .0062004 .0010402 5.96 0.000 .0041617 .0082392
        fourth | .0078617 .0010783 7.29 0.000 .0057484 .0099751
        highest | .0061953 .0012926 4.79 0.000 .0036619 .0087287
        |
        area |
        all urban | -.0016423 .0006855 -2.40 0.017 -.0029859 -.0002986
        ------------------------------------------------------------------------------
        Note: dy/dx for factor levels is the discrete change from the base level.


        .

        Comment


        • #5
          Muhammad:
          the output you posted is almost unreadable in the current format.
          Please repost using CODE delimiters (please, see FAQ #12). Thanks.
          That said, what do you mean with margins being not good? Did you contrast them against a reference yardstick?
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Hello Muhammad,

            Besides the excellent comments already made, I would like to offer some additional inputs.

            Regarding your original question: the use of OLS when you have a binary response variable (i.e., the linear probability model) is theme for debate. (http://www.mostlyharmlesseconometric...tter-than-lpm/) But overall, use it if you only have dummies in your model. It does not seem to be the case. I would avoid it myself. Even if probit, logit and its declinations (ordered logit etc) are more complex, they will deliver neater results (or at least it seems to be the consensus for most researchers).

            If you want to read further, you can refer to Horrace, W. C., and R. L. Oaxaca. 2006. “Results on the Bias and Inconsistency of Ordinary Least Squares for the Linear Probability Model.” Economic Letters, 90, 321-327.

            I also recommend the reading of these papers.

            Hoetker, G. (2007). The Use of Logit and Probit Models in Strategic Management Research: Critical Issues. Strategic Management Journal. (28),4, pp. 331-343

            Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. The Stata Journal, 12(2), 308-331

            Karaca-Mandic, P., Norton, E. C., and Dowd, B. Interaction Terms in Nonlinear Models. Health Service Research, 47(1), 255-274.


            Even if they do not address directly ordered logit, they may offer you some highlights.
            Another think I would think of is whether the 4 categories you selected make sense.

            Hope I have helped.

            Best,

            MM

            Comment


            • #7
              Thanks again!

              I mean by the "average marginal effects marginal effects are not good" is that average marginal effects (AME) are with more zeros.For example, the AME for gender is 0.0008. It means the females have .0008 percent more probability being in outcome 1 as compared to Males (on average). 0.0008 means zero probability. The same is with other categories.

              The code is as below:

              Code:
                ologit edu i.gender i.Me i.Hhe i.Hhn i.wlthind5 i.area
              
              Iteration 0:   log likelihood = -4672.7716  
              Iteration 1:   log likelihood = -4632.8068  
              Iteration 2:   log likelihood = -4569.0401  
              Iteration 3:   log likelihood = -4566.6892  
              Iteration 4:   log likelihood = -4566.6827  
              Iteration 5:   log likelihood = -4566.6827  
              
              Ordered logistic regression                     Number of obs     =    120,489
                                                              LR chi2(15)       =     212.18
                                                              Prob > chi2       =     0.0000
              Log likelihood = -4566.6827                     Pseudo R2         =     0.0227
              
              ------------------------------------------------------------------------------
                       edu |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                    gender |
                   female  |  -.1387143   .0754821    -1.84   0.066    -.2866565     .009228
                           |
                        Me |
                        2  |  -.2520384    .105095    -2.40   0.016    -.4580208    -.046056
                        3  |  -.0094196   .2016714    -0.05   0.963    -.4046883    .3858491
                        4  |   2.314247    .429267     5.39   0.000       1.4729    3.155595
                           |
                       Hhe |
                        2  |  -.2354058   .0839715    -2.80   0.005    -.3999869   -.0708246
                        3  |  -.3057352   .1619106    -1.89   0.059    -.6230741    .0116037
                        4  |   1.963383   .4046434     4.85   0.000     1.170297     2.75647
                           |
                       Hhn |
                        2  |   .0549861    .096692     0.57   0.570    -.1345268    .2444989
                        3  |  -.1152156   .1010498    -1.14   0.254    -.3132695    .0828383
                        4  |  -.0301888   .1200921    -0.25   0.802    -.2655651    .2051875
                           |
                  wlthind5 |
                   second  |  -.6246391   .1081967    -5.77   0.000    -.8367008   -.4125774
                   middle  |  -.7723642   .1149702    -6.72   0.000    -.9977017   -.5470268
                   fourth  |  -1.141412   .1386848    -8.23   0.000    -1.413229   -.8695946
                  highest  |  -.7714078   .1599307    -4.82   0.000    -1.084866   -.4579495
                           |
                      area |
                all urban  |   .2577991   .1021139     2.52   0.012     .0576595    .4579387
              -------------+----------------------------------------------------------------
                     /cut1 |   4.267976   .1004746                      4.071049    4.464903
                     /cut2 |   4.328564   .1008853                      4.130833    4.526296
                     /cut3 |   4.330017   .1008955                      4.132266    4.527769
              ------------------------------------------------------------------------------
              
              . margins,dydx(*) predict(outcome(1))
              
              Average marginal effects                        Number of obs     =    120,489
              Model VCE    : OIM
              
              Expression   : Pr(edu==1), predict(outcome(1))
              dy/dx w.r.t. : 2.gender 2.Me 3.Me 4.Me 2.Hhe 3.Hhe 4.Hhe 2.Hhn 3.Hhn 4.Hhn 2.wlthind5 3.wlthind5
                             4.wlthind5 5.wlthind5 1.area
              
              ------------------------------------------------------------------------------
                           |            Delta-method
                           |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                    gender |
                   female  |   .0008344   .0004504     1.85   0.064    -.0000484    .0017173
                           |
                        Me |
                        2  |   .0014245   .0005639     2.53   0.012     .0003192    .0025298
                        3  |   .0000598   .0012765     0.05   0.963     -.002442    .0025617
                        4  |  -.0543638   .0240623    -2.26   0.024    -.1015251   -.0072025
                           |
                       Hhe |
                        2  |   .0014315   .0005116     2.80   0.005     .0004289    .0024342
                        3  |   .0017986   .0008667     2.08   0.038        .0001    .0034973
                        4  |  -.0397149   .0176346    -2.25   0.024    -.0742781   -.0051517
                           |
                       Hhn |
                        2  |  -.0003494   .0006144    -0.57   0.570    -.0015536    .0008548
                        3  |   .0006738   .0005905     1.14   0.254    -.0004835    .0018311
                        4  |    .000184   .0007286     0.25   0.801    -.0012441    .0016121
                           |
                  wlthind5 |
                   second  |   .0053468   .0010092     5.30   0.000     .0033688    .0073248
                   middle  |   .0062004   .0010402     5.96   0.000     .0041617    .0082392
                   fourth  |   .0078617   .0010783     7.29   0.000     .0057484    .0099751
                  highest  |   .0061953   .0012926     4.79   0.000     .0036619    .0087287
                           |
                      area |
                all urban  |  -.0016423   .0006855    -2.40   0.017    -.0029859   -.0002986
              ------------------------------------------------------------------------------
              Note: dy/dx for factor levels is the discrete change from the base level.

              Comment


              • #8
                If I understood well, you are interested in only one outcome, what you call the outcome 1, if the person belongs to outcome 1 or not. Is that correct?
                If this is the case, you could try a binary response variable where outcome = 1 if within the characteristics you want in terms of education attainment and zero otherwise.
                One thing you should take care with, however, is with the distribution of outcome = 1 within your sample. If outcome = 1 is restrict to a very small population, the comparison will probably be challenging.

                Comment


                • #9
                  I am interested in all 4 outcomes. Just presented my results for outcome 1 in previous post as an example. The results for other outcomes are also same. For example for outcome 3 , the AME are below. As you can see, most of average marginal effects are insignificant and zero.





                  Code:
                  Average marginal effects                        Number of obs     =    120,489
                  Model VCE    : OIM
                  
                  Expression   : Pr(edu==3), predict(outcome(3))
                  dy/dx w.r.t. : 2.gender 2.Me 3.Me 4.Me 2.Hhe 3.Hhe 4.Hhe 2.Hhn 3.Hhn 4.Hhn 2.wlthind5 3.wlthind5
                                 4.wlthind5 5.wlthind5 1.area
                  
                  ------------------------------------------------------------------------------
                               |            Delta-method
                               |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
                  -------------+----------------------------------------------------------------
                        gender |
                       female  |  -1.12e-06   1.27e-06    -0.88   0.379    -3.62e-06    1.38e-06
                               |
                            Me |
                            2  |  -1.92e-06   2.07e-06    -0.93   0.353    -5.98e-06    2.13e-06
                            3  |  -8.07e-08   1.72e-06    -0.05   0.963    -3.46e-06    3.29e-06
                            4  |   .0000691   .0000748     0.92   0.356    -.0000775    .0002156
                               |
                           Hhe |
                            2  |  -1.93e-06   2.05e-06    -0.94   0.346    -5.95e-06    2.09e-06
                            3  |  -2.43e-06   2.70e-06    -0.90   0.369    -7.72e-06    2.86e-06
                            4  |   .0000512   .0000557     0.92   0.357    -.0000579    .0001603
                               |
                           Hhn |
                            2  |   4.69e-07   9.52e-07     0.49   0.622    -1.40e-06    2.33e-06
                            3  |  -9.05e-07   1.20e-06    -0.75   0.452    -3.26e-06    1.45e-06
                            4  |  -2.47e-07   1.01e-06    -0.24   0.807    -2.23e-06    1.74e-06
                               |
                      wlthind5 |
                       second  |  -7.15e-06   7.27e-06    -0.98   0.325    -.0000214    7.10e-06
                       middle  |  -8.30e-06   8.41e-06    -0.99   0.324    -.0000248    8.19e-06
                       fourth  |  -.0000105   .0000106    -0.99   0.321    -.0000314    .0000103
                      highest  |  -8.30e-06   8.47e-06    -0.98   0.327    -.0000249    8.30e-06
                               |
                          area |
                    all urban  |   2.20e-06   2.39e-06     0.92   0.356    -2.47e-06    6.88e-06
                  ------------------------------------------------------------------------------
                  Note: dy/dx for factor levels is the discrete change from the base level.
                  
                  .

                  Comment


                  • #10
                    I have a couple of comments:
                    • If you split up years of education to be a categorical/ordinal variable, then do so such that they approximately correspond to real diplomas.
                    • There are many educational systems where the ordinal nature is not self-evident. For example is lower level general secondary education plus vocational more or less than only higher level general secondary education? In many countries you can make a valid case for both.
                    • It is possible that the same number of years represent wildly different educational degrees. For example I have seen the case where very low level (and thus quick) general secondary plus very low level vociational would have the same number of years of education as pre-university general secondary.
                    • Did someone who repeated a year get more education than someone who did well and finished her or his education in one go? Whether this is a problem in your data depends your exact research question and on the exact wording of the surveyquestion, which is (should be) documented in the documentation that (should) come(s) with the data.
                    • Be carful to only include observations that are likely to have finished their education. If you have a sample of people who are 18 years old or older, then you probably don't want to include the youngest people. The exact age cut-off depends on the country.
                    • As you may have noticed, I am not a fan of years of education, but it can have it's uses. If you are in a situation where years of education is appropriate (enough), then my first choice would be linear regression with robust standard errors. Even though that variable is discrete, we do have a good idea about what distance is represented by the distance between the years, so it is not ordinal. We do have to be careful that years of education is bounded. This can lead to non-linear effects of continous variables, but normal regression diagnostics can tell us whether that is a problem.
                    ---------------------------------
                    Maarten L. Buis
                    University of Konstanz
                    Department of history and sociology
                    box 40
                    78457 Konstanz
                    Germany
                    http://www.maartenbuis.nl
                    ---------------------------------

                    Comment


                    • #11
                      Thanks Maartin,

                      I also agreed to use linear regression with robust standard errors. But I am unable to understand clearly your last point especially
                      1." Even though that variable is discrete, we do have a good idea about what distance is represented by the distance between the years, so it is not ordinal."
                      2. We do have to be careful that years of education is bounded. This can lead to non-linear effects of continuous variables, but normal regression diagnostics can tell us whether that is a problem."

                      Comment


                      • #12
                        With an ordinal variable you know that one category is more than another, but not by how much. Think of a question with answer categories: "fully agree", "agree", "neutral", "disagree", "fully disagree". With years of education we know more: 2 years of education is more than 1 year of education, and we know that the difference is 1 year. So years of education is more than ordinal.

                        You cannot have less than 0 years of education, so years of education is at least bounded by 0. There may also be an upper bound that bites, depending on how exactly years of education was asked. These bounds can result in non-linear effects of continuous variables. Checking for linearity of effects is standard for linear regression, and discussed in any intro course on regression or any intro book on regression. For how it is implemented in Stata see help regress postestimation plots. Maybe the word "normal" was confusing in this context; I used it to refer to standard not normal/Gaussian distribution.
                        ---------------------------------
                        Maarten L. Buis
                        University of Konstanz
                        Department of history and sociology
                        box 40
                        78457 Konstanz
                        Germany
                        http://www.maartenbuis.nl
                        ---------------------------------

                        Comment


                        • #13
                          Thanks again Maarten!

                          Here, in my case, I have also dummies on the right hand side of the equation. Like Mother and Father education are split into four categories (depending on the degree, as you suggested in one previous post), Gender (male or female) area (rural or urban) and wealth index quantiles (1 to 5). I think in this case, there may not be a problem of non-linearity (due to absence of continuous variable).

                          Comment


                          • #14
                            Dear all,
                            I want to know whether I can use an OLS regression when the dependent variable is quantitative discrete. It takes 5 possible numbers: 0, 20, 65, 120 and 200 ? Or I should go for any other econometric technique.
                            Best regards
                            Last edited by Nahed Eddai; 15 Aug 2021, 04:11.

                            Comment


                            • #15
                              It is unusual to treat a 5 category variable as continuous, and it is especially unusual to do so when the 5 categories have values like these! If you told us more about what the variable is and why it has values like these, we might be better able to advise you.
                              -------------------------------------------
                              Richard Williams, Notre Dame Dept of Sociology
                              StataNow Version: 19.5 MP (2 processor)

                              EMAIL: [email protected]
                              WWW: https://www3.nd.edu/~rwilliam

                              Comment

                              Working...
                              X