OLS on discrete dependent variable

Muhammad Hayat

Join Date: Jul 2016

Posts: 7
#1

OLS on discrete dependent variable

31 Jul 2016, 05:58

I want to analyze the factors behind the educational attainment of a person. I have a survey dataset. My pendent variable is Years of Education (Preschool, 1 , 2........,18) while independent variables are gender, income etc. Can I use OLS to estimated this model? Or I should go for any other econometric technique.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#2

31 Jul 2016, 07:03

Muhammad.
I would also consider -ologit.-
Anyway, things are a bit trickier in your case, as your data come from a survey (please, see -help svy-).

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

31 Jul 2016, 07:46

Hello, Muhammad,

Welcome to the Stata Forum.

Carlo gave two excellent suggestions, one concerning the - ologit - model and the other concerning the - svy - prefix.

By the way, if you choose a survey analysis, you'll need to think about (at least) three important aspects: strata, weights and the PSU.

Independent whether you do "svyset" or not, I wish to add the following comments:

a) Since it seems the variable "years" ranges at least from 1 to 18 (no zero?), you could categorize it into, say, 4 groups, and turn your - ologit - model more "understandable" to the audience, so to speak.

b) On the other hand, shall you wish to consider the years exactly within its full range, and now assuming you are dealing with count data, maybe you should think about a Poisson or Negative Binomial sort of model as well. If, theoretically, the model doen't allow zero values, you may wish to test whether a zero-truncated (Poisson or Negative Binomial) fits better.

c) It was not clear to me whether the count of years of education will be measured only once, for each individual, or you wish to measure "educational attainment" repeatedly. If you chose the second option, you may wish to select a panel-data design accordingly as well.

Best,

Marcos

Best regards,

Marcos
Comment
Muhammad Hayat

Join Date: Jul 2016

Posts: 7
#4

31 Jul 2016, 08:47

Thanks Carlo and Marcos!

I have already tried -ologit- dividing my dependent variable into 4 categories. Though the results of coefficients (after -ologit-) are significant but Average Marginal Effects (computed by-margins-) are not good for all outcomes. Mostly average marginal effects are with 3 zeros (like 0.0004) which are not good and i think not interpretable (means nearly zero probability on average). The results area attached below.

ologit edu i.gender i.Me i.Hhe i.Hhn i.wlthind5 i.area

Iteration 0: log likelihood = -4672.7716
Iteration 1: log likelihood = -4632.8068
Iteration 2: log likelihood = -4569.0401
Iteration 3: log likelihood = -4566.6892
Iteration 4: log likelihood = -4566.6827
Iteration 5: log likelihood = -4566.6827

Ordered logistic regression Number of obs = 120,489
LR chi2(15) = 212.18
Prob > chi2 = 0.0000
Log likelihood = -4566.6827 Pseudo R2 = 0.0227

------------------------------------------------------------------------------
edu | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
female | -.1387143 .0754821 -1.84 0.066 -.2866565 .009228
|
Me |
2 | -.2520384 .105095 -2.40 0.016 -.4580208 -.046056
3 | -.0094196 .2016714 -0.05 0.963 -.4046883 .3858491
4 | 2.314247 .429267 5.39 0.000 1.4729 3.155595
|
Hhe |
2 | -.2354058 .0839715 -2.80 0.005 -.3999869 -.0708246
3 | -.3057352 .1619106 -1.89 0.059 -.6230741 .0116037
4 | 1.963383 .4046434 4.85 0.000 1.170297 2.75647
|
Hhn |
2 | .0549861 .096692 0.57 0.570 -.1345268 .2444989
3 | -.1152156 .1010498 -1.14 0.254 -.3132695 .0828383
4 | -.0301888 .1200921 -0.25 0.802 -.2655651 .2051875
|
wlthind5 |
second | -.6246391 .1081967 -5.77 0.000 -.8367008 -.4125774
middle | -.7723642 .1149702 -6.72 0.000 -.9977017 -.5470268
fourth | -1.141412 .1386848 -8.23 0.000 -1.413229 -.8695946
highest | -.7714078 .1599307 -4.82 0.000 -1.084866 -.4579495
|
area |
all urban | .2577991 .1021139 2.52 0.012 .0576595 .4579387
-------------+----------------------------------------------------------------
/cut1 | 4.267976 .1004746 4.071049 4.464903
/cut2 | 4.328564 .1008853 4.130833 4.526296
/cut3 | 4.330017 .1008955 4.132266 4.527769
------------------------------------------------------------------------------

. margins,dydx(*) predict(outcome(1))

Average marginal effects Number of obs = 120,489
Model VCE : OIM

Expression : Pr(edu==1), predict(outcome(1))
dy/dx w.r.t. : 2.gender 2.Me 3.Me 4.Me 2.Hhe 3.Hhe 4.Hhe 2.Hhn 3.Hhn 4.Hhn 2.wlthind5 3.wlthind5
4.wlthind5 5.wlthind5 1.area

------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender |
female | .0008344 .0004504 1.85 0.064 -.0000484 .0017173
|
Me |
2 | .0014245 .0005639 2.53 0.012 .0003192 .0025298
3 | .0000598 .0012765 0.05 0.963 -.002442 .0025617
4 | -.0543638 .0240623 -2.26 0.024 -.1015251 -.0072025
|
Hhe |
2 | .0014315 .0005116 2.80 0.005 .0004289 .0024342
3 | .0017986 .0008667 2.08 0.038 .0001 .0034973
4 | -.0397149 .0176346 -2.25 0.024 -.0742781 -.0051517
|
Hhn |
2 | -.0003494 .0006144 -0.57 0.570 -.0015536 .0008548
3 | .0006738 .0005905 1.14 0.254 -.0004835 .0018311
4 | .000184 .0007286 0.25 0.801 -.0012441 .0016121
|
wlthind5 |
second | .0053468 .0010092 5.30 0.000 .0033688 .0073248
middle | .0062004 .0010402 5.96 0.000 .0041617 .0082392
fourth | .0078617 .0010783 7.29 0.000 .0057484 .0099751
highest | .0061953 .0012926 4.79 0.000 .0036619 .0087287
|
area |
all urban | -.0016423 .0006855 -2.40 0.017 -.0029859 -.0002986
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#5

31 Jul 2016, 09:03

Muhammad:
the output you posted is almost unreadable in the current format.
Please repost using CODE delimiters (please, see FAQ #12). Thanks.
That said, what do you mean with margins being not good? Did you contrast them against a reference yardstick?

Kind regards,
Carlo
(Stata 19.0)
Comment
Mari Meir

Join Date: Jul 2016

Posts: 61
#6

31 Jul 2016, 09:20

Hello Muhammad,

Besides the excellent comments already made, I would like to offer some additional inputs.

Regarding your original question: the use of OLS when you have a binary response variable (i.e., the linear probability model) is theme for debate. (http://www.mostlyharmlesseconometric...tter-than-lpm/) But overall, use it if you only have dummies in your model. It does not seem to be the case. I would avoid it myself. Even if probit, logit and its declinations (ordered logit etc) are more complex, they will deliver neater results (or at least it seems to be the consensus for most researchers).

If you want to read further, you can refer to Horrace, W. C., and R. L. Oaxaca. 2006. “Results on the Bias and Inconsistency of Ordinary Least Squares for the Linear Probability Model.” Economic Letters, 90, 321-327.

I also recommend the reading of these papers.

Hoetker, G. (2007). The Use of Logit and Probit Models in Strategic Management Research: Critical Issues. Strategic Management Journal. (28),4, pp. 331-343

Williams, R. (2012). Using the margins command to estimate and interpret adjusted predictions and marginal effects. The Stata Journal, 12(2), 308-331

Karaca-Mandic, P., Norton, E. C., and Dowd, B. Interaction Terms in Nonlinear Models. Health Service Research, 47(1), 255-274.

Even if they do not address directly ordered logit, they may offer you some highlights.
Another think I would think of is whether the 4 categories you selected make sense.

Hope I have helped.

Best,

MM
1 like
Comment

Muhammad Hayat

Join Date: Jul 2016
Posts: 7

31 Jul 2016, 09:26

Thanks again!

I mean by the "average marginal effects marginal effects are not good" is that average marginal effects (AME) are with more zeros.For example, the AME for gender is 0.0008. It means the females have .0008 percent more probability being in outcome 1 as compared to Males (on average). 0.0008 means zero probability. The same is with other categories.

The code is as below:

Code:

  ologit edu i.gender i.Me i.Hhe i.Hhn i.wlthind5 i.area

Iteration 0:   log likelihood = -4672.7716  
Iteration 1:   log likelihood = -4632.8068  
Iteration 2:   log likelihood = -4569.0401  
Iteration 3:   log likelihood = -4566.6892  
Iteration 4:   log likelihood = -4566.6827  
Iteration 5:   log likelihood = -4566.6827  

Ordered logistic regression                     Number of obs     =    120,489
                                                LR chi2(15)       =     212.18
                                                Prob &gt; chi2       =     0.0000
Log likelihood = -4566.6827                     Pseudo R2         =     0.0227

------------------------------------------------------------------------------
         edu |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |
     female  |  -.1387143   .0754821    -1.84   0.066    -.2866565     .009228
             |
          Me |
          2  |  -.2520384    .105095    -2.40   0.016    -.4580208    -.046056
          3  |  -.0094196   .2016714    -0.05   0.963    -.4046883    .3858491
          4  |   2.314247    .429267     5.39   0.000       1.4729    3.155595
             |
         Hhe |
          2  |  -.2354058   .0839715    -2.80   0.005    -.3999869   -.0708246
          3  |  -.3057352   .1619106    -1.89   0.059    -.6230741    .0116037
          4  |   1.963383   .4046434     4.85   0.000     1.170297     2.75647
             |
         Hhn |
          2  |   .0549861    .096692     0.57   0.570    -.1345268    .2444989
          3  |  -.1152156   .1010498    -1.14   0.254    -.3132695    .0828383
          4  |  -.0301888   .1200921    -0.25   0.802    -.2655651    .2051875
             |
    wlthind5 |
     second  |  -.6246391   .1081967    -5.77   0.000    -.8367008   -.4125774
     middle  |  -.7723642   .1149702    -6.72   0.000    -.9977017   -.5470268
     fourth  |  -1.141412   .1386848    -8.23   0.000    -1.413229   -.8695946
    highest  |  -.7714078   .1599307    -4.82   0.000    -1.084866   -.4579495
             |
        area |
  all urban  |   .2577991   .1021139     2.52   0.012     .0576595    .4579387
-------------+----------------------------------------------------------------
       /cut1 |   4.267976   .1004746                      4.071049    4.464903
       /cut2 |   4.328564   .1008853                      4.130833    4.526296
       /cut3 |   4.330017   .1008955                      4.132266    4.527769
------------------------------------------------------------------------------

. margins,dydx(*) predict(outcome(1))

Average marginal effects                        Number of obs     =    120,489
Model VCE    : OIM

Expression   : Pr(edu==1), predict(outcome(1))
dy/dx w.r.t. : 2.gender 2.Me 3.Me 4.Me 2.Hhe 3.Hhe 4.Hhe 2.Hhn 3.Hhn 4.Hhn 2.wlthind5 3.wlthind5
               4.wlthind5 5.wlthind5 1.area

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |
     female  |   .0008344   .0004504     1.85   0.064    -.0000484    .0017173
             |
          Me |
          2  |   .0014245   .0005639     2.53   0.012     .0003192    .0025298
          3  |   .0000598   .0012765     0.05   0.963     -.002442    .0025617
          4  |  -.0543638   .0240623    -2.26   0.024    -.1015251   -.0072025
             |
         Hhe |
          2  |   .0014315   .0005116     2.80   0.005     .0004289    .0024342
          3  |   .0017986   .0008667     2.08   0.038        .0001    .0034973
          4  |  -.0397149   .0176346    -2.25   0.024    -.0742781   -.0051517
             |
         Hhn |
          2  |  -.0003494   .0006144    -0.57   0.570    -.0015536    .0008548
          3  |   .0006738   .0005905     1.14   0.254    -.0004835    .0018311
          4  |    .000184   .0007286     0.25   0.801    -.0012441    .0016121
             |
    wlthind5 |
     second  |   .0053468   .0010092     5.30   0.000     .0033688    .0073248
     middle  |   .0062004   .0010402     5.96   0.000     .0041617    .0082392
     fourth  |   .0078617   .0010783     7.29   0.000     .0057484    .0099751
    highest  |   .0061953   .0012926     4.79   0.000     .0036619    .0087287
             |
        area |
  all urban  |  -.0016423   .0006855    -2.40   0.017    -.0029859   -.0002986
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

Comment

Mari Meir

Join Date: Jul 2016

Posts: 61
#8

31 Jul 2016, 10:06

If I understood well, you are interested in only one outcome, what you call the outcome 1, if the person belongs to outcome 1 or not. Is that correct?
If this is the case, you could try a binary response variable where outcome = 1 if within the characteristics you want in terms of education attainment and zero otherwise.
One thing you should take care with, however, is with the distribution of outcome = 1 within your sample. If outcome = 1 is restrict to a very small population, the comparison will probably be challenging.
Comment

Muhammad Hayat

Join Date: Jul 2016
Posts: 7

31 Jul 2016, 11:48

I am interested in all 4 outcomes. Just presented my results for outcome 1 in previous post as an example. The results for other outcomes are also same. For example for outcome 3 , the AME are below. As you can see, most of average marginal effects are insignificant and zero.

Code:

Average marginal effects                        Number of obs     =    120,489
Model VCE    : OIM

Expression   : Pr(edu==3), predict(outcome(3))
dy/dx w.r.t. : 2.gender 2.Me 3.Me 4.Me 2.Hhe 3.Hhe 4.Hhe 2.Hhn 3.Hhn 4.Hhn 2.wlthind5 3.wlthind5
               4.wlthind5 5.wlthind5 1.area

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |
     female  |  -1.12e-06   1.27e-06    -0.88   0.379    -3.62e-06    1.38e-06
             |
          Me |
          2  |  -1.92e-06   2.07e-06    -0.93   0.353    -5.98e-06    2.13e-06
          3  |  -8.07e-08   1.72e-06    -0.05   0.963    -3.46e-06    3.29e-06
          4  |   .0000691   .0000748     0.92   0.356    -.0000775    .0002156
             |
         Hhe |
          2  |  -1.93e-06   2.05e-06    -0.94   0.346    -5.95e-06    2.09e-06
          3  |  -2.43e-06   2.70e-06    -0.90   0.369    -7.72e-06    2.86e-06
          4  |   .0000512   .0000557     0.92   0.357    -.0000579    .0001603
             |
         Hhn |
          2  |   4.69e-07   9.52e-07     0.49   0.622    -1.40e-06    2.33e-06
          3  |  -9.05e-07   1.20e-06    -0.75   0.452    -3.26e-06    1.45e-06
          4  |  -2.47e-07   1.01e-06    -0.24   0.807    -2.23e-06    1.74e-06
             |
    wlthind5 |
     second  |  -7.15e-06   7.27e-06    -0.98   0.325    -.0000214    7.10e-06
     middle  |  -8.30e-06   8.41e-06    -0.99   0.324    -.0000248    8.19e-06
     fourth  |  -.0000105   .0000106    -0.99   0.321    -.0000314    .0000103
    highest  |  -8.30e-06   8.47e-06    -0.98   0.327    -.0000249    8.30e-06
             |
        area |
  all urban  |   2.20e-06   2.39e-06     0.92   0.356    -2.47e-06    6.88e-06
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

.

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3460
#10

01 Aug 2016, 02:44

I have a couple of comments:
If you split up years of education to be a categorical/ordinal variable, then do so such that they approximately correspond to real diplomas.

There are many educational systems where the ordinal nature is not self-evident. For example is lower level general secondary education plus vocational more or less than only higher level general secondary education? In many countries you can make a valid case for both.

It is possible that the same number of years represent wildly different educational degrees. For example I have seen the case where very low level (and thus quick) general secondary plus very low level vociational would have the same number of years of education as pre-university general secondary.

Did someone who repeated a year get more education than someone who did well and finished her or his education in one go? Whether this is a problem in your data depends your exact research question and on the exact wording of the surveyquestion, which is (should be) documented in the documentation that (should) come(s) with the data.

Be carful to only include observations that are likely to have finished their education. If you have a sample of people who are 18 years old or older, then you probably don't want to include the youngest people. The exact age cut-off depends on the country.

As you may have noticed, I am not a fan of years of education, but it can have it's uses. If you are in a situation where years of education is appropriate (enough), then my first choice would be linear regression with robust standard errors. Even though that variable is discrete, we do have a good idea about what distance is represented by the distance between the years, so it is not ordinal. We do have to be careful that years of education is bounded. This can lead to non-linear effects of continous variables, but normal regression diagnostics can tell us whether that is a problem.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
3 likes
Comment
Muhammad Hayat

Join Date: Jul 2016

Posts: 7
#11

02 Aug 2016, 16:27

Thanks Maartin,

I also agreed to use linear regression with robust standard errors. But I am unable to understand clearly your last point especially
1." Even though that variable is discrete, we do have a good idea about what distance is represented by the distance between the years, so it is not ordinal."
2. We do have to be careful that years of education is bounded. This can lead to non-linear effects of continuous variables, but normal regression diagnostics can tell us whether that is a problem."
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3460
#12

03 Aug 2016, 01:44

With an ordinal variable you know that one category is more than another, but not by how much. Think of a question with answer categories: "fully agree", "agree", "neutral", "disagree", "fully disagree". With years of education we know more: 2 years of education is more than 1 year of education, and we know that the difference is 1 year. So years of education is more than ordinal.

You cannot have less than 0 years of education, so years of education is at least bounded by 0. There may also be an upper bound that bites, depending on how exactly years of education was asked. These bounds can result in non-linear effects of continuous variables. Checking for linearity of effects is standard for linear regression, and discussed in any intro course on regression or any intro book on regression. For how it is implemented in Stata see help regress postestimation plots. Maybe the word "normal" was confusing in this context; I used it to refer to standard not normal/Gaussian distribution.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Muhammad Hayat

Join Date: Jul 2016

Posts: 7
#13

03 Aug 2016, 05:21

Thanks again Maarten!

Here, in my case, I have also dummies on the right hand side of the equation. Like Mother and Father education are split into four categories (depending on the degree, as you suggested in one previous post), Gender (male or female) area (rural or urban) and wealth index quantiles (1 to 5). I think in this case, there may not be a problem of non-linearity (due to absence of continuous variable).
Comment
Nahed Eddai

Join Date: Mar 2020

Posts: 10
#14

15 Aug 2021, 04:04

Dear all,
I want to know whether I can use an OLS regression when the dependent variable is quantitative discrete. It takes 5 possible numbers: 0, 20, 65, 120 and 200 ? Or I should go for any other econometric technique.
Best regards

Last edited by Nahed Eddai; 15 Aug 2021, 04:11.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#15

15 Aug 2021, 17:53

It is unusual to treat a 5 category variable as continuous, and it is especially unusual to do so when the 5 categories have values like these! If you told us more about what the variable is and why it has values like these, we might be better able to advise you.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment

Announcement