Choosing an appropriate model when dependent variable is discrete but with larger values

David Wong

Join Date: Mar 2017

Posts: 30
#1

Choosing an appropriate model when dependent variable is discrete but with larger values

09 May 2017, 06:18

Dear Statisticians,

I have one question regarding choosing an appropriate model for cross-sectional analysis when dependent variable is discrete but with larger values. Particularly, I have one dependent variable is discrete, arranging from 15 to 50.

If I use OLS, I may violate one assumption of it, leading to inconsistent estimation. Maybe ordered logistic model a choice? Thank you very much.

Best regards,
David
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#2

09 May 2017, 12:25

Which assumption of OLS are you worried about violating? If it is normality of residuals you are worried about, bear in mind that OLS regression is actually pretty robust to violations of that assumption. In particular, with large sample sizes the central limit theorem implies that the coefficients will have an asymptotically normal distribution around the usual OLS estimates, so everything works out anyway.

Other alternatives to consider here are the various count-variable models such as -poisson- and -nbreg-. An ordered logistic model on an outcome running from 15 to 50, if it actually takes on all 36 of those values, sounds like a nightmare!
Comment
David Wong

Join Date: Mar 2017

Posts: 30
#3

14 May 2017, 11:33

Originally posted by Clyde Schechter View Post

Which assumption of OLS are you worried about violating? If it is normality of residuals you are worried about, bear in mind that OLS regression is actually pretty robust to violations of that assumption. In particular, with large sample sizes the central limit theorem implies that the coefficients will have an asymptotically normal distribution around the usual OLS estimates, so everything works out anyway.

Other alternatives to consider here are the various count-variable models such as -poisson- and -nbreg-. An ordered logistic model on an outcome running from 15 to 50, if it actually takes on all 36 of those values, sounds like a nightmare!

Dear Clyde,

I am sorry for delaying to reply. I agree with you that -nbreg- may be an alternative choice. I have one follow-up question. How is about truncated regression? As in my case, the dependent variable is integer, lower limitation at 13 and upper limitation at 47. Thank you so much for your reply and time.

Best regards,
David
Comment
David Wong

Join Date: Mar 2017

Posts: 30
#4

14 May 2017, 11:41

I hope to complement that this dependent variable is an risk averse indicator, which is a sum index constructed by 13 Likert scale questions. Therefore, this variable has a lower limitation (13) and a upper limitation (47).
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17703
#5

14 May 2017, 12:12

David:
have you considered using -tobit-?

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

14 May 2017, 13:05

Clyde (on OLS) and Carlo (on tobit) underlined very interesting aspects.

Unfortunately, you didn't present information about how your data performed under the OLS model.

Maybe the range (15 to 50) could be taken as an issue just in terms of (undue) extrapolation, not necessarily truncation.

Indeed, concerning blood cholesterol, there is no zero value, yet we may use this variable (without qualms) as DV in a linear regression model.

That happens with a cornucopia of variables.

Best regards,

Marcos
Comment

David Wong

Join Date: Mar 2017
Posts: 30

14 May 2017, 19:00

Originally posted by Marcos Almeida View Post

Clyde (on OLS) and Carlo (on tobit) underlined very interesting aspects.

Unfortunately, you didn't present information about how your data performed under the OLS model.

Maybe the range (15 to 50) could be taken as an issue just in terms of (undue) extrapolation, not necessarily truncation.

Indeed, concerning blood cholesterol, there is no zero value, yet we may use this variable (without qualms) as DV in a linear regression model.

That happens with a cornucopia of variables.

Code:

Linear regression                               Number of obs     =        660
                                                F(12, 647)        =      21.29
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2731
                                                Root MSE          =     4.3987

--------------------------------------------------------------------------------
               |               Robust
           FRT |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
            FL |   .1266445   .0790341     1.60   0.110    -.0285499    .2818389
           Age |  -1.030976   .1897945    -5.43   0.000    -1.403663    -.658288
          Male |   1.884994   .3752252     5.02   0.000     1.148187      2.6218
        Martwo |  -.3720753   .4048684    -0.92   0.358     -1.16709    .4229394
        No_dep |   .2196351   .1471014     1.49   0.136    -.0692187     .508489
     Education |   .2508854   .1468261     1.71   0.088    -.0374278    .5391987
    Employment |   1.611418   .4797209     3.36   0.001     .6694205    2.553416
Annual_hincome |   .0367363   .2180807     0.17   0.866    -.3914952    .4649677
     Liq_asset |   .1591343   .0529181     3.01   0.003     .0552224    .2630463
   Fixed_asset |   .0792854   .0519164     1.53   0.127    -.0226596    .1812304
    White_race |  -.7922464   .5196254    -1.52   0.128    -1.812602    .2281094
     M_Expense |  -.1284775   .1891922    -0.68   0.497    -.4999823    .2430274
         _cons |   25.51912   2.448994    10.42   0.000     20.71018    30.32805



Tobit regression                                Number of obs     =        660
                                                F(  12,    648)   =      21.68
                                                Prob > F          =     0.0000
Log pseudolikelihood = -1907.6034               Pseudo R2         =     0.0523

--------------------------------------------------------------------------------
               |               Robust
           FRT |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
            FL |   .1266445   .0783112     1.62   0.106    -.0271299    .2804189
           Age |  -1.030976   .1880585    -5.48   0.000    -1.400253   -.6616979
          Male |   1.884994    .371793     5.07   0.000     1.154929    2.615058
        Martwo |  -.3720753   .4011651    -0.93   0.354    -1.159816    .4156653
        No_dep |   .2196351   .1457559     1.51   0.132    -.0665758     .505846
     Education |   .2508854   .1454831     1.72   0.085    -.0347899    .5365607
    Employment |   1.611418    .475333     3.39   0.001     .6780395    2.544797
Annual_hincome |   .0367363    .216086     0.17   0.865     -.387577    .4610495
     Liq_asset |   .1591343   .0524341     3.03   0.003     .0561732    .2620955
   Fixed_asset |   .0792854   .0514416     1.54   0.124    -.0217269    .1802977
    White_race |  -.7922464   .5148725    -1.54   0.124    -1.803266    .2187735
     M_Expense |  -.1284775   .1874617    -0.69   0.493    -.4965832    .2396283
         _cons |   25.51912   2.426593    10.52   0.000     20.75418    30.28405
---------------+----------------------------------------------------------------
        /sigma |   4.355196   .1125045                      4.134279    4.576114
--------------------------------------------------------------------------------
             0  left-censored observations
           660     uncensored observations
             0 right-censored observations

Liq_asset and Fixed_asset are my main effect.

Last edited by David Wong; 14 May 2017, 19:15.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17703
#8

14 May 2017, 22:51

David:
you reported the outcome Stata gave you, but we cannot see the codes of your commands.
By the way, which upper and lower limits did you impose in tobit?
Anyway, assuming that your codes are correct, there's no difference between -regress- and -tobit- outcomes; hence, I would go -regresss-.

Kind regards,
Carlo
(Stata 19.0)
Comment
David Wong

Join Date: Mar 2017

Posts: 30
#9

15 May 2017, 03:19

Originally posted by Carlo Lazzaro View Post

David:
you reported the outcome Stata gave you, but we cannot see the codes of your commands.
By the way, which upper and lower limits did you impose in tobit?
Anyway, assuming that your codes are correct, there's no difference between -regress- and -tobit- outcomes; hence, I would go -regresss-.

Thank you, Carlo. I apologize to upload the codes of my commands.

The codes of OLS and tobit regressions are:

Code:

reg FRT FL Age Male Martwo No_dep Education Employment Annual_hincome Liq_asset Fixed_asset White_race M_Expense,r tobit FRT FL Age Male Martwo No_dep Education Employment Annual_hincome Liq_asset Fixed_asset White_race M_Expense,r ll(13)

The dependent variable, FRT, has a lower limitation (13) and a upper limitation (47).
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4399

#10

15 May 2017, 04:05

If you're concerned, then do some diagnostics, for example, what do the residuals look like?

Code:

help diagnostic plots
help regress postestimation plots

Linear regression would seem to be a reasonable option when summing of as many as 13 ordered-categorical items into a Likert scale, unless they're nearly perfectly correlated.

Code:

version 14.2

clear *
set more off
set seed 1391759

forvalues i = 1/13 {
    local varlist `varlist' y`i'
}
tempname Corr
matrix define `Corr' = J(14, 14, 0.5) + I(14) * 0.5

quietly drawnorm `varlist' x, corr(`Corr') n(100)
foreach var of varlist `varlist' {
    generate byte l`var' = 1
    forvalues cut = 0.25(0.25)0.75 {
        quietly replace l`var' = l`var' + 1 if `var' > invnormal(`cut')
    }
}

egen double total = rowtotal(ly?)

histogram total
sleep 1500

regress total c.x
predict double xb, xb

qnorm xb
sleep 1500

pnorm xb
sleep 1500

rvfplot

exit

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#11

15 May 2017, 05:35

Joseph Coveney gave great advice in terms of checking assumptions and postestimations.

That said, particularly in your case, I gather it is "predictable" that - regress - and - tobit - will provide similar result, under similar range of DV.

The variables and theme are not part of my field. But I wonder whether what is really puzzling you (instead of finding the "best model") is the fact that fixed assets was non-significant, contrary to liquid assets. However, "adjusting" for liquid assets may have turned fixed assets non-significant, I fear say.

Best regards,

Marcos
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17703
#12

15 May 2017, 05:46

David:
thanks for providing further details.
As an aside, yp others' helpful advice, comparing your code for to your outcome from -tobit-, it does not seem that any left or right-censored observations has been considered with your data. If that result is in line with what you're after, there's no point in using -tobit- and you can go -regress- with no further concerns.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35651
#13

15 May 2017, 06:11

My prior prejudice is that Tobit is oversold in this context. It's hard to see that in principle that linearity is consistent with bounded responses.

What are the bounds of the outcome in principle (not in practice)? I guess at 13 and 65.

The values seem curiously labile from post to post in this thread? I see 13, 15 and 47, 50.

If so, then I'd rescale to (scale - 13) / 52 (i.e. bounds [0, 1]) and then apply logit link and robust standard errors.
1 like
Comment
David Wong

Join Date: Mar 2017

Posts: 30
#14

17 May 2017, 16:58

Originally posted by Carlo Lazzaro View Post

David:
thanks for providing further details.
As an aside, yp others' helpful advice, comparing your code for to your outcome from -tobit-, it does not seem that any left or right-censored observations has been considered with your data. If that result is in line with what you're after, there's no point in using -tobit- and you can go -regress- with no further concerns.

Thank you so much for your reply, which is very clear now to my next procedures.
Comment
David Wong

Join Date: Mar 2017

Posts: 30
#15

17 May 2017, 17:06

Originally posted by Marcos Almeida View Post

Joseph Coveney gave great advice in terms of checking assumptions and postestimations.

That said, particularly in your case, I gather it is "predictable" that - regress - and - tobit - will provide similar result, under similar range of DV.

The variables and theme are not part of my field. But I wonder whether what is really puzzling you (instead of finding the "best model") is the fact that fixed assets was non-significant, contrary to liquid assets. However, "adjusting" for liquid assets may have turned fixed assets non-significant, I fear say.

Dear Marcos,

Thank you very much for your reply. Since the estimator of liquid assets is significant, I can still tell some story in my paper. In the same time, I can also report that fixed asset can not explain the variance of financial risk tolerance, as a negative finding. The reason why I made a post here, is hoping to find out a way to do a robustness check on the results which got from OLS. I thought that using alternative regressions may be a first choice, although I am very fresh regarding the choices for robustness check.
Comment

Announcement