Problem with Dependent variables measured as Proportions & Interpreting coefficients

Guest

Problem with Dependent variables measured as Proportions & Interpreting coefficients

01 Apr 2020, 20:49

Hi all,

I am running an OLS regression with robust clustered standard errors (due to heteroskedasticity). I am using panel data.

My dependent variable is defined as the cash share of total transactions made in a typical month, measured between 0 and 1. Out of 6,695 observations, there are 178 observations with a response of 1 and 634 observations with a response of 0.

My explanatory variables are as follows:

Income - this is yearly household income. It is calculated in dollars as the mean of an assigned income category. i.e. 7,500 if the respondent falls within $5-10k income category.
Age - measured in years
Education - Four categories assigned values of 1- 4.

My questions are:

(1) I have read seen previous posts indicating that other models are better than OLS when the dependent variable is a proportion like mine. For example, use logit regression? I’m not sure how to run this type of regression nor how to interpret the results. I am not sure why OLS wouldn’t work well with a dependent variable measured as a proportion between 0 and 1.

(2) How can I accurately interpret coefficients on the explanatory variables? I am a bit confused by this.

(3) Also, the coefficient on income is very small. I was thinking of dividing income/1000. Is this method okay to use to re-scale?

(4) is it appropriate to take log of my dependent variable if it is a proportion?

Any advice would be really appreciated. Thanks!

I attach below the code for conducting the OLS regression and also the sample data using -dataex-. I did not know how to insert a table of my results.

Code:

reg cashshare income age i.educat male rating holdings credit cheque i.year if sample==1, robust

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(year cashshare) double(age income) float(educat male) double credit float rating double(holdings cheque) float sample
2015   .1334569 31 112500 4 1 1 20 108.33333333333341 1 1
2016   .3030303 32 112500 4 1 1 21                 20 1 1
2017   .1935484 34 112500 4 1 1 24  969.6428580000008 1 1
2015  .12854996 66  27500 4 0 1 21  300.0000000000001 1 1
2016   .1992903 67  17500 4 0 1 22 180.00000000000006 1 1
2017   .1682243 68  17500 4 0 1 23                280 1 1
2016 .020833334 41 112500 3 1 1 29                  0 1 1
2017          1 42 112500 3 1 1 25                 80 1 1
2015  .05921588 25  32500 2 1 0 25               82.5 1 1
2016          0 26  37500 2 1 0 26 11.666666666666664 1 1
2017          0 27  37500 2 1 1 24  973.9285715999991 1 1
2015   .3259842 53  37500 3 0 1 21 130.44642870000004 1 1
2016  .20155144 55  32500 3 0 1 23  80.00000000000001 1 1
2017   .4016003 56  22500 3 0 1 22  60.00000000000001 1 1
2015  .06490872 26  55000 4 0 1 20                 80 1 1
2016     .09375 28  55000 4 0 1 20                 20 1 1
2015  .05271691 83 112500 3 1 1 22 1304.4642869999998 1 1
2016  .05449017 84 112500 3 1 1 18                600 1 1
2017  .24793923 85 112500 3 1 1 20                300 0 1
2015          0 38 112500 3 1 1 14  83.33333333333327 1 1
2016  .13636364 40 162500 3 1 1 17  85.73735572900041 1 1
2017          0 41 162500 3 1 1 15 199.99999999999994 1 1
2015  .53912795 57  22500 2 0 1 29                 80 1 1
2016  .12972517 58  22500 2 0 1 28 200.00000000000014 1 1
2017  .54251766 59  22500 2 0 1 29 13.333333333333336 1 1
2015 .023809524 57 162500 4 1 1 22  869.6428579999998 1 1
2016  .29116118 58  55000 4 1 1 23  710.3417382269064 1 1
2017  .13953489 59  45000 4 1 1 21  869.6428579999998 1 1
2015  .52614975 44  87500 3 1 1 23 434.82142900000036 1 1
2016   .7837778 46  87500 3 1 1 23 434.82142900000036 1 1
2017   .8249276 47  87500 3 1 1 23 500.00000000000045 1 1
2015  .12204076 54 162500 3 1 1 21  150.0000000000001 1 1
2016          0 56 162500 3 1 1 22                 40 1 1
2017   .0961064 57 162500 3 1 1 24  60.00000000000001 1 1
2015  .06973366 64  17500 3 0 1 16 100.00000000000007 1 1
2016   .1854961 66  17500 3 0 1 22   708.333333333333 1 1
2017   .0924408 67  11250 3 0 1 19                300 1 1
2015  .50914204 48   6250 3 0 0 24        2174.107145 0 1
2016   .3966907 49   8750 3 0 0 30 1779.2857159999999 0 1
2017   .8424754 50   8750 3 0 0 24  521.7857148000004 1 1
2015  .24536224 54 112500 3 0 1 24  257.4107145000001 1 1
2017          . 57 112500 3 0 1 16                  . 1 0
2015  .05657994 56 162500 3 1 1 19 200.00000000000014 1 1
2016   .2283169 58 162500 3 1 1 22 100.00000000000006 1 1
2017  .14577565 58 162500 3 1 1 20 100.00000000000007 1 1
2015   .2158688 53  67500 2 1 1 25 373.92857160000017 1 1
2016  .53102005 55  67500 2 1 0 19 240.00000000000014 1 1
2015  .03986711 47 162500 3 0 1 19  33.33333333333334 1 1
2017   .0815647 50 162500 3 0 1 14 100.00000000000007 1 1
2015  .06666667 49  67500 4 1 1 21 23.333333333333336 1 1
2016  .03590127 51  67500 4 1 1 19                 40 1 1
2017  .07803112 52  67500 4 1 1 26                 80 1 1
2015   .1700716 62  87500 2 1 1 24  280.8928574000001 1 1
2016  .17845364 64 112500 2 1 1 24 180.00000000000006 1 1
2017  .28082514 64 112500 2 1 1 23 146.96428580000003 1 1
2015  .20849185 64   8750 3 0 1 16 173.92857160000003 1 1
2016  .46384865 65   8750 3 0 1 23  95.29761913333337 1 1
2017   .3653846 66   8750 3 0 1 17                120 1 1
2016   .6631991 50  32500 2 1 1 25  782.6785722000002 1 1
2017   .6666667 51  32500 2 1 1 22  360.0000000000001 1 1
2015  .04206984 46 162500 4 1 1 19                 20 1 1
2017          0 49 162500 4 1 1 20                 80 1 1
2015   .1640541 44  87500 3 1 1 20               1600 1 1
2016   .3001541 45 112500 3 1 1 20 120.00000000000001 1 1
2017        .25 46 112500 3 1 1 16 100.00000000000007 1 1
2015        .12 28   2500 4 0 1 16                260 1 1
2016          0 29  17500 4 0 1 21                 80 1 1
2017 .069695085 30  27500 4 0 1 17 46.666666666666664 1 1
2015        .18 30  45000 4 0 1 17 100.00000000000007 1 1
2016  .06896552 32  45000 4 0 1 25                 60 1 1
2017  .14180991 32  45000 4 0 1 23 100.00000000000007 1 1
2015  .23148148 52  67500 4 1 1 20 200.00000000000014 1 1
2016   .2897196 52  67500 4 1 1 21  360.0000000000003 1 1
2017  .20763187 53  67500 4 1 1 19  340.0000000000003 1 1
2015  .09425198 46 162500 3 1 1 17  554.8214290000002 1 1
2016  .06666667 47 112500 3 1 1 22 180.00000000000006 1 1
2017   .4494983 48 112500 3 1 1 24 120.00000000000001 1 1
2015          0 31  67500 3 1 1 20                 20 1 1
2016  .12244898 33  67500 3 1 1 22                160 1 1
2017       .125 34  67500 3 1 1 20                 40 1 1
2015   .3865514 56  67500 2 1 0 29 3892.8854616688204 1 1
2016  .25685653 58  67500 2 1 0 19 126.96428580000003 1 1
2017         .2 59  87500 2 1 0 19 213.92857160000003 1 1
2015   .3388633 58  32500 2 1 0 25 195.66964305000005 1 1
2016   .4044944 60  32500 2 1 0 22 200.00000000000014 1 1
2017   .4013378 61  32500 2 1 0 22 200.00000000000003 1 1
2015   .1178344 30 112500 3 1 1 21  33.33333333333334 1 1
2016 .012800976 32 112500 3 1 1 21 25.000000000000018 1 1
2017  .01222494 33 112500 3 1 1 22   41.6666666666667 1 1
2015    .522196 59  67500 2 1 0 15 1739.2857159999999 1 1
2016  .52024233 61  67500 2 1 0 23  521.7857148000002 1 1
2015          . 69   8750 1 1 .  .                  . . 0
2016          1 69   8750 1 1 0 12 130.44642870000007 0 1
2017          1 71   8750 1 1 0 17 173.92857160000003 1 1
2015          0 53  87500 3 1 1 25  53.33333333333333 1 1
2016   .1923077 54  67500 3 1 0 21                 20 1 1
2017  .06060606 55  87500 3 1 0 26  33.33333333333333 1 1
2015   .4244186 37  13750 3 0 0 23 180.00000000000003 1 1
2016  .17261343 39   8750 3 0 0 23 213.92857160000003 1 1
2017  .15873533 40  11250 3 0 0 21 173.92857160000003 1 1
end
label values educat educat_label
label def educat_label 1 "no diploma", modify
label def educat_label 2 "high school", modify
label def educat_label 3 "graduate", modify
label def educat_label 4 "post graduate", modify
label values male male_label
label def male_label 0 "female", modify
label def male_label 1 "male", modify
label values credit credit_label
label def credit_label 0 "no credit card", modify
label def credit_label 1 "credit card owner", modify

Last edited by sladmin; 11 May 2020, 07:56. Reason: anonymize original poster

Tags: OLS, panel data

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#2

02 Apr 2020, 01:54

Guest:
it is strange that you are dealing with panel data but your dataset does not report any -panelid-.
1) OLS is correct as you have a continuous regressand;
2) coefficients express, others things being equal (or when adjusted for the other coefficients) their contribution to variation of the regressand, expressed in percentage points;
3) my guess is that your problems in interpreting the coefficients is due to including mean data (say, for income) that, in turn, reduces possible variation in the regressand;
4) logging -income- would make the interpretation of your regression model more difficult.

Last edited by sladmin; 11 May 2020, 07:57. Reason: anonymize original poster

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35730
#3

02 Apr 2020, 02:12

There is a large literature on this, which I won't even try to epitomise, but the issues include

1. The problem for continuous proportions is in the first instance not so much with OLS -- which is an estimation method, not a model -- as with a linear form such as y = a + bx and its analogues with more predictors. With such forms predictions below 0 or above 1 are not excluded even though they may not arise in practice within the range of the data. This parallels part of the rationale for logit and probit models with binary outcomes.

2. The variance structure of continuous proportions follows from the usual rules, so that as the mean proportion approaches 0 or 1, so also the variance must approach 0. (For example, if the mean is zero, all values must be zero.) This may not matter much, but doing a better job with errors always beats out doing even an OK job. Otherwise put, heteroscedasticity is the norm, not a pathology, for such data.
2 likes
Comment
Guest
#4

02 Apr 2020, 07:18

Thanks very much Carlo and Nick for your detailed responses.

Guest
it is strange that you are dealing with panel data but your dataset does not report any -panelid-.
1) OLS is correct as you have a continuous regressand;
2) coefficients express, others things being equal (or when adjusted for the other coefficients) their contribution to variation of the regressand, expressed in percentage points;
3) my guess is that your problems in interpreting the coefficients is due to including mean data (say, for income) that, in turn, reduces possible variation in the regressand;
4) logging -income- would make the interpretation of your regression model more difficult.[/QUOTE]
I did miss out the ID variable (my apologies)!

With respect to (2) if the coefficients are expressed in percentage points, is there any way of expressing this as percentage change? I am also interested in observing, for example, the impact of a one year increase in age on % change in cash share. I assume that taking log of the dependent variable is the only other option, although log(0) would not be possible.

With respect to (3) If taking the mean of the income category may reduce variation in the regressand, is it more appropriate to create a categorical income variable similar to the education variable?

Last edited by sladmin; 11 May 2020, 07:57. Reason: anonymize original poster
Comment
Guest
#5

02 Apr 2020, 07:24

Originally posted by Nick Cox View Post

There is a large literature on this, which I won't even try to epitomise, but the issues include

1. The problem for continuous proportions is in the first instance not so much with OLS -- which is an estimation method, not a model -- as with a linear form such as y = a + bx and its analogues with more predictors. With such forms predictions below 0 or above 1 are not excluded even though they may not arise in practice within the range of the data. This parallels part of the rationale for logit and probit models with binary outcomes.

2. The variance structure of continuous proportions follows from the usual rules, so that as the mean proportion approaches 0 or 1, so also the variance must approach 0. (For example, if the mean is zero, all values must be zero.) This may not matter much, but doing a better job with errors always beats out doing even an OK job. Otherwise put, heteroscedasticity is the norm, not a pathology, for such data.

Many thanks Nick for your response. I had a few questions relating to your answer:
Is there a way of identifying whether there are predictions below 0 or above 1?

Also, how would I test for the impact of my explanatory variables on the continuous dependent variable in a legit model? My understanding is that logit models use binary dependent variables, but I have a continuous dependent variable. I think this is where my confusion lies. Would I have to create a new dependent variable? Apologies for my lack of understanding on this one!

Lastly, were you referring to logit or xtlogit? I have tested the following but not sure this is correct.

Code:

logit cashshare income age i.educat male rating holdings credit cheque i.year if sample==1, robust

Last edited by sladmin; 11 May 2020, 07:58. Reason: anonymize original poster
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#6

02 Apr 2020, 07:26

Guest:
1) I think that reporting percentage changes of a regressand expressed in percentage points can make disseminating your results a bit difficult (however, it may well depend on your target audience);
2) making categorical a continuous variable can further reduce its informative value.

Last edited by sladmin; 11 May 2020, 07:58. Reason: anonymize original poster

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35730
#7

02 Apr 2020, 09:53

Is there a way of identifying whether there are predictions below 0 or above 1?

Naturally. Run predict after fitting your model and check its limits.
Also, how would I test for the impact of my explanatory variables on the continuous dependent variable in a legit model? My understanding is that logit models use binary dependent variables, but I have a continuous dependent variable. I think this is where my confusion lies. Would I have to create a new dependent variable? Apologies for my lack of understanding on this one!

Logit models for continuous proportions are perfectly justified. In fact, historically they predate logit models for binary outcomes. Just use glm or similar commands.
Lastly, were you referring to logit or xtlogit? I have tested the following but not sure this is correct.

I wasn't making a distinction. I rather suspect xtlogit wants binary outcomes, so I guess xtgee is a better command for panel data, but now we are at the limits of my experience too. I am not an economist or econometrician or an expert on panel data.

Last edited by Nick Cox; 02 Apr 2020, 09:57.
1 like
Comment
Guest
#8

02 Apr 2020, 10:22

Thanks so much for your insights Nick - I really appreciate it.
Comment

Announcement

Problem with Dependent variables measured as Proportions & Interpreting coefficients

Comment

Comment

Comment

Comment

Comment

Comment

Comment