Generating variables for averages

Sasha Gulabivala

Join Date: Feb 2017

Posts: 33
#1

Generating variables for averages

27 Feb 2017, 04:44

Hi,

I have a panel dataset with 13 waves and my dependent variable is binary.

I am running a correlated random effects model which requires me to generate average variables for my time-variant explanatory variables ( please see slide 37: http://conference.iza.org/conference...nonlin_iza.pdf )

So I would do:

Code:

egen x1bar = mean(x1), by(id)

One of my main explanatory variables is "retire", which I think is a categorical variable. This measures how importantly the individual rates retirement as a motive to save money in the present, rated from 1-14 with 0 being very unimportant and 14 being very important.

To generate the average of this variable, I tried:

Code:

egen retirebar = mean(i.retire), by(id)

But this returned the error message: i: operator invalid

Could you please suggest how I can generate the average of this variable?

I think the following would be incorrect because it doesn't take into account that "retire" is categorical

Code:

egen retirebar = mean(retire), by(id)

Thank you
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35667
#2

27 Feb 2017, 04:59

I can't see that the mean of a multiple category categorical variable coded arbitrarily has any meaning or use.
Comment
Sasha Gulabivala

Join Date: Feb 2017

Posts: 33
#3

27 Feb 2017, 05:14

Hi Nick,

Please also see slide 51( http://conference.iza.org/conference...nonlin_iza.pdf ) where it appears that Jeff Wooldridge has generated average variables for any time-varying RHS variables (such as kids), and not for time-invariant RHS variables (such as black).

I thought I should do the same given that the "retire" variable varies over time

Thanks
Comment
Sasha Gulabivala

Join Date: Feb 2017

Posts: 33
#4

27 Feb 2017, 05:41

Also Dimitriy V. Masterov posted here ( http://stats.stackexchange.com/quest...ice-cre-probit ) where he suggested for the Chamberlain-Mundlak CRE model, we should

fit a panel random effects probit where the RHS variables are augmented with x¯i , the average of xit for each panel [...] The inclusion of the mean terms should capture the correlation between the unobserved heterogeneity and the covariates that renders the random effect model inconsistent

. Therefore I thought I should include the mean terms. Would this be incorrect for multiple-category categorical variables? Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35667
#5

27 Feb 2017, 06:18

Back-tracking to #1:

I see that you "think" that retire is categorical. I suggest that you resolve this doubt.

If you think that, then I would be amazed that you're expected to calculate its mean for this procedure.

Conversely if you are treating it as a measure then in Stata terms the last egen statement in #1 is the way to calculate its mean separately for panels.

But I am not any kind of expert on these random effect models.

I imagine that experts would want to see your intended model syntax, which I can't see as yet in this thread. I have not read any of your links.
Comment
Sasha Gulabivala

Join Date: Feb 2017

Posts: 33
#6

27 Feb 2017, 06:52

Hi Nick,

Firstly, thank you for pointing this out, I was indeed unsure and I think that I was wrong about retire being categorical. The ordering of this variable is meaningful, so I think it is an ordinal variable. Thus, as there is an intrinsic ordering of the levels of the categories, I think it is possible to calculate the mean for this procedure - would you agree? Then, as you confirmed, I should use:

Code:

egen retirebar = mean(retire), by(id)

Secondly, for categorical variables (such as occupation) in Stata, my understanding is that it is better to attach the prefix (i.occupation), to display all the categories separately in the regression output. Similarly, the c. prefix is attached to continuous variables. For ordinal variables (such as retire), is there a need to attach any prefix when running the regression?

So for example in a basic version of the Probit RE model (not using Chamberlain-Mundlak CRE model yet):

Code:

xtprobit saving1 retire i.occupation, re nolog

Thanks

Last edited by Sasha Gulabivala; 27 Feb 2017, 07:24.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35667
#7

27 Feb 2017, 08:37

I naturally agree that it's possible to calculate the mean of an ordinal variable, but necessarily I can't advise on whether it's a good idea for your purpose.

It seems to me that you need more support from a supervisor, advisor or mentor in talking this through.
Comment
Sasha Gulabivala

Join Date: Feb 2017

Posts: 33
#8

27 Feb 2017, 08:47

Thank you for your help - I will discuss this further with my tutor.
Comment
Sasha Gulabivala

Join Date: Feb 2017

Posts: 33
#9

27 Feb 2017, 10:03

Nick Cox how can I tell whether a variable should be treated as categorical or continuous in Stata?

I understand that variables like car colour (e.g. red =1, blue =2) would be categorical because there is no meaning or order to the number it is coded with.
I also understand that variables like age would be continuous as there is an intrinsic order to this.

With a variable such as self-perceived health status, there is an order to this, so would this be an ordinal variable? Would I then treat it as categorical (treated as c.health) in Stata?

Thank you for your time
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35667
#10

27 Feb 2017, 11:12

Ordinal scales are precisely those on which different researchers (and practitioners too) jump in different directions. For example, many universities average grades say 1 to 5 routinely while people in some departments tell their students that it is wrong to average ordinal scales.

I was brought up on texts which preached that Pearson correlation was wrong for ordinal data but Spearman correlation was fine, seemingly obliviously of what Spearman actually does.
Comment
Sasha Gulabivala

Join Date: Feb 2017

Posts: 33
#11

27 Feb 2017, 16:36

Hi Dr Cox,

Thanks for your reply

Indeed, as you suggested, it seems that Pearson's correlation can not be used when there is an ordinal variable.
So I ran a Spearman's correlation on saving (my key dependent variable) and health (explanatory/control variable).
The result is as follows:

There was a strong positive correlation between saving and health, which was statistically significant, r_s = 0.1605, p = .0000.
This leads me to believe that there is a monotonic relationship between the variables, and as health is on a Likert scale, I think it is appropriate to treat health, an ordinal variable, as continuous (rather than categorical with i. prefix) in Stata.

I would be extremely grateful if you could let me know if you think that I have misinterpreted this

Last edited by Sasha Gulabivala; 27 Feb 2017, 16:39.
Comment

Sasha Gulabivala

Join Date: Feb 2017
Posts: 33

#12

27 Feb 2017, 18:23

Amended with a more readable format:

Code:

 spearman saving health

 Number of obs =    3065
Spearman's rho =       0.1605

Test of Ho: saving and health are independent
    Prob > |t| =       0.0000

Comment

Rose Simmons

Join Date: Feb 2017

Posts: 114
#13

06 Mar 2017, 07:01

Hi Nick Cox ,

For the Chamberlain-Mundlak CRE model, Wooldridge generates average variables, as these should capture the correlation between the unobserved heterogeneity & the covariates which make the random effect (RE) model inconsistent. So, the CRE model attempts to act as something in between FE and RE.

Originally posted by Nick Cox View Post

I can't see that the mean of a multiple category categorical variable coded arbitrarily has any meaning or use.

I just wondered, do you think that the mean of an indicator/dummy variable would have any meaning or use?

Thank you

Rose Simmons
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35667
#14

06 Mar 2017, 07:18

The mean of an indicator variable has as much meaning and use as is possible. It is the fraction or probability of the state coded 1. If you have 7 females and 3 males and female is coded 1 and male 0 then the mean of 0.7 naturally corresponds to, nay is the same as, the proportion 7/10 who are female.
1 like
Comment

Announcement

Generating variables for averages

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment