Command collapse AND keeping the original dataset

Karel Novak

Join Date: Mar 2018

Posts: 37
#1

Command collapse AND keeping the original dataset

03 Jul 2020, 03:59

Dear all,

I was just thinking about this: When you analyze hierarchical structured dataset (for example students L1 clustered in schools L2) and you wish to analyze the second level only (by aggregation of the first level values) you probably use the command -collapse- which gives you for example mean, median or SD etc. of the L1 units to the L2 units, but it create new dataset.

Now, I was trying to figure out a way around this, so you can incorporate the results of -collapse- into your original dataset. I would use the command:

- by L2_var, sort: egen new_L2_var = mean(another_variable) -

the result of this command is that each L1 unit gets a new mean value of the variable of interest (which is completely correct). And what I was thinkg about -- can you keep just on value for each L2 unit created this way so you have de facto two datasets in one? I don't know how to write a code for this (but I assume it will mostlikely say: keep the first value of certain variable by group).

Does it make sense to you?

Thanks for your thoughts.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

03 Jul 2020, 04:47

Apart from frames, a modern way to do this, an ancient way to do this is tagging using egen. Code was first published in STB-50 in 1999, and a long standard trick even then, and folded into Stata 7.

Here is a silly little example.

Code:

. sysuse auto, clear (1978 Automobile Data) . egen mean_mpg = mean(mpg), by(rep78) . egen tag = tag(rep78) . sort rep78 . l rep78 mean_mpg if tag, noobs +------------------+ | rep78 mean_mpg | |------------------| | 1 21 | | 2 19.125 | | 3 19.43333 | | 4 21.66667 | | 5 27.36364 | +------------------+ .

The idea is that if all the values of a certain variable are identical within a group, then we need only (should only) look at or use one observation from that group.

For more see the help and manual entry for egen.

Another approach which is not so good is make every value but one missing within a group of observations. That gets messy very quickly when you want to relate to other groupings on other criteria.
1 like
Comment
Karel Novak

Join Date: Mar 2018

Posts: 37
#3

03 Jul 2020, 05:48

Thanks for the answer. I think it still might be a little different than what I had in mind. Which is basically this: "Another approach which is not so good is make every value but one missing within a group of observations." So, when I use -sum tag- it only shows the L2 observations. But I do not know how to do this, something like: keep only those values next to the tag = 1, other set as missing.

Last edited by Karel Novak; 03 Jul 2020, 05:51.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

03 Jul 2020, 06:07

The whole point of tagging is to ignore what you don't need to see or to use You shouldn't want to set values in non-tagged observations to missing, because it won't make any other approach easier and it will frustrate much else.
Comment
Karel Novak

Join Date: Mar 2018

Posts: 37
#5

03 Jul 2020, 13:45

I see, so basically it would be better to just:

Code:

sum mean_mpg if tag==1

etc.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

03 Jul 2020, 15:43

Specifically, no, as an unweighted summary of group summaries would not be best.
Comment

Announcement

Command collapse AND keeping the original dataset

Comment

Comment

Comment

Comment

Comment