Unexpected behavior of "_n" within egen command while using a by group.

Evgeny Sironov

Join Date: May 2017

Posts: 13
#1

Unexpected behavior of "_n" within egen command while using a by group.

04 Aug 2020, 13:27

Hello,
I've encountered a peculiar issue when I tried to use "_n==1" within an egen command and while using a by group.
The data is structured this way: each observation is a job with various variables that describe it. A single person (id) can hold several jobs. The goal is to assign an occupational code of the highest paying job to all of persons's observations.

This is the code I tried first:

Code:

gsort id -wage by id: egen main_occ_1=max(occ/(_n==1))

It didn't work correctly and only assigned the expected values to a single id, seemingly refusing to hop to other id's besides the first one.

This code worked correctly and produced what I expected:

Code:

gsort id -wage by id: g first=_n==1 by id: egen main_occ_2=max(occ/(first==1))

So the question is why didn't the first version of the code work correctly despite seemingly having identical logic to the second version?

Here's the code for data, it includes the resulting variables of both versions of the code:

Code:

clear input float(id wage occupation_code main_occ first main_occ_2) 1 5000 9 9 1 9 1 3000 4 9 0 9 1 2000 8 9 0 9 1 1000 5 9 0 9 2 3000 2 . 1 2 2 2000 7 . 0 2 2 2000 1 . 0 2 3 4000 5 . 1 5 3 1000 8 . 0 5 end

Thanks.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35818
#2

04 Aug 2020, 13:41

The help for egen warns against references to subscripts — because many of its functions sort temporarily and may break your intended interpretation.
1 like
Comment
Evgeny Sironov

Join Date: May 2017

Posts: 13
#3

04 Aug 2020, 14:22

Got it. Never bothered to look at the actual command description for some reason...
Thanks Nick.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

04 Aug 2020, 14:41

To compound the mystery, the problem arises only with some egen functions but not with others. The max function in your egen is just a dummy function which does nothing, as what you are interested is just picking up the only nonmissing expression in parenthesis. So you can replace the max with many other functions. By the way, I usually accomplish your type of task where dummy function is needed by using the mean. And here as you can see the mean does not fails us. The min and max do fails us:

Code:

. gsort id -wage

. by id: egen main_occ_max=max(occ/(_n==1))
(5 missing values generated)

. by id: egen main_occ_min=min(occ/(_n==1))
(5 missing values generated)

. by id: egen main_occ_mean=mean(occ/(_n==1))

. by id: egen main_occ_total=total(occ/(_n==1))

. list, sep(0) abb(16)

     +--------------------------------------------------------------------------------------------+
     | id   wage   occupation_code   main_occ_max   main_occ_min   main_occ_mean   main_occ_total |
     |--------------------------------------------------------------------------------------------|
  1. |  1   5000                 9              9              9               9                9 |
  2. |  1   3000                 4              9              9               9                9 |
  3. |  1   2000                 8              9              9               9                9 |
  4. |  1   1000                 5              9              9               9                9 |
  5. |  2   3000                 2              .              .               2                2 |
  6. |  2   2000                 1              .              .               2                2 |
  7. |  2   2000                 7              .              .               2                2 |
  8. |  3   4000                 5              .              .               5                5 |
  9. |  3   1000                 8              .              .               5                5 |
     +--------------------------------------------------------------------------------------------+

Comment

Evgeny Sironov

Join Date: May 2017

Posts: 13
#5

04 Aug 2020, 17:11

Yeah, it's a dummy function in this case, but sometimes there's a more complicated condition after the division that would result in more than one possible observation per group for which to compute something with egen. It's a useful "trick" I learned from Nick a while ago.

It is funny how it works with other egen functions indeed, but in any case, lesson learned - don't use subscripitng with egen. It wouldn't hurt for Statacorp to implement some hardwired sanity check that would produce an error when using subscripting with egen given its unpredictable nature.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

04 Aug 2020, 23:39

Notwithstanding the usefulness of the trick Nick shows in one of his columns--that division by zero in Stata does not break the computer but simply produces a missing--in your case what you want can be easily done with alternative methods. There is also the question of what do you do in badly defined situations such as e.g., the highest salary occurs at more than one occupation, or the salary is missing for all occupations.

Alternative solution one, only sorting, no egens:

Code:

. gsort id -wage

. by id: gen occsort = occupation_code[1]

Alternative solution two, no sorting, only egens:

Code:

. egen maxwage = max(wage), by(id)

. egen occup = max(occupation_code) if maxwage==wage, by(id)
(6 missing values generated)

. egen occegen = mean(occup), by(id)

. list, sep(0)

     +------------------------------------------------------------------------------------------+
     | id   wage   occupa~e   main_occ   first   main_o~2   occsort   maxwage   occup   occegen |
     |------------------------------------------------------------------------------------------|
  1. |  1   5000          9          9       1          9         9      5000       9         9 |
  2. |  1   3000          4          9       0          9         9      5000       .         9 |
  3. |  1   2000          8          9       0          9         9      5000       .         9 |
  4. |  1   1000          5          9       0          9         9      5000       .         9 |
  5. |  2   3000          2          .       1          2         2      3000       2         2 |
  6. |  2   2000          7          .       0          2         2      3000       .         2 |
  7. |  2   2000          1          .       0          2         2      3000       .         2 |
  8. |  3   4000          5          .       1          5         5      4000       5         5 |
  9. |  3   1000          8          .       0          5         5      4000       .         5 |
     +------------------------------------------------------------------------------------------+

Originally posted by Evgeny Sironov View Post

Yeah, it's a dummy function in this case, but sometimes there's a more complicated condition after the division that would result in more than one possible observation per group for which to compute something with egen. It's a useful "trick" I learned from Nick a while ago.

It is funny how it works with other egen functions indeed, but in any case, lesson learned - don't use subscripitng with egen. It wouldn't hurt for Statacorp to implement some hardwired sanity check that would produce an error when using subscripting with egen given its unpredictable nature.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35818
#7

05 Aug 2020, 02:38

The allusion in #6 is to https://www.stata-journal.com/articl...article=dm0055 which is freely visible.

The division by zero trick is in Section 10 of that paper, while earlier sections will be found to endorse precisely the kind of more basic techniques that Joro Kolev rightly recommends.

It's very hard in Stata, and ultimately quite unimportant, to know whether you read some trick earlier somewhere and didn't register at the time, but then later re-invented it yourself. It's just possible that the division by zero trick is a little original, but otherwise dm0055 is, may I say, one of the more useful of my columns if I can judge by the number of times it has been relevant to a question here.
Comment

Announcement