Create a new variable with all combinations of other 4

Marco Lazzeretti

Join Date: Apr 2017

Posts: 34
#1

Create a new variable with all combinations of other 4

12 Apr 2017, 04:58

Hi, I ask you help for a problem that I have.
I would create a new variable (gruppo) from 4 others (ftnd1 sex cat_sigdie cat_eta) that they have, respectively, 4,2,2,4 categories. I should have 64 different modalities for my new variable "gruppo", but some combinations of variables are missing (I obtain only 60 categories, because for some combinations there is no observation).
I have create a new variable with this command:
egen gruppo=group(ftnd1 sex cat_sigdie cat_eta)
Can I obtain whole modalities of "gruppo", altought I have some missing values?

Thanks
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35698

12 Apr 2017, 05:10

Code:

* ???
help fillin

Also, for what purpose? You don't always need to add data for what doesn't exist.

Consider e.g. tabcount (SSC) or groups (SSC):

Code:

. sysuse auto, clear
(1978 Automobile Data)

. * ssc inst groups will be needed for first use 

. groups foreign rep78, fillin

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       1       0      0.00 |
  |  Foreign       2       0      0.00 |
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

Last edited by Nick Cox; 12 Apr 2017, 05:13.

Comment

Marco Lazzeretti

Join Date: Apr 2017

Posts: 34
#3

12 Apr 2017, 05:24

Thanks!!
This is only the first part of my problem: in fact, I have to divide into 64 parts two different populations (that probably contains different combinations of factors). I would risk to have same number group indicating different combinations of factors if I don't consider groups where there is no observation
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

12 Apr 2017, 05:31

Sorry, I don't yet understand. You have a composite classification with 64 cross-combinations. In practice, some don't occur in your data. What's the Stata problem there?
Comment
Marco Lazzeretti

Join Date: Apr 2017

Posts: 34
#5

12 Apr 2017, 05:41

I have 64 possible cross-combinations (4x2x2x4) but with the command above, Stata creates only 60 combinations (4 don't occur in my data). I need that Stata creates 64 combinations and in which there is no observations, it let missing value.
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

12 Apr 2017, 05:45

I absolutely agree with Nick. Groups could be presented without the need to create new variables.

Shall you insist on creating group variables, this is an example:

Code:

. sysuse auto
(1978 Automobile Data)

. gen lowmpg = mpg <20

. gen highweight = weight > 3500

. label define low 1 "lowmpg" 0 "highmpg"

. label define high 0 "lowweight" 1 "highweight"

. label values highweight high

. label values lowmpg low

. egen oddgroupvar = group(foreign lowmpg highweight rep78), label
(5 missing values generated)

. tab oddgroupvar, missing

         group(foreign lowmpg |
            highweight rep78) |      Freq.     Percent        Cum.
------------------------------+-----------------------------------
 Domestic highmpg lowweight 1 |          1        1.35        1.35
 Domestic highmpg lowweight 2 |          3        4.05        5.41
 Domestic highmpg lowweight 3 |          9       12.16       17.57
 Domestic highmpg lowweight 4 |          2        2.70       20.27
 Domestic highmpg lowweight 5 |          2        2.70       22.97
Domestic highmpg highweight 3 |          1        1.35       24.32
Domestic highmpg highweight 4 |          1        1.35       25.68
  Domestic lowmpg lowweight 1 |          1        1.35       27.03
  Domestic lowmpg lowweight 2 |          1        1.35       28.38
  Domestic lowmpg lowweight 3 |          9       12.16       40.54
 Domestic lowmpg highweight 2 |          4        5.41       45.95
 Domestic lowmpg highweight 3 |          8       10.81       56.76
 Domestic lowmpg highweight 4 |          6        8.11       64.86
  Foreign highmpg lowweight 3 |          3        4.05       68.92
  Foreign highmpg lowweight 4 |          9       12.16       81.08
  Foreign highmpg lowweight 5 |          5        6.76       87.84
   Foreign lowmpg lowweight 5 |          4        5.41       93.24
                            . |          5        6.76      100.00
------------------------------+-----------------------------------
                        Total |         74      100.00

As we can see, we have underlined missing values as well.

That said, please note that there are "just" 17 levels (plus missing data) and it gets quite confusing. Imagine how confusing woult it be, shall we create a group variable with 64 variables.

Best regards,

Marcos

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

12 Apr 2017, 05:54

#5 makes the perceived problem clearer but if this from #1

Code:

egen gruppo=group(ftnd1 sex cat_sigdie cat_eta)

is "the command above" then it's no part of its purpose to add observations to the data.

To expand on the reply in #2: See the help for fillin and http://www.stata-journal.com/sjpdf.h...iclenum=dm0011 (especially closing paragraphs on p.136).

Last edited by Nick Cox; 12 Apr 2017, 06:00.
Comment
Marco Lazzeretti

Join Date: Apr 2017

Posts: 34
#8

12 Apr 2017, 06:22

Thanks Marcos. I can see the groups without generate a new variable. But this is only the first step of my problem. In fact, after that, I have to sample from this population (divided into 64 parts). If I create a new variable, I can use command below:
by gruppo: sample........
But if I don't create it, how can I do this?
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

12 Apr 2017, 06:26

Going on the (somewhat) weird side, you could use - decode - to create strings variables (4 ), - replace - the missing values as, say, "NA", then - encode - them anew. Finally, perform the - egen groupingvar(var1 var2 var3 var4) - command. I haven't tryed it out, and that seems mind boggling, to say the least, hence I'm not sure whether that will work perfectly, but you may wish to give it try.

Best regards,

Marcos
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#10

12 Apr 2017, 06:32

But if I don't create it, how can I do this?

To start, - sample - has the if clause.

Second, considering you have a sample large enough, you are expected to get random distribution of the factor variables as well, I mean, - sample - can cope with the task of randomly selecting subsamples.

Third, it has been demonstrated that "extreme" stratifications may lead to the opposite, i.e, unbalanced data, exactly due to missing data and on account of the non-stratified predictors.

In short, I have never envisaged a situation where this approach would excel, and that mabe be due to my ignorance on this particular field, but it would be nice to hear from members with expertise on this situation, whether such a strategy is "de facto" ideal.

Last edited by Marcos Almeida; 12 Apr 2017, 06:36.

Best regards,

Marcos
Comment

Announcement