Problem creating variable with equal discrete values

Max Baard

Join Date: Aug 2022

Posts: 7
#1

Problem creating variable with equal discrete values

23 Aug 2022, 09:01

Hi, I am using Stata14.2 (in case that changes anything)

I have a dataset with 300 observations and I want to create a new variable with an equal probability distribution for 5 discrete choices, that is randomly distributed across the sample (300 obs).
I.e.,
var Freq Perc Cum.
1 60 20 20
2 60 20 40
3 60 20 60
4 60 20 80
5 60 20 100

Where var is an ordinal variable.
This is in order to check that the observations (from experimental games) are statistically different from random frequencies.

I don't know if I'm going about this in the correct way or not. Thus, any help is greatly appreciated.
Tags: Generate, random frequency

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 Aug 2022, 09:23

Well, since you haven't shown us your code, we also don't know if you're going about this in the correct way.

Since you want exactly 1/5 of your observations in each of your 5 groups, which is possible since your number of observations is a multiple of 5, I would approach the task in the following way.

Code:

. generate r = runiform()

. sort r

. generate var = ceil(_n/60)

. tab var

        var |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         60       20.00       20.00
          2 |         60       20.00       40.00
          3 |         60       20.00       60.00
          4 |         60       20.00       80.00
          5 |         60       20.00      100.00
------------+-----------------------------------
      Total |        300      100.00

Comment

Max Baard

Join Date: Aug 2022

Posts: 7
#3

23 Aug 2022, 11:43

Thank you very much William that works perfectly for what I need
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

23 Aug 2022, 12:04

This would also do it:

Code:

. clear

. set obs 300
Number of observations (_N) was 0, now 300.

. gen r = runiform()

. egen var = cut(r), group(5)

. replace var = var + 1
(300 real changes made)

. tab var

        var |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         60       20.00       20.00
          2 |         60       20.00       40.00
          3 |         60       20.00       60.00
          4 |         60       20.00       80.00
          5 |         60       20.00      100.00
------------+-----------------------------------
      Total |        300      100.00

Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

23 Aug 2022, 12:06

And this would also do it:

Code:

 set obs 300
Number of observations (_N) was 0, now 300.

. gen r = runiform()

. sort r

. gen var = group(5)

. tab var

        var |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         60       20.00       20.00
          2 |         60       20.00       40.00
          3 |         60       20.00       60.00
          4 |         60       20.00       80.00
          5 |         60       20.00      100.00
------------+-----------------------------------
      Total |        300      100.00

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

23 Aug 2022, 13:44

Joro Kolev

Wow. That's new to me.

I notice that the group() function does not appear in the Stata Functions Reference Manual PDF included with Stata 17.

Is there a story to this? It doesn't appear in the output of help undocumented either.

It does work on my fully updated copy of Stata 17.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36054
#7

23 Aug 2022, 14:59

:@William Lisowski

For more about the group() function in Stata (not that in egen) see e.g.

https://www.statalist.org/forums/for...riable1-create

https://www.stata.com/statalist/arch.../msg00260.html

I think anyone should be much better off with xtile here.

It would be good if StataCorp made this undocumented function inaccessible except under version control. My guess is that many uses are accidents because of confusion between gen and egen or equivalently through typos
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

24 Aug 2022, 08:00

William Lisowski , yes, there is a fun story to this.

In Finance a common exercise is to form "roughly equally sized" portfolios (that is, with roughly equal number of firms inside) after sorting by some characteristic of the individual stocks, e.g., by firm size or firm book/market ratio.

Here comes the ancient and now depreciated function -gen, group()-; the function seems to do exactly what a finance guy would need. The function -gen, group()- existed in Stata 7, and this is how I met her (it?). Even back in Stata 7 the function was very tersely documented, something in the lines " 'gen, group()' splits your sorted data in roughly equal sized groups".

The good thing about -gen, group()- is that it is fast as lightning. It beats -xtile- and user written -egen, xtile- by many orders of magnitude, and it beats by a little bit the fastest user written alternative which is the -xtile- version in -gtools- (-gxtile- or something like that). This matters a lot of you are dealing with a big finance dataset like CRISP stock returns; it can make a difference between having to wait for days for the code to execute, or having it done in minutes.

The bad thing is that apparently nobody at Stata Corp anymore knows what exactly -gen, group()- does, and what is the algorithm it uses. The function is also badly behaved in the sense that it does not know how to handle missing values, and it does something funky if you specify a variable as an argument. E.g.,

gen newvar = group(oldvar)

is a legitimate code and produces some result. What this result is, I do not know, and I think nobody else knows.

In my experience the -gen, group()- is OK to use if you handle your missing values manually yourself, and if you do not use a variable as an argument but you use rather just a number as an argument to the function.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

24 Aug 2022, 08:13

Here is an illustration of speed, missing values are manually handled:

Code:

. clear

. 
. set obs 1000000
Number of observations (_N) was 0, now 1,000,000.

. 
. gen norm = rnormal() in 100000/1000000
(99,999 missing values generated)

. 
. timer clear

. 
. timeit 1: xtile xq = norm, nq(10)

. 
. timer on 2

. 
. sort norm

. 
. gen gq = group(10) if !missing(norm)
(99,999 missing values generated)

. 
. timer off 2

. 
. tab xq

         10 |
  quantiles |
    of norm |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     90,001       10.00       10.00
          2 |     90,000       10.00       20.00
          3 |     90,000       10.00       30.00
          4 |     90,000       10.00       40.00
          5 |     90,000       10.00       50.00
          6 |     90,000       10.00       60.00
          7 |     90,000       10.00       70.00
          8 |     90,000       10.00       80.00
          9 |     90,000       10.00       90.00
         10 |     90,000       10.00      100.00
------------+-----------------------------------
      Total |    900,001      100.00

. 
. tab gq

         gq |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |    100,000       11.11       11.11
          2 |    100,000       11.11       22.22
          3 |    100,000       11.11       33.33
          4 |    100,000       11.11       44.44
          5 |    100,000       11.11       55.56
          6 |    100,000       11.11       66.67
          7 |    100,000       11.11       77.78
          8 |    100,000       11.11       88.89
          9 |    100,000       11.11      100.00
         10 |          1        0.00      100.00
------------+-----------------------------------
      Total |    900,001      100.00

. 
. timer list
   1:      1.16 /        1 =       1.1560
   2:      0.17 /        1 =       0.1730

But if one does not take care of the missing values, -gen, group()- takes them as any other value, i.e., does not pay special attention to them:

Code:

. gen gqwrong = group(10)

.
. tab gqwrong

    gqwrong |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |    100,000       10.00       10.00
          2 |    100,000       10.00       20.00
          3 |    100,000       10.00       30.00
          4 |    100,000       10.00       40.00
          5 |    100,000       10.00       50.00
          6 |    100,000       10.00       60.00
          7 |    100,000       10.00       70.00
          8 |    100,000       10.00       80.00
          9 |    100,000       10.00       90.00
         10 |    100,000       10.00      100.00
------------+-----------------------------------
      Total |  1,000,000      100.00

Last edited by Joro Kolev; 24 Aug 2022, 08:20. Reason: Tabulated the same variable twice.

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#10

24 Aug 2022, 08:37

And we have now learnt why Stata Corp depreciated the function -gen, group()- :P . In this example it produces what we very much do not want it to produce...

I should remove this thing from my Stata repertoire, and I should never use it again.

In my defence I have never seen this before. I have written an egen function that does more or less what -egen, xtile()- does but built on the -gen, group()-. In testing it, I noticed that it does not do exactly what -xtile- does, but discrepancies were arguably reasonable, like when you have a set which cannot be exactly split in equal groups -xtile- would put an odd member here, and the -gen, group()- would put the odd member there... It is the first time I am seeing -gen, group()- produce such a crazy split into "equal groups", i.e., 9 of the groups really equal, and the odd member placed in its own group, thus resulting in 9 groups having 100 000 observations inside, and the last group having the 1 odd observations inside.

Thankfully I never used my egen function built on -gen, group()- for actual research, it was more of a fun project.
Comment

Announcement