Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem creating variable with equal discrete values

    Hi, I am using Stata14.2 (in case that changes anything)

    I have a dataset with 300 observations and I want to create a new variable with an equal probability distribution for 5 discrete choices, that is randomly distributed across the sample (300 obs).
    I.e.,
    var Freq Perc Cum.
    1 60 20 20
    2 60 20 40
    3 60 20 60
    4 60 20 80
    5 60 20 100

    Where var is an ordinal variable.
    This is in order to check that the observations (from experimental games) are statistically different from random frequencies.

    I don't know if I'm going about this in the correct way or not. Thus, any help is greatly appreciated.

  • #2
    Well, since you haven't shown us your code, we also don't know if you're going about this in the correct way.

    Since you want exactly 1/5 of your observations in each of your 5 groups, which is possible since your number of observations is a multiple of 5, I would approach the task in the following way.
    Code:
    . generate r = runiform()
    
    . sort r
    
    . generate var = ceil(_n/60)
    
    . tab var
    
            var |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |         60       20.00       20.00
              2 |         60       20.00       40.00
              3 |         60       20.00       60.00
              4 |         60       20.00       80.00
              5 |         60       20.00      100.00
    ------------+-----------------------------------
          Total |        300      100.00

    Comment


    • #3
      Thank you very much William that works perfectly for what I need

      Comment


      • #4
        This would also do it:

        Code:
        . clear
        
        . set obs 300
        Number of observations (_N) was 0, now 300.
        
        . gen r = runiform()
        
        . egen var = cut(r), group(5)
        
        . replace var = var + 1
        (300 real changes made)
        
        . tab var
        
                var |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |         60       20.00       20.00
                  2 |         60       20.00       40.00
                  3 |         60       20.00       60.00
                  4 |         60       20.00       80.00
                  5 |         60       20.00      100.00
        ------------+-----------------------------------
              Total |        300      100.00

        Comment


        • #5
          And this would also do it:

          Code:
           set obs 300
          Number of observations (_N) was 0, now 300.
          
          . gen r = runiform()
          
          . sort r
          
          . gen var = group(5)
          
          . tab var
          
                  var |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    1 |         60       20.00       20.00
                    2 |         60       20.00       40.00
                    3 |         60       20.00       60.00
                    4 |         60       20.00       80.00
                    5 |         60       20.00      100.00
          ------------+-----------------------------------
                Total |        300      100.00

          Comment


          • #6
            Joro Kolev

            Wow. That's new to me.

            I notice that the group() function does not appear in the Stata Functions Reference Manual PDF included with Stata 17.

            Is there a story to this? It doesn't appear in the output of help undocumented either.

            It does work on my fully updated copy of Stata 17.

            Comment


            • #7
              :@William Lisowski

              For more about the group() function in Stata (not that in egen) see e.g.

              https://www.statalist.org/forums/for...riable1-create

              https://www.stata.com/statalist/arch.../msg00260.html

              I think anyone should be much better off with xtile here.

              It would be good if StataCorp made this undocumented function inaccessible except under version control. My guess is that many uses are accidents because of confusion between gen and egen or equivalently through typos

              Comment


              • #8
                William Lisowski , yes, there is a fun story to this.

                In Finance a common exercise is to form "roughly equally sized" portfolios (that is, with roughly equal number of firms inside) after sorting by some characteristic of the individual stocks, e.g., by firm size or firm book/market ratio.

                Here comes the ancient and now depreciated function -gen, group()-; the function seems to do exactly what a finance guy would need. The function -gen, group()- existed in Stata 7, and this is how I met her (it?). Even back in Stata 7 the function was very tersely documented, something in the lines " 'gen, group()' splits your sorted data in roughly equal sized groups".

                The good thing about -gen, group()- is that it is fast as lightning. It beats -xtile- and user written -egen, xtile- by many orders of magnitude, and it beats by a little bit the fastest user written alternative which is the -xtile- version in -gtools- (-gxtile- or something like that). This matters a lot of you are dealing with a big finance dataset like CRISP stock returns; it can make a difference between having to wait for days for the code to execute, or having it done in minutes.

                The bad thing is that apparently nobody at Stata Corp anymore knows what exactly -gen, group()- does, and what is the algorithm it uses. The function is also badly behaved in the sense that it does not know how to handle missing values, and it does something funky if you specify a variable as an argument. E.g.,

                gen newvar = group(oldvar)

                is a legitimate code and produces some result. What this result is, I do not know, and I think nobody else knows.

                In my experience the -gen, group()- is OK to use if you handle your missing values manually yourself, and if you do not use a variable as an argument but you use rather just a number as an argument to the function.



                Comment


                • #9
                  Here is an illustration of speed, missing values are manually handled:

                  Code:
                  . clear
                  
                  . 
                  . set obs 1000000
                  Number of observations (_N) was 0, now 1,000,000.
                  
                  . 
                  . gen norm = rnormal() in 100000/1000000
                  (99,999 missing values generated)
                  
                  . 
                  . timer clear
                  
                  . 
                  . timeit 1: xtile xq = norm, nq(10)
                  
                  . 
                  . timer on 2
                  
                  . 
                  . sort norm
                  
                  . 
                  . gen gq = group(10) if !missing(norm)
                  (99,999 missing values generated)
                  
                  . 
                  . timer off 2
                  
                  . 
                  . tab xq
                  
                           10 |
                    quantiles |
                      of norm |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                            1 |     90,001       10.00       10.00
                            2 |     90,000       10.00       20.00
                            3 |     90,000       10.00       30.00
                            4 |     90,000       10.00       40.00
                            5 |     90,000       10.00       50.00
                            6 |     90,000       10.00       60.00
                            7 |     90,000       10.00       70.00
                            8 |     90,000       10.00       80.00
                            9 |     90,000       10.00       90.00
                           10 |     90,000       10.00      100.00
                  ------------+-----------------------------------
                        Total |    900,001      100.00
                  
                  . 
                  . tab gq
                  
                           gq |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                            1 |    100,000       11.11       11.11
                            2 |    100,000       11.11       22.22
                            3 |    100,000       11.11       33.33
                            4 |    100,000       11.11       44.44
                            5 |    100,000       11.11       55.56
                            6 |    100,000       11.11       66.67
                            7 |    100,000       11.11       77.78
                            8 |    100,000       11.11       88.89
                            9 |    100,000       11.11      100.00
                           10 |          1        0.00      100.00
                  ------------+-----------------------------------
                        Total |    900,001      100.00
                  
                  . 
                  . timer list
                     1:      1.16 /        1 =       1.1560
                     2:      0.17 /        1 =       0.1730
                  But if one does not take care of the missing values, -gen, group()- takes them as any other value, i.e., does not pay special attention to them:

                  Code:
                  . gen gqwrong = group(10)
                  
                  .
                  . tab gqwrong
                  
                      gqwrong |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                            1 |    100,000       10.00       10.00
                            2 |    100,000       10.00       20.00
                            3 |    100,000       10.00       30.00
                            4 |    100,000       10.00       40.00
                            5 |    100,000       10.00       50.00
                            6 |    100,000       10.00       60.00
                            7 |    100,000       10.00       70.00
                            8 |    100,000       10.00       80.00
                            9 |    100,000       10.00       90.00
                           10 |    100,000       10.00      100.00
                  ------------+-----------------------------------
                        Total |  1,000,000      100.00
                  Last edited by Joro Kolev; 24 Aug 2022, 08:20. Reason: Tabulated the same variable twice.

                  Comment


                  • #10
                    And we have now learnt why Stata Corp depreciated the function -gen, group()- :P . In this example it produces what we very much do not want it to produce...

                    I should remove this thing from my Stata repertoire, and I should never use it again.

                    In my defence I have never seen this before. I have written an egen function that does more or less what -egen, xtile()- does but built on the -gen, group()-. In testing it, I noticed that it does not do exactly what -xtile- does, but discrepancies were arguably reasonable, like when you have a set which cannot be exactly split in equal groups -xtile- would put an odd member here, and the -gen, group()- would put the odd member there... It is the first time I am seeing -gen, group()- produce such a crazy split into "equal groups", i.e., 9 of the groups really equal, and the odd member placed in its own group, thus resulting in 9 groups having 100 000 observations inside, and the last group having the 1 odd observations inside.

                    Thankfully I never used my egen function built on -gen, group()- for actual research, it was more of a fun project.

                    Comment

                    Working...
                    X