Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastest code for finding unique values

    I have a dataset of one million observation and have around 200 variables. I am struggling with finding a fastest way to count unique values in each variables. Any suggestion ? Thank you for your consideration.

  • #2
    perhaps levelsof,
    Code:
    sysuse auto, clear
    
    ds
    qui foreach v in `r(varlist)' {
    levelsof `v'
    noi display "`v'" _col(20) r(r)
    }

    Comment


    • #3
      -search distinct-

      Comment


      • #4
        distinct from the Stata Journal may not be the fastest way, but it is among the most flexible.

        Code:
        . sysuse auto, clear
        (1978 automobile data)
        
        . distinct
        
        -------------------------------------
                      |     total   distinct
        --------------+----------------------
                 make |        74         74
                price |        74         74
                  mpg |        74         21
                rep78 |        69          5
             headroom |        74          8
                trunk |        74         18
               weight |        74         64
               length |        74         47
                 turn |        74         18
         displacement |        74         31
           gear_ratio |        74         36
              foreign |        74          2
        -------------------------------------
        
        . distinct, sort(di)
        
        -------------------------------------
                      |     total   distinct
        --------------+----------------------
              foreign |        74          2
                rep78 |        69          5
             headroom |        74          8
                trunk |        74         18
                 turn |        74         18
                  mpg |        74         21
         displacement |        74         31
           gear_ratio |        74         36
               length |        74         47
               weight |        74         64
                 make |        74         74
                price |        74         74
        -------------------------------------

        The original 2008 paper remains relevant (and explains a view that distinct is a much better term than unique). But download the most recent version of the code, from 2020 at the time of writing.


        Code:
        .
        SJ-20-4 dm0042_3  . . . . . . . . . . . . . . . . Software update for distinct
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q4/20   SJ 20(4):1028--1030
                sort() option has been added
        
        SJ-8-4  dm0042  . . . . . . . . . . . .  Speaking Stata: Distinct observations
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q4/08   SJ 8(4):557--568
                shows how to answer questions about distinct observations
                from first principles; provides a convenience command

        Comment


        • #5
          If you need to do this repeatedly, then look into ftools or gtools. If you need to do this once, then writing the post on Statalist has already taken more time than those tools will save you over the mentioned alternatives.

          Comment


          • #6
            levelsof and distict are possible candidates for solving the problem, however, the essence of this request is the speed of the code. Daniel, can you please post an example to use ftools for making a table of unique values for all variables.

            Comment


            • #7
              gdistinct from gtools is certainly faster than distinct [NB] but it lacks the sorting options of the latter, which I would find essential in analysis and reporting of 200 variables. In scanning the output of gdistinct it is harder to see general patterns or extreme values. You could write extra code to work on its saved results. That would ... spend time to save time (@daniel klein's point).

              Comment

              Working...
              X