Fastest code for finding unique values

Liliani Costari

Join Date: Apr 2021

Posts: 14
#1

Fastest code for finding unique values

25 Jan 2022, 00:31

I have a dataset of one million observation and have around 200 variables. I am struggling with finding a fastest way to count unique values in each variables. Any suggestion ? Thank you for your consideration.
Tags: None
Øyvind Snilsberg

Join Date: Oct 2021

Posts: 591
#2

25 Jan 2022, 00:51

perhaps levelsof,

Code:

sysuse auto, clear ds qui foreach v in `r(varlist)' { levelsof `v' noi display "`v'" _col(20) r(r) }
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#3

25 Jan 2022, 02:19

-search distinct-
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35711

25 Jan 2022, 02:22

distinct from the Stata Journal may not be the fastest way, but it is among the most flexible.

Code:

. sysuse auto, clear
(1978 automobile data)

. distinct

-------------------------------------
              |     total   distinct
--------------+----------------------
         make |        74         74
        price |        74         74
          mpg |        74         21
        rep78 |        69          5
     headroom |        74          8
        trunk |        74         18
       weight |        74         64
       length |        74         47
         turn |        74         18
 displacement |        74         31
   gear_ratio |        74         36
      foreign |        74          2
-------------------------------------

. distinct, sort(di)

-------------------------------------
              |     total   distinct
--------------+----------------------
      foreign |        74          2
        rep78 |        69          5
     headroom |        74          8
        trunk |        74         18
         turn |        74         18
          mpg |        74         21
 displacement |        74         31
   gear_ratio |        74         36
       length |        74         47
       weight |        74         64
         make |        74         74
        price |        74         74
-------------------------------------

The original 2008 paper remains relevant (and explains a view that distinct is a much better term than unique). But download the most recent version of the code, from 2020 at the time of writing.

Code:

.
SJ-20-4 dm0042_3  . . . . . . . . . . . . . . . . Software update for distinct
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q4/20   SJ 20(4):1028--1030
        sort() option has been added

SJ-8-4  dm0042  . . . . . . . . . . . .  Speaking Stata: Distinct observations
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q4/08   SJ 8(4):557--568
        shows how to answer questions about distinct observations
        from first principles; provides a convenience command

Comment

daniel klein

Join Date: Mar 2014

Posts: 3859
#5

25 Jan 2022, 02:36

If you need to do this repeatedly, then look into ftools or gtools. If you need to do this once, then writing the post on Statalist has already taken more time than those tools will save you over the mentioned alternatives.
1 like
Comment
Liliani Costari

Join Date: Apr 2021

Posts: 14
#6

25 Jan 2022, 02:46

levelsof and distict are possible candidates for solving the problem, however, the essence of this request is the speed of the code. Daniel, can you please post an example to use ftools for making a table of unique values for all variables.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#7

25 Jan 2022, 03:21

gdistinct from gtools is certainly faster than distinct [NB] but it lacks the sorting options of the latter, which I would find essential in analysis and reporting of 200 variables. In scanning the output of gdistinct it is harder to see general patterns or extreme values. You could write extra code to work on its saved results. That would ... spend time to save time (@daniel klein's point).
Comment

Announcement

Fastest code for finding unique values

Comment

Comment

Comment

Comment

Comment

Comment