Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting the different values of a varlist

    Hello Statalist,

    I am currently writing my bachelor thesis in applied statistics and need your help with something that seems simpler than it might be.

    I have this variable personid and I'm trying to figure out how many different people there are in my data set. Is has values from 1 to 5378. However not ever value between 1 and 5378 is used. Also some appear multipe times (but are the same person).

    Is there an easy way to count the number of differend personids in my data set?
    I also found this thread, however the instructions didn't work for me. https://www.stata.com/statalist/arch.../msg00745.html

    Thank you in advance!

    Best,
    Markus

  • #2
    Markus:
    welcome to this forum.
    First off, I woud take a look at -duplicates- to investigate whether a given -personid- is repeated in your dataset.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Markus,
      You can quickly find the number of unique personids using the following code:

      Code:
      egen tag = tag(personid)
      egen unique = total(tag)
      The result in unique should give you your answer.

      Comment


      • #4
        The thread cited in #1 is about counting variables, which is a completely different question.

        The problem has been discussed massively: the key is knowing which keywords will get you good answers.

        Code:
        search distinct
        yields many hits, and this is a capricious selection.


        Code:
        [P]     levelsof  . . . . . . . . . . . . . . . . . . . . . Levels of variable
                (help levelsof)
        
        FAQ     . . . . . . . . . . . . . . . . . . .  Number of distinct observations
                . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox and G. Longton
                4/15    How do I compute the number of distinct observations?
                        http://www.stata.com/support/faqs/data-management/
                        number-of-distinct-observations/
        
        SJ-15-3 dm0042_2  . . . . . . . . . . . . . . . . Software update for distinct
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q3/15   SJ 15(3):899
                improved table format and display of large numbers of
                observations
        
        SJ-12-2 dm0042_1  . . . . . . . . . . . . . . . . Software update for distinct
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q2/12   SJ 12(2):352
                options added to restrict output to variables with a minimum
                or maximum of distinct values
        
        SJ-8-4  dm0042  . . . . . . . . . . . .  Speaking Stata: Distinct observations
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q4/08   SJ 8(4):557--568
                shows how to answer questions about distinct observations
                from first principles; provides a convenience command

        Comment


        • #5
          there's also -unique- (downloadable from SSC) which may duplicate and/or complement -distinct-
          __________________________________________________ __
          Assistant Professor, Department of Biostatistics and Epidemiology
          School of Public Health and Health Sciences
          University of Massachusetts- Amherst

          Comment


          • #6
            unique (SSC) is not unique.

            Nothing hinges on it, but when writing distinct (with Gary Longton) we (or at least I) left creation of a new variable on one side as already covered by code similar or even identical to that in #3.

            The two programs have been ignoring each other for about a decade, without there being human ill-will at all!

            On #3

            Code:
            egen tag = tag(personid)
            count if tag
            has the very small consequence of not creating a variable that only contains the same constant again and again.
            Last edited by Nick Cox; 20 Nov 2017, 13:22.

            Comment


            • #7
              I'd vote to remind the poster about the codebook command. It gives useful information and the number of unique values. And its built in and should be one of the basic commands known.

              Comment


              • #8
                while true that -unique- (SSC) is not unique - it is also not the same as -distinct- when more than one variable is included in the command; here is an example:

                Code:
                . sysuse auto
                (1978 Automobile Data)
                
                . unique fore rep78
                Number of unique values of foreign rep78 is  8
                Number of records is  69
                
                . distinct fore rep78
                
                --------------------------------
                         |     total   distinct
                ---------+----------------------
                 foreign |        74          2
                   rep78 |        69          5
                --------------------------------

                Comment


                • #9
                  How to get the same answer is documented:

                  Code:
                  . sysuse auto
                  (1978 Automobile Data)
                  
                  . unique fore rep78
                  Number of unique values of foreign rep78 is  8
                  Number of records is  69
                  
                  . distinct fore rep78, joint 
                  
                  ----------------------------------
                             |     total   distinct
                  -----------+----------------------
                   (jointly) |        69          8
                  ----------------------------------
                  I don't really need to remind my friend Rich to read the help!

                  Comment


                  • #10
                    well, obviously you do! <grin>

                    Comment


                    • #11
                      I am impressed. You guys were very helpful and answered much faster than I anticipated. Thank you very much.
                      The code supplied by Meg worked perfectly and really stopped me from having a bad day.

                      Also, I guess I had to live up to the "rookie can't use the search function" cliche.

                      Kind regards,
                      Markus

                      Comment


                      • #12
                        Originally posted by Andrew Lover View Post
                        there's also -unique- (downloadable from SSC) which may duplicate and/or complement -distinct-
                        I certainly didn't mean to imply anything nefarious- just two ados with different authors and similar scope (convergent evolution?)!
                        __________________________________________________ __
                        Assistant Professor, Department of Biostatistics and Epidemiology
                        School of Public Health and Health Sciences
                        University of Massachusetts- Amherst

                        Comment


                        • #13
                          Nothing nefarious either implied or inferred.

                          unique came first. So far as I can recall the main rationale for distinct was a preference for showing the number of distinct values separately for each variable. As above, the joint option matches the behaviour of unique.
                          Last edited by Nick Cox; 20 Nov 2017, 18:26.

                          Comment

                          Working...
                          X