Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to check if number of distinct observations is same for two variables?

    Hi,

    I have generated new ID variable and want to check if the number of distinct observations indicated by new ID is the same as indicated by the original ID.

    Here is the sample data:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str13(new_hhid temp_hhid)
    "10020040009" "1    2  1   9"
    "10020040009" "1    2  1   9"
    "10020040017" "1    2  1  17"
    "10020040017" "1    2  1  17"
    "10020040017" "1    2  1  17"
    "10020040017" "1    2  1  17"
    "10020040033" "1    2  1  33"
    "10020040033" "1    2  1  33"
    "10020040033" "1    2  1  33"
    "10020040033" "1    2  1  33"
    "10020040041" "1    2  1  41"
    "10020040041" "1    2  1  41"
    "10020040041" "1    2  1  41"
    "10020040049" "1    2  1  49"
    "10020040049" "1    2  1  49"
    "10020040049" "1    2  1  49"
    "10020040057" "1    2  1  57"
    "10020040057" "1    2  1  57"
    "10020040057" "1    2  1  57"
    "10020040057" "1    2  1  57"
    "10020040057" "1    2  1  57"
    "10020040065" "1    2  1  65"
    "10020040065" "1    2  1  65"
    "10020040065" "1    2  1  65"
    "10020040073" "1    2  1  73"
    "10020040073" "1    2  1  73"
    "10020040073" "1    2  1  73"
    "10020040073" "1    2  1  73"
    "10020040089" "1    2  1  89"
    "10020040089" "1    2  1  89"
    "10020040089" "1    2  1  89"
    "10020040089" "1    2  1  89"
    "10020040089" "1    2  1  89"
    "10020040089" "1    2  1  89"
    "10020040089" "1    2  1  89"
    "10020040113" "1    2  1 113"
    "10020040113" "1    2  1 113"
    "10020040113" "1    2  1 113"
    "10020040121" "1    2  1 121"
    "10020040121" "1    2  1 121"
    "10020040121" "1    2  1 121"
    "10020040121" "1    2  1 121"
    "10020040121" "1    2  1 121"
    "10020040121" "1    2  1 121"
    "10020040121" "1    2  1 121"
    "10020040121" "1    2  1 121"
    "10020040129" "1    2  1 129"
    "10020040137" "1    2  1 137"
    "10020040145" "1    2  1 145"
    "10020040145" "1    2  1 145"
    "10020040145" "1    2  1 145"
    "10020040153" "1    2  1 153"
    "10020040153" "1    2  1 153"
    "10020040153" "1    2  1 153"
    "10020040177" "1    2  1 177"
    "10020040201" "1    2  1 201"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040209" "1    2  1 209"
    "10020040217" "1    2  1 217"
    "10020040217" "1    2  1 217"
    "10020040233" "1    2  1 233"
    "10020040233" "1    2  1 233"
    "10020040233" "1    2  1 233"
    "10020040241" "1    2  1 241"
    "10020040249" "1    2  1 249"
    "10020040249" "1    2  1 249"
    "10020040249" "1    2  1 249"
    "10020040257" "1    2  1 257"
    "10020040257" "1    2  1 257"
    "10020040257" "1    2  1 257"
    "10020040257" "1    2  1 257"
    "10020040257" "1    2  1 257"
    "10020040257" "1    2  1 257"
    "10020040273" "1    2  1 273"
    "10020040273" "1    2  1 273"
    "10020040273" "1    2  1 273"
    "10020040273" "1    2  1 273"
    "10020040273" "1    2  1 273"
    "10020040273" "1    2  1 273"
    "10020040281" "1    2  1 281"
    "10020040281" "1    2  1 281"
    "10020040281" "1    2  1 281"
    "10020040289" "1    2  1 289"
    "10020040289" "1    2  1 289"
    "10020040289" "1    2  1 289"
    "10020040289" "1    2  1 289"
    "10020050054" "1    2  2  54"
    "10020050054" "1    2  2  54"
    "10020050054" "1    2  2  54"
    "10020050054" "1    2  2  54"
    "10020050054" "1    2  2  54"
    "10020050054" "1    2  2  54"
    end
    new_hhid is the id I have generated. temp_hhid is that available in the original data. I need to check if I have generated the id properly and would like to check whether number of distinct observations in new_hhid is the same as in temp_hhid.

    Would appreciate any help

    Thanks

  • #2
    distinct (SSC),
    Code:
    . distinct new_hhid temp_hhid
    
               |        Observations
               |      total   distinct
    -----------+----------------------
      new_hhid |        100         27
     temp_hhid |        100         27

    Comment


    • #3
      Originally posted by Øyvind Snilsberg View Post
      distinct (SSC),
      Code:
      . distinct new_hhid temp_hhid
      
      | Observations
      | total distinct
      -----------+----------------------
      new_hhid | 100 27
      temp_hhid | 100 27
      Thanks. This worked perfectly

      Comment


      • #4
        distinct is more up-to-date at the Stata Journal site.

        Code:
        .  search distinct, sj
        
        Search of official help files, FAQs, Examples, and Stata Journals
        
        SJ-20-4 dm0042_3  . . . . . . . . . . . . . . . . Software update for distinct
                (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
                Q4/20   SJ 20(4):1028--1030
                sort() option has been added
        A bigger deal for the likely underlying question here is that having the same number of distinct values doesn't guarantee that two variables line up one to one, a question addressed at https://www.stata.com/support/faqs/d...ions-in-group/

        Comment

        Working...
        X