Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Randomly generating an ID which is same for duplicates of an observation

    There are 4 states coded 1,2,3 and 4. Within each state, I have generated a unique id to identify mobile numbers using the following code:

    Code:
    set seed 2803
    gen random=runiform()
    bysort state (random): gen id=_n
    However, in the master data set (with over 600,000 mobile numbers) there are several that have duplicates. How can I ensure that all the mobile numbers that are repeated get assigned the same id? For instance, if "1234567890" is appears 3 times and currently has 3 different ids, I want to generate a new variable that assigns the same id to these 3 observations with the same phone number.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float state str11(mobile gender)
    4 "6000000000" "Male"  
    4 "6000000000" "Female"
    4 "6000000000" "Male"  
    4 "61111111111" "Male"  
    4 "61111111114" "Female"
    4 "6062728189" "Female"
    3 "6839299911" "Female"
    4 "1212112333" "Male"  
    3 "2312312312" "Male"  
    3 "3037655823" "Female"
    3 "3048692376" "Male"  
    4 "4056439806" "Female"
    2 "4060072276" "Male"  
    2 "5062303315" "Male"  
    2 "5562589458" "Male"  
    2 "5062653281" "Male"  
    3 "5065858911" "Female"
    3 "1066464736" "Male"  
    1 "1066659166" "Female"
    3 "1067447415" "Male"  
    4 "1068252085" "Male"  
    3 "1069373244" "Male"  
    4 "1070890654" "Male"  
    3 "1071245191" "Male"  
    4 "1076167740" "Male"  
    4 "1076543908" "Female"
    4 "1079472160" "Female"
    3 "1080984749" "Male"  
    2 "1111808637" "Female"
    4 "1183964153" "Female"
    4 "1084763982" "Female"
    4 "1084763982" "Female"
    4 "1085351526" "Male"  
    4 "1098216759" "Male"  
    4 "1098345612" "Female"
    4 "1098372939" "Male"  
    3 "1102671939" "Female"
    4 "1110099285" "Male"  
    4 "1110099285" "Male"  
    4 "1110099285" "Male"  
    4 "11111111111"Male"  
    3 "1117037204" "Female"
    2 "1122530373" "Male"  
    3 "1123334455" "Male"  
    2 "1125303731" "Male"  
    4 "1134543654" "Male"  
    3 "1155288929" "Male"  
    3 "1156498641" "Male"  
    2 "1162462846" "Female"
    2 "1162652122" "Female"
    2 "1162652152" "Female"
    2 "1162867644" "Male"  
    4 "1172839450" "Male"  
    3 "1173431643" "Female"
    3 "1173710363" "Male"  
    3 "1179879464" "Female"
    3 "1180471007" "Male"  
    4 "1181636878" "Male"  
    3 "1194946161" "Male"  
    4 "1199519990" "Female"
    2 "1205525819" "Male"  
    2 "1206586264" "Male"  
    2 "1206586264" "Male"  
    2 "1208090188" "Male"  
    4 "1209870987" "Male"  
    4 "1211222333" "Female"
    4 "1222222222" "Male"  
    4 "1225025679" "Female"
    1 "1251640258" "Female"
    end
    Truly appreciate any help and guidance!

  • #2
    Code:
    egen ID = group(mobile)

    Comment


    • #3
      To Jorrit's ideal advice let me add a warning about what you have accomplished with your current code. Not only do you find you have identical phone numbers with different IDs, the IDs you generated are almost certainly not distinct, because you stored them in a float variable rather than a double, which lost precision, so two random numbers that were different in double precision are the same when "rounded" into a float. Here's an example copied from a recent thread on this topic.
      Code:
      . clear
      
      . set obs 600000
      number of observations (_N) was 0, now 600,000
      
      . set seed 666
      
      . generate double u = runiform()
      
      . generate float v = u
      
      . bysort u: assert _N==1
      
      . bysort v: assert _N==1
      6,985 contradictions in 592,948 by-groups
      assertion is false
      r(9);
      So do proceed along the path that Jorrit suggests.

      Comment

      Working...
      X