Randomly generating an ID which is same for duplicates of an observation

Shruti Sheopurkar

Join Date: Nov 2017
Posts: 20

Randomly generating an ID which is same for duplicates of an observation

02 Jul 2019, 05:44

There are 4 states coded 1,2,3 and 4. Within each state, I have generated a unique id to identify mobile numbers using the following code:

Code:

set seed 2803
gen random=runiform()
bysort state (random): gen id=_n

However, in the master data set (with over 600,000 mobile numbers) there are several that have duplicates. How can I ensure that all the mobile numbers that are repeated get assigned the same id? For instance, if "1234567890" is appears 3 times and currently has 3 different ids, I want to generate a new variable that assigns the same id to these 3 observations with the same phone number.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float state str11(mobile gender)
4 "6000000000" "Male"  
4 "6000000000" "Female"
4 "6000000000" "Male"  
4 "61111111111" "Male"  
4 "61111111114" "Female"
4 "6062728189" "Female"
3 "6839299911" "Female"
4 "1212112333" "Male"  
3 "2312312312" "Male"  
3 "3037655823" "Female"
3 "3048692376" "Male"  
4 "4056439806" "Female"
2 "4060072276" "Male"  
2 "5062303315" "Male"  
2 "5562589458" "Male"  
2 "5062653281" "Male"  
3 "5065858911" "Female"
3 "1066464736" "Male"  
1 "1066659166" "Female"
3 "1067447415" "Male"  
4 "1068252085" "Male"  
3 "1069373244" "Male"  
4 "1070890654" "Male"  
3 "1071245191" "Male"  
4 "1076167740" "Male"  
4 "1076543908" "Female"
4 "1079472160" "Female"
3 "1080984749" "Male"  
2 "1111808637" "Female"
4 "1183964153" "Female"
4 "1084763982" "Female"
4 "1084763982" "Female"
4 "1085351526" "Male"  
4 "1098216759" "Male"  
4 "1098345612" "Female"
4 "1098372939" "Male"  
3 "1102671939" "Female"
4 "1110099285" "Male"  
4 "1110099285" "Male"  
4 "1110099285" "Male"  
4 "11111111111"Male"  
3 "1117037204" "Female"
2 "1122530373" "Male"  
3 "1123334455" "Male"  
2 "1125303731" "Male"  
4 "1134543654" "Male"  
3 "1155288929" "Male"  
3 "1156498641" "Male"  
2 "1162462846" "Female"
2 "1162652122" "Female"
2 "1162652152" "Female"
2 "1162867644" "Male"  
4 "1172839450" "Male"  
3 "1173431643" "Female"
3 "1173710363" "Male"  
3 "1179879464" "Female"
3 "1180471007" "Male"  
4 "1181636878" "Male"  
3 "1194946161" "Male"  
4 "1199519990" "Female"
2 "1205525819" "Male"  
2 "1206586264" "Male"  
2 "1206586264" "Male"  
2 "1208090188" "Male"  
4 "1209870987" "Male"  
4 "1211222333" "Female"
4 "1222222222" "Male"  
4 "1225025679" "Female"
1 "1251640258" "Female"
end

Truly appreciate any help and guidance!

Tags: None

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#2

02 Jul 2019, 06:03

Code:

egen ID = group(mobile)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

02 Jul 2019, 08:41

To Jorrit's ideal advice let me add a warning about what you have accomplished with your current code. Not only do you find you have identical phone numbers with different IDs, the IDs you generated are almost certainly not distinct, because you stored them in a float variable rather than a double, which lost precision, so two random numbers that were different in double precision are the same when "rounded" into a float. Here's an example copied from a recent thread on this topic.

Code:

. clear . set obs 600000 number of observations (_N) was 0, now 600,000 . set seed 666 . generate double u = runiform() . generate float v = u . bysort u: assert _N==1 . bysort v: assert _N==1 6,985 contradictions in 592,948 by-groups assertion is false r(9);

So do proceed along the path that Jorrit suggests.
Comment

Announcement

Randomly generating an ID which is same for duplicates of an observation

Comment

Comment