Dear Statalist,
I have a list of duplicates by firm and I would like to keep one observation by firm. Now, a survey response from a firm can either be done by a manager or by a point of contact. Here are all the relevant variable I need to use for the selection process :
- id : firm ID.
- datestamp_num : the date the person last answered the survey
- filename : The survey file the observation is from
- resp_status : The function of the person who answered. Can either be 1 (manager) or 2 (point of contact).
- resp_email : The email of the person who answered.
- lastpage : The number of pages answered. For instance if it is equal to 1, the respondent last stopped his online survey at page 1.
Here is the priority order I would like to have when dropping my duplicate observation :
(1) If only one manager answered the survey, I would like to keep his observation in priority unless the lastpage value of one or more of their point of contact is superior, in which case we should keep the latter (refer to point 4 in that case).
(2) If one manager answered several times the survey, I would like to keep the observation from the last filename using the "filename" variable.
(3) If there are at least two managers who answered the survey (they can be identified with their resp_email), I would like to keep the one with the highest lastpage value, and if they are the same, then I would like to keep the latest response with datestamp_num.
(4) Similarly, if there are no manager respondent, I would like to perform the same selection process for the points of contact than (2) and (3).
To sum up, the selection process should be by
- Relative value of lastpage
- Being a manager
- if several observations of the same person : keep the last filename
- if several observations of different persons : keep the highest number of page, then the latest answer
- Being a point of contact
- if several observations of the same person : keep the last filename
- if several observations of different persons : keep the highest number of page, then the latest answer
Here is a toy example using dataex:
I anonymized the email variable by encoding it, but this should be fine as it's intended to be used as an ID. I'm sorry because my problem may seem trivial and time-consuming (at least for me!) at the same time, therefore I thank everyone who would be willing to spend time on it.
I have a list of duplicates by firm and I would like to keep one observation by firm. Now, a survey response from a firm can either be done by a manager or by a point of contact. Here are all the relevant variable I need to use for the selection process :
- id : firm ID.
- datestamp_num : the date the person last answered the survey
- filename : The survey file the observation is from
- resp_status : The function of the person who answered. Can either be 1 (manager) or 2 (point of contact).
- resp_email : The email of the person who answered.
- lastpage : The number of pages answered. For instance if it is equal to 1, the respondent last stopped his online survey at page 1.
Here is the priority order I would like to have when dropping my duplicate observation :
(1) If only one manager answered the survey, I would like to keep his observation in priority unless the lastpage value of one or more of their point of contact is superior, in which case we should keep the latter (refer to point 4 in that case).
(2) If one manager answered several times the survey, I would like to keep the observation from the last filename using the "filename" variable.
(3) If there are at least two managers who answered the survey (they can be identified with their resp_email), I would like to keep the one with the highest lastpage value, and if they are the same, then I would like to keep the latest response with datestamp_num.
(4) Similarly, if there are no manager respondent, I would like to perform the same selection process for the points of contact than (2) and (3).
To sum up, the selection process should be by
- Relative value of lastpage
- Being a manager
- if several observations of the same person : keep the last filename
- if several observations of different persons : keep the highest number of page, then the latest answer
- Being a point of contact
- if several observations of the same person : keep the last filename
- if several observations of different persons : keep the highest number of page, then the latest answer
Here is a toy example using dataex:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input double datestamp_num str3 id long resp_status str7 filename str2 lastpage long email 1905618538000 "A13" 1 "v1" "11" 42 1905777509000 "A13" 1 "v1" "1" 50 1921766520000 "A14" 1 "v4" "11" 8 1905767698000 "A15" 1 "v1" "11" 1 1908719305000 "A17" 1 "v4" "11" 16 1905680114000 "A18" 1 "v1" "1" 34 1908722658000 "A18" 1 "v4" "11" 34 1905976802000 "A20" 1 "v1" "11" 28 1909062235000 "A22" 1 "v3" "5" 36 1921926702000 "A22" 1 "v4" "11" 36 1907772123000 "A23" 2 "v4" "11" 39 1908779281000 "A24" 2 "v3" "2" 19 1922104288000 "A24" 2 "v4" "2" 19 1914430388000 "A25" 1 "v4" "11" 22 1906017691000 "A27" 1 "v1" "11" 23 1907344484000 "A28" 1 "v4" "11" 15 1911577497000 "A29" 1 "v4" "1" 45 1908096037000 "A31" 1 "v4" "11" 25 1905626795000 "A32" 2 "v1" "11" 31 1905817696000 "A34" 1 "v1" "11" 46 1908793721000 "A35" 1 "v4" "11" 44 1921430355000 "A36" 1 "v4" "11" 30 1907005175000 "A38" 1 "v4" "11" 32 1908813590000 "A39" 1 "v4" "11" 5 1908547326000 "A40" 1 "v4" "11" 41 1914258125000 "A42" 1 "v4" "11" 12 1905614303000 "A43" 1 "v1" "11" 14 1907517473000 "A44" 1 "v4" "11" 37 1922449366000 "A46" 1 "v4" "11" 33 1908719181000 "A47" 1 "v4" "11" 40 1906050674000 "A48" 1 "v1" "11" 29 1907262579000 "A49" 1 "v4" "11" 11 1914258735000 "A5" 1 "v4" "11" 7 1.9075292e+12 "A50" 2 "v4" "11" 27 1907424046000 "A52" 1 "v4" "11" 18 1922101203000 "A54" 1 "v4" "1" 2 1908704748000 "A55" 2 "v4" "1" 38 1922126924000 "A55" 1 "v4" "11" 17 1905704649000 "A56" 1 "v1" "11" 26 1922101393000 "A6" 1 "v4" "11" 24 1922349623000 "A60" 2 "v4" "11" 3 1907530191000 "A63" 2 "v4" "11" 43 1905640828000 "A65" 1 "v1" "2" 47 1907233169000 "A65" 2 "v3" "2" 4 1921519057000 "A65" 2 "v4" "11" 4 1908810252000 "A66" 1 "v4" "11" 13 1908709681000 "A71" 1 "v4" "11" 48 1907106677000 "A72" 1 "v4" "11" 49 1911634886000 "A73" 1 "v3" "1" 6 1922201471000 "A73" 1 "v4" "11" 6 1911579583000 "A74" 1 "v4" "11" 10 1911576713000 "A76" 1 "v4" "11" 51 1906549666000 "A77" 1 "v1" "1" 9 1915736907000 "A77" 1 "v4" "11" 9 1905613915000 "A78" 2 "v1" "1" 35 1907533074000 "A79" 1 "v4" "11" 21 1911575858000 "A79" 2 "v4" "2" 20 end format %tC datestamp_num label values resp_status resp_status2 label def resp_status2 1 "manager", modify label def resp_status2 2 "point of contact", modify label values email email

Comment