Give priorities in my duplicates drop

Hugo Denis

Join Date: Apr 2022

Posts: 27
#1

Give priorities in my duplicates drop

06 Feb 2023, 02:39

Dear Statalist,

I have a list of duplicates by firm and I would like to keep one observation by firm. Now, a survey response from a firm can either be done by a manager or by a point of contact. Here are all the relevant variable I need to use for the selection process :

- id : firm ID.
- datestamp_num : the date the person last answered the survey
- filename : The survey file the observation is from
- resp_status : The function of the person who answered. Can either be 1 (manager) or 2 (point of contact).
- resp_email : The email of the person who answered.
- lastpage : The number of pages answered. For instance if it is equal to 1, the respondent last stopped his online survey at page 1.

Here is the priority order I would like to have when dropping my duplicate observation :

(1) If only one manager answered the survey, I would like to keep his observation in priority unless the lastpage value of one or more of their point of contact is superior, in which case we should keep the latter (refer to point 4 in that case).
(2) If one manager answered several times the survey, I would like to keep the observation from the last filename using the "filename" variable.
(3) If there are at least two managers who answered the survey (they can be identified with their resp_email), I would like to keep the one with the highest lastpage value, and if they are the same, then I would like to keep the latest response with datestamp_num.
(4) Similarly, if there are no manager respondent, I would like to perform the same selection process for the points of contact than (2) and (3).

To sum up, the selection process should be by

- Relative value of lastpage
- Being a manager
- if several observations of the same person : keep the last filename
- if several observations of different persons : keep the highest number of page, then the latest answer
- Being a point of contact
- if several observations of the same person : keep the last filename
- if several observations of different persons : keep the highest number of page, then the latest answer

Here is a toy example using dataex:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double datestamp_num str3 id long resp_status str7 filename str2 lastpage long email 1905618538000 "A13" 1 "v1" "11" 42 1905777509000 "A13" 1 "v1" "1" 50 1921766520000 "A14" 1 "v4" "11" 8 1905767698000 "A15" 1 "v1" "11" 1 1908719305000 "A17" 1 "v4" "11" 16 1905680114000 "A18" 1 "v1" "1" 34 1908722658000 "A18" 1 "v4" "11" 34 1905976802000 "A20" 1 "v1" "11" 28 1909062235000 "A22" 1 "v3" "5" 36 1921926702000 "A22" 1 "v4" "11" 36 1907772123000 "A23" 2 "v4" "11" 39 1908779281000 "A24" 2 "v3" "2" 19 1922104288000 "A24" 2 "v4" "2" 19 1914430388000 "A25" 1 "v4" "11" 22 1906017691000 "A27" 1 "v1" "11" 23 1907344484000 "A28" 1 "v4" "11" 15 1911577497000 "A29" 1 "v4" "1" 45 1908096037000 "A31" 1 "v4" "11" 25 1905626795000 "A32" 2 "v1" "11" 31 1905817696000 "A34" 1 "v1" "11" 46 1908793721000 "A35" 1 "v4" "11" 44 1921430355000 "A36" 1 "v4" "11" 30 1907005175000 "A38" 1 "v4" "11" 32 1908813590000 "A39" 1 "v4" "11" 5 1908547326000 "A40" 1 "v4" "11" 41 1914258125000 "A42" 1 "v4" "11" 12 1905614303000 "A43" 1 "v1" "11" 14 1907517473000 "A44" 1 "v4" "11" 37 1922449366000 "A46" 1 "v4" "11" 33 1908719181000 "A47" 1 "v4" "11" 40 1906050674000 "A48" 1 "v1" "11" 29 1907262579000 "A49" 1 "v4" "11" 11 1914258735000 "A5" 1 "v4" "11" 7 1.9075292e+12 "A50" 2 "v4" "11" 27 1907424046000 "A52" 1 "v4" "11" 18 1922101203000 "A54" 1 "v4" "1" 2 1908704748000 "A55" 2 "v4" "1" 38 1922126924000 "A55" 1 "v4" "11" 17 1905704649000 "A56" 1 "v1" "11" 26 1922101393000 "A6" 1 "v4" "11" 24 1922349623000 "A60" 2 "v4" "11" 3 1907530191000 "A63" 2 "v4" "11" 43 1905640828000 "A65" 1 "v1" "2" 47 1907233169000 "A65" 2 "v3" "2" 4 1921519057000 "A65" 2 "v4" "11" 4 1908810252000 "A66" 1 "v4" "11" 13 1908709681000 "A71" 1 "v4" "11" 48 1907106677000 "A72" 1 "v4" "11" 49 1911634886000 "A73" 1 "v3" "1" 6 1922201471000 "A73" 1 "v4" "11" 6 1911579583000 "A74" 1 "v4" "11" 10 1911576713000 "A76" 1 "v4" "11" 51 1906549666000 "A77" 1 "v1" "1" 9 1915736907000 "A77" 1 "v4" "11" 9 1905613915000 "A78" 2 "v1" "1" 35 1907533074000 "A79" 1 "v4" "11" 21 1911575858000 "A79" 2 "v4" "2" 20 end format %tC datestamp_num label values resp_status resp_status2 label def resp_status2 1 "manager", modify label def resp_status2 2 "point of contact", modify label values email email

I anonymized the email variable by encoding it, but this should be fine as it's intended to be used as an ID. I'm sorry because my problem may seem trivial and time-consuming (at least for me!) at the same time, therefore I thank everyone who would be willing to spend time on it.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 36058
#2

06 Feb 2023, 03:36

You are not implying that but let me flag that "duplicates drop" does not here mean the Stata command duplicates drop because the goal of that is only that some groups of observations are identical in stated respects and that you want only one of each group. Your problem is too subtle for that command to work well for you (and in large part I am commenting on my own work there, but in the strong sense that no command works well at doing something it was not intended to do).

Without grasping all the details here, I think I would approach this in stages, which would be to work with an indicator variable called say tokeep which you ultimately set to 1 for observations that you want to keep. Then you work through your criteria starting like this

Code:

gen tokeep = 0 replace tokeep = 1 if ...

where the guiding principle is that you have got to where you want to be whenever the total of tokeep is 1 for every firm (which is true if and only if there is one observation with 1 and all the others are 0.

That is, the values of

Code:

egen number_to_keep = total(tokeep), by(firmid)

should always be 1 which is easy enough to check with (e.g.)

list if tokeep != 1

Although it may seem a distinction without a difference, I would approach this more as a question of what to keep than of what to drop. -- if only because dropping whatever observations don't meet an early criterion gives them no scope to qualify as meeting later criteria.
1 like
Comment
Hugo Denis

Join Date: Apr 2022

Posts: 27
#3

07 Feb 2023, 01:30

Dear Nick :

Thank you for the clarification. However, I'm not sure I understand how it is possible to tell Stata to replace tokeep = 1 if the value of lastpage for the points of contact is greater than the one for the managers by firmid group.

I'm not even sure I understand how I am supposed to end up with just one firm with 1 and all the others with 0 if I have successive steps that are not necessarily narrowing my subset of observations down. Consider these two lines in post #1 :

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double datestamp_num str3 id long resp_status str7 filename str2 lastpage long email 1908704748000 "A55" 2 "v4" "1" 38 1922126924000 "A55" 1 "v4" "11" 17 end format %tC datestamp_num label values resp_status resp_status2 label def resp_status2 1 "manager", modify label def resp_status2 2 "point of contact", modify label values email email

The first line is a director because resp_status = 2. However, one can see with the lastpage variable that this person only went through the first page of the survey while the second observation, who's also in the same firm A55 but with a lower position, answered the survey fully. In this case I would like to keep the second one, but this seems exceptional as I would like to keep the director in priority, other things being equal.

I'm sorry if I sound repetitive or if I don't make myself clear.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#4

07 Feb 2023, 01:54

Indeed: you're conveying that your mix of criteria implies steps backwards and forwards, which I didn't follow -- not because you were unclear, but because I didn't try to grasp all the details you were giving.

There is some caprice on Statalist about which questions get answered in detail, and here I am not confident that spending more time trying to follow your details would lead to a good answer, as your kind of data is some distance from my usual kinds of data. I am sorry if that is disappointing, and even sorrier if it seems rude, but it is no more than is standard for a forum like this. I hope you get better answers from others.
Comment
Hugo Denis

Join Date: Apr 2022

Posts: 27
#5

07 Feb 2023, 04:21

Nick : Thank you for your message. I understand your point and I definitely agree with you!
Comment

Announcement

Give priorities in my duplicates drop

Comment

Comment

Comment

Comment