Hi,
I am trying to identify and eventually drop duplicates from a data set of attendance data. I found this really helpful post and used the code from it https://www.stata.com/support/faqs/d...rue-and-false/
The variable I am trying to sort by is a string variable and the missing values are really messing this process up.
quietly bysort email: gen dup = cond(_N==1 | _N==.,0,_n)
This returns a new variable called dup that counts up how many times a unique email appears in my data set. The problem is I have 51 observation with missing emails. Stata is treating these all as duplicates of each other and their dup values range from 1- 51. I eventually want to either be able to sort on or remove dup values greater than 1. If i do that right now all the observations with missing emails will also be sorted out or dropped. I would like to keep these observations.
Other code I have tried.
quietly bysort email: gen dup = cond(_N>=1,0,_n)
quietly bysort email: gen dup = cond(_N<=1,0,_n)
quietly bysort email: gen dup = cond(_N==1 | _N==.,0,_n)
quietly bysort email: gen dup = cond(_N<=.,0,_n)
quietly bysort email: gen dup = cond(_N<= "",0,_n)
None of these have returned what I would like. I want to count the missings as 0, or as true missings in the new variable. I could just be confused about the syntax of the by or cond commands. I have tried reading the help on both of these things. I am using Stata version 16.
Thank you
I am trying to identify and eventually drop duplicates from a data set of attendance data. I found this really helpful post and used the code from it https://www.stata.com/support/faqs/d...rue-and-false/
The variable I am trying to sort by is a string variable and the missing values are really messing this process up.
quietly bysort email: gen dup = cond(_N==1 | _N==.,0,_n)
This returns a new variable called dup that counts up how many times a unique email appears in my data set. The problem is I have 51 observation with missing emails. Stata is treating these all as duplicates of each other and their dup values range from 1- 51. I eventually want to either be able to sort on or remove dup values greater than 1. If i do that right now all the observations with missing emails will also be sorted out or dropped. I would like to keep these observations.
Other code I have tried.
quietly bysort email: gen dup = cond(_N>=1,0,_n)
quietly bysort email: gen dup = cond(_N<=1,0,_n)
quietly bysort email: gen dup = cond(_N==1 | _N==.,0,_n)
quietly bysort email: gen dup = cond(_N<=.,0,_n)
quietly bysort email: gen dup = cond(_N<= "",0,_n)
None of these have returned what I would like. I want to count the missings as 0, or as true missings in the new variable. I could just be confused about the syntax of the by or cond commands. I have tried reading the help on both of these things. I am using Stata version 16.
Thank you

Comment