Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing String Values and Identifying Duplicates

    Hi,
    I am trying to identify and eventually drop duplicates from a data set of attendance data. I found this really helpful post and used the code from it https://www.stata.com/support/faqs/d...rue-and-false/
    The variable I am trying to sort by is a string variable and the missing values are really messing this process up.

    quietly bysort email: gen dup = cond(_N==1 | _N==.,0,_n)

    This returns a new variable called dup that counts up how many times a unique email appears in my data set. The problem is I have 51 observation with missing emails. Stata is treating these all as duplicates of each other and their dup values range from 1- 51. I eventually want to either be able to sort on or remove dup values greater than 1. If i do that right now all the observations with missing emails will also be sorted out or dropped. I would like to keep these observations.

    Other code I have tried.

    quietly bysort email: gen dup = cond(_N>=1,0,_n)

    quietly bysort email: gen dup = cond(_N<=1,0,_n)

    quietly bysort email: gen dup = cond(_N==1 | _N==.,0,_n)

    quietly bysort email: gen dup = cond(_N<=.,0,_n)

    quietly bysort email: gen dup = cond(_N<= "",0,_n)

    None of these have returned what I would like. I want to count the missings as 0, or as true missings in the new variable. I could just be confused about the syntax of the by or cond commands. I have tried reading the help on both of these things. I am using Stata version 16.

    Thank you



  • #2
    Welcome to the Stata Forum / Statalist,

    As per FAQ, I recommend to share data/command/output within code delimiters.

    With regard to - duplicates - one can add 'varlist' and then fine tune the number of duplicates.

    When using strings, the 'value' must be written within quotation marks.

    To end, one can tag the duplicates - help duplicates tag - and this may be helpful in this case.
    Best regards,

    Marcos

    Comment


    • #3
      Are you sure you didn't mean
      Code:
      quietly bysort email: generate dup = cond(_N==1 | email=="",0,_n)
      The code runs the generate command separately for each group of observations with a given value of email. _N is the number of observations in the group, it will never be missing. You need to exclude the group where the value of email is missing.

      Comment


      • #4
        Thank you both for responding. Next time I will use the code delimiters and share data/command/output. William your suggestion worked and makes so much sense. Thank you so much!

        Comment

        Working...
        X