Missing String Values and Identifying Duplicates

Gaelyn Archer

Join Date: Jun 2020

Posts: 3
#1

Missing String Values and Identifying Duplicates

17 Jun 2020, 16:27

Hi,
I am trying to identify and eventually drop duplicates from a data set of attendance data. I found this really helpful post and used the code from it https://www.stata.com/support/faqs/d...rue-and-false/
The variable I am trying to sort by is a string variable and the missing values are really messing this process up.

quietly bysort email: gen dup = cond(_N==1 | _N==.,0,_n)

This returns a new variable called dup that counts up how many times a unique email appears in my data set. The problem is I have 51 observation with missing emails. Stata is treating these all as duplicates of each other and their dup values range from 1- 51. I eventually want to either be able to sort on or remove dup values greater than 1. If i do that right now all the observations with missing emails will also be sorted out or dropped. I would like to keep these observations.

Other code I have tried.

quietly bysort email: gen dup = cond(_N>=1,0,_n)

quietly bysort email: gen dup = cond(_N<=1,0,_n)

quietly bysort email: gen dup = cond(_N==1 | _N==.,0,_n)

quietly bysort email: gen dup = cond(_N<=.,0,_n)

quietly bysort email: gen dup = cond(_N<= "",0,_n)

None of these have returned what I would like. I want to count the missings as 0, or as true missings in the new variable. I could just be confused about the syntax of the by or cond commands. I have tried reading the help on both of these things. I am using Stata version 16.

Thank you
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

17 Jun 2020, 16:58

Welcome to the Stata Forum / Statalist,

As per FAQ, I recommend to share data/command/output within code delimiters.

With regard to - duplicates - one can add 'varlist' and then fine tune the number of duplicates.

When using strings, the 'value' must be written within quotation marks.

To end, one can tag the duplicates - help duplicates tag - and this may be helpful in this case.

Best regards,

Marcos
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

17 Jun 2020, 18:39

Are you sure you didn't mean

Code:

quietly bysort email: generate dup = cond(_N==1 | email=="",0,_n)

The code runs the generate command separately for each group of observations with a given value of email. _N is the number of observations in the group, it will never be missing. You need to exclude the group where the value of email is missing.
1 like
Comment
Gaelyn Archer

Join Date: Jun 2020

Posts: 3
#4

18 Jun 2020, 09:36

Thank you both for responding. Next time I will use the code delimiters and share data/command/output. William your suggestion worked and makes so much sense. Thank you so much!
Comment

Announcement

Missing String Values and Identifying Duplicates

Comment

Comment

Comment