generating duplicate condition not reproducible

Alyssa Beavers

Join Date: Feb 2015

Posts: 72
#1

generating duplicate condition not reproducible

06 Jul 2023, 09:28

Good morning,

I have survey data that I am using IP Address (variable name IPAddress) to identify duplicates. I have created a duplicate IP address variable (dup_IPAddress) using this code:

Code:

sort IPAddress quietly by IPAddress: gen dup_IPAddress= cond(_N==1,0,_n)

However, I have found that this code does not produce reproducible results each time: for example, if observation 1 and 2 are duplicates, sometimes dup_IPAddress is 1 for observation 1 and 2 for obs 2 whereas other times I run the same code I get the opposite. This impacts the reproducibility of my downstream analysis. Is there a way to ensure reprodicibility when generating a duplicate condition?

Due to confidentiality I do not want to provide a dataex; if that would be necessary to answer my question please let me know and I will try to find a workaround to create a non-idenifiable dataset that could be used as an example.

Thanks,
Alyssa Beavers
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3187
#2

06 Jul 2023, 09:54

"duplicates tag" work correctly?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#3

06 Jul 2023, 10:00

Well, your results are not only irreproducible, they are also arbitrary. So, yes, there are ways of making it all reproducible, but then it will be reproducible but still arbitrary. Selecting only one observation to retain from among all those with a duplicate IPAddress in a non-arbitrary way requires relying on some additional variable(s). For example, if there is a date variable that orders the observations, you might want to select the first, or the last, or something like that. That would go like this:

Code:

by IPAddress (date), sort: gen byte select = (_n == 1) // FOR THE FIRST; IF YOU WANT THE LAST, _n == _N

If you cannot think of another variable to de-arbitrarify the process, then you have a serious problem because you are reduced to picking an arbitrary representative for each IPAddress and, if the observations of a given IPAddress have different values on other variables you are interested in, you will get irreproducible, arbitrary results. (If all observations for each IPAddress have the same values for all variables of interest, then it doesn't matter which you select and there is no problem.)

Due to confidentiality I do not want to provide a dataex

Confidentiality need not be an issue. You can overwrite the real data with fake, random data, and then use -dataex- on that. What is usually most needed is the structure and organization of the data; the actual values are usually not important. Only when the specific data values affect the computations to be coded would this be a problem.
2 likes
Comment
Alyssa Beavers

Join Date: Feb 2015

Posts: 72
#4

07 Jul 2023, 06:47

George Ford I actually did not know that duplicates tag existed, thank you for sharing! It appears that it works a little differently than the code I am using, but nonetheless may be helpful for me in the future.

Clyde Schechter Thank you. I do understand that the choice of which duplicate to retain is arbitrary if no other information is taken into account. However, I am in part just curious about how and why the code is not reproducible. Additionally, I am learning R and I am replicating what I have done in Stata in R. It is difficult for me to know if I am coding correctly in R if the code in Stata does not give me the same results each time to compare against.

I did find that sorting by the unique ID variable prior to running the code I listed in post #1 does in fact give me repdroducible results.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#5

07 Jul 2023, 09:17

The irreproducibility I refer to arises because Stata's -sort- command is, by design, irreproducible if the sort variables do not uniquely determine the order of the data. When the sort keys do not uniquely identify observations, the ties in sort order are broken by random selection. The PDF documentation of the -sort- command that comes built-in with your Stata explanation has a lengthy and detailed explanation of this and its implications. The passage involved is too long for me to directly quote here. But it is definitely worth reading. Open the -sort- section of the documentation, and then scroll down to the "Sorting with ties" subsection.

I do not use R myself, so I do not know how R handles the ties in sorting. Some packages handle ties by preserving, among the ties, their original sort order. Stata does not do that, unless you specifically tell it to by specifying the -stable- option.
1 like
Comment
Alyssa Beavers

Join Date: Feb 2015

Posts: 72
#6

10 Jul 2023, 07:56

Clyde Schechter thank you for that explanation, that is exactly what was happening!
Comment

Announcement

generating duplicate condition not reproducible

Comment

Comment

Comment

Comment

Comment