Removing duplicates across variables

Allan Kelly

Join Date: Sep 2022
Posts: 5

Removing duplicates across variables

29 Sep 2022, 09:58

Hello all,

I am working with a large dataset (10,473 observations and 6,465 variables) in Stata 14.1

This dataset has multiple versions of the same variable and therefore I should be able to reduce the number of variables currently in the dataset as it is an unnecessary number. An extract of the dataset is below;

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str97(cn_proposed_use_of_building1 cn_proposed_use_of_building2 cn_proposed_use_of_building3 cn_proposed_use_of_building4 cn_proposed_use_of_building5)
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   "Office"                "Assembly And Recreation"                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   "Residential Dwellings" "Residential Dwellings" "Residential Dwellings"
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Assembly And Recreation" "Assembly And Recreation"                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   "Residential Dwellings" "Residential Dwellings" "Residential Dwellings"
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                                                          
"Industrial"              "Industrial"              ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Industrial"              "Industrial"              ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Other"                   "Other"                   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Other"                   "Other"                   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   "Residential Dwellings" "Residential Dwellings" ""                    
"Residential Dwellings"   "Residential Dwellings"   "Residential Dwellings" "Residential Dwellings" "Residential Dwellings"
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
"Shop"                    "Shop"                    ""                      ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   "Residential Dwellings" ""                      ""                    
"Residential Dwellings"   "Residential Dwellings"   ""                      ""                      ""                    
end

From the above we can see that the purpose of the building is listed, and, in most cases this is the same across the observation. However, the purpose of the building is not consistent across the variables for all observations (e.g., lines 3 and 7) also, the purpose of the building is blank across the variables for some of of the observations. For clarity the above variable extends to cn_proposed_use_of_building75 so there is a lot of blanks in this section of the dataset.

I would like, if possible, to loop over the cn_proposed_use_of_building[_n] variables and check if the non-blank variables are the same. If they are then I would like to set the duplicate values to blank, using a code like the below but on a much larger scale

Code:

replace cn_proposed_use_of_building2 = "" if cn_proposed_use_of_building1 == cn_proposed_use_of_building2

For observations where there is several different uses of the building I would like to combine the non-blank strings. I think that the following code could be used

Code:

egen cn_proposed_use_of_building = concat(cn_proposed_use_of_building1-cn_proposed_use_of_building75), punct(&)

I should also note at this stage that the cn_proposed_use_of_building[_n] variables do not appear simultaneously in the dataset which may require the above code to be changed.

I should also mention at this point that I hope to use what I learn addressing the issues presented by cn_proposed_use_of_building[_n] to address the other duplicate variables in the dataset, some of which are numeric and some of which are in date format. I am not sure if this would necessitate significant change in the command.

Open to advice on all of the above. Thanks in advance.
Allan

Last edited by Allan Kelly; 29 Sep 2022, 10:03.

Tags: duplicates

Clyde Schechter

Join Date: Apr 2014
Posts: 30354

29 Sep 2022, 11:03

No loops required. The trick is to put the data into long layout. To do that you need an identifier variable or set of variables. Perhaps you already have that and just didn't show it with your example data. If so, skip the first command, and replace obs_no by your identifier variable(s) throughout the rest of the code.

Code:

gen long obs_no = _n
reshape long cn_proposed_use_of_building, i(obs_no)
drop if missing(cn_proposed_use_of_building)
by obs_no (cn_proposed_use_of_building), sort: ///
    drop if cn_proposed_use_of_building == cn_proposed_use_of_building[_n-1]
by obs_no (_j), sort: replace cn_proposed_use_of_building ///
    = cn_proposed_use_of_building[_n-1] + " & " + cn_proposed_use_of_building ///
    if _n > 1
by obs_no (_j), sort: keep if _n == _N
drop _j

Comment

Allan Kelly

Join Date: Sep 2022

Posts: 5
#3

30 Sep 2022, 03:15

Originally posted by Clyde Schechter View Post

No loops required. The trick is to put the data into long layout. To do that you need an identifier variable or set of variables. Perhaps you already have that and just didn't show it with your example data. If so, skip the first command, and replace obs_no by your identifier variable(s) throughout the rest of the code.

Code:

gen long obs_no = _n reshape long cn_proposed_use_of_building, i(obs_no) drop if missing(cn_proposed_use_of_building) by obs_no (cn_proposed_use_of_building), sort: /// drop if cn_proposed_use_of_building == cn_proposed_use_of_building[_n-1] by obs_no (_j), sort: replace cn_proposed_use_of_building /// = cn_proposed_use_of_building[_n-1] + " & " + cn_proposed_use_of_building /// if _n > 1 by obs_no (_j), sort: keep if _n == _N drop _j

Hi Clyde,

Thanks for your answer. Regarding the above code I have a question about a specific command;

Code:

drop if missing(cn_proposed_use_of_building)

Would this command result in the entire observation being dropped if the cn_proposed_use_of_building is blank? I think this would result in data loss if there is data for the observation in the other variables.

Is there another way of doing this which would leave the data in wide form?

In my data there should be a unique commencement notice code for the construction of a site (cn_number), however, the site may contain multiple buildings which are completed at different dates. Each new completion date generates a new line in the dataset (i.e., long format) in which most of the information is duplicated, save for a few completion related variables. I would like to add the unique completion data to the existing commencement data and thus remove the duplicates in the commencement notice code (cn_number). I have already reshaped from long to wide for all variables as there is some minor differences in some of the string variables, hence the above attempt to -concat- the strings.

When I attempted your code I received the following error message for the -reshape- command:

Code:

.dta file corrupt unexpected file read error .dta file is corrupt Pieces in the file are not where they are expected to be.

Perhaps, in addressing my issues, I reshaped from long to wide prematurely? If I was to split the data (when in long format) into the variables (i) I want to reshape wide and (ii) the rest, then work on the two new datasets separately and then merge on the hopefully unique cn_number this might be a work around?

Looking forward to hearing your thoughts.
Comment

Allan Kelly

Join Date: Sep 2022
Posts: 5

30 Sep 2022, 04:14

Also I have used a version of your code with the string variable and I think that I encountered a problem;

Code:

by cn_number (cn_proposed_use_of_building), sort: ///
    replace cn_proposed_use_of_building = "" if cn_proposed_use_of_building == cn_proposed_use_of_building[_n-1]

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str97 cn_proposed_use_of_building str15 cn_number
"Residential Dwellings"   "CN0000057DC_1" 
""                        "CN0000057DC_1" 
"Residential Dwellings"   "CN0000162DR_1" 
""                        "CN0000162DR_1" 
"Residential Dwellings"   "CN0000206OY_1" 
""                        "CN0000206OY_1" 
"Assembly And Recreation" "CN0000261GC_7" 
""                        "CN0000261GC_7" 
"Industrial"              "CN0000466FL_8" 
""                        "CN0000466FL_8" 
"Industrial"              "CN0000466FL_8" 
"Residential Dwellings"   "CN0000551FL_1" 
""                        "CN0000551FL_1" 
"Residential Dwellings"   "CN0000596KY_1" 
""                        "CN0000596KY_1" 
"Storage"                 "CN0000607DL_9" 
""                        "CN0000607DL_9" 
"Office"                  "CN0000684FL_4" 
""                        "CN0000684FL_4" 
"Residential Dwellings"   "CN0000762LH_1" 
""                        "CN0000762LH_1" 
"Residential Dwellings"   "CN0000898DR_1" 
""                        "CN0000898DR_1" 
"Residential Dwellings"   "CN0000898DR_1" 
""                        "CN0000898DR_1" 
"Residential Dwellings"   "CN0000898DR_1" 
""                        "CN0000898DR_1" 
"Residential Dwellings"   "CN0000898DR_1" 
""                        "CN0000898DR_1" 
"Residential Dwellings"   "CN0000898DR_1" 
""                        "CN0000898DR_1" 
"Residential Dwellings"   "CN0000898DR_1" 
""                        "CN0000898DR_1" 
"Residential Dwellings"   "CN0000898DR_1" 
"Residential Dwellings"   "CN0000900DC_1" 
""                        "CN0000900DC_1" 
"Assembly And Recreation" "CN0000933KE_7" 
""                        "CN0000933KE_7" 
end

As we can see from the above for the cn_number variable there is multiple instances of "Residential Dwellings" for CN0000898DR_1. It seems the code only compares the observation directly before it rather than all occurrences of CN00898DR_1. A similar issue occurs for CN0000466FL_8 which is showing "Industrial" twice. Is there a way to rectify this issue?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#5

30 Sep 2022, 11:39

Re #4: Yes, you are correct. My error. It should be:

Code:

by cn_number cn_proposed_use_of_building, sort: replace cn_proposed_use_of_building = "" if _n > 1

In light of the above, and also the mention in #3 of "Would this command result in the entire observation being dropped if the cn_proposed_use_of_building is blank? I think this would result in data loss if there is data for the observation in the other variables." I would change the code in #2 to the following, which fixes both problems.

Code:

gen long obs_no = _n reshape long cn_proposed_use_of_building, i(obs_no) by obs_no (cn_proposed_use_of_building), sort: /// drop if missing(cn_proposed_use_of_building) & _n < _N by obs_no cn_proposed_use_of_building, sort: /// drop if _n > 1 by obs_no (_j), sort: replace cn_proposed_use_of_building /// = cn_proposed_use_of_building[_n-1] + " & " + cn_proposed_use_of_building /// if _n > 1 by obs_no (_j), sort: keep if _n == _N drop _j

Is there another way of doing this which would leave the data in wide form?

Probably, but it would necessarily more complicated. This task is ideally suited to the use of long layout, as is most work in Stata. Wide layout should be reserved for the less common situations where it is actually necessary. This isn't one of those. If your data set is large enough that the -reshape- is taking hours to run, try using -tolong- available from SSC, or -greshape-, which is part of the -gtools- package, also available at SSC. Both of these are much faster than Stata's official -reshape- command.

In my data there should be a unique commencement notice code for the construction of a site (cn_number), however, the site may contain multiple buildings which are completed at different dates. Each new completion date generates a new line in the dataset (i.e., long format) in which most of the information is duplicated, save for a few completion related variables. I would like to add the unique completion data to the existing commencement data and thus remove the duplicates in the commencement notice code (cn_number). I have already reshaped from long to wide for all variables as there is some minor differences in some of the string variables, hence the above attempt to -concat- the strings.

I don't understand what you are describing, and it isn't clear to me what, if anything, you're asking in relation to it. Showing example data and illustrating what you'd like to do would be helpful.

When I attempted your code I received the following error message for the -reshape- command:
Code:

.dta file corrupt unexpected file read error .dta file is corrupt Pieces in the file are not where they are expected to be.

I have never seen this before and do not know what might be causing it. What do your data look like just before the -reshape- command? If you show an example (-dataex-) that illustrates this problem, I can try to troubleshoot. It certainly doesn't happen with the example data you have shown in this thread.

Perhaps, in addressing my issues, I reshaped from long to wide prematurely? If I was to split the data (when in long format) into the variables (i) I want to reshape wide and (ii) the rest, then work on the two new datasets separately and then merge on the hopefully unique cn_number this might be a work around?

Maybe. As I don't really understand the error you are getting with -reshape-, I can't predict.
Comment

Announcement

Removing duplicates across variables

Comment

Comment

Comment

Comment