Hello all,
I am working with a large dataset (10,473 observations and 6,465 variables) in Stata 14.1
This dataset has multiple versions of the same variable and therefore I should be able to reduce the number of variables currently in the dataset as it is an unnecessary number. An extract of the dataset is below;
From the above we can see that the purpose of the building is listed, and, in most cases this is the same across the observation. However, the purpose of the building is not consistent across the variables for all observations (e.g., lines 3 and 7) also, the purpose of the building is blank across the variables for some of of the observations. For clarity the above variable extends to cn_proposed_use_of_building75 so there is a lot of blanks in this section of the dataset.
I would like, if possible, to loop over the cn_proposed_use_of_building[_n] variables and check if the non-blank variables are the same. If they are then I would like to set the duplicate values to blank, using a code like the below but on a much larger scale
For observations where there is several different uses of the building I would like to combine the non-blank strings. I think that the following code could be used
I should also note at this stage that the cn_proposed_use_of_building[_n] variables do not appear simultaneously in the dataset which may require the above code to be changed.
I should also mention at this point that I hope to use what I learn addressing the issues presented by cn_proposed_use_of_building[_n] to address the other duplicate variables in the dataset, some of which are numeric and some of which are in date format. I am not sure if this would necessitate significant change in the command.
Open to advice on all of the above. Thanks in advance.
Allan
I am working with a large dataset (10,473 observations and 6,465 variables) in Stata 14.1
This dataset has multiple versions of the same variable and therefore I should be able to reduce the number of variables currently in the dataset as it is an unnecessary number. An extract of the dataset is below;
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str97(cn_proposed_use_of_building1 cn_proposed_use_of_building2 cn_proposed_use_of_building3 cn_proposed_use_of_building4 cn_proposed_use_of_building5) "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "Office" "Assembly And Recreation" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Assembly And Recreation" "Assembly And Recreation" "" "" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "" "" "" "Industrial" "Industrial" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Industrial" "Industrial" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Other" "Other" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Other" "Other" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "" "" "" "Shop" "Shop" "" "" "" "Residential Dwellings" "Residential Dwellings" "Residential Dwellings" "" "" "Residential Dwellings" "Residential Dwellings" "" "" "" end
I would like, if possible, to loop over the cn_proposed_use_of_building[_n] variables and check if the non-blank variables are the same. If they are then I would like to set the duplicate values to blank, using a code like the below but on a much larger scale
Code:
replace cn_proposed_use_of_building2 = "" if cn_proposed_use_of_building1 == cn_proposed_use_of_building2
Code:
egen cn_proposed_use_of_building = concat(cn_proposed_use_of_building1-cn_proposed_use_of_building75), punct(&)
I should also mention at this point that I hope to use what I learn addressing the issues presented by cn_proposed_use_of_building[_n] to address the other duplicate variables in the dataset, some of which are numeric and some of which are in date format. I am not sure if this would necessitate significant change in the command.
Open to advice on all of the above. Thanks in advance.
Allan

Comment