Removing Non-unique duplicates with Unstructured Values Pattern

Ahmad zaky Ismail

Join Date: Mar 2023

Posts: 4
#1

Removing Non-unique duplicates with Unstructured Values Pattern

25 Mar 2023, 03:55

I want to create a LogMiscellaneousExpenditure variable of each household (denoted by hhid14) consisting of their expenditures for three item categories (denoted by ks3type). However, because there are multiple duplicates of the variable, I want to remove them but still keeping household id that has 0 expenditure values in the dataset. The example of data that I use is as follows:

ks3type hhid14 logMiscellaneousExp
E 0010600 0
F 0010600 11.69525
G 0010600 0
E 0010651 0
F 0010651 12.76569
G 0010651 0
E 0010800 0
G 0010800 0
F 0010800 0
E 0010851 0
G 0010851 0
F 0010851 0
F 0012200 0
E 0012200 22.51503
G 0012200 22.51503
G 0012241 0
E 0012241 33.0063
F 0012241 33.0063
G 0012242 14.07788
F 0012242 0
E 0012242 0
G 0012251 0
F 0012251 0
E 0012251 0
F 0012400 0
G 0012400 12.61154
E 0012400 0
E 0012451 10.12663
G 0012451 0
F 0012451 0
E 0012452 19.67914
F 0012452 0
G 0012452 19.67914

previously I have tried two commands:
i) duplicate drop hhid14, force -> but this command will delete some output variable values
ii) sort hhid14 logMiscellaneousExp
egen tag = tag(logMiscellaneousExp) -> but this command will only recognize the first row of expense variables that have a value, and remove variables from some households that have a value of 0. Thanks in advance!
Tags: data cleaning, duplicates
Nick Cox

Join Date: Mar 2014

Posts: 35664
#2

25 Mar 2023, 04:25

As the original author of the duplicates command I am squeamish about the force option. As in the result of life, force should only be used if you are certain that it is the only choice available, and even then there can be unwanted side-effects. In fact, the force option is documented as implying a loss of information.

That aside, do you want some variant on

Code:

duplicates drop hhid logMiscell

?

That in turn aside, the logged variable is in practice in most cases the natural logarithm of round numbers like 25000, as pushing it through exp() will show. But

1. exp(0) is 1 implying an expenditure of 1 unit. That looks like a fudge on getting missings out of log(0).

2. What makes you confident that repeated values are just unwanted duplicates? The same round number might occur genuinely.
Comment
Ahmad zaky Ismail

Join Date: Mar 2023

Posts: 4
#3

25 Mar 2023, 09:46

So beforehand, I created the logexp variable based on this command:

bysort hhid14: egen logMiscellaneousExp = sum(lTaxes + lOtherExpHH + lOtherExp + lNonFoodExp)
replace logMiscellaneousExp = 0 if lTaxes + lOtherExp + lOtherExpHH + lNonFoodExp == 0 | logMiscellaneousExp ==.

by that, I want to calculate the sum of miscellaneous expenditure items: Taxes, Non-Food, and Other Exp (denoted by ks3type E, F, and G, respectively). Now as I already have the logexp variable, I want to clean the variable so that it removes the duplicated value of each household id.

1. The logexp(0) value is an unwanted duplicate in the case where: if the household logexp already has a value > 0, the duplicate 0 value is just the residue of the actual value. However, for some logexp variables, there are several households that should have a value of 0.

So here I want to remove duplicate values in the logexp for households that already have a value > 0, but keep a value of 0 in households that have logexp(0)
Comment
Ahmad zaky Ismail

Join Date: Mar 2023

Posts: 4
#4

25 Mar 2023, 09:58

I have tried using your command
duplicates drop hhid logMiscell however, it will only remove some of the duplicated variables and still kept some of them as shown below:

ks3type hhid14 logMiscellaneousExp
E 0010600 0
F 0010600 11.69525
E 0010651 0
F 0010651 12.76569
F 0010800 0
G 0010851 0
E 0012200 22.51503
F 0012200 0
E 0012241 33.0063
G 0012241 0
F 0012242 0
G 0012242 14.07788
F 0012251 0
G 0012400 12.61154
F 0012400 0
E 0012451 10.12663
G 0012451 0
G 0012452 19.67914
F 0012452 0

is there any chances to remove the 0 on duplicated variables (e.g., hhid14: 0010600, 0010651, etc.) but still keep the 0 value for unduplicated variables (e.g., hhdid14: 0010800, 001851, etc.)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35664
#5

25 Mar 2023, 10:39

Sorry, but I have to conclude that I really don't understand what you have done or what you want, so I should bail out at this point.

Other than a precision detail, a number like 11.65925 is clearly the natural logarithm of some round number like 120000, so how taking logarithms and adding up are compatible here, I just don't follow.

Code:

. di exp( 11.69525) 120000.36

I hope that someone else can follow what you seek.
Comment

Announcement

Removing Non-unique duplicates with Unstructured Values Pattern

Comment

Comment

Comment

Comment