Trying to code for a loop to clean data entry errors and destring my variables.

Hania Sahn

Join Date: Sep 2018

Posts: 2
#1

Trying to code for a loop to clean data entry errors and destring my variables.

01 Dec 2018, 02:01

Dear All,

I have imported "household survey" in csv format that contains numerous data entry errors. The most common one is a double entry. When importing, stata reads all my variables as string even for the expenditure and income variables. I am trying to clean it and want to use a loop for stata to replace these data entry issues so I may destring my expenditure and income variables and other numerical variables

The data set is
obs: 84,317
vars: 246

One of the ways i tried to do it is to identify these double entries as such for one variable :

Code:

gen rent1=regexm(exp_jdtheportionre, "[0-9] [0-9]")

This is what my data looks like for certain " double entry " observations for only 3 of the variables but it is consistent through each income or expenditure related variable or numerical variable such as household size or age, etc.. .

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str32(exp_jdtheportionre exp_jdutilitiesele) str33 exp_jdfoodexcludin "0 0" "0 0" "0 0" "0 0" "0 0" "0 0" "75 75" "10 10" "50 50" "150 150" "23 23" "100" "50 50" "10 10" "50 50" "0 0" "0 0" "0 0" "150 150" "20 20" "50 50" "75 75" "20 20" "100 100" "110 110" "30 30" "150 150" "50 50" "0 0" "0 0" "0 0" "0 0" "0 0" "125 125" "10 10" "50 50" "100 100" "25 25" "35 35" "100 100" "24 24" "40 40" "150 150" "24 24" "70 70" "0 0" "27 27" "70 70" "50 50" "7 7" "50 50" "40 40" "15 15" "40 40" "100 100" "30 30" "100 100" "150 150" "17 17" "30 30" end

---------

Then I would run for my varlist the following to replace for instance the "0 0" entry by "0" etc..

Code:

foreach var of varlist `r(varlist)' { 2. replace `var'="0" if inlist(`var',"0 0") 3. }

However, it is time consuming and totally inelegant and does not identify all these double entry errors in my dataset. It was fine when it was for a "Yes Yes" or "No No" entry.

Code:

ds, has(type string) foreach var of varlist `r(varlist)' { replace `var'="Yes" if inlist(`var',"Yes Yes") } foreach var of varlist `r(varlist)' { replace `var'="No" if inlist(`var',"No No") }

I did try to look at the string functions to identify and replace but I am not finding the right way to do it. I am sure there is a way for string functions to do it but I am not able to code it or find it.

Thanks for any help

Best

H
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35637

01 Dec 2018, 03:04

Code:

foreach var in `r(varlist)' {
    replace `var' = word(`var', 1) if wordcount(`var') == 2 & word(`var', 1) == word(`var', 2)
}

Comment

Hania Sahn

Join Date: Sep 2018

Posts: 2
#3

01 Dec 2018, 03:49

Originally posted by Nick Cox View Post

Code:

foreach var in `r(varlist)' { replace `var' = word(`var', 1) if wordcount(`var') == 2 & word(`var', 1) == word(`var', 2) }

Dear Nick,
I just can not thank you enough. This works like a charm.

Cheers again

H
1 like
Comment

Announcement

Trying to code for a loop to clean data entry errors and destring my variables.

Comment

Comment