Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trying to code for a loop to clean data entry errors and destring my variables.

    Dear All,

    I have imported "household survey" in csv format that contains numerous data entry errors. The most common one is a double entry. When importing, stata reads all my variables as string even for the expenditure and income variables. I am trying to clean it and want to use a loop for stata to replace these data entry issues so I may destring my expenditure and income variables and other numerical variables

    The data set is
    obs: 84,317
    vars: 246


    One of the ways i tried to do it is to identify these double entries as such for one variable :

    Code:
    gen rent1=regexm(exp_jdtheportionre, "[0-9] [0-9]")
    This is what my data looks like for certain " double entry " observations for only 3 of the variables but it is consistent through each income or expenditure related variable or numerical variable such as household size or age, etc.. .


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str32(exp_jdtheportionre exp_jdutilitiesele) str33 exp_jdfoodexcludin
    "0 0"     "0 0"   "0 0"    
    "0 0"     "0 0"   "0 0"    
    "75 75"   "10 10" "50 50"  
    "150 150" "23 23" "100"    
    "50 50"   "10 10" "50 50"  
    "0 0"     "0 0"   "0 0"    
    "150 150" "20 20" "50 50"  
    "75 75"   "20 20" "100 100"
    "110 110" "30 30" "150 150"
    "50 50"   "0 0"   "0 0"    
    "0 0"     "0 0"   "0 0"    
    "125 125" "10 10" "50 50"  
    "100 100" "25 25" "35 35"  
    "100 100" "24 24" "40 40"  
    "150 150" "24 24" "70 70"  
    "0 0"     "27 27" "70 70"  
    "50 50"   "7 7"   "50 50"  
    "40 40"   "15 15" "40 40"  
    "100 100" "30 30" "100 100"
    "150 150" "17 17" "30 30"  
    end
    ---------

    Then I would run for my varlist the following to replace for instance the "0 0" entry by "0" etc..

    Code:
    foreach var of varlist `r(varlist)' {
      2. replace `var'="0" if inlist(`var',"0 0")
      3. }
    However, it is time consuming and totally inelegant and does not identify all these double entry errors in my dataset. It was fine when it was for a "Yes Yes" or "No No" entry.

    Code:
    ds, has(type string)
    foreach var of varlist `r(varlist)' {
    replace `var'="Yes" if inlist(`var',"Yes Yes")
    }
    foreach var of varlist `r(varlist)' {
    replace `var'="No" if inlist(`var',"No No")
    }
    I did try to look at the string functions to identify and replace but I am not finding the right way to do it. I am sure there is a way for string functions to do it but I am not able to code it or find it.

    Thanks for any help

    Best

    H

  • #2
    Code:
    foreach var in `r(varlist)' {
        replace `var' = word(`var', 1) if wordcount(`var') == 2 & word(`var', 1) == word(`var', 2)
    }

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Code:
      foreach var in `r(varlist)' {
      replace `var' = word(`var', 1) if wordcount(`var') == 2 & word(`var', 1) == word(`var', 2)
      }
      Dear Nick,
      I just can not thank you enough. This works like a charm.

      Cheers again

      H

      Comment

      Working...
      X