No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to search for alpha characters in a numeric (long/float) variable

    i have a huge data set (over 6 million). one of my variables is age which is a numeric var stored as float. i created this from a variable that was a string variable using the encode command. however, i realized that some of the data was mixed up in some observations in that some of the entries for age were not numbers but alpha characters and even words. how can i search the numeric variable age for these alpha characters and replace or drop them from my data list. im guessing there are around 80 of these so they are impossible to find with naked eye

  • #2
    Welcome to Statalist, Vishal.

    You made a serious mistake: converting a number stored in a string variable to a numeric variable requires the destring command, not the encode command. The output of the help encode command even tells us

    Do not use encode if varname contains numbers that merely happen to be stored as strings; instead, use generate newvar = real(varname) or destring; see real() or [D] destring.
    You should discard what you have and recreate the numeric age variable using destring. Here is some made up data demonstrating the problem with using encode in this situation, and a better approach to converting age from string to numeric.
    * Example generated by -dataex-. To install: ssc install dataex
    input float person str7 age_str
    101 "42"    
    102 "39"    
    103 "10"    
    104 "8"      
    105 "unknown"
    // wrong approach
    encode age_str, generate(age_enc)  // will have value labels
    generate age_enc2 = age_enc        // will not have value labels
    // better approach
    destring age_str, generate(age_des) force
    . describe
    Contains data
      obs:             5                          
     vars:             5                          
     size:           100                          
                  storage   display    value
    variable name   type    format     label      variable label
    person          float   %9.0g                
    age_str         str7    %9s                  
    age_des         byte    %10.0g                
    age_enc         long    %8.0g      age_enc    
    age_enc2        float   %9.0g                
    Sorted by:
         Note: Dataset has changed since last saved.
    . list
         | person   age_str   age_des   age_enc   age_enc2 |
      1. |    101        42        42        42          3 |
      2. |    102        39        39        39          2 |
      3. |    103        10        10        10          1 |
      4. |    104         8         8         8          4 |
      5. |    105   unknown         .   unknown          5 |
    You see that the age_enc created by encode looks correct, but that is a consequence of encode creating value labels that display the original value of the encoded data. You see that the values of age_enc2, which are the same as those of age_enc but displayed without value labels, have nothing to do with the original data. And age_des contains what you need - numeric values that correspond to numeric ages, and a missing value for those that did not have numeric values.
    Last edited by William Lisowski; 15 Sep 2018, 17:40.


    • #3
      See also my posts within