Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cleaning a variable using the generate command with the 'real' function

    Hi all

    I currently have a variable 'patientheight' (storage type string) in my dataset that describes heights in cm for 25,000 plus patients. While the majority of these measurements are recorded correctly, a small proportion have been entered in the wrong format. For example, a tab of the first fifteen recorded heights is as follows:

    Code:
    list patientheight in 1/15
    168
    171
    171
    156
    None
    135
    1.2
    175cm
    132 cm
    125
    136
    148.7cm
    N/A
    148.7
    None


    From the above, I would like to replace those coded as 'None' and 'N/A' as missing as they cannot be used. I generally use the following code to take care of this:

    Code:
    tab patientheight if missing(real(patientheight))
    
    gen heightnew = real(patientheight)
    tab heightnew
    However, the above will also get rid of those heights entered as 175cm, 132 cm etc. There are almost 3000 heights entered with a 'cm' suffix which I would like to retain. Is there a way to tell stata to retain heights entered with the 'cm' suffix and convert the same to numerical?

    Thanks

    /Amal

  • #2
    Code:
    clear
    input str10 height
    "168"
    "171"
    "171"
    "156"
    "None"
    "135"
    "1.2"
    "175cm"
    "132 cm"
    "125"
    "136"
    "148.7cm"
    "N/A"
    "148.7"
    "None"
    end
    
    // in the ignore option you can specify the strings you want to ignore
    destring height, gen(height2) ignore("cm" "None" "N/A")
    
    l
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      destring is great. No one could be a bigger fan.

      But other ways of doing it are of interest too. Write a regular expression for a number. It's easy: the rules here appear to be.

      1. Numeric digits.

      2. Possibly, a decimal point.

      3. Possibly, more numeric digits.

      I like moss (SSC) as a wrapper. A rolling moss gathers the stones. (British proverb, reversed.) Maarten cleaned up Amal's example, so here we go:

      Code:
      . clear
      
      . input str10 height
      
               height
        1. "168"
        2. "171"
        3. "171"
        4. "156"
        5. "None"
        6. "135"
        7. "1.2"
        8. "175cm"
        9. "132 cm"
       10. "125"
       11. "136"
       12. "148.7cm"
       13. "N/A"
       14. "148.7"
       15. "None"
       16. end
      
      . moss height, match("([0-9]+\.*[0-9]*)") regex
      
      . assert _count <= 1
      
      . drop _count _pos1
      
      . list
      
           +-------------------+
           |  height   _match1 |
           |-------------------|
        1. |     168       168 |
        2. |     171       171 |
        3. |     171       171 |
        4. |     156       156 |
        5. |    None           |
           |-------------------|
        6. |     135       135 |
        7. |     1.2       1.2 |
        8. |   175cm       175 |
        9. |  132 cm       132 |
       10. |     125       125 |
           |-------------------|
       11. |     136       136 |
       12. | 148.7cm     148.7 |
       13. |     N/A           |
       14. |   148.7     148.7 |
       15. |    None           |
           +-------------------+
      Clearly, that 1.2 needs a fix, presumably to 120. That kind of problem is easy to spot.

      A subtlety is that we check that no more than 1 number is returned. If there is no number, we are disappointed but not surprised. If there is 1 number, we are happy. 2 or more numbers: something else is happening and we should check.

      Note that this approach extends to looking for units, "cm", or whatever.
      Last edited by Nick Cox; 28 Oct 2015, 07:03.

      Comment


      • #4
        Thank you both for your replies! I did not know about 'moss'. Excited to play around with it now!

        /Amal

        Comment

        Working...
        X