Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • summarize information not accurate for age variable

    hi all,

    I am new to Stata and using it (15.1) for my graduate thesis work with administrative health data.

    I have > 200,000 records, 1 for each patient visit, with numerous variables. I have 18 values for age (from 0 to 17). When I summarize to get the descriptive stats it's tell me min is 1 and max is 18, which are not actually accurate which leads me to believe that the mean is also incorrect. Do I need to somehow recode the ages to be read correctly? Or is the fact there is zeroes present (appropriately so) throwing it off?

    Thanks so much.


    Krystle

  • #2
    Welcome to Statalist

    I believe your data was read into Stata with age treated as a string variable, and you used encode rather than destring to create a numeric variable.
    Code:
    . input str2 age_s
    
             age_s
      1. 0
      2. 1
      3. 2
      4. 3
      5. 4
      6. 5
      7. 6
      8. 7
      9. end
    
    . destring age_s, generate(age1)
    age_s: all characters numeric; age1 generated as byte
    
    . encode age_s, generate(age2)
    
    . list, clean
    
           age_s   age1   age2  
      1.       0      0      0  
      2.       1      1      1  
      3.       2      2      2  
      4.       3      3      3  
      5.       4      4      4  
      6.       5      5      5  
      7.       6      6      6  
      8.       7      7      7  
    
    . list, clean nolabel
    
           age_s   age1   age2  
      1.       0      0      1  
      2.       1      1      2  
      3.       2      2      3  
      4.       3      3      4  
      5.       4      4      5  
      6.       5      5      6  
      7.       6      6      7  
      8.       7      7      8  
    
    . codebook
    
    ------------------------------------------------------------------------------------------------
    age_s                                                                                (unlabeled)
    ------------------------------------------------------------------------------------------------
    
                      type:  string (str2), but longest is str1
    
             unique values:  8                        missing "":  0/8
    
                tabulation:  Freq.  Value
                                 1  "0"
                                 1  "1"
                                 1  "2"
                                 1  "3"
                                 1  "4"
                                 1  "5"
                                 1  "6"
                                 1  "7"
    
    ------------------------------------------------------------------------------------------------
    age1                                                                                 (unlabeled)
    ------------------------------------------------------------------------------------------------
    
                      type:  numeric (byte)
    
                     range:  [0,7]                        units:  1
             unique values:  8                        missing .:  0/8
    
                tabulation:  Freq.  Value
                                 1  0
                                 1  1
                                 1  2
                                 1  3
                                 1  4
                                 1  5
                                 1  6
                                 1  7
    
    ------------------------------------------------------------------------------------------------
    age2                                                                                 (unlabeled)
    ------------------------------------------------------------------------------------------------
    
                      type:  numeric (long)
                     label:  age2
    
                     range:  [1,8]                        units:  1
             unique values:  8                        missing .:  0/8
    
                tabulation:  Freq.   Numeric  Label
                                 1         1  0
                                 1         2  1
                                 1         3  2
                                 1         4  3
                                 1         5  4
                                 1         6  5
                                 1         7  6
                                 1         8  7
    
    .
    You need to return to your original data and use destring, rather than encode, to convert numbers stored as strings to numeric variables - age, and any other variables for which you did that.

    Better still, you should start by trying to understand why your age values were treated by Stata as strings. If it was because some values contained text like "UNK" for unknown, then you need to handle that differently, by replacing the "UNK" values with a string representing a Stata numeric missing value (. or .a through .z) and then applying destring.
    Code:
    . input str3 age_s
    
             age_s
      1. 0
      2. 1
      3. 2
      4. 3
      5. 4
      6. UNK
      7. 5
      8. 6
      9. 7
     10. end
    
    . replace age_s = "." if age_s == "UNK"
    (1 real change made)
    
    . destring age_s, generate(age1)
    age_s: all characters numeric; age1 generated as byte
    (1 missing value generated)
    
    . list, clean
    
           age_s   age1  
      1.       0      0  
      2.       1      1  
      3.       2      2  
      4.       3      3  
      5.       4      4  
      6.       .      .  
      7.       5      5  
      8.       6      6  
      9.       7      7

    Comment


    • #3
      thank you so much for your response. It does appear there was a single nonnumeric entry which is why encode was used over destring. I have used gen age1=real(age) and that appears to have solved the problem as well. Does this equate to what you proposed?

      Comment


      • #4
        It does equate to what I proposed, as would
        Code:
        destring age, generate(age1) force
        The reason they equate is because you have created a new variable, preserving the original, and have taken the time to precisely what the problem was - a single non-numeric entry. That was the important part of my proposal, not the command itself. As a result, you've (a) preserved the original data (not so much an issue since there was just a single anomaly) and (b) ensured that you understand your data. Too many users just apply destring, replace force and lose important information.

        From this you've learned that, as the output of help encode tells us,

        Do not use encode if varname contains numbers that merely happen to be stored as strings; instead, use generate newvar = real(varname) or destring

        Comment


        • #5
          thank you so much!

          Comment


          • #6
            The name force for the option really is intended to convey that its use is not preferred over caution and consideration.

            Comment

            Working...
            X