Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to spèlit a numeric variable

    Hi All,
    I have a filename.dta with several numerical variables such as M040810
    Code:
         +--------------------+
          | Cod_Mi~1   M040810 |
          |--------------------|
       1. |        2       288 |
       2. |        3       288 |
       3. |        4       288 |
       4. |        5       288 |
       5. |        7       180 |
          |--------------------|
       6. |       10       288 |
       7. |       12       288 |
       8. |       13       288 |
       9. |       14       288 |
      10. |       15       178 |
          |--------------------|
      11. |       17       288 |
      12. |       18       288 |
      13. |       19       288 |
      14. |       21       288 |
      15. |       22       288 |
    I'd like to split M040810. The left end digit of M040910 indicates if the subject has (1) or not (2) some disease and the other two digits indicate year of diagnosis or not applicable (88). So I'd like to create two new variables: the first a binary variable newvariable= 1 if the condition is present and newvariable=2 otherwise and a second one with the year of diagnosis if applicable.
    The problem is that I don't know how to do it
    Can you help me?
    Thank you in advance

  • #2
    There are many ways to do this. Here are three one-line methods of getting your flag variable with values 1 or 2. Some care is needed just in case there are missing values that you have not shown us.

    Code:
    gen flag1 = real(substr(string(M040810), 1, 1))
    gen flag2 = cond(M040810 >= 200, 2, cond(M040810 >= 100, 1, .))
    gen flag3 = floor(M040810/100)
    Here is a method of getting the other variable. The code above should suggest others.

    Code:
    gen year = mod(M040810, 100) if mod(M048010, 100) != 88
    To be useful the year variable may need extra surgery, say

    Code:
    replace year = cond(year < 15, 2000 + year, 1900 + year)
    PS. The coding here is extraordinary. It seems that year of diagnosis could be 78 or 80 (presumably 1978 or 1980) but year of diagnosis could not be 1988 as 88 is a special code for not applicable! Avoiding a missing value code that could be a legitimate data value is one of the first lessons in data management.

    Last edited by Nick Cox; 24 Sep 2014, 05:26.

    Comment


    • #3
      Code:
      // create some example data
      clear
      input   M040810
       288
       288
       288
       288
       180
       288
       288
       288
       288
       178
       288
       288
       288
       288
       288
       end
       
      // create the disease variable
      gen disease = floor(M040810/100)
      
      // check if it only takes the values 1 or 2
      assert inlist(disease, 1, 2)
      
      // create the year variable
      gen year = mod(M040810, 100)
      
      // check whether it takes values between 1 and 99
      assert inrange(year,1,99)
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        Thank you Nick and Marteen and sortry for the typo in tittle

        Comment


        • #5
          You may still be able to edit out the typo in your original title. As I write, it is less than 1 hour since your first post. You should certainly correct "Marteen" to "Maarten". He is a very nice guy but he doesn't like his name being spelled incorrectly.

          Comment


          • #6
            Nick, the coding is fine and 88 can be a valid year for disease diagnosis, since the interpretation of the last digits depends on the first one and should not be taken independently. This notation is convenient for dataentry, when you have to type a value for each patient (and there is no missing-value digit in real life, sigh). In that case you use first digit as a pilot. This conveniently allows all values to be of same width (3 in this case). Alternative and more common approach (used for example in DHS data) is to use a value of a different magnitude (e.g. 997) for a missing value.

            However, once the two variables are separated, this can be represented with a missing value in Stata (extending Maarten's example):
            Code:
            // create some example data
            clear
            input   M040810
             288
             288
             288
             288
             180
             288
             288
             288
             288
             178
             288
             288
             288
             288
             288
             end
             
            // create the disease variable
            gen disease = floor(M040810/100)
            
            // check if it only takes the values 1 or 2
            assert inlist(disease, 1, 2)
            
            // create the year variable
            gen year = mod(M040810, 100)
            
            assert year==88 if (disease==2)
            replace year=.a if (disease==2)
            label define yrlab .a "Not applicable"
            label values year yrlab
            list
            You can probably also drop the diesease variable, as it will now be easily constructable from year.

            PS: I don't see why Maarten asserts year is non-zero in his example. Nothing in the description suggests that year can't be 2000.

            Best, Sergiy Radyakin

            Comment


            • #7
              Sergiy: Excellent point on coding of 88; I missed that on first reading.

              Comment


              • #8
                Originally posted by Sergiy Radyakin View Post
                PS: I don't see why Maarten asserts year is non-zero in his example. Nothing in the description suggests that year can't be 2000.
                Sergiy is right: the assert should have been:

                Code:
                assert inrange(year,0, 99)
                ---------------------------------
                Maarten L. Buis
                University of Konstanz
                Department of history and sociology
                box 40
                78457 Konstanz
                Germany
                http://www.maartenbuis.nl
                ---------------------------------

                Comment

                Working...
                X