Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • for each observation, find the first change in variables that tracking by years

    I am conducting a study to track people's health condition changes. Data will be look like the one below. My goal is to find the first year that the condition changed. For example, for person 1, first year of change is 2001; person 2: never changed; person 5: 2002; ... Maybe it is good idea to create a variable at the end, say "first_yr" which contains the year of first change. The same analysis will be conducted for subsample that includes all individuals without missing information. My real data has more years and observations, I tried to use loop and kind of lost tracking. Any help will be great! thanks.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id str1(h_2000 h_2001 h_2002 h_2003)
    1 "a" "b" "b" "a"
    2 "c" ""  "c" "c"
    3 "d" "e" "d" "d"
    4 "f" ""  "f" "f"
    5 "g" ""  "h" "g"
    6 "b" "b" "b" "g"
    end

  • #2
    This problem is nearly the same as the one you posted yesterday, and the solution starts out similarly. The key insight is that most things in Stata are easier when the data is laid out long. So the first step is to -reshape- long. What makes this a bit more complicated is the need to identify not just whether a change has occurred, but when the first one was.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id str1(h_2000 h_2001 h_2002 h_2003)
    1 "a" "b" "b" "a"
    2 "c" ""  "c" "c"
    3 "d" "e" "d" "d"
    4 "f" ""  "f" "f"
    5 "g" ""  "h" "g"
    6 "b" "b" "b" "g"
    end
    
    reshape long h_, i(id) j(year)
    drop if missing(h_)
    //    GENERATE RUNNING COUNT OF CHANGES WITHIN ID
    by id (year), sort: gen n_changes = sum(h_ != h_[_n-1] & _n != 1)
    //    NOW IDENTIFY THE YEAR OF THE VERY FIRST CHANGE
    by id (year): egen first_yr = max(cond(n_changes == 1 & n_changes[_n-1] == 0), year, .)
    
    //    AND IF YOU NEED TO GO BACK TO WIDE LAYOUT
    drop n_changes
    reshape wide
    Notes: For people who do not have any changes, this code sets first_yr to missing value. As with yesterday's problem, I don't know where you're heading with this data, but it is likely that whatever you want to do next will be easier if you keep it in long layout and skip the -reshape wide- command shown at the end.

    Thank you for using -dataex- and for giving an example that illustrates the problem nicely.
    Last edited by Clyde Schechter; 07 Dec 2016, 13:01. Reason: Correct error in code

    Comment


    • #3
      Here is the cross-reference implied by Clyde: http://www.statalist.org/forums/foru...a-new-variable

      Comment


      • #4
        Thank you very much, Clyde. Your code is really clear and easy to use. I did think about your answer from yesterday and tried reshape, but i didn't find the simple way like you did. I think i need to learn more about Stata! For example, I can test and see your code is working well, but I don't really understand what "_n != 1" does in the sum().
        Thanks again for your help!

        Comment


        • #5
          Hello Dong Shen,

          There are also other ways to do this if you want to keep the data in wide format, though as Clyde noted, it might be better to use long format for your analysis.

          For example,
          Code:
          gen temp = h_2000 //first year
          gen first_yr = .
          forvalues year=2000/2003 { //Edit as needed for years
              replace first_yr = `year' if temp!=h_`year' & !mi(h_`year',temp) & first_yr==.
              replace temp = h_`year' if h_`year'!=""
          }
          drop temp

          Clyde Schechter Do you mind explaining the following code?
          Code:
          by id (year): egen first_yr = max(cond(n_changes == 1 & n_changes[_n-1] == 0), year, .)
          I am confused by the positioning of the first close parenthesis. The code works, but should the close parenthesis be at the end of the line?
          E.g.
          Code:
          by id (year): egen first_yr = max(cond(n_changes == 1 & n_changes[_n-1] == 0, year, .))
          Both codes seem to work -- I must be misunderstanding something here.
          Last edited by Roger Chu; 07 Dec 2016, 13:32.

          Comment


          • #6
            but I don't really understand what "_n != 1" does in the sum()
            So the first thing to know is that in a command that is prefixed with -by:-, _n refers to observations within groups. So, in particular, _n != 1 appearing as it does means that the first observation in each group does not meet the condition. The reason for that is that when _n = 1, the condition _h[1] != _h[0] will, in general, be true because _h[0] is always missing value. But having _h[1] be non-missing doesn't count as a "change." So that's why the _n != 1 was needed: to prevent the first observation from being counted as a change when it really isn't.

            I think a good way to get familiar with Stata is to start with the Getting Started [GS] section of the on-line PDF manuals. After that, read the User's Guide [U] section. These cover all of the basic commands that are part of every day data management and analysis in Stata. The documentation is lengthy and has lots of worked examples. You won't retain everything you read. But having exposed yourself to the basic commands, when you encounter a problem, you will probably be able to think of the commands that are likely to be helpful, and then turning to the -help- files or the corresponding manual sections will refresh your memory on the details. There are also several books in the Stata bookstore (at stata.com/bookstore/) that cover basics of data management and programming in Stata. Finally, from time to time, StataCorp also offers online netcourses that range from the very beginning level through advanced programming and in-depth treatment of certain types of analysis. I took several of these courses when I was a Stata beginner and found them well worth the time and expense. Finally, you can learn an enormous amount by just following Statalist and seeing the problems that others pose and how they are solved. I daresay that the majority of what I learned about Stata early on came from Statalist (which was a listserve back then).

            Comment


            • #7
              I am going to tweak Clyde's code slightly.

              The first year (subject to conditions) I get using min() rather than max().

              I show two ways to get that. See also http://www.stata-journal.com/sjpdf.h...iclenum=dm0055 Sections 9 and 10.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input byte id str1(h_2000 h_2001 h_2002 h_2003)
              1 "a" "b" "b" "a"
              2 "c" ""  "c" "c"
              3 "d" "e" "d" "d"
              4 "f" ""  "f" "f"
              5 "g" ""  "h" "g"
              6 "b" "b" "b" "g"
              end
              
              reshape long h_, i(id) j(year)
              drop if missing(h_)
              
              * change is an indicator, 0 or 1 
              by id (year), sort: gen change = h_ != h_[_n-1] & _n > 1
              by id: egen first_yr = min(cond(change, year, .)) 
              by id: egen first_yr2 = min(year / change) 
              
              list, sepby(id) 
              
                   +-----------------------------------------------+
                   | id   year   h_   change   first_yr   first_~2 |
                   |-----------------------------------------------|
                1. |  1   2000    a        0       2001       2001 |
                2. |  1   2001    b        1       2001       2001 |
                3. |  1   2002    b        0       2001       2001 |
                4. |  1   2003    a        1       2001       2001 |
                   |-----------------------------------------------|
                5. |  2   2000    c        0          .          . |
                6. |  2   2002    c        0          .          . |
                7. |  2   2003    c        0          .          . |
                   |-----------------------------------------------|
                8. |  3   2000    d        0       2001       2001 |
                9. |  3   2001    e        1       2001       2001 |
               10. |  3   2002    d        1       2001       2001 |
               11. |  3   2003    d        0       2001       2001 |
                   |-----------------------------------------------|
               12. |  4   2000    f        0          .          . |
               13. |  4   2002    f        0          .          . |
               14. |  4   2003    f        0          .          . |
                   |-----------------------------------------------|
               15. |  5   2000    g        0       2002       2002 |
               16. |  5   2002    h        1       2002       2002 |
               17. |  5   2003    g        1       2002       2002 |
                   |-----------------------------------------------|
               18. |  6   2000    b        0       2003       2003 |
               19. |  6   2001    b        0       2003       2003 |
               20. |  6   2002    b        0       2003       2003 |
               21. |  6   2003    g        1       2003       2003 |
                   +-----------------------------------------------+

              Comment


              • #8
                Thank you all for the explanation. I learned a lot from you! This will be the place that I will visit more often.

                Comment

                Working...
                X