Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping unneeded observations in panel data

    Hello everyone,

    I hope this is the right place for this questions. If not please tell me where the right place is, it is my first posting after all.

    So, I want to drop several oberservations from my dataset. Everyone in the set has an unique ID (panel variable) and the time variable is the year variable (1984 - 2016).

    I want to drop everyone who hasn't the value "retired" in a variable called "labor force status" at any time. But I want to keep every observation from every year for the persons which do have the value "retired" at any time.
    Basically I want to keep all the information for pensioners, and all the information for people still working should be dropped.

    It would be really helpful if somebody could help me with it, because I rarely worked with panel data before.

    Greetings
    Florian


  • #2
    Please see FAQ Advice 12 on posting data examples. https://www.statalist.org/forums/help#stata

    Here are two guesses, that being retired is conveyed by a particular numeric code and that it is conveyed by a particular string value. The principle used is the same in either case: that a identifier being some value in any year (at least one year) is identifiable by the maximum of a true or false expression being 1 over the panel. (More generaliy, and more concisely, any <-> max and all <-> min over true-or-false expressions.)

    For lengthier discussion, see the FAQ https://www.stata.com/support/faqs/d...ble-recording/

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id year numstatus) str8 strstatus
    1 2015 4 "employed"
    1 2016 7 "retired" 
    2 2015 4 "employed"
    2 2006 4 "employed"
    end
    
    egen wanted1 = max(numstatus == 7), by(id)
    egen wanted2 = max(strstatus == "retired"), by(id)
    
    list, sepby(id)
    
         +-----------------------------------------------------+
         | id   year   numsta~s   strsta~s   wanted1   wanted2 |
         |-----------------------------------------------------|
      1. |  1   2015          4   employed         1         1 |
      2. |  1   2016          7    retired         1         1 |
         |-----------------------------------------------------|
      3. |  2   2015          4   employed         0         0 |
      4. |  2   2006          4   employed         0         0 |
         +-----------------------------------------------------+
    Once you have such a variable, do something like

    Code:
    keep if wanted1

    Comment


    • #3
      This was really helpful. Thank you

      Comment


      • #4
        I have two questions regarding a similar topic.
        Hopefully the code example is adequate.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long pid int(syear birth occupation Life-satisfaction)
         201 1983 1926  12  5
         201 1984 1926  12  6
         201 1985 1926  13  7
         201 1986 1926  13  8
         201 1987 1926  13  7
        So I want to generate a variable which gives me the life satisfaction of the person, when he/she first retires. In this case in 1985 and the variable should give me always the value of 7. How do I do that for the whole dataset?

        Second question:

        How can I drop people from the dataset which were unemployed (12) for only one year before retiring (13). In this case the person 201 wouldn't be dropped because he was unemployed for two years before.

        Thanks a lot.

        Comment


        • #5
          See the concurrent thread https://www.statalist.org/forums/for...30520-anywatch

          See also Sections 9 and 10 of Speaking Stata: Compared with ...

          http://www.stata-journal.com/sjpdf.h...iclenum=dm0055


          and mentions of dm0055 in the forum.

          Code:
          clear
          input long pid int(syear birth occupation Life_satisfaction)
           201 1983 1926  12  5
           201 1984 1926  12  6
           201 1985 1926  13  7
           201 1986 1926  13  8
           201 1987 1926  13  7
           end 
           
          bysort pid : egen retire = min(cond(occupation == 13, syear, .)) 
          by pid : egen satis_retire = min(cond(syear == retire, Life, .)) 
           
          list 
          
              +---------------------------------------------------------------+
               | pid   syear   birth   occupa~n   Life_s~n   retire   satis_~e |
               |---------------------------------------------------------------|
            1. | 201    1983    1926         12          5     1985          7 |
            2. | 201    1984    1926         12          6     1985          7 |
            3. | 201    1985    1926         13          7     1985          7 |
            4. | 201    1986    1926         13          8     1985          7 |
            5. | 201    1987    1926         13          7     1985          7 |
               +---------------------------------------------------------------+
          Then the second problem is a twist or two on the first. You can get close to where you want via


          Code:
          bysort pid : egen wanted = total(cond(syear < retire, occup == 12, .))

          Comment


          • #6
            I really appreciate the help. And I'll try to adjust to the rules of this forum. Thanks a lot, again.

            Comment


            • #7
              Sadly, I have yet another question.

              I want to run a regression to find out what has an effect on my dependent variable happiness. Simply put, there are missing values in on of my independent variables income. Should i ignore them and just run the regression anyways? Or should I limit my sample so every value for happiness has a corresponding value for income? Is there another way to fix this problem?

              Thank you

              Comment


              • #8
                I'd post that as a new question. But simply Stata ignores missing values by default any way, so your choice is no choice.

                Another way to approach the problem is multiple imputation.

                Comment


                • #9
                  Florina:
                  as an aside to Nick's helpful advice, you should investigate first if the missingness that you detected in your dataset is ignorable or not.
                  Kind regards,
                  Carlo
                  (Stata 18.0 SE)

                  Comment

                  Working...
                  X