Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transforming dataset into a panel dataset

    Dear STATA-Users,

    I am working on a data set which currently looks like that: (The data is sorted already using: sort id date_start)


    id date_start date_end policy1 policy2 policy3 policy 4
    1 1 Apr 2020 3 Apr 2020 1 0 0 0
    1 1 Apr 2020 3 Apr 2020 0 2 0 0
    1 1 Apr 2020 2 Apr 2020 0 0 1 0
    1 2 Apr 2020 . 0 0 0 1
    1 3 Apr 2020 . 0 0 0 0
    1 3 Apr 2020 . 0 0 0 0
    2 1 Apr 2020 2 Apr 2020 0 1 0 0
    2 2 Apr 2020 3 Apr 2020 1 0 0 0
    2 2 Apr 2020 6 Apr 2020 0 0 0 1
    2 3 Apr 2020 1 May 2020 0 0 1 0


    I started coding a qualitative data set where policies are described in written form, where each observation (i.e. line) described a policy. I generated new variables (i.e. policy1, policy2, etc. which are coded as dummies or variables taking ordinal values). he data includes a large number of different policies in different countries (id), the date when the policy was introduced (date_start) and the date the policies end (date_end). When there are missings in date_end this means the policy is introduced for only one day.

    I would like to create a panel data set with id as the panel variable and date as the time variable and I would like to carry on the values of each policy when it is introduced until the date when the policy ends.

    I already managed to replace the values when each policy was introduced for each duplicate starting date. So I get all observations of policy variables in the same line. My idea was to be able to drop duplicate observations in terms of date-start later. So my data now looks like that:

    id date_start date_end policy1 policy2 policy3 policy 4
    1 1 Apr 2020 3 Apr 2020 1 2 1 0
    1 1 Apr 2020 3 Apr 2020 1 2 1 0
    1 1 Apr 2020 2 Apr 2020 1 2 1 0
    1 2 Apr 2020 . 0 0 0 1
    1 3 Apr 2020 . 0 0 0 0
    1 3 Apr 2020 1 Apr 2020 0 0 0 0
    2 1 Apr 2020 2 Apr 2020 0 1 0 0
    2 2 Apr 2020 3 Apr 2020 1 0 0 1
    2 2 Apr 2020 6 Apr 2020 1 0 0 1
    2 3 Apr 2020 1 May 2020 0 0 1 0

    But from here I do not manage to carry on the values of each policy until date_start takes the value of date_end when the policy was introduced. So e.g. policy1 should take the value 1, when at date 1 Apr 2020 and should remain 1 until the date 3 Apr 2020.

    I am happy for any help in this regard.

    Many thanks and best,
    Sophie
    Last edited by Sophie Schneider; 02 Jun 2021, 06:02.

  • #2
    For a panel, you need at most one id-date combination. Some of your values are not duplicates, e.g.,

    id date_start date_end policy1 policy2 policy3 policy 4
    1 1 Apr 2020 3 Apr 2020 1 0 0 0
    1 1 Apr 2020 3 Apr 2020 0 2 0 0
    Here, you have policy1 equal to both 0 and 1 for the same id and time-range. What is the rule here?

    Comment


    • #3
      Thanks for your comment.

      Exactly this is the problem with the current data set. It is not yet in a panel structure. Policy1 should take the value 1 from 1 April 2020 until 3 April 2020. So that in the end the data looks like this (with id as the panel variable and date as the time variable):

      id date policy1
      1 1 Apr 2020 1
      1 2 Apr 2020 1
      1 3 Apr 2020 1
      1 4 Apr 2020 0
      ...

      Comment


      • #4
        You still did not answer my question. One observation assigns a value of 1 to policy1 corresponding to id 1, while another assigns a value of 0. Is your rule that you choose the maximum of the two values if there is a conflict? If so, consider

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(id date_start date_end policy1 policy2 policy3 policy4)
        1 22006 22008 1 2 1 0
        1 22006 22008 1 2 1 0
        1 22006 22007 1 2 1 0
        1 22007     . 0 0 0 1
        1 22008     . 0 0 0 0
        1 22008 22016 0 0 0 0
        2 22006 22007 0 1 0 0
        2 22007 22008 1 0 0 1
        2 22007 22011 1 0 0 1
        2 22008 22036 0 0 1 0
        end
        format %td date_start
        format %td date_end
        
        collapse (max) policy1 policy2 policy3 policy4, by(id date_start date_end)
        gen toexpand= date_end- date_start+1
        expand toexpand, g(new)
        bys id date_start date_end (new): replace date_start= date_start[_n-1]+1 if new
        collapse (max) policy1 policy2 policy3 policy4, by(id date_start)
        rename date_start date
        Res.:

        Code:
        
        . l, sepby(id)
        
             +--------------------------------------------------------+
             | id        date   policy1   policy2   policy3   policy4 |
             |--------------------------------------------------------|
          1. |  1   01apr2020         1         2         1         0 |
          2. |  1   02apr2020         1         2         1         1 |
          3. |  1   03apr2020         1         2         1         0 |
          4. |  1   04apr2020         0         0         0         0 |
          5. |  1   05apr2020         0         0         0         0 |
          6. |  1   06apr2020         0         0         0         0 |
          7. |  1   07apr2020         0         0         0         0 |
          8. |  1   08apr2020         0         0         0         0 |
          9. |  1   09apr2020         0         0         0         0 |
         10. |  1   10apr2020         0         0         0         0 |
         11. |  1   11apr2020         0         0         0         0 |
             |--------------------------------------------------------|
         12. |  2   01apr2020         0         1         0         0 |
         13. |  2   02apr2020         1         1         0         1 |
         14. |  2   03apr2020         1         0         1         1 |
         15. |  2   04apr2020         1         0         1         1 |
         16. |  2   05apr2020         1         0         1         1 |
         17. |  2   06apr2020         1         0         1         1 |
         18. |  2   07apr2020         0         0         1         0 |
         19. |  2   08apr2020         0         0         1         0 |
         20. |  2   09apr2020         0         0         1         0 |
         21. |  2   10apr2020         0         0         1         0 |
         22. |  2   11apr2020         0         0         1         0 |
         23. |  2   12apr2020         0         0         1         0 |
         24. |  2   13apr2020         0         0         1         0 |
         25. |  2   14apr2020         0         0         1         0 |
         26. |  2   15apr2020         0         0         1         0 |
         27. |  2   16apr2020         0         0         1         0 |
         28. |  2   17apr2020         0         0         1         0 |
         29. |  2   18apr2020         0         0         1         0 |
         30. |  2   19apr2020         0         0         1         0 |
         31. |  2   20apr2020         0         0         1         0 |
         32. |  2   21apr2020         0         0         1         0 |
         33. |  2   22apr2020         0         0         1         0 |
         34. |  2   23apr2020         0         0         1         0 |
         35. |  2   24apr2020         0         0         1         0 |
         36. |  2   25apr2020         0         0         1         0 |
         37. |  2   26apr2020         0         0         1         0 |
         38. |  2   27apr2020         0         0         1         0 |
         39. |  2   28apr2020         0         0         1         0 |
         40. |  2   29apr2020         0         0         1         0 |
         41. |  2   30apr2020         0         0         1         0 |
         42. |  2   01may2020         0         0         1         0 |
             +--------------------------------------------------------+
        
        .
        Last edited by Andrew Musau; 07 Jun 2021, 15:14.

        Comment


        • #5
          I see your point. The rule if there is a conflict would be that for each id and date_start it should always take the value which is different from zero. So I think the maximum value works here.

          Using your code brings me exactly where I wanted to get. Many thanks for your help!

          Best,
          Sophie

          Comment

          Working...
          X