Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • creating duration variable for panel data

    Hi all, I have a panel data question. specifically I'm trying to create a duration variable that counts the number of years elapsed. I know about tsspell but I do not know how to modify it to the specific problem at hand.
    so:
    I have a panel country-year dataset. I want the duration to reflect the number of years a country exists in the data. this is easy, after sorting by country year this would simply be a counter equal to _n. (by country: gen counter=_n)
    However, the counter needs to be reset to 1 every time an event stops occurring.
    so say a country enters the dataset in 1950 and the event is a war that takes place 1952 to 1954 this counter would begin in 1950 as 1 but keep on counting even while the war is ongoing and restart at 1 in 1954. if i created _n this is how it would look. however I want the hypothetical counter (duration) variable to be like this: it equals _n until war ceases. Note that if another war to occur down the line, again it should be reset for the country.

    I looked at others' do files and people have had the country code A (assume numeric code) taken on a different value any time event occurs and ceases. I've tried this as well but can't get it right.
    Any help would be appreciated. otherwise, I'm going to resort to doing this manually on Excel.

    country year war _n counter
    A 1950 0 1 1
    A 1951 0 2 2
    A 1952 1 3 3
    A 1953 1 4 4
    A 1954 0 5 1
    A 1955 0 6 2
    A 1956 0 7 3




  • #2
    I think this will do it:

    Code:
    xtset country year
    by country (year): gen spell_start = (_n == 1) | (war == 0 & L.war == 1)
    by country (year): gen spell_num = sum(spell_start)
    by country spell_num (year), sort: gen counter = _n
    This will restart the counter at 1 with either the first year a country appears in the data, or in the first year immediately after the last year of a war.

    Note: -xtset country year- requires that country be numeric. So if in your real data, as in your example, it is a string, you should -encode- it.

    By the way, unless you are doing this just for fun, you really should not even think of doing data management manually in Excel. It is a spreadsheet, not a statistical package or data base, and it leaves absolutely no trace of whatever changes you make to the data. If you are doing this for any serious purpose, you need an audit trail of your work: working in Stata using do-files and keeping logs accomplishes that. In Excel, it will be anybody's guess what you did.
    Last edited by Clyde Schechter; 09 Jun 2015, 17:57.

    Comment


    • #3
      Manually? In Excel??? Nooooo!

      Code:
      clear
      input country year war n counter
      1 1950 0 1 1
      1 1951 0 2 2
      1 1952 1 3 3
      1 1953 1 4 4
      1 1954 0 5 1
      1 1955 0 6 2
      1 1956 0 7 3
      2 1950 0 1 1
      2 1951 0 2 2
      2 1952 1 3 3
      2 1953 1 4 4
      2 1954 0 5 1
      2 1955 0 6 2
      2 1956 0 7 3
      3 1961 1 1 1
      3 1962 1 2 2
      3 1963 0 3 1
      end
      
      drop n
      sort country year
      list, sepby(country)
      // --------------------------------- here is the part you need ------------------------------------------
      sort country year
      generate c=0
      by country: replace c=cond((war==0 & war[_n-1]==1)|_n==1,1,c[_n-1]+1)
      // -------------------------------- rest is presentation ---------------------------------------------------
      list, sepby(country)
      
      assert c==counter
      Output:
      Code:
           +------------------------------------+
           | country   year   war   counter   c |
           |------------------------------------|
        1. |       1   1950     0         1   1 |
        2. |       1   1951     0         2   2 |
        3. |       1   1952     1         3   3 |
        4. |       1   1953     1         4   4 |
        5. |       1   1954     0         1   1 |
        6. |       1   1955     0         2   2 |
        7. |       1   1956     0         3   3 |
           |------------------------------------|
        8. |       2   1950     0         1   1 |
        9. |       2   1951     0         2   2 |
       10. |       2   1952     1         3   3 |
       11. |       2   1953     1         4   4 |
       12. |       2   1954     0         1   1 |
       13. |       2   1955     0         2   2 |
       14. |       2   1956     0         3   3 |
           |------------------------------------|
       15. |       3   1961     1         1   1 |
       16. |       3   1962     1         2   2 |
       17. |       3   1963     0         1   1 |
           +------------------------------------+
      Best, Sergiy Radyakin

      Comment


      • #4
        Thank you so much! I tried various ways to do it , via foreach and then via tsspell, and the closest I got was resetting the counter every time the event changed from 1 to 0.
        With Excel--I have not gone the manual route in a long time; I was merely trying to express my frustration
        I will try these; I'm glad I asked!

        Comment


        • #5
          By the way, Sergiy's code and mine are equivalent in the data you showed in your post. They will work differently, however, if there are gaps in the years within any country. Sergiy's will find an end of a war if the variable war changes from one observation to the next, even if there is one or more skipped years between those observations. Mine will not. Which of those corresponds to your intention, is for you to decide.

          Comment


          • #6
            Thanks again. I tried both on my hypothetical mini data. I believe that in my full data there should not be any gaps. I've dealt with data that contains gaps before.
            and yes I have country codes but o/w I would have used encode.
            I had actually come very close to Clyde's code using tsspell to create a spell variable and then using that to sort...
            I'm so glad I asked! again much appreciated!

            Comment


            • #7
              Hello all, I have a follow up question to this. Luckily both codes above worked well with my data as I did not have any gaps.
              Nonetheless, there is something else that I need to do that you might be able to help me on.

              Basically, I have a numeric country code, call this country.
              I want this code to change every time the event comes to pass and ends. So this is when the counter would be reset as well. The code should then stay changed throughout for that country.

              so again using the above example: assume that we create another country code first equal to country (I'd like to keep the original country code in the data):
              gen ccode=country

              country year war _ncounter ccode

              1 1950 0 1 1 1
              1 1951 0 2 2 1
              1 1952 1 3 3 1
              1 1953 1 4 4 1
              1 1954 0 5 1 11
              1 1955 0 6 2 11
              1 1956 0 7 3 11
              2 1950 0 1 1 2
              2 1951 0 2 2 2
              2 1952 1 3 3 2
              2 1953 0 4 1 21

              Now I've written a code for this using the counter we generated but my code only allows the country code to change in the first year when event ends. so for example, in my code country 1 changes to 11 in 1954 but reverts back to 1. I can see why this is the case in the code but do not know how to fix it.

              Also I am having the country code change by multiplying it by 10 and adding 1 (due to the nature of the unique codes my data has, this ensures no overlap with existing country codes).

              I will provide my code and if you could point me in the direction of how I'd modify it so that the new country code would obtain until the data switches to a new country, that would be great.
              Also, while the above example does not show it, I also need the country code to change again this time incrementing the code by 1 if a 2nd event occurs and ceases. So if country 1 were to experience a 2nd war, I'd like its value to be 12. My current code also does not accommodate this.

              so after gen ccode=country

              by country: replace code=cond((c>=1 &(war==0 & war[_n-1]==1)),code*10,country)

              Note that I'm making use of the counter variable here and saying that any time war ceases change the country code by this formula.How do I then tell the program to continue applying this rule until the country code switches?


              Many thanks!

              Comment


              • #8
                The data example you posted is a little confusing because there is some extra variable in there. So I'm not 100% sure I understand what you want to do, but try this:

                Code:
                // FIND LARGEST VALUE OF COUNTRY CODE
                // SO NEW CODES WILL START FROM THERE
                summ country, meanonly
                local new_code_start = `r(max)' + 1
                
                // IDENTIFY WHEN WAR ENDS
                xtset country year
                gen byte war_ends = (war == 0 & L1.war == 1)
                by country (year), sort: gen war_over = sum(war_ends)
                gen new_country_code = country + cond(war_over, `new_code_start', 0)
                list, noobs clean
                Notes:
                1. You don't say what you want to do if the same country has more than one war. The code above simply assigns a single new code after the first war and keeps using it thereafter.

                2. As before, assuming no year gaps within country.

                3. Rather than multiplying by 10 and adding 1, my code begins by finding the largest country code actually in use and then skips a couple of numbers and the new country codes start from there. (I assume your real data set is larger than this. If there is already a country 11 or a country 21, your approach breaks.) If you don't want to do that and really want to stick with multiplying by 10 and adding 1, then delete the first two lines of code and

                Code:
                // REPLACE
                gen new_country_code = country + cond(war_over, `new_code_start', 0)
                
                // WITH
                gen new_country_code = cond(war_over, country*10 + 1, country)

                Comment


                • #9
                  You mentioned tsspell (SSC) in #1 and it seems to apply here.

                  Code:
                   
                  input country year war counter 
                  1 1950 0 1 
                  1 1951 0 2 
                  1 1952 1 3 
                  1 1953 1 4 
                  1 1954 0 1 
                  1 1955 0 2 
                  1 1956 0 3 
                  2 1950 0 1 
                  2 1951 0 2 
                  2 1952 1 3 
                  2 1953 0 1 
                  end 
                  tsset country year 
                  tsspell , fcond(counter == 1) 
                  bysort country (year) : gen ccode = sum(counter == 1)
                  In fact, the last command alone, which echoes what tsspell does, is simple enough to create distinct spell identifiers.

                  Comment


                  • #10
                    Many thanks again to both; I'll try both.

                    True, the code would break down if there were a country 21 but the way the country codes are set up, I believe if there's country 2, there is no country 21, to the best of my knowledge. I'm basically going off someone else's do file and guessing that this is what they did with their data.

                    In response to what happens with a 2nd war, apologies if my message was confusing. I followed up on it with another example in the other thread. I'd basically want the country code to take on another value if there's another war that ends.

                    There is another variable because I'd like to keep the original country code and create another called ccode that I alter.

                    I actually like the general approach of identifying the max.
                    Luckily, there are no gaps in the data.

                    I read through the code-I'm teaching myself programming in Stata so I'm able to read the language and conceptualize what it's doing but not yet ready to put different elements together.

                    so if there's a 2nd war, I'd like the country code to change again: (i think the extra variable you mention is simply _n; when the data is sorted by country year).

                    country year war _n c ccode
                    1 1950 0 1 1 1
                    1 1951 0 2 2 1
                    1 1952 1 3 3 1
                    1 1953 1 4 4 1
                    1 1954 0 5 1 11
                    1 1955 0 6 2 11
                    1 1956 0 7 3 11
                    2 1950 0 1 1 2
                    2 1951 0 2 2 2
                    2 1952 1 3 3 2
                    2 1953 0 4 1 21
                    2 1954 0 5 2 21
                    2 1955 1 6 3 21
                    2 1956 1 7 4 21
                    2 1957 0 8 1 22

                    Comment


                    • #11
                      Thanks for clarifying about the extra variable.

                      so if there's a 2nd war, I'd like the country code to change again
                      What do you want it to change to? Suppose country 1 had another war later. What would the next code be after 1 and 11?
                      Give a general rule that will work in your data and not end up assigning ccodes that clash with preexisting country numbers of with other ccodes.

                      Comment


                      • #12
                        Hello Clyde, thanks again for your help.
                        I'd like it to change by incrementing by 1. So if I went with the multiply by 10 plus 1 rule for the first war, it'd be 12 for the second, 13 for the third and so forth.

                        If I went with the maximum rule, incrementing by 1 will not work b/c it's going to clash with an existing number.
                        I thought of another possibility: incrementing by something after the decimal point. the original country codes are all integers. so for the 2nd war, say we chose the max, the code would change by .1 every time.

                        I think with the aforementioned do file, the authors came up with the multiply by 10 solution because they eyeballed the country codes (everyone in my field sort of knows the generic country codes we use) and decided that multiply by 10 and then increment by 1 would work --except that they must have gone back and seen that there are a few cases for which there was an overlap-this is why they then must have changed some of these overlapping cases by recoding say country 100 to 100.5.

                        Comment


                        • #13
                          So, here's a thought. There are only a few hundred countries in the world, right? So if we multiply the original number by 1000 and add 1, 2, ... we shouldn't have any clash problem.

                          Your decimal point idea would work in theory. The problems with it are 1) if there might be a country with more than 10 wars, then we need the increment to begin with 0.01, not 0.1. If there were a country with more than 100 wars we'd need to start with 0.001. I guess that could be determined from the data, but it's a bit of a nuisance. 2) There is no exact representation in binary (which is how Stata stores numbers internally) for 0.1, or any other negative power of 10. So at some point you might encounter some precision issues and start seeing numbers like 50.6999999998 when you were hoping for 50.7. Both of those could be overcome if we made the new code a string variable. I don't know where you are going with this from here, but I tend to reserve string variables for actual text or for many-digit numerical identifiers, things that I know I will never have to do calculations on. Anyway, if the multiply by 1000 and add doesn't look good to you, I think it is clear in the code how to change to a different system.

                          I think this will do it (tested and works with your input in #10):

                          Code:
                          // IDENTIFY WHEN WAR ENDS
                          xtset country year
                          gen byte war_ends = (war == 0 & L1.war == 1)
                          by country (year), sort: gen war_over = sum(war_ends)
                          gen new_country_code = cond(war_over, 1000*country+war_over, country)
                          list, noobs clean
                          By the way, as you have probably observed, the norm in the Stata community is to use full first and last name. It would be appreciated if you would help us maintain the professional atmosphere of the Forum by clicking on "contact us" at the bottom of the page and requesting the administrators to change your registration to your full name. Thanks.

                          Comment


                          • #14


                            Thanks so much !! Actually, it occurred to me just a few hours ago that I need to check the max number of events to see how many decimal points I'd need for precisely the problem you have identified!

                            And I hadn't thought through the precision issue but I know it all too well. It had puzzled me quite a bit back when I was working on my dissertation and I know exactly how it might be a problem. I prefer the approach here. The number of countries/territorial units never exceeds 210 so this should work. It seems the authors I mentioned before had to do a bit of trial and error when they realized the unique coding issue.

                            Thanks for the tip; I joined the forum in the fall but have not been active much so did not know the community norms. I shall contact the admin and see if they can change my name here.

                            Comment

                            Working...
                            X