Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Largest consecutive of equal numbers for variable list (explanation in text)

    Dear Statalists

    I am not entirely sure how describe my goal accurately in a few words, so sorry for the potentially misleading title. Here is what I would like to do:

    I have data over a number of years, say, 2000 to 2010 that looks like the following:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 id float(y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009 y2010)
    "001" 1 1 1 0 0 0 0 0 0 0 0
    "002" 1 0 0 1 1 1 0 0 0 0 0
    "003" 0 0 0 0 0 1 1 1 1 1 0
    "004" 0 1 1 1 0 0 0 1 1 1 0
    end
    My goal is to find out the highest number of consecutive 1s in each observation in the period. For "001" that would be 3, for "003" it would be 5. Additionally, I would like to have a way to systematically determine in which year the longest period of consecutive 1s started and in which it ended. For "001" that would be in 2000 (start) and 2002 (end) etc. Such that the resulting data would look like the following.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 id float(y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009 y2010 longest_consecutive start end multiple)
    "001" 1 1 1 0 0 0 0 0 0 0 0 3 2000 2002 0
    "002" 1 0 0 1 1 1 0 0 0 0 0 3 2003 2005 0
    "003" 0 0 0 0 0 1 1 1 1 1 0 5 2005 2009 0
    "004" 0 1 1 1 0 0 0 1 1 1 0 3 2001 2003 1
    end
    As a bonus, if there are two different periods with the same length of consecutive 1s as in "004", I would like to have the start and end years of the first and an indication that there is more than one such period (variable name here: multiple).

    I am not looking for perfect code (although I would not mind it either) but rather for some conceptual advice as to how I might approach. Everything I can think of at the moment would make things very complicated, especially considering that in the full data I have a period of 50 years and over 3000 observations.

    Does anyone know how I might approach the question? Any advice is much appreciated.

    Thank you all.

    Best wishes
    Milan

  • #2



    Here are two approaches. One keeps the present wide layout (format, structure); the other reshapes to long and then applies tsspell (SSC). These are panel data in Stata's sense and for most purposes you're better off long.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 id float(y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009 y2010)
    "001" 1 1 1 0 0 0 0 0 0 0 0
    "002" 1 0 0 1 1 1 0 0 0 0 0
    "003" 0 0 0 0 0 1 1 1 1 1 0
    "004" 0 1 1 1 0 0 0 1 1 1 0
    end
    
    egen Y = concat(y????)
    
    gen longest = 0
    
    qui forval j = 1/11 {
        replace longest = `j' if strpos(Y, `j'*"1")
    }
    
    list id Y longest
    
         +-----------------------------+
         |  id             Y   longest |
         |-----------------------------|
      1. | 001   11100000000         3 |
      2. | 002   10011100000         3 |
      3. | 003   00000111110         5 |
      4. | 004   01110001110         3 |
         +-----------------------------+
    
    
    drop longest
    reshape long y, i(id) j(year)
    egen nid = group(id), label
    tsset nid year
    tsspell, pcond(y)
    egen longest = max(_seq), by(id)
    
    tabdisp nid, c(longest)
    
    ----------------------
    group(id) |    longest
    ----------+-----------
          001 |          3
          002 |          3
          003 |          5
          004 |          3
    ----------------------
    Here is one way to get the start of the (first) longest spell and the number of spells of that length, using your original data structure.

    See http://www.stata-journal.com/sjpdf.h...iclenum=dm0056 for counting substrings within strings.

    Code:
    gen start = 1999 + strpos(Y, longest*"1") 
    
    gen count = (11 - length(subinstr(Y, longest*"1", "", .)))/longest 
    
    list id Y longest start count 
    
         +---------------------------------------------+
         |  id             Y   longest   start   count |
         |---------------------------------------------|
      1. | 001   11100000000         3    2000       1 |
      2. | 002   10011100000         3    2003       1 |
      3. | 003   00000111110         5    2005       1 |
      4. | 004   01110001110         3    2001       2 |
         +---------------------------------------------+
    Last edited by Nick Cox; 24 Apr 2017, 08:14.

    Comment


    • #3
      if you -reshape- the data to long format, you can use the user-written -tsspell- (use -search- to locate and install) to get info on all such spells

      Comment


      • #4
        Thank you, Nick Cox. This solves my problem and the way your code works me through it conceptually, is very helpful for me. Especially the strpos function (and how you use it) is very useful. I will keep it in mind for similar problems I have in the future.

        Best,
        Milan

        Comment

        Working...
        X