Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Check order of data without resorting

    Suppose the following hypothetical dataset

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(Variable1 Variable2)
    1 3
    1 3
    1 5
    1 7
    1 3
    1 3
    1 8
    2 1
    2 1
    2 2
    2 3
    2 5
    end
    Each Variable1 has one or more values in Variable2 assigned to it. Given the current sorting, I would like to create a variable that indicates if those entries of Variable2 that occur more than once are all grouped below each other.
    For example, focusing on Variable1 = 1 we observe that Variable2 takes on the value 3 multiple, i.e. 4 times in total. As in the current sorting, the entries with the value 3 are not grouped below each other but are "interrupted" by the values 5 and 7, the new variable should take on the value 1 for Variable2 = 3.
    For Variable1 = 2, the Value 1 occurs multiple times in Variable2. However, in the current sorting, the two entries with the value 1 are sorted below each other so that the newly generated variable should not take on the value 1.

    In a nutshell: How do I indicate if the same values for Variable2 are all sorted below each other without changing the current order?

    Thank you!

  • #2
    EDITED.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(Variable1 Variable2)
    1 3
    1 3
    1 5
    1 7
    1 3
    1 3
    1 8
    1 8
    2 5
    2 1
    2 1
    2 2
    2 2
    2 2
    2 2
    2 3
    2 3
    2 5
    end
    
    
    gen long seq=_n
    bys Variable1 (seq): gen tag1= Variable2[_n+1]!= Variable2
    bys Variable1 Variable2 (seq): gen tag2= sum(tag1)
    by Variable1 Variable2: egen wanted= max(tag2>1)
    sort seq
    drop seq tag*
    l, sepby(Variable1)
    Res.:

    Code:
    . l, sepby(Variable1)
    
         +------------------------------+
         | Variab~1   Variab~2   wanted |
         |------------------------------|
      1. |        1          3        1 |
      2. |        1          3        1 |
      3. |        1          5        0 |
      4. |        1          7        0 |
      5. |        1          3        1 |
      6. |        1          3        1 |
      7. |        1          8        0 |
      8. |        1          8        0 |
         |------------------------------|
      9. |        2          5        1 |
     10. |        2          1        0 |
     11. |        2          1        0 |
     12. |        2          2        0 |
     13. |        2          2        0 |
     14. |        2          2        0 |
     15. |        2          2        0 |
     16. |        2          3        0 |
     17. |        2          3        0 |
     18. |        2          5        1 |
         +------------------------------+
    Last edited by Andrew Musau; 02 Feb 2023, 12:21.

    Comment


    • #3
      Define spells as runs of the same value. If I understand correctly you want something like this. I use tsspell from SSC.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input long(Variable1 Variable2)
      1 3
      1 3
      1 5
      1 7
      1 3
      1 3
      1 8
      2 1
      2 1
      2 2
      2 3
      2 5
      end
      
      gen long obsno = _n 
      
      tsset Variable1 obsno 
      
      ssc install tsspell 
      
      tsspell Variable2 
      
      list, sepby(Variable1 _spell)
      
      tab Variable2 Variable1 if _end
      Results:

      Code:
       
      
      . list, sepby(Variable1 _spell)
      
           +----------------------------------------------------+
           | Variab~1   Variab~2   obsno   _spell   _seq   _end |
           |----------------------------------------------------|
        1. |        1          3       1        1      1      0 |
        2. |        1          3       2        1      2      1 |
           |----------------------------------------------------|
        3. |        1          5       3        2      1      1 |
           |----------------------------------------------------|
        4. |        1          7       4        3      1      1 |
           |----------------------------------------------------|
        5. |        1          3       5        4      1      0 |
        6. |        1          3       6        4      2      1 |
           |----------------------------------------------------|
        7. |        1          8       7        5      1      1 |
           |----------------------------------------------------|
        8. |        2          1       8        1      1      0 |
        9. |        2          1       9        1      2      1 |
           |----------------------------------------------------|
       10. |        2          2      10        2      1      1 |
           |----------------------------------------------------|
       11. |        2          3      11        3      1      1 |
           |----------------------------------------------------|
       12. |        2          5      12        4      1      1 |
           +----------------------------------------------------+
      
      . 
      . tab Variable2 Variable1 if _end 
      
                 |       Variable1
       Variable2 |         1          2 |     Total
      -----------+----------------------+----------
               1 |         0          1 |         1 
               2 |         0          1 |         1 
               3 |         2          1 |         3 
               5 |         1          1 |         2 
               7 |         1          0 |         1 
               8 |         1          0 |         1 
      -----------+----------------------+----------
           Total |         5          4 |         9
      The code catches 2 spells of 3 for the first group. You can tag duplicated spells with something like


      Code:
      bysort Variable Variable2 (_end) : gen bad = _end[_N-1] == 1
      as each combination should have at most one spell, but if there are two or more _end will take the value 1 twice or more.


      Comment


      • #4
        Thank you

        @ Andrew Musau

        I tried to replicate your code but unfortunately don't get the desired results.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input long(Variable1 Variable2)
         6  86917
         6 327069
         6 327069
        50 342693
        50 342698
        50 342697
        50 342697
        50 342697
        end

        If I run your code with this example I get the following results:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input long(Variable1 Variable2 seq) float(tag wanted)
         6  86917 1 0 0
         6 327069 2 1 1
         6 327069 3 0 1
        50 342693 4 0 0
        50 342698 5 1 0
        50 342697 6 1 1
        50 342697 7 0 1
        50 342697 8 0 1
        end
        However, for Variable1 = 6 for example, "wanted" should always be equal to 0 as Variable2 = 86917 only occurs once and Variable2 = 327069 occur below each other with and are not "interrupted" by other entries.
        But it might also be that I get something wrong.

        @Nick Cox I will try you approach as well, thank you.

        Comment


        • #5
          There were a couple of edits to #2 once I realized that the initial code worked for the data in #1 but was not generalizable. See the edited version which yields

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input long(Variable1 Variable2)
           6  86917
           6 327069
           6 327069
          50 342693
          50 342698
          50 342697
          50 342697
          50 342697
          end
          
          gen long seq=_n
          bys Variable1 (seq): gen tag1= Variable2[_n+1]!= Variable2
          bys Variable1 Variable2 (seq): gen tag2= sum(tag1)
          by Variable1 Variable2: egen wanted= max(tag2>1)
          sort seq
          drop seq tag*
          l, sepby(Variable1)
          Res.:

          Code:
          . l, sepby(Variable1)
          
               +------------------------------+
               | Variab~1   Variab~2   wanted |
               |------------------------------|
            1. |        6      86917        0 |
            2. |        6     327069        0 |
            3. |        6     327069        0 |
               |------------------------------|
            4. |       50     342693        0 |
            5. |       50     342698        0 |
            6. |       50     342697        0 |
            7. |       50     342697        0 |
            8. |       50     342697        0 |
               +------------------------------+

          Comment


          • #6
            Thank you, both solutions work well.
            Is it possible to assign a unique value to the wanted variable (or a newly created one) per Variable1-Variable2 combination when wanted != 0?

            I created an example to illustrate what I mean:
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input long(Variable1 Variable2) float(wanted groups)
            1 3 1 1
            1 3 1 1
            1 5 0 .
            1 7 0 .
            1 3 1 2
            1 3 1 2
            1 8 0 .
            1 8 0 .
            2 5 1 1
            2 1 0 .
            2 1 0 .
            2 2 0 .
            2 2 0 .
            2 2 0 .
            2 2 0 .
            2 3 0 .
            2 3 0 .
            2 5 1 2
            3 1 0 0
            3 2 0 0
            3 2 0 0
            end

            So single and continuous occurrences of the value 1 in wanted are considered to be a unique group.



            Comment


            • #7
              That in turn is another kind of spell (or run). In terms of tsspell as used in #3.


              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input long(Variable1 Variable2) float(wanted groups)
              1 3 1 1
              1 3 1 1
              1 5 0 .
              1 7 0 .
              1 3 1 2
              1 3 1 2
              1 8 0 .
              1 8 0 .
              2 5 1 1
              2 1 0 .
              2 1 0 .
              2 2 0 .
              2 2 0 .
              2 2 0 .
              2 2 0 .
              2 3 0 .
              2 3 0 .
              2 5 1 2
              3 1 0 0
              3 2 0 0
              3 2 0 0
              end
              
              gen long obsno = _n
              tsset Variable1 obsno 
              
              tsspell, cond(wanted== 1) 
              replace _spell = . if _spell == 0 
               
              list, sepby(Variable1)
              
                 +----------------------------------------------------------------------+
                   | Variab~1   Variab~2   wanted   groups   obsno   _seq   _spell   _end |
                   |----------------------------------------------------------------------|
                1. |        1          3        1        1       1      1        1      0 |
                2. |        1          3        1        1       2      2        1      1 |
                3. |        1          5        0        .       3      0        .      0 |
                4. |        1          7        0        .       4      0        .      0 |
                5. |        1          3        1        2       5      1        2      0 |
                6. |        1          3        1        2       6      2        2      1 |
                7. |        1          8        0        .       7      0        .      0 |
                8. |        1          8        0        .       8      0        .      0 |
                   |----------------------------------------------------------------------|
                9. |        2          5        1        1       9      1        1      1 |
               10. |        2          1        0        .      10      0        .      0 |
               11. |        2          1        0        .      11      0        .      0 |
               12. |        2          2        0        .      12      0        .      0 |
               13. |        2          2        0        .      13      0        .      0 |
               14. |        2          2        0        .      14      0        .      0 |
               15. |        2          2        0        .      15      0        .      0 |
               16. |        2          3        0        .      16      0        .      0 |
               17. |        2          3        0        .      17      0        .      0 |
               18. |        2          5        1        2      18      1        2      1 |
                   |----------------------------------------------------------------------|
               19. |        3          1        0        0      19      0        .      0 |
               20. |        3          2        0        0      20      0        .      0 |
               21. |        3          2        0        0      21      0        .      0 |
                   +----------------------------------------------------------------------+
              You seem to be applying a different rule in the case of Variable 1 == 3. From your earlier posts I see no problem with that block.

              There was a paper on spells in https://www.stata-journal.com/articl...article=dm0029

              Originally i was going to include a discussion of tsspell, but the paper was already longer than I expected, so I stopped before discussing the command. But the intended sequel never got written as people using it seemed to find the help file enough, or if they needed more material there was the paper already.

              The point of mentioning the paper is that it includes discussion of principles. Clearly, tsspell is not essential for spell problems and despite being its author I often approach spell problems otherwise, which is necessary if the data cannot be tsset.

              Andrew Musau here in this thread, and Clyde Schechter often, and doubtless many others have built on the main idea, which I certainly didn't invent, of identifying the start of any spell by assigning 1 to an indicator and thereafter using sum(),

              If you already have _spell _seq _end in your dataset, you need to drop them, rename them or specify new names to tsspell.


              Comment


              • #8
                While Nick Cox claims he didn't invent the technique of identifying the start of a spell by creating an indicator and summing, I learned it from him. I vividly remember reading this FAQ in my early Stata days: https://www.stata.com/support/faqs/d...t-occurrences/. The technique of identifying first occurrences is clever and simple, and has proven useful in solving many programming challenges I've faced.
                Last edited by Andrew Musau; 03 Feb 2023, 02:50.

                Comment


                • #9
                  Andrew Musau It's always good to hear such stories. My awareness of the usefulness of (0, 1) variables probably owes most to reading Kenneth Iverson and others on APL and J, but in a Stata context my indebtedness is mostly twofold, to looking at existing official code and to posts of Bill Gould on Statalist in the 1990s.

                  Comment


                  • #10
                    Nick Cox Andrew Musau Thank you so much for your assistance and valuable input.
                    Have a great weekend.

                    Comment

                    Working...
                    X