Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorising consecutively

    I am trying to generate a new variable which consecutively categorises an existing category. I am not sure how to explain this, except by an example:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(var1 var2)
    2 1
    2 1
    2 1
    2 1
    2 1
    0 2
    0 2
    0 2
    2 3
    2 3
    0 4
    0 4
    0 4
    2 5
    2 5
    0 6
    end
    I have var1, and I want to generate var2. Because I am not sure how to word or explain my problem, I am not quite sure where to look for a solution. I may not require the complete solution, but if I am pointed in the right direction (or command / function) I am sure I can figure it out. (once I find the solution, I will post it for closure).

    My question therefore is which command or function will allow me to generate var2 in the example above?

    Thank you so much.

  • #2
    Code:
    clear
    input float(var1 var2)
    2 1
    2 1
    2 1
    2 1
    2 1
    0 2
    0 2
    0 2
    2 3
    2 3
    0 4
    0 4
    0 4
    2 5
    2 5
    0 6
    end
    
    gen wanted = sum(var1 != var1[_n-1])
    
    assert wanted == var2
    
    list, sepby(var2)
    
         +----------------------+
         | var1   var2   wanted |
         |----------------------|
      1. |    2      1        1 |
      2. |    2      1        1 |
      3. |    2      1        1 |
      4. |    2      1        1 |
      5. |    2      1        1 |
         |----------------------|
      6. |    0      2        2 |
      7. |    0      2        2 |
      8. |    0      2        2 |
         |----------------------|
      9. |    2      3        3 |
     10. |    2      3        3 |
         |----------------------|
     11. |    0      4        4 |
     12. |    0      4        4 |
     13. |    0      4        4 |
         |----------------------|
     14. |    2      5        5 |
     15. |    2      5        5 |
         |----------------------|
     16. |    0      6        6 |
         +----------------------+
    Every time var1 differs from its previous value we add 1 to a running sum.

    Notes:

    1. This works here for var1[1] too because var1[0] is evaluated as missing. There is no var1[0] but Stata is not fazed by that reference; it just returns missing.

    2. Hence if the first value might be missing, you need a twist to the code


    Code:
    gen wanted = sum((var1 != var1[_n-1]) | (_n == 1) )
    Parenthesising aggressively does no harm to spell out your intention to readers.

    3. See also https://www.stata-journal.com/articl...article=dm0029 and tsspell (SSC) for more general discussion and a convenience command.

    Comment


    • #3
      Thanks a lot for your solution. I would have posted my final solution here, but it would just a copy-paste.

      I thought the solution would be more in line of the following (does not work):

      Code:
      by var1: egen wanted = seq()
      Thanks again!

      Comment


      • #4
        No; that doesn't take, let alone take account of, any inputs.

        Comment


        • #5
          Here is another solution, with some added "reflective narrative" which depending on the point of view of the observer might be useful (if you want to know why Stata does stuff in the way she does) or obnoxious (if you just want to get the job done)

          Code:
          . gen flag = var1 != var1[_n-1]
          
          . gen wanted = sum(flag)
          
          . gen flag_to_see = flag
          
          . replace flag = flag + flag[_n-1] in 2/l
          (15 real changes made)
          
          . assert wanted==flag
          
          . list, sepby(var2)
          
               +----------------------------------------+
               | var1   var2   flag   wanted   flag_t~e |
               |----------------------------------------|
            1. |    2      1      1        1          1 |
            2. |    2      1      1        1          0 |
            3. |    2      1      1        1          0 |
            4. |    2      1      1        1          0 |
            5. |    2      1      1        1          0 |
               |----------------------------------------|
            6. |    0      2      2        2          1 |
            7. |    0      2      2        2          0 |
            8. |    0      2      2        2          0 |
               |----------------------------------------|
            9. |    2      3      3        3          1 |
           10. |    2      3      3        3          0 |
               |----------------------------------------|
           11. |    0      4      4        4          1 |
           12. |    0      4      4        4          0 |
           13. |    0      4      4        4          0 |
               |----------------------------------------|
           14. |    2      5      5        5          1 |
           15. |    2      5      5        5          0 |
               |----------------------------------------|
           16. |    0      6      6        6          1 |
               +----------------------------------------+
          
          .
          Reflective narrative:

          1. Nick's solution is actually two steps wrapped up in one.
          a) In this first step a variable flag is generated, which flags by 1 the first member in each distinct group.
          b) in the second step Nick uses the fact that the command -gen sth=sum()- generates a running sum, and one can use this running sum to cascade.

          2. In my solution there are also two steps, which are not wrapped up in one:
          a) first step is the same, flag the first observation in the distinct group.
          b) In the second step, I use the fact that -replace- uses the current sort order, and can be used to create cascades.

          Comment

          Working...
          X