Categorising consecutively

Rob Henst

Join Date: Sep 2017

Posts: 63
#1

Categorising consecutively

23 Dec 2018, 02:56

I am trying to generate a new variable which consecutively categorises an existing category. I am not sure how to explain this, except by an example:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(var1 var2) 2 1 2 1 2 1 2 1 2 1 0 2 0 2 0 2 2 3 2 3 0 4 0 4 0 4 2 5 2 5 0 6 end

I have var1, and I want to generate var2. Because I am not sure how to word or explain my problem, I am not quite sure where to look for a solution. I may not require the complete solution, but if I am pointed in the right direction (or command / function) I am sure I can figure it out. (once I find the solution, I will post it for closure).

My question therefore is which command or function will allow me to generate var2 in the example above?

Thank you so much.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35637

23 Dec 2018, 03:16

Code:

clear
input float(var1 var2)
2 1
2 1
2 1
2 1
2 1
0 2
0 2
0 2
2 3
2 3
0 4
0 4
0 4
2 5
2 5
0 6
end

gen wanted = sum(var1 != var1[_n-1])

assert wanted == var2

list, sepby(var2)

     +----------------------+
     | var1   var2   wanted |
     |----------------------|
  1. |    2      1        1 |
  2. |    2      1        1 |
  3. |    2      1        1 |
  4. |    2      1        1 |
  5. |    2      1        1 |
     |----------------------|
  6. |    0      2        2 |
  7. |    0      2        2 |
  8. |    0      2        2 |
     |----------------------|
  9. |    2      3        3 |
 10. |    2      3        3 |
     |----------------------|
 11. |    0      4        4 |
 12. |    0      4        4 |
 13. |    0      4        4 |
     |----------------------|
 14. |    2      5        5 |
 15. |    2      5        5 |
     |----------------------|
 16. |    0      6        6 |
     +----------------------+

Every time var1 differs from its previous value we add 1 to a running sum.

Notes:

1. This works here for var1[1] too because var1[0] is evaluated as missing. There is no var1[0] but Stata is not fazed by that reference; it just returns missing.

2. Hence if the first value might be missing, you need a twist to the code

Code:

gen wanted = sum((var1 != var1[_n-1]) | (_n == 1) )

Parenthesising aggressively does no harm to spell out your intention to readers.

3. See also https://www.stata-journal.com/articl...article=dm0029 and tsspell (SSC) for more general discussion and a convenience command.

Comment

Rob Henst

Join Date: Sep 2017

Posts: 63
#3

24 Dec 2018, 02:29

Thanks a lot for your solution. I would have posted my final solution here, but it would just a copy-paste.

I thought the solution would be more in line of the following (does not work):

Code:

by var1: egen wanted = seq()

Thanks again!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35637
#4

24 Dec 2018, 02:32

No; that doesn't take, let alone take account of, any inputs.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

24 Dec 2018, 06:36

Here is another solution, with some added "reflective narrative" which depending on the point of view of the observer might be useful (if you want to know why Stata does stuff in the way she does) or obnoxious (if you just want to get the job done)

Code:

. gen flag = var1 != var1[_n-1]

. gen wanted = sum(flag)

. gen flag_to_see = flag

. replace flag = flag + flag[_n-1] in 2/l
(15 real changes made)

. assert wanted==flag

. list, sepby(var2)

     +----------------------------------------+
     | var1   var2   flag   wanted   flag_t~e |
     |----------------------------------------|
  1. |    2      1      1        1          1 |
  2. |    2      1      1        1          0 |
  3. |    2      1      1        1          0 |
  4. |    2      1      1        1          0 |
  5. |    2      1      1        1          0 |
     |----------------------------------------|
  6. |    0      2      2        2          1 |
  7. |    0      2      2        2          0 |
  8. |    0      2      2        2          0 |
     |----------------------------------------|
  9. |    2      3      3        3          1 |
 10. |    2      3      3        3          0 |
     |----------------------------------------|
 11. |    0      4      4        4          1 |
 12. |    0      4      4        4          0 |
 13. |    0      4      4        4          0 |
     |----------------------------------------|
 14. |    2      5      5        5          1 |
 15. |    2      5      5        5          0 |
     |----------------------------------------|
 16. |    0      6      6        6          1 |
     +----------------------------------------+

.

Reflective narrative:

1. Nick's solution is actually two steps wrapped up in one.
a) In this first step a variable flag is generated, which flags by 1 the first member in each distinct group.
b) in the second step Nick uses the fact that the command -gen sth=sum()- generates a running sum, and one can use this running sum to cascade.

2. In my solution there are also two steps, which are not wrapped up in one:
a) first step is the same, flag the first observation in the distinct group.
b) In the second step, I use the fact that -replace- uses the current sort order, and can be used to create cascades.

Announcement