Check order of data without resorting

Klaus Klausen

Join Date: Mar 2021

Posts: 72
#1

Check order of data without resorting

02 Feb 2023, 10:56

Suppose the following hypothetical dataset

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long(Variable1 Variable2) 1 3 1 3 1 5 1 7 1 3 1 3 1 8 2 1 2 1 2 2 2 3 2 5 end

Each Variable1 has one or more values in Variable2 assigned to it. Given the current sorting, I would like to create a variable that indicates if those entries of Variable2 that occur more than once are all grouped below each other.
For example, focusing on Variable1 = 1 we observe that Variable2 takes on the value 3 multiple, i.e. 4 times in total. As in the current sorting, the entries with the value 3 are not grouped below each other but are "interrupted" by the values 5 and 7, the new variable should take on the value 1 for Variable2 = 3.
For Variable1 = 2, the Value 1 occurs multiple times in Variable2. However, in the current sorting, the two entries with the value 1 are sorted below each other so that the newly generated variable should not take on the value 1.

In a nutshell: How do I indicate if the same values for Variable2 are all sorted below each other without changing the current order?

Thank you!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10482

02 Feb 2023, 11:45

EDITED.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(Variable1 Variable2)
1 3
1 3
1 5
1 7
1 3
1 3
1 8
1 8
2 5
2 1
2 1
2 2
2 2
2 2
2 2
2 3
2 3
2 5
end


gen long seq=_n
bys Variable1 (seq): gen tag1= Variable2[_n+1]!= Variable2
bys Variable1 Variable2 (seq): gen tag2= sum(tag1)
by Variable1 Variable2: egen wanted= max(tag2>1)
sort seq
drop seq tag*
l, sepby(Variable1)

Res.:

Code:

. l, sepby(Variable1)

     +------------------------------+
     | Variab~1   Variab~2   wanted |
     |------------------------------|
  1. |        1          3        1 |
  2. |        1          3        1 |
  3. |        1          5        0 |
  4. |        1          7        0 |
  5. |        1          3        1 |
  6. |        1          3        1 |
  7. |        1          8        0 |
  8. |        1          8        0 |
     |------------------------------|
  9. |        2          5        1 |
 10. |        2          1        0 |
 11. |        2          1        0 |
 12. |        2          2        0 |
 13. |        2          2        0 |
 14. |        2          2        0 |
 15. |        2          2        0 |
 16. |        2          3        0 |
 17. |        2          3        0 |
 18. |        2          5        1 |
     +------------------------------+

Last edited by Andrew Musau; 02 Feb 2023, 12:21.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 36058

02 Feb 2023, 11:47

Define spells as runs of the same value. If I understand correctly you want something like this. I use tsspell from SSC.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(Variable1 Variable2)
1 3
1 3
1 5
1 7
1 3
1 3
1 8
2 1
2 1
2 2
2 3
2 5
end

gen long obsno = _n 

tsset Variable1 obsno 

ssc install tsspell 

tsspell Variable2 

list, sepby(Variable1 _spell)

tab Variable2 Variable1 if _end

Results:

Code:

 

. list, sepby(Variable1 _spell)

     +----------------------------------------------------+
     | Variab~1   Variab~2   obsno   _spell   _seq   _end |
     |----------------------------------------------------|
  1. |        1          3       1        1      1      0 |
  2. |        1          3       2        1      2      1 |
     |----------------------------------------------------|
  3. |        1          5       3        2      1      1 |
     |----------------------------------------------------|
  4. |        1          7       4        3      1      1 |
     |----------------------------------------------------|
  5. |        1          3       5        4      1      0 |
  6. |        1          3       6        4      2      1 |
     |----------------------------------------------------|
  7. |        1          8       7        5      1      1 |
     |----------------------------------------------------|
  8. |        2          1       8        1      1      0 |
  9. |        2          1       9        1      2      1 |
     |----------------------------------------------------|
 10. |        2          2      10        2      1      1 |
     |----------------------------------------------------|
 11. |        2          3      11        3      1      1 |
     |----------------------------------------------------|
 12. |        2          5      12        4      1      1 |
     +----------------------------------------------------+

. 
. tab Variable2 Variable1 if _end 

           |       Variable1
 Variable2 |         1          2 |     Total
-----------+----------------------+----------
         1 |         0          1 |         1 
         2 |         0          1 |         1 
         3 |         2          1 |         3 
         5 |         1          1 |         2 
         7 |         1          0 |         1 
         8 |         1          0 |         1 
-----------+----------------------+----------
     Total |         5          4 |         9

The code catches 2 spells of 3 for the first group. You can tag duplicated spells with something like

Code:

bysort Variable Variable2 (_end) : gen bad = _end[_N-1] == 1

as each combination should have at most one spell, but if there are two or more _end will take the value 1 twice or more.

Comment

Klaus Klausen

Join Date: Mar 2021

Posts: 72
#4

02 Feb 2023, 12:24

Thank you

@ Andrew Musau

I tried to replicate your code but unfortunately don't get the desired results.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long(Variable1 Variable2) 6 86917 6 327069 6 327069 50 342693 50 342698 50 342697 50 342697 50 342697 end

If I run your code with this example I get the following results:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long(Variable1 Variable2 seq) float(tag wanted) 6 86917 1 0 0 6 327069 2 1 1 6 327069 3 0 1 50 342693 4 0 0 50 342698 5 1 0 50 342697 6 1 1 50 342697 7 0 1 50 342697 8 0 1 end

However, for Variable1 = 6 for example, "wanted" should always be equal to 0 as Variable2 = 86917 only occurs once and Variable2 = 327069 occur below each other with and are not "interrupted" by other entries.
But it might also be that I get something wrong.

@Nick Cox I will try you approach as well, thank you.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10482

02 Feb 2023, 12:57

There were a couple of edits to #2 once I realized that the initial code worked for the data in #1 but was not generalizable. See the edited version which yields

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(Variable1 Variable2)
 6  86917
 6 327069
 6 327069
50 342693
50 342698
50 342697
50 342697
50 342697
end

gen long seq=_n
bys Variable1 (seq): gen tag1= Variable2[_n+1]!= Variable2
bys Variable1 Variable2 (seq): gen tag2= sum(tag1)
by Variable1 Variable2: egen wanted= max(tag2>1)
sort seq
drop seq tag*
l, sepby(Variable1)

Res.:

Code:

. l, sepby(Variable1)

     +------------------------------+
     | Variab~1   Variab~2   wanted |
     |------------------------------|
  1. |        6      86917        0 |
  2. |        6     327069        0 |
  3. |        6     327069        0 |
     |------------------------------|
  4. |       50     342693        0 |
  5. |       50     342698        0 |
  6. |       50     342697        0 |
  7. |       50     342697        0 |
  8. |       50     342697        0 |
     +------------------------------+

Comment

Klaus Klausen

Join Date: Mar 2021

Posts: 72
#6

03 Feb 2023, 01:05

Thank you, both solutions work well.
Is it possible to assign a unique value to the wanted variable (or a newly created one) per Variable1-Variable2 combination when wanted != 0?

I created an example to illustrate what I mean:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long(Variable1 Variable2) float(wanted groups) 1 3 1 1 1 3 1 1 1 5 0 . 1 7 0 . 1 3 1 2 1 3 1 2 1 8 0 . 1 8 0 . 2 5 1 1 2 1 0 . 2 1 0 . 2 2 0 . 2 2 0 . 2 2 0 . 2 2 0 . 2 3 0 . 2 3 0 . 2 5 1 2 3 1 0 0 3 2 0 0 3 2 0 0 end

So single and continuous occurrences of the value 1 in wanted are considered to be a unique group.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36058

03 Feb 2023, 02:01

That in turn is another kind of spell (or run). In terms of tsspell as used in #3.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(Variable1 Variable2) float(wanted groups)
1 3 1 1
1 3 1 1
1 5 0 .
1 7 0 .
1 3 1 2
1 3 1 2
1 8 0 .
1 8 0 .
2 5 1 1
2 1 0 .
2 1 0 .
2 2 0 .
2 2 0 .
2 2 0 .
2 2 0 .
2 3 0 .
2 3 0 .
2 5 1 2
3 1 0 0
3 2 0 0
3 2 0 0
end

gen long obsno = _n
tsset Variable1 obsno 

tsspell, cond(wanted== 1) 
replace _spell = . if _spell == 0 
 
list, sepby(Variable1)

   +----------------------------------------------------------------------+
     | Variab~1   Variab~2   wanted   groups   obsno   _seq   _spell   _end |
     |----------------------------------------------------------------------|
  1. |        1          3        1        1       1      1        1      0 |
  2. |        1          3        1        1       2      2        1      1 |
  3. |        1          5        0        .       3      0        .      0 |
  4. |        1          7        0        .       4      0        .      0 |
  5. |        1          3        1        2       5      1        2      0 |
  6. |        1          3        1        2       6      2        2      1 |
  7. |        1          8        0        .       7      0        .      0 |
  8. |        1          8        0        .       8      0        .      0 |
     |----------------------------------------------------------------------|
  9. |        2          5        1        1       9      1        1      1 |
 10. |        2          1        0        .      10      0        .      0 |
 11. |        2          1        0        .      11      0        .      0 |
 12. |        2          2        0        .      12      0        .      0 |
 13. |        2          2        0        .      13      0        .      0 |
 14. |        2          2        0        .      14      0        .      0 |
 15. |        2          2        0        .      15      0        .      0 |
 16. |        2          3        0        .      16      0        .      0 |
 17. |        2          3        0        .      17      0        .      0 |
 18. |        2          5        1        2      18      1        2      1 |
     |----------------------------------------------------------------------|
 19. |        3          1        0        0      19      0        .      0 |
 20. |        3          2        0        0      20      0        .      0 |
 21. |        3          2        0        0      21      0        .      0 |
     +----------------------------------------------------------------------+

You seem to be applying a different rule in the case of Variable 1 == 3. From your earlier posts I see no problem with that block.

There was a paper on spells in https://www.stata-journal.com/articl...article=dm0029

Originally i was going to include a discussion of tsspell, but the paper was already longer than I expected, so I stopped before discussing the command. But the intended sequel never got written as people using it seemed to find the help file enough, or if they needed more material there was the paper already.

The point of mentioning the paper is that it includes discussion of principles. Clearly, tsspell is not essential for spell problems and despite being its author I often approach spell problems otherwise, which is necessary if the data cannot be tsset.

Andrew Musau here in this thread, and Clyde Schechter often, and doubtless many others have built on the main idea, which I certainly didn't invent, of identifying the start of any spell by assigning 1 to an indicator and thereafter using sum(),

If you already have _spell _seq _end in your dataset, you need to drop them, rename them or specify new names to tsspell.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10482
#8

03 Feb 2023, 02:47

While Nick Cox claims he didn't invent the technique of identifying the start of a spell by creating an indicator and summing, I learned it from him. I vividly remember reading this FAQ in my early Stata days: https://www.stata.com/support/faqs/d...t-occurrences/. The technique of identifying first occurrences is clever and simple, and has proven useful in solving many programming challenges I've faced.

Last edited by Andrew Musau; 03 Feb 2023, 02:50.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#9

03 Feb 2023, 03:00

Andrew Musau It's always good to hear such stories. My awareness of the usefulness of (0, 1) variables probably owes most to reading Kenneth Iverson and others on APL and J, but in a Stata context my indebtedness is mostly twofold, to looking at existing official code and to posts of Bill Gould on Statalist in the 1990s.
1 like
Comment
Klaus Klausen

Join Date: Mar 2021

Posts: 72
#10

03 Feb 2023, 07:27

Nick Cox Andrew Musau Thank you so much for your assistance and valuable input.
Have a great weekend.
Comment

Announcement