Clustering by the sequence pattern

Hayoung Choi

Join Date: Nov 2021

Posts: 14
#1

Clustering by the sequence pattern

10 Nov 2021, 06:10

I have a data that the secquence patter is like below:

Sequence-Pa |
ttern | Freq. Percent Cum.
------------+-----------------------------------
11111 | 209 83.27 83.27
22222 | 17 6.77 90.04
21111 | 8 3.19 93.23
22111 | 5 1.99 95.22
22211 | 4 1.59 96.81
11112 | 1 0.40 97.21
11122 | 1 0.40 97.61
11222 | 1 0.40 98.01
12111 | 1 0.40 98.41
12211 | 1 0.40 98.80
12222 | 1 0.40 99.20
21222 | 1 0.40 99.60
22212 | 1 0.40 100.00
------------+-----------------------------------
Total | 251 100.00

I tried Sqom syntex but the _SQdist score somehow did't interpret well the secuence pattern (the way that I wanted)
for example, the pattern 22111 and 11222 is completly difference pattern that I want to cluster. but it gave same distance score.

the cluster result should be like this :
1) 11111
2) 22222
3) 21111 or 22111 or 22211 the similar sequence doesn't matter, because I just want to cluster group who moved from 2 to 1.
4) 11112 or 11122, etc, I want to cluster group who started from 1 and moved to 2 at the end

Should I try to do another analysis than cluster analysis?

If it is recommended, Please help!
Tags: None

Hayoung Choi

Join Date: Nov 2021
Posts: 14

10 Nov 2021, 21:09

Maybe I was not clear enough.

I have a panal data of 5 years that its sequence is like below:

Sequence-Pa
ttern	Freq.	Percent	Cum.

11111	209	83.27	83.27
22222	17	6.77	90.04
21111	8	3.19	93.23
22111	5	1.99	95.22
22211	4	1.59	96.81
11112	1	0.40	97.21
11122	1	0.40	97.61
11222	1	0.40	98.01
12111	1	0.40	98.41
12211	1	0.40	98.80
12222	1	0.40	99.20
21222	1	0.40	99.60
22212	1	0.40	100.00

Total	251	100.00

I ran "sqom" and it gave _SQdist score.
the result was like this:

. ta _SQdist

sqom with |
k(0) |
indel(1) |
subcost(2) |
refseqid() | Freq. Percent Cum.
------------+-----------------------------------
0 | 85 6.77 6.77
.4 | 15 1.20 7.97
.8 | 25 1.99 9.96
1.2 | 35 2.79 12.75
1.6 | 50 3.98 16.73
2 | 1,045 83.27 100.00
------------+-----------------------------------
Total | 1,255 100.00

by running sqom it gave the same SQdist score to the sequence that is 12211, 22111, 11122, etc

How can I manage to get the different _SQdist for different sequence pattern?
especially, I wanto to distinguish who started from 1 in the first wave and ended 2 in the last wave,
also, who started from 2 and ended up being 1 in the last sequence.

Should I try to do another analysis? or another way to distinguish those groups?

Thanks!

Comment

Fei Wang

Join Date: Oct 2021

Posts: 726
#3

10 Nov 2021, 21:53

Hayoung, the code below assigns unique group id to each sequence pattern. Not sure if it's what you want, because it seems you'd like to further combine sequence patterns with common features.

Code:

reshape wide var, i(panelid) j(year) egen gid = group(var*) reshape long var, i(panelid) j(year)
Comment

Announcement