Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering by the sequence pattern

    I have a data that the secquence patter is like below:

    Sequence-Pa |
    ttern | Freq. Percent Cum.
    ------------+-----------------------------------
    11111 | 209 83.27 83.27
    22222 | 17 6.77 90.04
    21111 | 8 3.19 93.23
    22111 | 5 1.99 95.22
    22211 | 4 1.59 96.81
    11112 | 1 0.40 97.21
    11122 | 1 0.40 97.61
    11222 | 1 0.40 98.01
    12111 | 1 0.40 98.41
    12211 | 1 0.40 98.80
    12222 | 1 0.40 99.20
    21222 | 1 0.40 99.60
    22212 | 1 0.40 100.00
    ------------+-----------------------------------
    Total | 251 100.00


    I tried Sqom syntex but the _SQdist score somehow did't interpret well the secuence pattern (the way that I wanted)
    for example, the pattern 22111 and 11222 is completly difference pattern that I want to cluster. but it gave same distance score.

    the cluster result should be like this :
    1) 11111
    2) 22222
    3) 21111 or 22111 or 22211 the similar sequence doesn't matter, because I just want to cluster group who moved from 2 to 1.
    4) 11112 or 11122, etc, I want to cluster group who started from 1 and moved to 2 at the end

    Should I try to do another analysis than cluster analysis?

    If it is recommended, Please help!



  • #2
    Maybe I was not clear enough.

    I have a panal data of 5 years that its sequence is like below:
    Sequence-Pa
    ttern Freq. Percent Cum.
    11111 209 83.27 83.27
    22222 17 6.77 90.04
    21111 8 3.19 93.23
    22111 5 1.99 95.22
    22211 4 1.59 96.81
    11112 1 0.40 97.21
    11122 1 0.40 97.61
    11222 1 0.40 98.01
    12111 1 0.40 98.41
    12211 1 0.40 98.80
    12222 1 0.40 99.20
    21222 1 0.40 99.60
    22212 1 0.40 100.00
    Total 251 100.00

    I ran "sqom" and it gave _SQdist score.
    the result was like this:


    . ta _SQdist

    sqom with |
    k(0) |
    indel(1) |
    subcost(2) |
    refseqid() | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 85 6.77 6.77
    .4 | 15 1.20 7.97
    .8 | 25 1.99 9.96
    1.2 | 35 2.79 12.75
    1.6 | 50 3.98 16.73
    2 | 1,045 83.27 100.00
    ------------+-----------------------------------
    Total | 1,255 100.00


    by running sqom it gave the same SQdist score to the sequence that is 12211, 22111, 11122, etc

    How can I manage to get the different _SQdist for different sequence pattern?
    especially, I wanto to distinguish who started from 1 in the first wave and ended 2 in the last wave,
    also, who started from 2 and ended up being 1 in the last sequence.

    Should I try to do another analysis? or another way to distinguish those groups?

    Thanks!

    Comment


    • #3
      Hayoung, the code below assigns unique group id to each sequence pattern. Not sure if it's what you want, because it seems you'd like to further combine sequence patterns with common features.

      Code:
      reshape wide var, i(panelid) j(year)
      egen gid = group(var*)
      reshape long var, i(panelid) j(year)

      Comment

      Working...
      X