Sequence analysis - optimal matching problems with SQOM

Tom Kemeny

Join Date: Apr 2016

Posts: 52
#1

Sequence analysis - optimal matching problems with SQOM

18 Apr 2016, 09:02

Hi all,

I'm doing sequence analysis, trying to understand patterns for a series of firms across time. I'm using the latest update of the sq package, in Stata 14, and my data have been declared a sqset. I am able to do descriptive stuff (ie sqdes) with no trouble. I can also run sqom successfully with no options, or as sqom, subcost(rawdistance).

But, when conducting the OM, I really want to use the inverse of transition probabilities. I have calculated these in a matrix I call Pinv, which is a symmetrical 7x7 matrix - corresponding to the 7 'states' my firms can be in at any particular time. I'm attaching a screenshot of the matrix, since it looks bad if I just copy and paste it into the text window.

The problem arises when I run the following:
sqom, subcost(Pinv)

I get the following error:
subcostmatrix invalid: subcostmatrix is not symmetric
r(198);

I am confused, since my matrix clearly is symmetric, and contains the right number of elements. Any advice?

Just for fun, I also tried the following:
sqom, subcost(meanprobdistance)

This also returns an error:
needlemanwunschexactmatrix(): 3200 conformability error
sqomref(): - function returned error
<istmt>: - function returned error

Any help figuring out what is going on would be very much appreciated.

Many thanks,
Tom

1 Photo
Tags: None
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#2

20 Apr 2016, 04:30

Your matrix is not symmetric. Symmetric means that Pinv[i,j] == Pinv[j,i].

Transition based substitution costs (which are not necessarily a good idea, but that's another story) are usually calculated as 2 - Pij - Pji
Comment
Tom Kemeny

Join Date: Apr 2016

Posts: 52
#3

21 Apr 2016, 08:11

Thanks Brendan - that's quite helpful! Our of curiiosity, do you have a reference on your suggestion of a rule of 2 - Pij - Pji?

Also, this still doesn't explain the error I got using 'meanprobdistance'. Any thoughts there?

Many thanks
Tom
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#4

21 Apr 2016, 10:08

2 - Pij - Pji is what meanprobdistance is doing automatically, and is described in the sqom help.

Why you get an error, I don't know. It may be that there is something odd about your data. Can you look at the pattern of transitions and show the data? For instance, if your state variable is m (i.e., you did "sqset m id t"), do the following in long format:

Code:

by id: gen lagm = m[_n-1] tab lagm m
Comment

Tom Kemeny

Join Date: Apr 2016
Posts: 52

25 Apr 2016, 08:13

Thanks - I see the relevant discussion in the sqom help file with regard to meanprobdistance.

Naively, my starting point was to use the markov command to generate transition probabilities. These gave me a matrix as follows

Code:

Transition probabilities
------------------------

P[7,7]
        1       2       3       4       5       6       7
1  0.9358  0.0120  0.0197  0.0167  0.0087  0.0068  0.0003
2  0.7320  0.0206  0.0928  0.0928  0.0309  0.0206  0.0103
3  0.4905  0.0228  0.3460  0.0304  0.0076  0.0875  0.0152
4  0.6106  0.0048  0.0240  0.2356  0.0481  0.0721  0.0048
5  0.6145  0.0843  0.0120  0.1325  0.0723  0.0602  0.0241
6  0.3981  0.0093  0.2222  0.1574  0.0185  0.1574  0.0370
7  0.4286  0.0000  0.1429  0.2143  0.1429  0.0714  0.0000

So, one thing to note is that some transitions are extremely unusual - this fits my descriptive results using the SQ descriptive commands.

Using sqom with meanprobdistance, I get

Code:

sqom, subcost(meanprobdistance) 

.         sqom, subcost(meanprobdistance) 

symmetric SQsubcost[7,7]
           c1         c2         c3         c4         c5         c6         c7
r1          0
r2  1.9814351          0
r3  1.9683218  1.9977899          0
r4  1.9674377  1.9986739  1.9980846          0
r5   1.986592  1.9985266   1.999558  1.9969058          0
r6  1.9892441   1.999558  1.9932223  1.9954324  1.9989686          0
r7  1.9989686  1.9998527   1.999116  1.9994106  1.9994106  1.9992633          0

needlemanwunschexactmatrix():  3200  conformability error
               sqomref():     -  function returned error
                 <istmt>:     -  function returned error
r(3200);

I'm not quite sure what 'odd' data might look like. Based on your suggestion, I ran the following, and got this output:

Code:

.         by cid: gen lagm = st[_n-1]
(640 missing values generated)

. tab lagm st

           |                                 State/Event
      lagm | No Suppor        CED  Venture C  SBIR/STTR  NC Biotec  2 Support  3+ Suppor |     Total
-----------+-----------------------------------------------------------------------------+----------
         1 |     5,712         61        105        104         43         37          2 |     6,064 
         2 |        65          1          9          9          3          2          1 |        90 
         3 |       110          6         91          8          2         22          4 |       243 
         4 |       117          0          5         49         10         15          1 |       197 
         5 |        48          7          1         11          6          5          2 |        80 
         6 |        36          1         24         16          2         17          4 |       100 
         7 |         5          0          2          3          2          1          0 |        13 
-----------+-----------------------------------------------------------------------------+----------
     Total |     6,093         76        237        200         68         99         14 |     6,787

Is this 'odd'?

Any suggestions very much welcome. thanks!

Comment

Brendan Halpin

Join Date: Mar 2014

Posts: 152
#6

26 Apr 2016, 03:02

With transition rates like that, your substitution costs are all nearly 2.0, i.e., nearly constant. That's not necessarily a problem, but it means you are using a fairly neutral scheme.

Looking at your transitions, it looks like sequences are mostly in state 1, with very short excursions (usually 1 time unit) into the other states.

I don't see anything wrong with your data, as such, with the exception that state 7 is very rare: try collapsing it into state 6 to see if you can run sqom successfully. If you can't get sqom to work, try SADI (my sequence analysis package, on SSC). If that doesn't work, I have a better chance of understanding the errors.
Comment
Tom Kemeny

Join Date: Apr 2016

Posts: 52
#7

26 Apr 2016, 06:00

As you suggested, I collapsed states 6 and 7 together, and re-ran sqom with meanprobdistance. This worked, in that it no longer produced an error message.

But I find the results substantively odd. Specifically, as you observed, the substitution costs that emerge from sqom all cluster right around a value of 2.

Code:

. sqom, subcost(meanprobdistance) // symmetric SQsubcost[6,6] c1 c2 c3 c4 c5 c6 r1 0 r2 1.9814351 0 r3 1.9683218 1.9977899 0 r4 1.9674377 1.9986739 1.9980846 0 r5 1.986592 1.9985266 1.999558 1.9969058 0 r6 1.9882128 1.9994106 1.9923383 1.9948431 1.9983793 0 Distance Variable saved as _SQdist

Given the exploratory nature of my work, I think it makes sense to use the actual data to generate substitution costs. Certainly theory is of no help. But how do I square the matrix above with the transition probabilities generated with markov below?

Code:

Transition probabilities ------------------------ P[6,6] 1 2 3 4 5 6 1 0.9358 0.0120 0.0197 0.0167 0.0087 0.0071 2 0.7320 0.0206 0.0928 0.0928 0.0309 0.0309 3 0.4905 0.0228 0.3460 0.0304 0.0076 0.1027 4 0.6106 0.0048 0.0240 0.2356 0.0481 0.0769 5 0.6145 0.0843 0.0120 0.1325 0.0723 0.0843 6 0.4016 0.0082 0.2131 0.1639 0.0328 0.1803

If I consider the formula 2- Pij - Pji for, say transitions between state 6 and state 1, I should get something around 1.59, not 1.98. But that is not what sqom generates, and I am not clear why. Mechanically, I'm unclear on why I am getting such flat substitution costs that do not resemble what I expect given my markov-based transition probabilities.

At a deeper level though, I'm wondering if I need some other way of dealing with the fact that order, ie i-j vs j-i really matters for my particular set of sequences. To get some conceptual help I have been reading things like:
Aisenbrey, Silke and Anette E. Fasang. 2010. ‘‘New Life for Old Ideas: The ‘Second Wave’ of Sequence Analysis Bringing the ‘Course’ Back Into the Life Course.’’ Sociological Methods & Research 38:420-462.

While this brings up the order issue, it doesn't seem to propose real solutions to the issue.

I welcome your thoughts.
Comment

Brendan Halpin

Join Date: Mar 2014
Posts: 152

26 Apr 2016, 07:56

OK, with your data (from the table) I do the following:

Code:

tab lagm st [freq=s], matcell(tab1)
tab lagm st if lagm!=st [freq=s], matcell(tab2)

mata:
tab1 = st_matrix("tab1")
tab2 = st_matrix("tab2")

tab1r = tab1 :/ rowsum(tab1)
tab2r = tab2 :/ rowsum(tab2)

trprob1 = J(6,6,2) - tab1r - tab1r'
trprob2 = J(6,6,2) - tab2r - tab2r'

trprob1 = trprob1 - diag(trprob1)
trprob2 = trprob2 - diag(trprob2)
end

which yields

Code:

. mata trprob1
                 1             2             3             4             5             6
    +-------------------------------------------------------------------------------------+
  1 |            0   1.267718411   1.530009799   1.388940975   1.392908971   1.630736743  |
  2 |  1.267718411             0   1.875308642           1.9   1.879166667   1.957817109  |
  3 |  1.530009799   1.875308642             0   1.941697479   1.979269547    1.66291562  |
  4 |  1.388940975           1.9   1.941697479             0   1.811738579   1.750640133  |
  5 |  1.392908971   1.879166667   1.979269547   1.811738579             0    1.87710177  |
  6 |  1.630736743   1.957817109    1.66291562   1.750640133    1.87710177             0  |
    +-------------------------------------------------------------------------------------+

. mata trprob2
                 1             2             3             4             5             6
    +-------------------------------------------------------------------------------------+
  1 |            0   1.096367467   .9780203349    .914004914    1.22919226   1.438655095  |
  2 |  1.096367467             0    1.85940272   1.898876404    1.87169754   1.955303124  |
  3 |  .9780203349    1.85940272             0   1.913584637   1.973328592   1.543233083  |
  4 |   .914004914   1.898876404   1.913584637             0   1.783783784   1.683100683  |
  5 |   1.22919226    1.87169754   1.973328592   1.783783784             0   1.861449361  |
  6 |  1.438655095   1.955303124   1.543233083   1.683100683   1.861449361             0  |
    +-------------------------------------------------------------------------------------+

In other words, with or without the diagonal (sqom does without), the transition probabilities don't look like what you get from sqom. Something is wrong, either with your setup or with SQ.

Re the broader question of directional substitution costs (i->j vs j->i): it's a red herring. Substitution costs have to do with similarity between states, not transitions. They are inherently symmetric. You can use transition rates to estimate similarity (because transitions between similar states can be expected to be more common, but you then need to do something to enforce symmetry), and this causes a lot of confusion, but substitutions and transitions are orthogonally different concepts. Substitutions bear on the calculation "how similar is element i in sequence A to element j in sequence B" whereas transitions are changes in time, from i to i+1 in a single sequence.

Comment

Tom Kemeny

Join Date: Apr 2016

Posts: 52
#9

28 Apr 2016, 08:49

I see what you are saying about directional substitution costs - I did not fully appreciate this point.

In terms of the odd costs that sqom is generating, I wasn't quite sure how to determine where the problem might lie. So I made my data wide, and tried using your sadi suite of commands for comparison. This produced what seems like a more reasonable set of substitution costs. Before going wide, I did the following:

Code:

. trans2subs st, id(cid) subs(smat) (640 missing values generated) Generating transition-driven substitution matrix . matrix list smat symmetric smat[6,6] c1 c2 c3 c4 c5 c6 r1 0 r2 1.096367 0 r3 .97802 1.859403 0 r4 .914005 1.898876 1.913585 0 r5 1.229192 1.871698 1.973329 1.783784 0 r6 1.438655 1.955303 1.543233 1.683101 1.861449 0

This is not identical with your estimates above, though it resembles trprob2 pretty closely. After reshaping and generating a sequence length variable, I ran oma, which did not return any errors.

I still can't say what went wrong with sqom, and I would be interested in any ideas you might have. Still, it doesn't re-occur using oma in sadi, so that is a plus.

Thank you very much for your help!
Comment

Tom Kemeny

Join Date: Apr 2016
Posts: 52

#10

29 Apr 2016, 06:16

Brendan - an additional question. After running oma, I test for whether the matrix is metric. After doing some reading, including some of your recent work relating to SADI ('three narratives' etc), I remain befuddled by what is meant by metric. I gather that it matters, but I'm unclear about why. I interpret the output below to suggest that my matrix of distances is not metric. Any advice - whether in terms of actions I can take, or some reading I can do to better understand the issue?

Code:

    
.oma st1-st28, subsmat(smat) indel(2) pwdist(oma) len(lth)
Normalising distances with respect to length
(0 observations deleted)
271 unique observations

.         metricp oma // not sure if this is a problem, but does not satisfy 'triangle inequlity'
Shorter route exists between seq   8 and seq  53 -- 0.618 > 0.616
Shorter route exists between seq   8 and seq 284 -- 0.598 > 0.596
Shorter route exists between seq   8 and seq 387 -- 0.657 > 0.655
Shorter route exists between seq   8 and seq 592 -- 0.594 > 0.592
Shorter route exists between seq  14 and seq  79 -- 0.974 > 0.971
Shorter route exists between seq  14 and seq 544 -- 0.868 > 0.865
Shorter route exists between seq  16 and seq 416 -- 0.921 > 0.920
Shorter route exists between seq  16 and seq 624 -- 0.921 > 0.920
Shorter route exists between seq  17 and seq  41 -- 1.361 > 1.352
Shorter route exists between seq  17 and seq 358 -- 0.951 > 0.946
Matrix oma is NOT consistent with a metric space

Many thanks!
Tom

Comment

Brendan Halpin

Join Date: Mar 2014

Posts: 152
#11

29 Apr 2016, 07:42

In simple terms "metric" means that a dissimilarity acts like a distance, that we can think of it in spatial terms. This is important for the logic of cluster analysis or multidimensional scaling. metricp tests one particular aspect of this, the "triangle inequality", that the distance from A to C cannot be greater than d(A,B)+d(B,C) for any B (i.e., no short-cuts). I am very surprised to see non-metric distances resulting from OMA, though. Bugs aside, the only way OMA distances can be non-metric is if the substitution costs are non-metric. That is, if the state-space dissimilarities do not imply a metric state space, the resulting OMA distances may also fail the test.

This is the source of the problem in your case. If you run metricp smat, detail on the substitution cost matrix it will show you. I've known it in theory that transition rates may indeed yield non-metric substitution costs, but this is the first time I've seen it "in the wild".

Code:

. metricp trprob2 Shorter route exists between seq 3 and seq 4 -- 1.914 > 1.892 Matrix trprob2 is NOT consistent with a metric space . metricp trprob2, detail 3 4 1: 1.914 > 1.892 = 0.978 + 0.914 Matrix trprob2 is NOT consistent with a metric space

You could fix it as follows, essentially putting a floor on the substitution costs:

Code:

. mat trprob2[1,4]=0.94 . mat trprob2[4,1]=0.94 . metricp trprob2 Matrix trprob2 is consistent with a metric space
Comment
Tom Kemeny

Join Date: Apr 2016

Posts: 52
#12

29 Apr 2016, 13:35

I have talent for producing exceptions to rules! Thanks - your fix works nicely, for oma at least.

Just so I'm clear(ish), if the matrix of distances is not metric, or at least this is indicated by failing the triangle inequality test, then is the implication that it is not viable as a measure of the distance between sequences?

Building on this, if I am using, say, omav, and my distance matrix does not emerge as consistent with a metric space, does this mean the distances estimated are functionally unusable, in terms of proceeding to cluster analysis? I see a logic for exploring omav based on my project, but of course while a matrix estimated using standard OM now passes the metricp test, one derived using your omav command fails.

Code:

.metricp omav Shorter route exists between seq 1 and seq 76 -- 0.140 > 0.129 Shorter route exists between seq 1 and seq 82 -- 0.179 > 0.144 Shorter route exists between seq 1 and seq 100 -- 0.149 > 0.148 Shorter route exists between seq 1 and seq 103 -- 0.081 > 0.068 Shorter route exists between seq 1 and seq 105 -- 0.121 > 0.113 Shorter route exists between seq 1 and seq 131 -- 0.384 > 0.225 Shorter route exists between seq 1 and seq 147 -- 0.138 > 0.129 Shorter route exists between seq 1 and seq 149 -- 0.132 > 0.104 Shorter route exists between seq 1 and seq 159 -- 0.187 > 0.149 Shorter route exists between seq 1 and seq 176 -- 0.379 > 0.194 Matrix omav is NOT consistent with a metric space

Again, many thanks!
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#13

29 Apr 2016, 15:29

Try TWED instead of OMAV if you want something less sensitive to spell duration. Since OMAV and Hollister's Localised OM may produce non metric distances, I don't trust them. For a full discussion see my paper (or the preprint version).
Comment
Tom Kemeny

Join Date: Apr 2016

Posts: 52
#14

03 May 2016, 05:46

Many thanks Brendan!
Comment

Tom Kemeny

Join Date: Apr 2016
Posts: 52

#15

03 May 2016, 06:55

And, what if twed produces a non-metric space (also, I'm fuzzy on how to set lambda and nu)?

Code:

. twed st1-st28, subsmat(smat) lambda(0.5) nu(0.15) pwdist(twed) len(lth) // time warping
Normalising distances with respect to length
(0 observations deleted)
271 unique observations

. metricp twed //
Shorter route exists between seq   1 and seq 213 -- 1.339 > 1.326
Shorter route exists between seq   2 and seq   7 -- 1.641 > 1.444
Shorter route exists between seq   2 and seq  11 -- 1.903 > 1.576
Shorter route exists between seq   2 and seq  56 -- 1.475 > 1.356
Shorter route exists between seq   2 and seq  72 -- 1.511 > 1.504
Shorter route exists between seq   2 and seq 106 -- 1.638 > 1.385
Shorter route exists between seq   2 and seq 213 -- 2.048 > 1.605
Shorter route exists between seq   2 and seq 266 -- 1.364 > 1.321
Shorter route exists between seq   2 and seq 314 -- 1.823 > 1.470
Shorter route exists between seq   2 and seq 349 -- 1.320 > 1.308
Matrix twed is NOT consistent with a metric space

Announcement