Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sequence analysis - optimal matching problems with SQOM

    Hi all,

    I'm doing sequence analysis, trying to understand patterns for a series of firms across time. I'm using the latest update of the sq package, in Stata 14, and my data have been declared a sqset. I am able to do descriptive stuff (ie sqdes) with no trouble. I can also run sqom successfully with no options, or as sqom, subcost(rawdistance).

    But, when conducting the OM, I really want to use the inverse of transition probabilities. I have calculated these in a matrix I call Pinv, which is a symmetrical 7x7 matrix - corresponding to the 7 'states' my firms can be in at any particular time. I'm attaching a screenshot of the matrix, since it looks bad if I just copy and paste it into the text window.

    The problem arises when I run the following:
    sqom, subcost(Pinv)

    I get the following error:
    subcostmatrix invalid: subcostmatrix is not symmetric
    r(198);


    I am confused, since my matrix clearly is symmetric, and contains the right number of elements. Any advice?

    Just for fun, I also tried the following:
    sqom, subcost(meanprobdistance)

    This also returns an error:
    needlemanwunschexactmatrix(): 3200 conformability error
    sqomref(): - function returned error
    <istmt>: - function returned error


    Any help figuring out what is going on would be very much appreciated.

    Many thanks,
    Tom
    screengrab of the inverse of the transition probabilities

  • #2
    Your matrix is not symmetric. Symmetric means that Pinv[i,j] == Pinv[j,i].

    Transition based substitution costs (which are not necessarily a good idea, but that's another story) are usually calculated as 2 - Pij - Pji

    Comment


    • #3
      Thanks Brendan - that's quite helpful! Our of curiiosity, do you have a reference on your suggestion of a rule of 2 - Pij - Pji?

      Also, this still doesn't explain the error I got using 'meanprobdistance'. Any thoughts there?

      Many thanks
      Tom

      Comment


      • #4
        2 - Pij - Pji is what meanprobdistance is doing automatically, and is described in the sqom help.

        Why you get an error, I don't know. It may be that there is something odd about your data. Can you look at the pattern of transitions and show the data? For instance, if your state variable is m (i.e., you did "sqset m id t"), do the following in long format:
        Code:
        by id: gen lagm = m[_n-1]
        tab lagm m

        Comment


        • #5
          Thanks - I see the relevant discussion in the sqom help file with regard to meanprobdistance.

          Naively, my starting point was to use the markov command to generate transition probabilities. These gave me a matrix as follows

          Code:
          Transition probabilities
          ------------------------
          
          P[7,7]
                  1       2       3       4       5       6       7
          1  0.9358  0.0120  0.0197  0.0167  0.0087  0.0068  0.0003
          2  0.7320  0.0206  0.0928  0.0928  0.0309  0.0206  0.0103
          3  0.4905  0.0228  0.3460  0.0304  0.0076  0.0875  0.0152
          4  0.6106  0.0048  0.0240  0.2356  0.0481  0.0721  0.0048
          5  0.6145  0.0843  0.0120  0.1325  0.0723  0.0602  0.0241
          6  0.3981  0.0093  0.2222  0.1574  0.0185  0.1574  0.0370
          7  0.4286  0.0000  0.1429  0.2143  0.1429  0.0714  0.0000

          So, one thing to note is that some transitions are extremely unusual - this fits my descriptive results using the SQ descriptive commands.

          Using sqom with meanprobdistance, I get

          Code:
          sqom, subcost(meanprobdistance) 
          
          .         sqom, subcost(meanprobdistance) 
          
          symmetric SQsubcost[7,7]
                     c1         c2         c3         c4         c5         c6         c7
          r1          0
          r2  1.9814351          0
          r3  1.9683218  1.9977899          0
          r4  1.9674377  1.9986739  1.9980846          0
          r5   1.986592  1.9985266   1.999558  1.9969058          0
          r6  1.9892441   1.999558  1.9932223  1.9954324  1.9989686          0
          r7  1.9989686  1.9998527   1.999116  1.9994106  1.9994106  1.9992633          0
          
          needlemanwunschexactmatrix():  3200  conformability error
                         sqomref():     -  function returned error
                           <istmt>:     -  function returned error
          r(3200);
          I'm not quite sure what 'odd' data might look like. Based on your suggestion, I ran the following, and got this output:

          Code:
          .         by cid: gen lagm = st[_n-1]
          (640 missing values generated)
          
          . tab lagm st
          
                     |                                 State/Event
                lagm | No Suppor        CED  Venture C  SBIR/STTR  NC Biotec  2 Support  3+ Suppor |     Total
          -----------+-----------------------------------------------------------------------------+----------
                   1 |     5,712         61        105        104         43         37          2 |     6,064 
                   2 |        65          1          9          9          3          2          1 |        90 
                   3 |       110          6         91          8          2         22          4 |       243 
                   4 |       117          0          5         49         10         15          1 |       197 
                   5 |        48          7          1         11          6          5          2 |        80 
                   6 |        36          1         24         16          2         17          4 |       100 
                   7 |         5          0          2          3          2          1          0 |        13 
          -----------+-----------------------------------------------------------------------------+----------
               Total |     6,093         76        237        200         68         99         14 |     6,787
          Is this 'odd'?

          Any suggestions very much welcome. thanks!

          Comment


          • #6
            With transition rates like that, your substitution costs are all nearly 2.0, i.e., nearly constant. That's not necessarily a problem, but it means you are using a fairly neutral scheme.

            Looking at your transitions, it looks like sequences are mostly in state 1, with very short excursions (usually 1 time unit) into the other states.

            I don't see anything wrong with your data, as such, with the exception that state 7 is very rare: try collapsing it into state 6 to see if you can run sqom successfully. If you can't get sqom to work, try SADI (my sequence analysis package, on SSC). If that doesn't work, I have a better chance of understanding the errors.

            Comment


            • #7
              As you suggested, I collapsed states 6 and 7 together, and re-ran sqom with meanprobdistance. This worked, in that it no longer produced an error message.

              But I find the results substantively odd. Specifically, as you observed, the substitution costs that emerge from sqom all cluster right around a value of 2.

              Code:
              .         sqom, subcost(meanprobdistance) // 
              
              symmetric SQsubcost[6,6]
                         c1         c2         c3         c4         c5         c6
              r1          0
              r2  1.9814351          0
              r3  1.9683218  1.9977899          0
              r4  1.9674377  1.9986739  1.9980846          0
              r5   1.986592  1.9985266   1.999558  1.9969058          0
              r6  1.9882128  1.9994106  1.9923383  1.9948431  1.9983793          0
              Distance Variable saved as _SQdist
              Given the exploratory nature of my work, I think it makes sense to use the actual data to generate substitution costs. Certainly theory is of no help. But how do I square the matrix above with the transition probabilities generated with markov below?

              Code:
              Transition probabilities
              ------------------------
              
              P[6,6]
                      1       2       3       4       5       6
              1  0.9358  0.0120  0.0197  0.0167  0.0087  0.0071
              2  0.7320  0.0206  0.0928  0.0928  0.0309  0.0309
              3  0.4905  0.0228  0.3460  0.0304  0.0076  0.1027
              4  0.6106  0.0048  0.0240  0.2356  0.0481  0.0769
              5  0.6145  0.0843  0.0120  0.1325  0.0723  0.0843
              6  0.4016  0.0082  0.2131  0.1639  0.0328  0.1803
              If I consider the formula 2- Pij - Pji for, say transitions between state 6 and state 1, I should get something around 1.59, not 1.98. But that is not what sqom generates, and I am not clear why. Mechanically, I'm unclear on why I am getting such flat substitution costs that do not resemble what I expect given my markov-based transition probabilities.

              At a deeper level though, I'm wondering if I need some other way of dealing with the fact that order, ie i-j vs j-i really matters for my particular set of sequences. To get some conceptual help I have been reading things like:
              Aisenbrey, Silke and Anette E. Fasang. 2010. ‘‘New Life for Old Ideas: The ‘Second Wave’ of Sequence Analysis Bringing the ‘Course’ Back Into the Life Course.’’ Sociological Methods & Research 38:420-462.



              While this brings up the order issue, it doesn't seem to propose real solutions to the issue.

              I welcome your thoughts.

              Comment


              • #8
                OK, with your data (from the table) I do the following:
                Code:
                tab lagm st [freq=s], matcell(tab1)
                tab lagm st if lagm!=st [freq=s], matcell(tab2)
                
                mata:
                tab1 = st_matrix("tab1")
                tab2 = st_matrix("tab2")
                
                tab1r = tab1 :/ rowsum(tab1)
                tab2r = tab2 :/ rowsum(tab2)
                
                trprob1 = J(6,6,2) - tab1r - tab1r'
                trprob2 = J(6,6,2) - tab2r - tab2r'
                
                trprob1 = trprob1 - diag(trprob1)
                trprob2 = trprob2 - diag(trprob2)
                end
                which yields
                Code:
                . mata trprob1
                                 1             2             3             4             5             6
                    +-------------------------------------------------------------------------------------+
                  1 |            0   1.267718411   1.530009799   1.388940975   1.392908971   1.630736743  |
                  2 |  1.267718411             0   1.875308642           1.9   1.879166667   1.957817109  |
                  3 |  1.530009799   1.875308642             0   1.941697479   1.979269547    1.66291562  |
                  4 |  1.388940975           1.9   1.941697479             0   1.811738579   1.750640133  |
                  5 |  1.392908971   1.879166667   1.979269547   1.811738579             0    1.87710177  |
                  6 |  1.630736743   1.957817109    1.66291562   1.750640133    1.87710177             0  |
                    +-------------------------------------------------------------------------------------+
                
                . mata trprob2
                                 1             2             3             4             5             6
                    +-------------------------------------------------------------------------------------+
                  1 |            0   1.096367467   .9780203349    .914004914    1.22919226   1.438655095  |
                  2 |  1.096367467             0    1.85940272   1.898876404    1.87169754   1.955303124  |
                  3 |  .9780203349    1.85940272             0   1.913584637   1.973328592   1.543233083  |
                  4 |   .914004914   1.898876404   1.913584637             0   1.783783784   1.683100683  |
                  5 |   1.22919226    1.87169754   1.973328592   1.783783784             0   1.861449361  |
                  6 |  1.438655095   1.955303124   1.543233083   1.683100683   1.861449361             0  |
                    +-------------------------------------------------------------------------------------+
                In other words, with or without the diagonal (sqom does without), the transition probabilities don't look like what you get from sqom. Something is wrong, either with your setup or with SQ.

                Re the broader question of directional substitution costs (i->j vs j->i): it's a red herring. Substitution costs have to do with similarity between states, not transitions. They are inherently symmetric. You can use transition rates to estimate similarity (because transitions between similar states can be expected to be more common, but you then need to do something to enforce symmetry), and this causes a lot of confusion, but substitutions and transitions are orthogonally different concepts. Substitutions bear on the calculation "how similar is element i in sequence A to element j in sequence B" whereas transitions are changes in time, from i to i+1 in a single sequence.

                Comment


                • #9
                  I see what you are saying about directional substitution costs - I did not fully appreciate this point.

                  In terms of the odd costs that sqom is generating, I wasn't quite sure how to determine where the problem might lie. So I made my data wide, and tried using your sadi suite of commands for comparison. This produced what seems like a more reasonable set of substitution costs. Before going wide, I did the following:

                  Code:
                  .         trans2subs st, id(cid) subs(smat)
                  (640 missing values generated)
                  Generating transition-driven substitution matrix
                  
                  .                 matrix list smat
                  
                  symmetric smat[6,6]
                            c1        c2        c3        c4        c5        c6
                  r1         0
                  r2  1.096367         0
                  r3    .97802  1.859403         0
                  r4   .914005  1.898876  1.913585         0
                  r5  1.229192  1.871698  1.973329  1.783784         0
                  r6  1.438655  1.955303  1.543233  1.683101  1.861449         0
                  This is not identical with your estimates above, though it resembles trprob2 pretty closely. After reshaping and generating a sequence length variable, I ran oma, which did not return any errors.

                  I still can't say what went wrong with sqom, and I would be interested in any ideas you might have. Still, it doesn't re-occur using oma in sadi, so that is a plus.

                  Thank you very much for your help!

                  Comment


                  • #10
                    Brendan - an additional question. After running oma, I test for whether the matrix is metric. After doing some reading, including some of your recent work relating to SADI ('three narratives' etc), I remain befuddled by what is meant by metric. I gather that it matters, but I'm unclear about why. I interpret the output below to suggest that my matrix of distances is not metric. Any advice - whether in terms of actions I can take, or some reading I can do to better understand the issue?

                    Code:
                        
                    .oma st1-st28, subsmat(smat) indel(2) pwdist(oma) len(lth)
                    Normalising distances with respect to length
                    (0 observations deleted)
                    271 unique observations
                    
                    .         metricp oma // not sure if this is a problem, but does not satisfy 'triangle inequlity'
                    Shorter route exists between seq   8 and seq  53 -- 0.618 > 0.616
                    Shorter route exists between seq   8 and seq 284 -- 0.598 > 0.596
                    Shorter route exists between seq   8 and seq 387 -- 0.657 > 0.655
                    Shorter route exists between seq   8 and seq 592 -- 0.594 > 0.592
                    Shorter route exists between seq  14 and seq  79 -- 0.974 > 0.971
                    Shorter route exists between seq  14 and seq 544 -- 0.868 > 0.865
                    Shorter route exists between seq  16 and seq 416 -- 0.921 > 0.920
                    Shorter route exists between seq  16 and seq 624 -- 0.921 > 0.920
                    Shorter route exists between seq  17 and seq  41 -- 1.361 > 1.352
                    Shorter route exists between seq  17 and seq 358 -- 0.951 > 0.946
                    Matrix oma is NOT consistent with a metric space
                    Many thanks!
                    Tom

                    Comment


                    • #11
                      In simple terms "metric" means that a dissimilarity acts like a distance, that we can think of it in spatial terms. This is important for the logic of cluster analysis or multidimensional scaling. metricp tests one particular aspect of this, the "triangle inequality", that the distance from A to C cannot be greater than d(A,B)+d(B,C) for any B (i.e., no short-cuts). I am very surprised to see non-metric distances resulting from OMA, though. Bugs aside, the only way OMA distances can be non-metric is if the substitution costs are non-metric. That is, if the state-space dissimilarities do not imply a metric state space, the resulting OMA distances may also fail the test.

                      This is the source of the problem in your case. If you run metricp smat, detail on the substitution cost matrix it will show you. I've known it in theory that transition rates may indeed yield non-metric substitution costs, but this is the first time I've seen it "in the wild".

                      Code:
                      . metricp trprob2
                      Shorter route exists between seq   3 and seq   4 -- 1.914 > 1.892
                      Matrix trprob2 is NOT consistent with a metric space
                      
                      . metricp trprob2, detail
                        3   4   1:  1.914 >  1.892 =  0.978 +  0.914
                      Matrix trprob2 is NOT consistent with a metric space
                      You could fix it as follows, essentially putting a floor on the substitution costs:
                      Code:
                      . mat trprob2[1,4]=0.94
                      . mat trprob2[4,1]=0.94
                      . metricp trprob2
                      Matrix trprob2 is consistent with a metric space

                      Comment


                      • #12
                        I have talent for producing exceptions to rules! Thanks - your fix works nicely, for oma at least.

                        Just so I'm clear(ish), if the matrix of distances is not metric, or at least this is indicated by failing the triangle inequality test, then is the implication that it is not viable as a measure of the distance between sequences?

                        Building on this, if I am using, say, omav, and my distance matrix does not emerge as consistent with a metric space, does this mean the distances estimated are functionally unusable, in terms of proceeding to cluster analysis? I see a logic for exploring omav based on my project, but of course while a matrix estimated using standard OM now passes the metricp test, one derived using your omav command fails.

                        Code:
                        .metricp omav
                        Shorter route exists between seq   1 and seq  76 -- 0.140 > 0.129
                        Shorter route exists between seq   1 and seq  82 -- 0.179 > 0.144
                        Shorter route exists between seq   1 and seq 100 -- 0.149 > 0.148
                        Shorter route exists between seq   1 and seq 103 -- 0.081 > 0.068
                        Shorter route exists between seq   1 and seq 105 -- 0.121 > 0.113
                        Shorter route exists between seq   1 and seq 131 -- 0.384 > 0.225
                        Shorter route exists between seq   1 and seq 147 -- 0.138 > 0.129
                        Shorter route exists between seq   1 and seq 149 -- 0.132 > 0.104
                        Shorter route exists between seq   1 and seq 159 -- 0.187 > 0.149
                        Shorter route exists between seq   1 and seq 176 -- 0.379 > 0.194
                        Matrix omav is NOT consistent with a metric space
                        Again, many thanks!

                        Comment


                        • #13
                          Try TWED instead of OMAV if you want something less sensitive to spell duration. Since OMAV and Hollister's Localised OM may produce non metric distances, I don't trust them. For a full discussion see my paper (or the preprint version).

                          Comment


                          • #14
                            Many thanks Brendan!

                            Comment


                            • #15
                              And, what if twed produces a non-metric space (also, I'm fuzzy on how to set lambda and nu)?

                              Code:
                              . twed st1-st28, subsmat(smat) lambda(0.5) nu(0.15) pwdist(twed) len(lth) // time warping
                              Normalising distances with respect to length
                              (0 observations deleted)
                              271 unique observations
                              
                              . metricp twed //
                              Shorter route exists between seq   1 and seq 213 -- 1.339 > 1.326
                              Shorter route exists between seq   2 and seq   7 -- 1.641 > 1.444
                              Shorter route exists between seq   2 and seq  11 -- 1.903 > 1.576
                              Shorter route exists between seq   2 and seq  56 -- 1.475 > 1.356
                              Shorter route exists between seq   2 and seq  72 -- 1.511 > 1.504
                              Shorter route exists between seq   2 and seq 106 -- 1.638 > 1.385
                              Shorter route exists between seq   2 and seq 213 -- 2.048 > 1.605
                              Shorter route exists between seq   2 and seq 266 -- 1.364 > 1.321
                              Shorter route exists between seq   2 and seq 314 -- 1.823 > 1.470
                              Shorter route exists between seq   2 and seq 349 -- 1.320 > 1.308
                              Matrix twed is NOT consistent with a metric space

                              Comment

                              Working...
                              X