Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • SQ-Ados bundle: How to perform clustermat stop after clustering

    Dear all,

    I am using Stata 13.1 on Windows and created a dissimilarity matrix using SQ-Ados, see my code below.

    Code:
    * Use specified csv as input
    import delimited using C:\Users\04BAJ\Documents\Stata\170703_CFO_SQ_v01.csv, delimiters (";")
    
    * Prepare data for SQ analysis
    reshape long year, i(id) j(order)
    encode year, generate(value)
    drop year
    sqset value id order, trim
    
    * Input substitution cost matrix
    matrix input sub = (0.000,0.274,0.332,0.606,0.394,0.668,0.726,1.000\0.274,0.000,0.606,0.332,0.668,0.394,1.000,0.726\0.332,0.606,0.000,0.274,0.726,1.000,0.394,0.668\0.606,0.332,0.274, 0.000,1.000,0.726,0.668,0.394\0.394,0.668,0.726,1.000,0.000,0.274,0.332,0.606\0.668,0.394,1.000,0.726,0.274,0.000,0.606,0.332\0.726,1.000,0.394,0.668,0.332,0.606,0.000,0.274\ 1.000, 0.726,0.668,0.394,0.606,0.332,0.274,0.000)
    
    * Perform full SQ Analysis with specified substitution and in/del cost
    sqom, full indelcost(0.49) subcost(sub)
    
    * Save dissimilarity matrix to file and replace existing file
    sqom save SQdist, replace
    
    * Prepare data for clustering
    sqclusterdat
    
    * Perform clustering of the dissimilarity matrix using Wards
    clustermat wardslinkage SQdist, name(wards) add
    
    * Calculate Calinski stopping rules for cluster 2 to 10 as generated by Wards and name resulting matrix Calinski
    clustermat stop, variables(value) rule(calinski) groups(2/10) matrix(calinski)
    My goal is to validate the Wards clustering with clustermat stop. How can I perform clustermat stop based on the Wards clusters? Do I need to use sqclusterdat, return first and then apply clustermat stop? It is unclear to me what would be the correct input for variables in the clustermat stop syntax.

    Many thanks for your help.

  • #2
    Any advice on this please?

    Thank you!

    Comment


    • #3
      A late reply is possibly no better than no reply, but anyway:

      1: clustermat stop does not work as you might reasonably expect: it calculates the CH statistic based on the squared Euclidean distances between the variables listed in the variables() option, and not the SQ distances. I have written a module, calinski (available on SSC) which calculates this correctly. See http://www.ulsites.ul.ie/sociology/s...p2016-01_0.pdf


      2: SQ works in the sequences in long format, thus multiple observations per case -- this is sometimes awkward for analysis (and is what the sqclusterdat command is for). An alternative for sqclusterdat is to reshape wide. Here is an example using the youthemp.dta dataset that comes with SQ.

      Code:
      // Install calinski
      ssc install calinski
      Code:
      // Use youthemp.dta (comes with SQ, do "net get sq" to install it in current directory)
      use youthemp,clear
      
      // rather arbitrary substitution cost matrix
      matrix sm1 = (0,1,1,2,3 \ ///
                    1,0,1,2,3 \ ///
                    1,1,0,2,2 \ ///
                    2,2,2,0,1 \ ///
                    3,3,2,1,0 )
      
      
      // Set up and run SQOM: puts distances in SQdist Stata matrix
      reshape long st, i(id) j(t)
      sqset st id t
      sqom, name(td) indelcost(1.5) subcost(sm1) full
      
      // Important: return to wide format
      reshape wide
      
      // Sort into the order SQ uses internally
      sort st*
      gen id2 = _n
      sort id2
      
      
      clustermat wards SQdist, add
      cluster gen q8=groups(8)
      
      calinski, dist(SQdist) id(id2)

      Comment

      Working...
      X