Sequence and cluster analysis: How do I set up my data?

Sherine Maui

Join Date: Apr 2018

Posts: 90
#1

Sequence and cluster analysis: How do I set up my data?

24 Apr 2023, 10:22

Hi Statalist,

I am confused about how to set up my data for sequence and cluster analysis using the Sq-Ados package (SSC, Brzinsky-Fay & Kohler, 2006).

I have data on children in hospitals. Not all children enter the data at the same time, only when they need treatment. The data is in long format, so each child gets a new row when they need a different treatment. For example, a child who needs surgery, then a scan will have two rows (or a sequence of two states) and a child who need surgery, scan, therapy will have three rows (or a sequence of three states). Once children leave the hospital, they no longer have a record in the data.

This means I have sequences of different lengths. In Sq-Ados, the sqset command has the options ltrim, rtrim, trim, and keep longest. From my understanding is that the sequences need to all be of the same length, but because I don’t have any “missing” data, using any of these options doesn’t seem to make a difference. Keeping the data with different sequence lengths, my sqindexplots look weird with large white gaps (which don’t represent any state).

I cannot share an example of my data as I am not allowed (sensitive admin data etc.). I am not sure if I am setting up my data incorrectly? Here is an example of my code:

Code:

sqset state ID ordervar sqindexplot, title(“treatment trajectories”) ranks (1/100) color(green pink orange blue).

The plot has large white gaps which does not resemble what the sqindexplot looks like in the help file. I think I’m missing something here?
Tags: None
Eric Makela

Join Date: Aug 2022

Posts: 45
#2

25 Apr 2023, 04:59

On first glance, have you tried using the 'order' command option with 'sqindexplot'? Stata graphics can do wonky things when data are not ordered by how they are graphed.

In terms of setting up the sequence data, I presume you've done some research on how to do sequence and cluster analysis? Have you thought to be sure about how you construct ordervar? The 'sqset' example reshapes the data with id and order to be on different data dimensions, but from reading your description it does not seem your dataset is shaped in this manner (there are only one variable each for ID and ordervar). Please correct me if I am wrong.
Comment

Sherine Maui

Join Date: Apr 2018
Posts: 90

25 Apr 2023, 07:39

Thanks for your reply Eric and for pointing out the correct way to reshape the data.
I have generated an example dataset to show how my data is structured, it is originally in long format.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float ID long state float ordervar
12 1 1
12 3 2
12 4 3
13 4 1
13 1 2
13 3 3
13 2 4
14 3 1
end
label values state state_
label def state_ 1 "care", modify
label def state_ 2 "scan", modify
label def state_ 3 "surgery", modify
label def state_ 4 "therapy", modify

I then followed the steps in the Sq-Ados article:

Code:

reshape wide state, i(ID) j(ordervar)
 
*then reshape to long again:
reshape long state, i(ID) j(ordering)
 
*because I have unequal sequences (missing points at the end of the sequence), I set my data using sqset and rtrim:
 
sqset state ID ordering, rtrim
 
egen length1=sqlength, element(1)
egen length2=sqlength, element(2)
egen length3=sqlength, element(3)
egen length4=sqlength, element(4)
 
*Then I use optimal matching and cluster analysis to generate 5 clusters:

matrix sub_cost=(1,1,1,1\1,1,1,1\1,1,1,1\1,1,1,1)
sqom, full indelcost(0.5) subcost(sub_cost) name(dist) standard(longer)
 
*The clustermat commands produces three variables wards_ord, wards_id and wards_hgt

sqclusterdat
clustermat wardslinkage SQdist, name(wards)add
cluster generate clust_5=groups(5)
sqclusterdat, return
 
*Now I attempt to graph my results as sequence index plots, I have tried the following variations, but they all produce graphs with large white gaps:
 
sqindexplot, by(clus_5) color(green red orange blue)
sqindexplot, order(wards_ord)by(clus_5) color(green red orange blue)
 
egen plotorder=group(wards_hgt length1 length2 length3 length4)
sqindexplot, order(plotorder) by(clus_5) color(green red orange blue)

All the sequence index plots have large white gaps (i.e. don't extend to the end of the graph). I am not sure If I am setting up my data incorrectly?

Comment

Marc Kaulisch

Join Date: Jan 2016
Posts: 182

25 Apr 2023, 08:23

I can give you no final answer but I tested your example a bit and I changed a few lines of code that it worked with my version (Stata 17, sqset version 1.2 from 2012)

Code:

clear
input float ID long state float ordervar
12 1 1
12 3 2
12 4 3
12 . 4
13 4 1
13 1 2
13 3 3
13 2 4
14 3 1
14 . 2
14 . 3
14 . 4
end

label values state state_
label def state_ 1 "care", modify
label def state_ 2 "scan", modify
label def state_ 3 "surgery", modify
label def state_ 4 "therapy", modify


reshape wide state, i(ID) j(ordervar)
 
*then reshape to long again:
reshape long state, i(ID) j(ordering)
 
*because I have unequal sequences (missing points at the end of the sequence), I set my data using sqset and rtrim:
 
sqset state ID ordering, rtrim

sqindexplot, color(green red orange blue) name(unclustered)

egen length1 = sqlength(), element(1)
egen length2 = sqlength(), element(2)
egen length3 = sqlength(), element(3)
egen length4 = sqlength(), element(4)
 
*Then I use optimal matching and cluster analysis to generate 5 clusters:

matrix sub_cost=(1,1,1,1\1,1,1,1\1,1,1,1\1,1,1,1)
sqom, full indelcost(0.5) subcost(sub_cost) name(dist) standard(longer)
 
*The clustermat commands produces three variables wards_ord, wards_id and wards_hgt

sqclusterdat
clustermat wardslinkage SQdist, name(wards)add
cluster generate clus_5= groups(2)
sqclusterdat, return
 
*Now I attempt to graph my results as sequence index plots, I have tried the following variations, but they all produce graphs with large white gaps:
 
sqindexplot, by(clus_5) color(green red orange blue) name(sq1_by)
sqindexplot, order(wards_ord) by(clus_5) color(green red orange blue) name(sq2_order)
 
egen plotorder=group(wards_hgt length1 length2 length3 length4)
sqindexplot, order(plotorder) by(clus_5) color(green red orange blue) name(sq3_plotby)

My observation from this limited number of cases that the vertical axis in sqindexplot may refer to cases that are not part of the by-group. In any case when I used -sq- I avoided using by-groups and made graphs selecting the specific groups and the combined them either in Word or maybe nowadays I would use -grc1leg2- from SSC.

EDIT: You may also like to look at the -sadi- package from SSC. To me it looks to be better maintained than the -sq- package. But I only have experience in using -sq- and not -sadi-.

Last edited by Marc Kaulisch; 25 Apr 2023, 08:26.

Comment

Sherine Maui

Join Date: Apr 2018

Posts: 90
#5

25 Apr 2023, 10:00

Thank you Marc, producing the graphs using "if" instead of "by" sorts out the issue on the yaxis. But I seem to still be getting very odd looking graphs (I have included an example below). I tried to order the sqindexplot by the variables produced after clustering (wards_ord/wards_hgt) given the following in the sq-ados article:

"At the end of the process, the sequence data contain the variables produced by the cluster analysis. The variables suffixed with hgt can be used in the same fashion as the distance variable produced by OM on a reference sequence. We use it to produce yet another version of the sequence index plot."

The figures show cluster 1 and cluster 2. The xaxis differ and there's a lot of white gaps that shouldn't be there.

Last edited by Sherine Maui; 25 Apr 2023, 10:02.
Comment

Announcement