Latent class analysis - constraining categorical variables

Marla Pauli

Join Date: Nov 2020
Posts: 2

Latent class analysis - constraining categorical variables

29 Nov 2020, 03:26

Hi Statalist!

I am trying to perform a latent class analysis and I am very new to this method, so I hope you will understand if my question is a bit naïve.
I would like to perform the analysis using nine categorical variables (each with five categories) and obtain three classes. Here is the code I am using:

Code:

gsem (e4_1b e4_2b e4_3b e4_4b e4_5b e4_6b e4_7b e4_8b e4_9b <-, ologit), lclass(C 3)

Once I try to estimate the marginal predicted means of the outcome within each latent class (estat lcmean), the command seems to run forever, without giving any output. Looking on the forum, I am supposing that the issue is related to one coefficient being above 15. Below you can find part of the output, and the high coefficient in red.

Code:

Class          : 2

Response       : e4_1b
Family         : ordinal
Link           : logit

Response       : e4_2b
Family         : ordinal
Link           : logit

Response       : e4_3b
Family         : ordinal
Link           : logit

Response       : e4_4b
Family         : ordinal
Link           : logit

Response       : e4_5b
Family         : ordinal
Link           : logit

Response       : e4_6b
Family         : ordinal
Link           : logit

Response       : e4_7b
Family         : ordinal
Link           : logit

Response       : e4_8b
Family         : ordinal
Link           : logit

Response       : e4_9b
Family         : ordinal
Link           : logit

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
/e4_1b       |
        cut1 |  -6.207387   .2099544                      -6.61889   -5.795884
        cut2 |  -4.585708     .09238                     -4.766769   -4.404646
        cut3 |  -1.894731   .0285634                     -1.950714   -1.838748
        cut4 |   .7498389   .0206396                      .7093861    .7902917
-------------+----------------------------------------------------------------
/e4_2b       |
        cut1 |  -3.075159   .0378004                     -3.149246   -3.001071
        cut2 |  -.6201423   .0158303                     -.6511692   -.5891154
        cut3 |   1.625804   .0206937                      1.585246    1.666363
        cut4 |   4.059866   .0600725                      3.942126    4.177606
-------------+----------------------------------------------------------------
/e4_3b       |
        cut1 |  -4.235327   .0753749                     -4.383059   -4.087595
        cut2 |  -2.243794   .0294187                     -2.301453   -2.186134
        cut3 |   -.249513   .0220957                     -.2928197   -.2062063
        cut4 |   2.190033   .0348925                      2.121645    2.258421
-------------+----------------------------------------------------------------
/e4_4b       |
        cut1 |  -24.39119   2397.723                     -4723.841    4675.059
        cut2 |  -3.252751   .0575151                     -3.365479   -3.140024
        cut3 |  -.6396046   .0260331                     -.6906285   -.5885807
        cut4 |    2.04866   .0369879                      1.976165    2.121155
-------------+----------------------------------------------------------------
/e4_5b       |
        cut1 |  -5.785578   .1817201                     -6.141743   -5.429414
        cut2 |   -2.87042   .0414458                     -2.951652   -2.789188
        cut3 |  -.8560699   .0229513                     -.9010535   -.8110862
        cut4 |   1.127883   .0237217                       1.08139    1.174377
-------------+----------------------------------------------------------------
/e4_6b       |
        cut1 |  -.2159903   .0155038                     -.2463771   -.1856035
        cut2 |   1.242691   .0190969                      1.205261     1.28012
        cut3 |   2.667265   .0303306                      2.607818    2.726711
        cut4 |   5.438793   .1438952                      5.156764    5.720823
-------------+----------------------------------------------------------------
/e4_7b       |
        cut1 |   -1.62553   .0205138                     -1.665736   -1.585324
        cut2 |   .0296444   .0151189                      .0000119    .0592768
        cut3 |   1.490041   .0191115                      1.452583    1.527499
        cut4 |   3.487356   .0482501                      3.392788    3.581925
-------------+----------------------------------------------------------------
/e4_8b       |
        cut1 |   1.331648    .023303                      1.285975    1.377321
        cut2 |   2.582377   .0354519                      2.512892    2.651861
        cut3 |    3.56694    .043742                      3.481207    3.652673
        cut4 |   6.136812   .1994962                      5.745807    6.527817
-------------+----------------------------------------------------------------
/e4_9b       |
        cut1 |   .0988198   .0149272                      .0695631    .1280765
        cut2 |   1.434078   .0194443                      1.395968    1.472188
        cut3 |   2.952968   .0329211                      2.888444    3.017492
        cut4 |   5.747459   .1649939                      5.424077    6.070841
------------------------------------------------------------------------------

As I understood, I should constrain the coefficient to 15. I tried to use the following code, but it is not adequate for my categorical variable.

Code:

gsem (e4_1b e4_2b e4_3b e4_4b e4_5b e4_6b e4_7b e4_8b e4_9b <- ) (2: e4_4b  <- _cons@15) , ologit lclass(C 3)

Could anyone please suggest me the right specifications? Any suggestion is be more than welcome,

Kind regards,
Marla

Tags: None

Marla Pauli

Join Date: Nov 2020

Posts: 2
#2

29 Nov 2020, 07:38

Hi Statalist,

I have another related question to ask. Unfortunately I cannot edit anymore the previous post.
I tried to perform a more simple LCA and I am wondering what are usually the timings of the estimation process. For the estimation of the marginal predicted means (estat lcmeans) I let it run for almost 18 hours without getting any output, just the wheel spinning, before interrupting the process.
Does anyone know to what this can be due, or is this "regular"?
My dataset is "quite" big; it includes almost 60,000 observations. May this be the reason for the long processing?

Kind regards,
Marla
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#3

02 Jan 2021, 05:30

Hello Marla, did you figure out how to add the appropriate constraints to your ordinal variable? I have the same issue. It seems like something that should be straightforward, but I am not having any success.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

18 Jan 2021, 13:11

Marla and Josephine,

I realize this is a bit late, but I don't check the Stata forum all the time these days.

First off, I have seen at least one latent class analysis paper dichotomize items that are originally categorical. When you fit an LCA model with ordinal items, you have a lot of parameters involved - remember you now have 4 parameters for each item (one for each cutpoint). I realize that this is not normally what people would recommend in other contexts, but I think LCA is an exception.

Additionally, I don't think the fact that the specified cutpoint has an absolute value over 15 is the problem in this case. Here, your coefficient has a standard error, so there's no apparent issue. I can't see the sample size, but estat lcmean will generally run really slowly on large datasets and/or complex models unless you use the nose option, which doesn't calculate the standard errors. I'm not sure how to get around this. In fact, I have one dataset that's so large that even without calculating standard errors, the estimation time was infeasible. I went and manually calculated the class-specific means. I had binary items, so this was relatively easy. You can actually do that here, but it is more complex. Anyway, the coefficient of -24 basically means that nobody endorses the lowest level of E4_4B in latent class 2.

To the question of constraints, the constraints manual says, on page 2, that cutpoints can't be constrained using that syntax. You have to issue a separate constraint, and you have to name the specific cutpoint. I am not certain how to issue constraints for ordinal items, so I must pass on that question. However, I maintain that this should not be relevant to your issue.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#5

19 Jan 2021, 07:31

Thanks Weiwen. For my case, I think dichotomising the ordinal variable isn't going to work - a lot of the nuance in the grouping will come from people who respond to a statement with, say "strongly agree" versus "agree somewhat".
The rule of thumb about using the presence/absence of SEs to diagnose whether there's a cutpoint issue is very helpful.
Comment

Joseph Luchman

Join Date: Mar 2014
Posts: 114

19 Jan 2021, 08:09

Marla Pauli

I believe you are looking for something like this:

Code:

. sysuse auto

. constraint 1 _b[foreign:1.C] = 2

. gsem ( foreign <- , logit) (weight <- , regress),  lclass(C 2) constraints(1)

Fitting class model:

Iteration 0:   (class) log likelihood = -51.292891  
Iteration 1:   (class) log likelihood = -51.292891  

Fitting outcome model:

Iteration 0:   (outcome) log likelihood = -588.08292  
Iteration 1:   (outcome) log likelihood = -584.96687  
Iteration 2:   (outcome) log likelihood = -584.96687  

Refining starting values:

Iteration 0:   (EM) log likelihood = -636.81012
Iteration 1:   (EM) log likelihood =  -636.7001
Iteration 2:   (EM) log likelihood = -636.69087
Iteration 3:   (EM) log likelihood = -636.68984
Iteration 4:   (EM) log likelihood = -636.68972

Fitting full model:

Iteration 0:   log likelihood = -629.82534  (not concave)
Iteration 1:   log likelihood = -627.31727  
Iteration 2:   log likelihood = -624.97096  
Iteration 3:   log likelihood = -624.73291  
Iteration 4:   log likelihood = -624.73085  
Iteration 5:   log likelihood = -624.73085  

Generalized structural equation model           Number of obs     =         74
Log likelihood = -624.73085

 ( 1)  [foreign]1bn.C = 2
 ( 2)  [/]var(e.weight)#1bn.C - [/]var(e.weight)#2.C = 0

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.C          |  (base outcome)
-------------+----------------------------------------------------------------
2.C          |
       _cons |   .5604748   .2833409     1.98   0.048     .0051369    1.115813
------------------------------------------------------------------------------

Class          : 1

Response       : foreign
Family         : Bernoulli
Link           : logit

Response       : weight
Family         : Gaussian
Link           : identity

-------------------------------------------------------------------------------
              |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign       |
        _cons |          2  (constrained)
--------------+----------------------------------------------------------------
weight        |
        _cons |   2297.312   109.6387    20.95   0.000     2082.425      2512.2
--------------+----------------------------------------------------------------
 var(e.weight)|   298138.1   71993.66                      185725.4    478589.9
-------------------------------------------------------------------------------

Class          : 2

Response       : foreign
Family         : Bernoulli
Link           : logit

Response       : weight
Family         : Gaussian
Link           : identity

-------------------------------------------------------------------------------
              |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
foreign       |
        _cons |  -17.47504   1496.679    -0.01   0.991    -2950.911    2915.961
--------------+----------------------------------------------------------------
weight        |
        _cons |   3431.802   102.6992    33.42   0.000     3230.515    3633.088
--------------+----------------------------------------------------------------
 var(e.weight)|   298138.1   71993.66                      185725.4    478589.9
-------------------------------------------------------------------------------

In this latent class analysis (2 classes) with the foreign and weight variables from the auto dataset, the value for latent class 1's logit constant was constrained to the value of 2. You can do the same but might have better luck using the constraint command and option to gsem as shown above.

- joe

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

29 Sep 2021, 10:38

Originally posted by Josephine George View Post

Hello Marla, did you figure out how to add the appropriate constraints to your ordinal variable? I have the same issue. It seems like something that should be straightforward, but I am not having any success

...

I think dichotomising the ordinal variable isn't going to work - a lot of the nuance in the grouping will come from people who respond to a statement with, say "strongly agree" versus "agree somewhat".

I realize that it's been a while, but a worked example with simulated data and sample code are available here, in case you are still looking. Basically, I believe you need to separately define the constraint. The command looks something like this:

Code:

constraint 1 _b[/q5:3.C#cut2] = 15

Where you replace q5: with the name of whatever indicator you are using (the : indicates to Stata that it's a separate equation), 3.C references whichever latent class needs the constraint, and cut2 references the cutpoint that needs the constraint. In my simulated data, one latent class was simulated as never answering the top level of q5, which is a 3-level ordered categorical item. So, that example constrains the top cutpoint (which is 2) at 15. In Marla's example, most likely the estimation algorithm was trying to go towards an MLE where class 2 had a 0 probability of responding at the bottom category of e4_4b, so she'd have needed to constrain the bottom one at -15.

Pinging Marla Pauli in case she is also still looking.

I believe that this is how you would issue the constraint mechanically. I haven't very widely reviewed LCA papers. I haven't seen many that use categorical indicators at all, but here is one paper that used categorical indicators with MPlus. It did not report any constraints in full text or discuss any convergence issues. If you examine their supplemental table 2, you'll see that for latent class #1 (their healthy class), 0.1% of that class have poor self-reported health on a 5-point scale, and 0.1% have severe cognitive impairment on a 3-point scale. I don't know how MPlus handles that situation. We don't know how Stata would handle it (although you might be able to request their data, as it seems to be public but by request), but it's possible that Stata would have convergence issues. Remember, that's now two intercepts that should be wandering off to high values. However, we just don't know.

Responding to Josephine's point: consider examining the marginal distribution of all your ordinal indicators (that is, just tabulate them). If you have sparse data in a lot of top or bottom categories to begin with, I have a feeling you will quickly run into trouble. This seemed to be happening to the other person I responded to. Anyway, I have seen other LCA papers that do dichotomize ordinal scales. In this paper and this one, (I'm 3rd author on the second paper, both papers use the same type of survey but the authors for each paper are independent and our samples differ) both sets of authors were working with a 4-point scale. Responses to the lowest two categories are relatively rare - not like zero, but quite rare. We did dichotomize at the midpoint. Collapsing is definitely not the preferred method in most cases. In this case, we did feel that there wasn't that much nuance lost. You also don't have to dichotomize things. For example, if we felt like there was a meaningful distinction between strongly vs somewhat agree, but responses to the disagree categories were sparse, we could have collapsed just the bottom two categories.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Michelle Reiter

Join Date: Jun 2022

Posts: 4
#8

20 Jun 2022, 03:45

Hello everyone!

I have a similar problem as Marla: The "estat lcmean" command runs for DAYS and not hours without producing any result. I am analyzing a five-class LCA model with 7 ordinal variables with 5 categories each, the dataset is quite big with 50,000 observations.

Weiwen Ng mentioned above, that he manually calculated the class-specific means. Does anyone have a code/tipp for me, how I could do that?

Many thanks in advance!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2426
#9

21 Jun 2022, 08:37

In my limited experience, avoiding calculation of the standard errors by using the -nose- can reduce the computation time tremendously. In a problem I worked on with an _N of about 1,000, the standard errors were so tiny as to be uninteresting anyway.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#10

21 Jun 2022, 12:27

Originally posted by Michelle Reiter View Post

Hello everyone!

I have a similar problem as Marla: The "estat lcmean" command runs for DAYS and not hours without producing any result. I am analyzing a five-class LCA model with 7 ordinal variables with 5 categories each, the dataset is quite big with 50,000 observations.

Weiwen Ng mentioned above, that he manually calculated the class-specific means. Does anyone have a code/tipp for me, how I could do that?

Many thanks in advance!

I described how to manually calculate the class-specific means for binary indicators of the latent class. That is easy: take the inverse logit of the intercept. Go ahead and fit the latent class model described in SEM example 50 and 51, then compare the results you get from here to estat lcmean.

For ordinal logit items, it's a bit harder, but it's a tractable problem. First, consider the methods and formula section of the ordered logit command. It says that p_ij, the probability of the j-th observation endorsing the i-th response category, is given by:

1 / [1 + exp(-kappa_i + XB)] - 1 / [1 + exp(-kappa_i-1 + XB)]

That is, you're using one cutpoint and the one before it to figure out the probability of responding in one class.

Now, I don't have access to Stata on a public machine, but I verified this on a remote server. Try this code.

Code:

sysuse auto gsem (rep78 <-, ologit), lclass(C 1) estat lcmean, nose

That is, you fit a latent class model with 1 indicator (the repair record in 1978) and you specified that there's one latent class. cut1 is given as 0.0289855. If you type

Code:

di 1 / (1 + exp(-0.0289855)

and you compare that to the estat lcmean output, you'll see it matches the estat lcmean output (for latent class #1, the proportion endorsing rep78 = 1). If you type

Code:

di 1 / (1 + exp(-1.662549) - (1 / (1 + exp(.3215836)))

which are the values for cut points 3 and 4, you'll get 0.2608..., which matches estat lcmean for the prevalence of rep78 = 4 in class 1. How about rep78 = 5, the top response option? That's 1 - (1 / (1+exp(-cut 4)).

So, a bit of arithmetic, but this is something you could easily rig up in Excel after you export the coefficient table.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Latent class analysis - constraining categorical variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment