Latent Class Analysis

Strong Marbaniang

Join Date: Apr 2020
Posts: 14

Latent Class Analysis

02 Mar 2021, 03:10

Hi Statalist

I am performing Latent Class Analysis in STATA 14.
I am using the following

Code:

gsem (accident play insurance stock <- ), logit lclass(C 2)

But this is the error I got

option lclass() not allowed

This is the data I used

Code:

(accident play insurance stock)
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 0 1
0 0 1 0
0 0 1 0
0 0 1 0
0 0 1 0
0 0 1 0
0 0 1 0
0 0 1 0
0 0 1 0
0 0 1 0
0 0 1 1
0 0 1 1
0 1 0 0
0 1 0 0
0 1 0 0
0 1 0 0
0 1 0 0
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 0
0 1 1 0
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
end

Tags: None

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#2

02 Mar 2021, 04:34

Stata 14 can’t perform latent class analysis. You will need to use the Penn State University Stata plug-in for that.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Strong Marbaniang

Join Date: Apr 2020

Posts: 14
#3

02 Mar 2021, 08:53

Thank You Weiwen for the information
Comment

Strong Marbaniang

Join Date: Apr 2020
Posts: 14

03 Mar 2021, 00:56

I performed Latent Class Analysis using the following code

Code:

gsem ( Ciggeret Biddi Pan Alcohol vegetable Fruit <-), logit lclass(C 2)

But this is the error I got

Code:

option lclass() not allowed;
option lclass() is not allowed with models specified with continuous latent variables

This is my data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(Ciggeret Biddi Pan Alcohol vegetable Fruit)
0 1 0 1 1 1
1 2 0 1 1 1
0 1 0 1 2 1
0 1 0 1 2 1
1 1 0 1 2 1
1 1 1 1 2 0
1 2 1 0 2 0
0 2 1 0 2 0
0 2 1 0 0 0
0 1 2 0 0 0
1 1 2 0 0 0
1 1 2 1 0 0
0 1 1 0 3 1
0 1 0 1 3 1
0 1 1 1 3 1
0 2 2 0 3 2
0 2 0 0 0 2
0 2 0 1 1 2
0 2 0 1 0 2
1 2 0 0 0 2
end

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

03 Mar 2021, 04:40

The syntax you showed would be correct for latent class analysis in version 15 or later. In your original post, you said you have Stata version 14. Stata only implemented latent class analysis through the gsem command in version 15. Thus, no matter what you type or how hard you hit the return key, the command will not work, unless you upgraded your Stata license without saying so.

Penn State University wrote a plug-in LCA command for Stata that works on version 14. I believe it only handles binary indicators, and they need to be coded as 1s and 2s, which is a bit different from native Stata. The new (?) data example you showed seems like it might have some un-ordered or ordered categorical indicators, which you would have to decide how to treat. That plugin's syntax is different from gsem, so do go through the manual to familiarize yourself with it. You can access the dataset Stata provided for their latent class analysis example - the dataset in your original post is probably from SEM example 50.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Strong Marbaniang

Join Date: Apr 2020
Posts: 14

04 Mar 2021, 01:14

Thank you Weiwen for clarifying the problems.

I would like to mention that I have now used STATA 15 to run the LCA.

I used the following code in STATA 15

Code:

gsem (ciggaret alcohol fruit1 vegetable1 aerated1 fried <-) if sex==0, logit lclass(C 2)

However after the Iteration 16000

Code:

Iteration 15996: log likelihood = -8282.4175  (not concave)
Iteration 15997: log likelihood = -8282.4175  (not concave)
Iteration 15998: log likelihood = -8282.4175  (not concave)
Iteration 15999: log likelihood = -8282.4175  (not concave)
Iteration 16000: log likelihood = -8282.4175  (not concave)

I received the following error

Code:

convergence not achieved

Also, the following results appear after the error

Code:

Generalized structural equation model           Number of obs     =      3,289
Log likelihood = -8282.4175

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.C          |  (base outcome)
-------------+----------------------------------------------------------------
2.C          |
       _cons |  -.7113892   .0371189   -19.17   0.000     -.784141   -.6386375
------------------------------------------------------------------------------

Class          : 1

Response       : ciggaret
Family         : Bernoulli
Link           : logit

Response       : alcohol
Family         : Bernoulli
Link           : logit

Response       : fruit1
Family         : Bernoulli
Link           : logit

Response       : vegetable1
Family         : Bernoulli
Link           : logit

Response       : aerated1
Family         : Bernoulli
Link           : logit

Response       : fried
Family         : Bernoulli
Link           : logit

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
ciggaret     |
       _cons |        -18          .        .       .            .           .
-------------+----------------------------------------------------------------
alcohol      |
       _cons |  -.2497261   .0429168    -5.82   0.000    -.3338415   -.1656106
-------------+----------------------------------------------------------------
fruit1       |
       _cons |   3.942039   .1557935    25.30   0.000     3.636689    4.247388
-------------+----------------------------------------------------------------
vegetable1   |
       _cons |   6.598958   .5777432    11.42   0.000     5.466603    7.731314
-------------+----------------------------------------------------------------
aerated1     |
       _cons |   1.136443   .0496436    22.89   0.000     1.039143    1.233742
-------------+----------------------------------------------------------------
fried        |
       _cons |   1.087771   .0490374    22.18   0.000      .991659    1.183882
------------------------------------------------------------------------------

Class          : 2

Response       : ciggaret
Family         : Bernoulli
Link           : logit

Response       : alcohol
Family         : Bernoulli
Link           : logit

Response       : fruit1
Family         : Bernoulli
Link           : logit

Response       : vegetable1
Family         : Bernoulli
Link           : logit

Response       : aerated1
Family         : Bernoulli
Link           : logit

Response       : fried
Family         : Bernoulli
Link           : logit

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
ciggaret     |
       _cons |   12.22408   148.4807     0.08   0.934    -278.7928     303.241
-------------+----------------------------------------------------------------
alcohol      |
       _cons |    1.08267    .069898    15.49   0.000     .9456723    1.219667
-------------+----------------------------------------------------------------
fruit1       |
       _cons |   4.265493   .2600058    16.41   0.000     3.755891    4.775095
-------------+----------------------------------------------------------------
vegetable1   |
       _cons |   5.886128   .5781578    10.18   0.000      4.75296    7.019296
-------------+----------------------------------------------------------------
aerated1     |
       _cons |   1.261688    .073273    17.22   0.000     1.118076      1.4053
-------------+----------------------------------------------------------------
fried        |
       _cons |   .6070902   .0635951     9.55   0.000      .482446    .7317343
------------------------------------------------------------------------------
convergence not achieved
r(430);

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

04 Mar 2021, 06:07

Please try to be clear about what you are doing, because it makes it easier for readers to help you.

I now see what went wrong with post #4. You initially left the variable names in sentence case (i.e. capital first letter). By convention, gsem treats any variables in sentence case as latent variables. Most latent variables are continuous - think random intercepts in mixed models, or the latent trait in SEM models. However, the latent variable in LCA is categorical, not continuous. gsem can't handle models with both continuous and categorical variables. You seem to have renamed the variables to lower case, which is fine. The other option is that you can use the nocapslatent option and then specify which names the latent variables have - see the gsem syntax for more detail.

To your current problem, in class number 1, essentially nobody is endorsing the cigarette item. With logistic items, the coefficients are the log odds of endorsing the item, so if you take the inverse logit of the coefficient, you get the probability. If you type di invlogit(-18), you'll see that the probability is essentially 0. The problem is that when the logit intercepts trend towards positive or negative infinity, the estimation algorithm will not declare convergence.

I think that it's justifiable to constrain the logit intercept at -15 in this case (or conversely, if an intercept trends over +15, you can constrain it at 15). However, if you need to constrain too many logit intercepts this way, I would regard this as a bad sign. It would be a sign that you're trying to extract too many latent classes. How many is too many? Unfortunately, that seems like a subjective judgment.

How do you do this? See the syntax below. Note also that you can limit the maximum number of iterations - Stata 15 had it at 16,000, and in my experience, a latent class model will either converge or clearly be in trouble well before then. What this syntax does is that it limits the number of iterations to 100, and it asks Stata to save the parameter estimates (which control the proportions of each latent class and the means of the indicators in each class) to a matrix. Then it has Stata re-fit the model, with one parameter constrained. Here's some previous discussion of the issue. Note that by my recollection, Penn State's plugin automatically constrains the parameters when this issue occurs, so you might still want to consider switching to that.

Code:

gsem (ciggaret alcohol fruit1 vegetable1 aerated1 fried <-) if sex==0, logit lclass(C 2) iterate(100) matrix b = e(b) gsem (ciggaret alcohol fruit1 vegetable1 aerated1 fried <-) (1: ciggaret <- _cons@-15) if sex==0, logit lclass(C 2) iterate(100) from(b)

You have to inspect the results and constrain manually every time you do this. If cigarettes is consistently problematic (i.e. you have to constrain the parameters across multiple classes) across all the latent classes, you might think about removing it as an indicator entirely and reporting this. I can see that class 2 has a high proportion of respondents endorsing that indicator, so maybe it separates the classes well, but you still might have trouble.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Strong Marbaniang

Join Date: Apr 2020
Posts: 14

04 Mar 2021, 23:54

Than you Weiwen. I would like to know that apart from the AIC and BIC values do I need to check the significance of the Likelihood ratio for the goodness of fit

I used this code to compared the AIC and BIC value of the four latent class

Code:

estimates stats class1 class2 class3 class4

This is the AIC and BIC value

Code:

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
      class1 |     22,249         .  -56432.68       8    112881.4   112945.5
      class2 |     22,249         .  -55050.16      17    110134.3   110270.5
      class3 |     22,249         .   -54817.6      26    109687.2   109895.5
      class4 |     22,249         .   -54740.6      33    109547.2   109811.5
-----------------------------------------------------------------------------
               Note: N=Obs used in calculating BIC; see [R] BIC note.

Code for Goodness of fit

Code:

estat lcgof

Result

Code:

----------------------------------------------------------------------------
Fit statistic        |      Value   Description
---------------------+------------------------------------------------------
Likelihood ratio     |
        chi2_ms(222) |    265.770   model vs. saturated
            p > chi2 |      0.024
---------------------+------------------------------------------------------
Information criteria |
                 AIC | 109547.203   Akaike's information criterion
                 BIC | 109811.535   Bayesian information criterion
----------------------------------------------------------------------------

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#9

05 Mar 2021, 06:05

Just use the BIC. I am not familiar with the likelihood ratio test from estat lcgof. (The alternative name is the G^2 statistic.) I am actually not 100% sure what this is testing, but I think it's an overall test of model fit and you actually would not want the test to reject, i.e. p > 0.05. It's not a comparison between models.

The problem with tests based on chi-square distributions in general is that when you have a large sample size, they are very sensitive and will often reject on differences that are not substantively significant. You'll note that the SEM example for latent class analysis has only 216 obs. You have over 3k. In the papers in my field that use latent class analysis, I don't believe I've seen anyone rely on the G^2 statistic. Sample sizes similar to or larger (sometimes much larger) than yours are common.

In the LCA field, there is a test called the bootstrap likelihood ratio test that's used to compare different models, e.g. compare the 4-class to the 3-class model. This test isn't available in Stata and there doesn't seem to be any easy way to implement it. If you had it, you should report the results from that test. I think the PSU LCA plugin can do that test.

Going back to an earlier point I made about the data sample you showed in post #4, I think your current code treats any non-zero response as equivalent to a 1, because you told Stata you had logit items. Make sure this is what you actually want to do. You could elect to treat the items as ordered logit instead - but you are now fitting many more parameters per latent class and you may run into identification problems, especially if some of the categories are rare. It may be acceptable to dichotomize the responses, just make sure that this is actually what you want to do.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#10

11 Feb 2023, 20:22

Hi
I am wondering is there any backwards in using plugin from Penn State University (even though I have access to Stata 15). I need to implement unbiased 3-step LCA_distal outcome analysis (all variables binary). it seems that the only way of doing this in Stata is through LCA Distal BCH function which I guess is can be used after conducting the LCA using plugin from Penn State University.
Comment
Javiera Cartagena-Farias

Join Date: Jun 2023

Posts: 1
#11

02 Jun 2023, 10:57

Hello, I am running a latent class model using gsem in Stata 17. The model runs incredibly slow when I try to obtain the marginal means and even slower when I try to predict individual memberships. I have tried different options, but nothing seems to help. Any suggestions would be more than welcome. Thank you very much in advance!

gsem (LoS7_Occ2* LoS14_Occ2* LoS21_Occ2* PNMCR* Bed_Occ* <- ), family(gaussian) link(identity) lclass(C 3) iterate(10)
** Latent class marginal probabilities
estat lcprob
** Latent class marginal means
estat lcmean

**** Identifying classes *****
predict classpost*, classposteriorpr
Comment

Announcement

Latent Class Analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment