Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent Class Analysis

    Hi Statalist

    I am performing Latent Class Analysis in STATA 14.
    I am using the following
    Code:
    gsem (accident play insurance stock <- ), logit lclass(C 2)
    But this is the error I got
    option lclass() not allowed
    This is the data I used
    Code:
    (accident play insurance stock)
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 1
    0 0 0 1
    0 0 1 0
    0 0 1 0
    0 0 1 0
    0 0 1 0
    0 0 1 0
    0 0 1 0
    0 0 1 0
    0 0 1 0
    0 0 1 0
    0 0 1 1
    0 0 1 1
    0 1 0 0
    0 1 0 0
    0 1 0 0
    0 1 0 0
    0 1 0 0
    0 1 0 0
    0 1 0 1
    0 1 1 0
    0 1 1 0
    0 1 1 0
    0 1 1 0
    0 1 1 1
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 0
    1 0 0 1
    1 0 0 1
    1 0 0 1
    1 0 0 1
    1 0 0 1
    1 0 0 1
    1 0 0 1
    1 0 1 0
    1 0 1 0
    1 0 1 0
    1 0 1 0
    1 0 1 0
    1 0 1 0
    1 0 1 0
    1 0 1 0
    1 0 1 0
    1 0 1 0
    end

  • #2
    Stata 14 can’t perform latent class analysis. You will need to use the Penn State University Stata plug-in for that.
    Please use the code delimiters to show code and results - use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Please use the command -dataex- to show a representative sample of data; it is installed already if you have Stata 14.2 or 15.1, else you can install it by typing

    Code:
    ssc install dataex

    Comment


    • #3
      Thank You Weiwen for the information

      Comment


      • #4
        I performed Latent Class Analysis using the following code

        Code:
        gsem ( Ciggeret Biddi Pan Alcohol vegetable Fruit <-), logit lclass(C 2)
        But this is the error I got

        Code:
        option lclass() not allowed;
        option lclass() is not allowed with models specified with continuous latent variables
        This is my data

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int(Ciggeret Biddi Pan Alcohol vegetable Fruit)
        0 1 0 1 1 1
        1 2 0 1 1 1
        0 1 0 1 2 1
        0 1 0 1 2 1
        1 1 0 1 2 1
        1 1 1 1 2 0
        1 2 1 0 2 0
        0 2 1 0 2 0
        0 2 1 0 0 0
        0 1 2 0 0 0
        1 1 2 0 0 0
        1 1 2 1 0 0
        0 1 1 0 3 1
        0 1 0 1 3 1
        0 1 1 1 3 1
        0 2 2 0 3 2
        0 2 0 0 0 2
        0 2 0 1 1 2
        0 2 0 1 0 2
        1 2 0 0 0 2
        end

        Comment


        • #5
          The syntax you showed would be correct for latent class analysis in version 15 or later. In your original post, you said you have Stata version 14. Stata only implemented latent class analysis through the gsem command in version 15. Thus, no matter what you type or how hard you hit the return key, the command will not work, unless you upgraded your Stata license without saying so.

          Penn State University wrote a plug-in LCA command for Stata that works on version 14. I believe it only handles binary indicators, and they need to be coded as 1s and 2s, which is a bit different from native Stata. The new (?) data example you showed seems like it might have some un-ordered or ordered categorical indicators, which you would have to decide how to treat. That plugin's syntax is different from gsem, so do go through the manual to familiarize yourself with it. You can access the dataset Stata provided for their latent class analysis example - the dataset in your original post is probably from SEM example 50.
          Please use the code delimiters to show code and results - use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Please use the command -dataex- to show a representative sample of data; it is installed already if you have Stata 14.2 or 15.1, else you can install it by typing

          Code:
          ssc install dataex

          Comment


          • #6
            Thank you Weiwen for clarifying the problems.

            I would like to mention that I have now used STATA 15 to run the LCA.

            I used the following code in STATA 15
            Code:
            gsem (ciggaret alcohol fruit1 vegetable1 aerated1 fried <-) if sex==0, logit lclass(C 2)
            However after the Iteration 16000
            Code:
            Iteration 15996: log likelihood = -8282.4175  (not concave)
            Iteration 15997: log likelihood = -8282.4175  (not concave)
            Iteration 15998: log likelihood = -8282.4175  (not concave)
            Iteration 15999: log likelihood = -8282.4175  (not concave)
            Iteration 16000: log likelihood = -8282.4175  (not concave)
            I received the following error

            Code:
            convergence not achieved
            Also, the following results appear after the error
            Code:
            Generalized structural equation model           Number of obs     =      3,289
            Log likelihood = -8282.4175
            
            ------------------------------------------------------------------------------
                         |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            1.C          |  (base outcome)
            -------------+----------------------------------------------------------------
            2.C          |
                   _cons |  -.7113892   .0371189   -19.17   0.000     -.784141   -.6386375
            ------------------------------------------------------------------------------
            
            Class          : 1
            
            Response       : ciggaret
            Family         : Bernoulli
            Link           : logit
            
            Response       : alcohol
            Family         : Bernoulli
            Link           : logit
            
            Response       : fruit1
            Family         : Bernoulli
            Link           : logit
            
            Response       : vegetable1
            Family         : Bernoulli
            Link           : logit
            
            Response       : aerated1
            Family         : Bernoulli
            Link           : logit
            
            Response       : fried
            Family         : Bernoulli
            Link           : logit
            
            ------------------------------------------------------------------------------
                         |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            ciggaret     |
                   _cons |        -18          .        .       .            .           .
            -------------+----------------------------------------------------------------
            alcohol      |
                   _cons |  -.2497261   .0429168    -5.82   0.000    -.3338415   -.1656106
            -------------+----------------------------------------------------------------
            fruit1       |
                   _cons |   3.942039   .1557935    25.30   0.000     3.636689    4.247388
            -------------+----------------------------------------------------------------
            vegetable1   |
                   _cons |   6.598958   .5777432    11.42   0.000     5.466603    7.731314
            -------------+----------------------------------------------------------------
            aerated1     |
                   _cons |   1.136443   .0496436    22.89   0.000     1.039143    1.233742
            -------------+----------------------------------------------------------------
            fried        |
                   _cons |   1.087771   .0490374    22.18   0.000      .991659    1.183882
            ------------------------------------------------------------------------------
            
            Class          : 2
            
            Response       : ciggaret
            Family         : Bernoulli
            Link           : logit
            
            Response       : alcohol
            Family         : Bernoulli
            Link           : logit
            
            Response       : fruit1
            Family         : Bernoulli
            Link           : logit
            
            Response       : vegetable1
            Family         : Bernoulli
            Link           : logit
            
            Response       : aerated1
            Family         : Bernoulli
            Link           : logit
            
            Response       : fried
            Family         : Bernoulli
            Link           : logit
            
            ------------------------------------------------------------------------------
                         |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            ciggaret     |
                   _cons |   12.22408   148.4807     0.08   0.934    -278.7928     303.241
            -------------+----------------------------------------------------------------
            alcohol      |
                   _cons |    1.08267    .069898    15.49   0.000     .9456723    1.219667
            -------------+----------------------------------------------------------------
            fruit1       |
                   _cons |   4.265493   .2600058    16.41   0.000     3.755891    4.775095
            -------------+----------------------------------------------------------------
            vegetable1   |
                   _cons |   5.886128   .5781578    10.18   0.000      4.75296    7.019296
            -------------+----------------------------------------------------------------
            aerated1     |
                   _cons |   1.261688    .073273    17.22   0.000     1.118076      1.4053
            -------------+----------------------------------------------------------------
            fried        |
                   _cons |   .6070902   .0635951     9.55   0.000      .482446    .7317343
            ------------------------------------------------------------------------------
            convergence not achieved
            r(430);

            Comment


            • #7
              Please try to be clear about what you are doing, because it makes it easier for readers to help you.

              I now see what went wrong with post #4. You initially left the variable names in sentence case (i.e. capital first letter). By convention, gsem treats any variables in sentence case as latent variables. Most latent variables are continuous - think random intercepts in mixed models, or the latent trait in SEM models. However, the latent variable in LCA is categorical, not continuous. gsem can't handle models with both continuous and categorical variables. You seem to have renamed the variables to lower case, which is fine. The other option is that you can use the nocapslatent option and then specify which names the latent variables have - see the gsem syntax for more detail.

              To your current problem, in class number 1, essentially nobody is endorsing the cigarette item. With logistic items, the coefficients are the log odds of endorsing the item, so if you take the inverse logit of the coefficient, you get the probability. If you type di invlogit(-18), you'll see that the probability is essentially 0. The problem is that when the logit intercepts trend towards positive or negative infinity, the estimation algorithm will not declare convergence.

              I think that it's justifiable to constrain the logit intercept at -15 in this case (or conversely, if an intercept trends over +15, you can constrain it at 15). However, if you need to constrain too many logit intercepts this way, I would regard this as a bad sign. It would be a sign that you're trying to extract too many latent classes. How many is too many? Unfortunately, that seems like a subjective judgment.

              How do you do this? See the syntax below. Note also that you can limit the maximum number of iterations - Stata 15 had it at 16,000, and in my experience, a latent class model will either converge or clearly be in trouble well before then. What this syntax does is that it limits the number of iterations to 100, and it asks Stata to save the parameter estimates (which control the proportions of each latent class and the means of the indicators in each class) to a matrix. Then it has Stata re-fit the model, with one parameter constrained. Here's some previous discussion of the issue. Note that by my recollection, Penn State's plugin automatically constrains the parameters when this issue occurs, so you might still want to consider switching to that.

              Code:
              gsem (ciggaret alcohol fruit1 vegetable1 aerated1 fried <-) if sex==0, logit lclass(C 2) iterate(100)
              matrix b = e(b)
              gsem (ciggaret alcohol fruit1 vegetable1 aerated1 fried <-) (1: ciggaret <- _cons@-15) if sex==0, logit lclass(C 2) iterate(100) from(b)
              You have to inspect the results and constrain manually every time you do this. If cigarettes is consistently problematic (i.e. you have to constrain the parameters across multiple classes) across all the latent classes, you might think about removing it as an indicator entirely and reporting this. I can see that class 2 has a high proportion of respondents endorsing that indicator, so maybe it separates the classes well, but you still might have trouble.
              Please use the code delimiters to show code and results - use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

              Please use the command -dataex- to show a representative sample of data; it is installed already if you have Stata 14.2 or 15.1, else you can install it by typing

              Code:
              ssc install dataex

              Comment


              • #8
                Than you Weiwen. I would like to know that apart from the AIC and BIC values do I need to check the significance of the Likelihood ratio for the goodness of fit

                I used this code to compared the AIC and BIC value of the four latent class
                Code:
                estimates stats class1 class2 class3 class4
                This is the AIC and BIC value

                Code:
                Akaike's information criterion and Bayesian information criterion
                
                -----------------------------------------------------------------------------
                       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
                -------------+---------------------------------------------------------------
                      class1 |     22,249         .  -56432.68       8    112881.4   112945.5
                      class2 |     22,249         .  -55050.16      17    110134.3   110270.5
                      class3 |     22,249         .   -54817.6      26    109687.2   109895.5
                      class4 |     22,249         .   -54740.6      33    109547.2   109811.5
                -----------------------------------------------------------------------------
                               Note: N=Obs used in calculating BIC; see [R] BIC note.

                Code for Goodness of fit

                Code:
                estat lcgof
                Result
                Code:
                ----------------------------------------------------------------------------
                Fit statistic        |      Value   Description
                ---------------------+------------------------------------------------------
                Likelihood ratio     |
                        chi2_ms(222) |    265.770   model vs. saturated
                            p > chi2 |      0.024
                ---------------------+------------------------------------------------------
                Information criteria |
                                 AIC | 109547.203   Akaike's information criterion
                                 BIC | 109811.535   Bayesian information criterion
                ----------------------------------------------------------------------------

                Comment


                • #9
                  Just use the BIC. I am not familiar with the likelihood ratio test from estat lcgof. (The alternative name is the G^2 statistic.) I am actually not 100% sure what this is testing, but I think it's an overall test of model fit and you actually would not want the test to reject, i.e. p > 0.05. It's not a comparison between models.

                  The problem with tests based on chi-square distributions in general is that when you have a large sample size, they are very sensitive and will often reject on differences that are not substantively significant. You'll note that the SEM example for latent class analysis has only 216 obs. You have over 3k. In the papers in my field that use latent class analysis, I don't believe I've seen anyone rely on the G^2 statistic. Sample sizes similar to or larger (sometimes much larger) than yours are common.

                  In the LCA field, there is a test called the bootstrap likelihood ratio test that's used to compare different models, e.g. compare the 4-class to the 3-class model. This test isn't available in Stata and there doesn't seem to be any easy way to implement it. If you had it, you should report the results from that test. I think the PSU LCA plugin can do that test.

                  Going back to an earlier point I made about the data sample you showed in post #4, I think your current code treats any non-zero response as equivalent to a 1, because you told Stata you had logit items. Make sure this is what you actually want to do. You could elect to treat the items as ordered logit instead - but you are now fitting many more parameters per latent class and you may run into identification problems, especially if some of the categories are rare. It may be acceptable to dichotomize the responses, just make sure that this is actually what you want to do.
                  Please use the code delimiters to show code and results - use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                  Please use the command -dataex- to show a representative sample of data; it is installed already if you have Stata 14.2 or 15.1, else you can install it by typing

                  Code:
                  ssc install dataex

                  Comment

                  Working...
                  X