Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent class analysis - constraining categorical variables

    Hi Statalist!

    I am trying to perform a latent class analysis and I am very new to this method, so I hope you will understand if my question is a bit naïve.
    I would like to perform the analysis using nine categorical variables (each with five categories) and obtain three classes. Here is the code I am using:

    Code:
    gsem (e4_1b e4_2b e4_3b e4_4b e4_5b e4_6b e4_7b e4_8b e4_9b <-, ologit), lclass(C 3)
    Once I try to estimate the marginal predicted means of the outcome within each latent class (estat lcmean), the command seems to run forever, without giving any output. Looking on the forum, I am supposing that the issue is related to one coefficient being above 15. Below you can find part of the output, and the high coefficient in red.

    Code:
    Class          : 2
    
    Response       : e4_1b
    Family         : ordinal
    Link           : logit
    
    Response       : e4_2b
    Family         : ordinal
    Link           : logit
    
    Response       : e4_3b
    Family         : ordinal
    Link           : logit
    
    Response       : e4_4b
    Family         : ordinal
    Link           : logit
    
    Response       : e4_5b
    Family         : ordinal
    Link           : logit
    
    Response       : e4_6b
    Family         : ordinal
    Link           : logit
    
    Response       : e4_7b
    Family         : ordinal
    Link           : logit
    
    Response       : e4_8b
    Family         : ordinal
    Link           : logit
    
    Response       : e4_9b
    Family         : ordinal
    Link           : logit
    
    ------------------------------------------------------------------------------
                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    /e4_1b       |
            cut1 |  -6.207387   .2099544                      -6.61889   -5.795884
            cut2 |  -4.585708     .09238                     -4.766769   -4.404646
            cut3 |  -1.894731   .0285634                     -1.950714   -1.838748
            cut4 |   .7498389   .0206396                      .7093861    .7902917
    -------------+----------------------------------------------------------------
    /e4_2b       |
            cut1 |  -3.075159   .0378004                     -3.149246   -3.001071
            cut2 |  -.6201423   .0158303                     -.6511692   -.5891154
            cut3 |   1.625804   .0206937                      1.585246    1.666363
            cut4 |   4.059866   .0600725                      3.942126    4.177606
    -------------+----------------------------------------------------------------
    /e4_3b       |
            cut1 |  -4.235327   .0753749                     -4.383059   -4.087595
            cut2 |  -2.243794   .0294187                     -2.301453   -2.186134
            cut3 |   -.249513   .0220957                     -.2928197   -.2062063
            cut4 |   2.190033   .0348925                      2.121645    2.258421
    -------------+----------------------------------------------------------------
    /e4_4b       |
            cut1 |  -24.39119   2397.723                     -4723.841    4675.059
            cut2 |  -3.252751   .0575151                     -3.365479   -3.140024
            cut3 |  -.6396046   .0260331                     -.6906285   -.5885807
            cut4 |    2.04866   .0369879                      1.976165    2.121155
    -------------+----------------------------------------------------------------
    /e4_5b       |
            cut1 |  -5.785578   .1817201                     -6.141743   -5.429414
            cut2 |   -2.87042   .0414458                     -2.951652   -2.789188
            cut3 |  -.8560699   .0229513                     -.9010535   -.8110862
            cut4 |   1.127883   .0237217                       1.08139    1.174377
    -------------+----------------------------------------------------------------
    /e4_6b       |
            cut1 |  -.2159903   .0155038                     -.2463771   -.1856035
            cut2 |   1.242691   .0190969                      1.205261     1.28012
            cut3 |   2.667265   .0303306                      2.607818    2.726711
            cut4 |   5.438793   .1438952                      5.156764    5.720823
    -------------+----------------------------------------------------------------
    /e4_7b       |
            cut1 |   -1.62553   .0205138                     -1.665736   -1.585324
            cut2 |   .0296444   .0151189                      .0000119    .0592768
            cut3 |   1.490041   .0191115                      1.452583    1.527499
            cut4 |   3.487356   .0482501                      3.392788    3.581925
    -------------+----------------------------------------------------------------
    /e4_8b       |
            cut1 |   1.331648    .023303                      1.285975    1.377321
            cut2 |   2.582377   .0354519                      2.512892    2.651861
            cut3 |    3.56694    .043742                      3.481207    3.652673
            cut4 |   6.136812   .1994962                      5.745807    6.527817
    -------------+----------------------------------------------------------------
    /e4_9b       |
            cut1 |   .0988198   .0149272                      .0695631    .1280765
            cut2 |   1.434078   .0194443                      1.395968    1.472188
            cut3 |   2.952968   .0329211                      2.888444    3.017492
            cut4 |   5.747459   .1649939                      5.424077    6.070841
    ------------------------------------------------------------------------------
    As I understood, I should constrain the coefficient to 15. I tried to use the following code, but it is not adequate for my categorical variable.

    Code:
    gsem (e4_1b e4_2b e4_3b e4_4b e4_5b e4_6b e4_7b e4_8b e4_9b <- ) (2: e4_4b  <- _cons@15) , ologit lclass(C 3)
    Could anyone please suggest me the right specifications? Any suggestion is be more than welcome,

    Kind regards,
    Marla


  • #2
    Hi Statalist,

    I have another related question to ask. Unfortunately I cannot edit anymore the previous post.
    I tried to perform a more simple LCA and I am wondering what are usually the timings of the estimation process. For the estimation of the marginal predicted means (estat lcmeans) I let it run for almost 18 hours without getting any output, just the wheel spinning, before interrupting the process.
    Does anyone know to what this can be due, or is this "regular"?
    My dataset is "quite" big; it includes almost 60,000 observations. May this be the reason for the long processing?

    Kind regards,
    Marla

    Comment


    • #3
      Hello Marla, did you figure out how to add the appropriate constraints to your ordinal variable? I have the same issue. It seems like something that should be straightforward, but I am not having any success.

      Comment


      • #4
        Marla and Josephine,

        I realize this is a bit late, but I don't check the Stata forum all the time these days.

        First off, I have seen at least one latent class analysis paper dichotomize items that are originally categorical. When you fit an LCA model with ordinal items, you have a lot of parameters involved - remember you now have 4 parameters for each item (one for each cutpoint). I realize that this is not normally what people would recommend in other contexts, but I think LCA is an exception.

        Additionally, I don't think the fact that the specified cutpoint has an absolute value over 15 is the problem in this case. Here, your coefficient has a standard error, so there's no apparent issue. I can't see the sample size, but estat lcmean will generally run really slowly on large datasets and/or complex models unless you use the nose option, which doesn't calculate the standard errors. I'm not sure how to get around this. In fact, I have one dataset that's so large that even without calculating standard errors, the estimation time was infeasible. I went and manually calculated the class-specific means. I had binary items, so this was relatively easy. You can actually do that here, but it is more complex. Anyway, the coefficient of -24 basically means that nobody endorses the lowest level of E4_4B in latent class 2.

        To the question of constraints, the constraints manual says, on page 2, that cutpoints can't be constrained using that syntax. You have to issue a separate constraint, and you have to name the specific cutpoint. I am not certain how to issue constraints for ordinal items, so I must pass on that question. However, I maintain that this should not be relevant to your issue.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Thanks Weiwen. For my case, I think dichotomising the ordinal variable isn't going to work - a lot of the nuance in the grouping will come from people who respond to a statement with, say "strongly agree" versus "agree somewhat".
          The rule of thumb about using the presence/absence of SEs to diagnose whether there's a cutpoint issue is very helpful.

          Comment


          • #6
            Marla Pauli

            I believe you are looking for something like this:

            Code:
            . sysuse auto
            
            . constraint 1 _b[foreign:1.C] = 2
            
            . gsem ( foreign <- , logit) (weight <- , regress),  lclass(C 2) constraints(1)
            
            Fitting class model:
            
            Iteration 0:   (class) log likelihood = -51.292891  
            Iteration 1:   (class) log likelihood = -51.292891  
            
            Fitting outcome model:
            
            Iteration 0:   (outcome) log likelihood = -588.08292  
            Iteration 1:   (outcome) log likelihood = -584.96687  
            Iteration 2:   (outcome) log likelihood = -584.96687  
            
            Refining starting values:
            
            Iteration 0:   (EM) log likelihood = -636.81012
            Iteration 1:   (EM) log likelihood =  -636.7001
            Iteration 2:   (EM) log likelihood = -636.69087
            Iteration 3:   (EM) log likelihood = -636.68984
            Iteration 4:   (EM) log likelihood = -636.68972
            
            Fitting full model:
            
            Iteration 0:   log likelihood = -629.82534  (not concave)
            Iteration 1:   log likelihood = -627.31727  
            Iteration 2:   log likelihood = -624.97096  
            Iteration 3:   log likelihood = -624.73291  
            Iteration 4:   log likelihood = -624.73085  
            Iteration 5:   log likelihood = -624.73085  
            
            Generalized structural equation model           Number of obs     =         74
            Log likelihood = -624.73085
            
             ( 1)  [foreign]1bn.C = 2
             ( 2)  [/]var(e.weight)#1bn.C - [/]var(e.weight)#2.C = 0
            
            ------------------------------------------------------------------------------
                         |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            1.C          |  (base outcome)
            -------------+----------------------------------------------------------------
            2.C          |
                   _cons |   .5604748   .2833409     1.98   0.048     .0051369    1.115813
            ------------------------------------------------------------------------------
            
            Class          : 1
            
            Response       : foreign
            Family         : Bernoulli
            Link           : logit
            
            Response       : weight
            Family         : Gaussian
            Link           : identity
            
            -------------------------------------------------------------------------------
                          |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            --------------+----------------------------------------------------------------
            foreign       |
                    _cons |          2  (constrained)
            --------------+----------------------------------------------------------------
            weight        |
                    _cons |   2297.312   109.6387    20.95   0.000     2082.425      2512.2
            --------------+----------------------------------------------------------------
             var(e.weight)|   298138.1   71993.66                      185725.4    478589.9
            -------------------------------------------------------------------------------
            
            Class          : 2
            
            Response       : foreign
            Family         : Bernoulli
            Link           : logit
            
            Response       : weight
            Family         : Gaussian
            Link           : identity
            
            -------------------------------------------------------------------------------
                          |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            --------------+----------------------------------------------------------------
            foreign       |
                    _cons |  -17.47504   1496.679    -0.01   0.991    -2950.911    2915.961
            --------------+----------------------------------------------------------------
            weight        |
                    _cons |   3431.802   102.6992    33.42   0.000     3230.515    3633.088
            --------------+----------------------------------------------------------------
             var(e.weight)|   298138.1   71993.66                      185725.4    478589.9
            -------------------------------------------------------------------------------
            In this latent class analysis (2 classes) with the foreign and weight variables from the auto dataset, the value for latent class 1's logit constant was constrained to the value of 2. You can do the same but might have better luck using the constraint command and option to gsem as shown above.

            - joe
            Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
            ----
            Research Fellow
            Fors Marsh

            ----
            Version 18.0 MP

            Comment


            • #7
              Originally posted by Josephine George View Post
              Hello Marla, did you figure out how to add the appropriate constraints to your ordinal variable? I have the same issue. It seems like something that should be straightforward, but I am not having any success

              ...

              I think dichotomising the ordinal variable isn't going to work - a lot of the nuance in the grouping will come from people who respond to a statement with, say "strongly agree" versus "agree somewhat".
              I realize that it's been a while, but a worked example with simulated data and sample code are available here, in case you are still looking. Basically, I believe you need to separately define the constraint. The command looks something like this:

              Code:
              constraint 1 _b[/q5:3.C#cut2] = 15
              Where you replace q5: with the name of whatever indicator you are using (the : indicates to Stata that it's a separate equation), 3.C references whichever latent class needs the constraint, and cut2 references the cutpoint that needs the constraint. In my simulated data, one latent class was simulated as never answering the top level of q5, which is a 3-level ordered categorical item. So, that example constrains the top cutpoint (which is 2) at 15. In Marla's example, most likely the estimation algorithm was trying to go towards an MLE where class 2 had a 0 probability of responding at the bottom category of e4_4b, so she'd have needed to constrain the bottom one at -15.

              Pinging Marla Pauli in case she is also still looking.

              I believe that this is how you would issue the constraint mechanically. I haven't very widely reviewed LCA papers. I haven't seen many that use categorical indicators at all, but here is one paper that used categorical indicators with MPlus. It did not report any constraints in full text or discuss any convergence issues. If you examine their supplemental table 2, you'll see that for latent class #1 (their healthy class), 0.1% of that class have poor self-reported health on a 5-point scale, and 0.1% have severe cognitive impairment on a 3-point scale. I don't know how MPlus handles that situation. We don't know how Stata would handle it (although you might be able to request their data, as it seems to be public but by request), but it's possible that Stata would have convergence issues. Remember, that's now two intercepts that should be wandering off to high values. However, we just don't know.

              Responding to Josephine's point: consider examining the marginal distribution of all your ordinal indicators (that is, just tabulate them). If you have sparse data in a lot of top or bottom categories to begin with, I have a feeling you will quickly run into trouble. This seemed to be happening to the other person I responded to. Anyway, I have seen other LCA papers that do dichotomize ordinal scales. In this paper and this one, (I'm 3rd author on the second paper, both papers use the same type of survey but the authors for each paper are independent and our samples differ) both sets of authors were working with a 4-point scale. Responses to the lowest two categories are relatively rare - not like zero, but quite rare. We did dichotomize at the midpoint. Collapsing is definitely not the preferred method in most cases. In this case, we did feel that there wasn't that much nuance lost. You also don't have to dichotomize things. For example, if we felt like there was a meaningful distinction between strongly vs somewhat agree, but responses to the disagree categories were sparse, we could have collapsed just the bottom two categories.
              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

              Comment


              • #8

                Hello everyone!

                I have a similar problem as Marla: The "estat lcmean" command runs for DAYS and not hours without producing any result. I am analyzing a five-class LCA model with 7 ordinal variables with 5 categories each, the dataset is quite big with 50,000 observations.

                Weiwen Ng mentioned above, that he manually calculated the class-specific means. Does anyone have a code/tipp for me, how I could do that?

                Many thanks in advance!


                Comment


                • #9
                  In my limited experience, avoiding calculation of the standard errors by using the -nose- can reduce the computation time tremendously. In a problem I worked on with an _N of about 1,000, the standard errors were so tiny as to be uninteresting anyway.

                  Comment


                  • #10
                    Originally posted by Michelle Reiter View Post
                    Hello everyone!

                    I have a similar problem as Marla: The "estat lcmean" command runs for DAYS and not hours without producing any result. I am analyzing a five-class LCA model with 7 ordinal variables with 5 categories each, the dataset is quite big with 50,000 observations.

                    Weiwen Ng mentioned above, that he manually calculated the class-specific means. Does anyone have a code/tipp for me, how I could do that?

                    Many thanks in advance!

                    I described how to manually calculate the class-specific means for binary indicators of the latent class. That is easy: take the inverse logit of the intercept. Go ahead and fit the latent class model described in SEM example 50 and 51, then compare the results you get from here to estat lcmean.

                    For ordinal logit items, it's a bit harder, but it's a tractable problem. First, consider the methods and formula section of the ordered logit command. It says that p_ij, the probability of the j-th observation endorsing the i-th response category, is given by:

                    1 / [1 + exp(-kappa_i + XB)] - 1 / [1 + exp(-kappa_i-1 + XB)]

                    That is, you're using one cutpoint and the one before it to figure out the probability of responding in one class.

                    Now, I don't have access to Stata on a public machine, but I verified this on a remote server. Try this code.

                    Code:
                    sysuse auto
                    gsem (rep78 <-, ologit), lclass(C 1)
                    estat lcmean, nose
                    That is, you fit a latent class model with 1 indicator (the repair record in 1978) and you specified that there's one latent class. cut1 is given as 0.0289855. If you type

                    Code:
                    di 1 / (1 + exp(-0.0289855)
                    and you compare that to the estat lcmean output, you'll see it matches the estat lcmean output (for latent class #1, the proportion endorsing rep78 = 1). If you type

                    Code:
                    di 1 / (1 + exp(-1.662549) - (1 / (1 + exp(.3215836)))
                    which are the values for cut points 3 and 4, you'll get 0.2608..., which matches estat lcmean for the prevalence of rep78 = 4 in class 1. How about rep78 = 5, the top response option? That's 1 - (1 / (1+exp(-cut 4)).

                    So, a bit of arithmetic, but this is something you could easily rig up in Excel after you export the coefficient table.
                    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                    Comment

                    Working...
                    X