Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent Class Analysis with Stata 15 -gsem- Problem

    I'm trying out the new Latent Class Analysis feature of Stata 15's -gsem- command with Stata/IC 15, but I have been unable to get any model to converge except the very simple example in Example 50b in the Stata 15 sem.pdf manual.

    The Methodology Center at Penn State provides a Latent Class Analysis Plugin for Stata (version 1.21), which I have been using for a couple of years with Stata 14 without difficulty, but I was hoping the new Stata 15 latent class feature would allow me to avoid using an additional plugin. (The plugin is only available for Windows and can be downloaded at https://methodology.psu.edu/downloads/lcastata .)

    In my testing, I'm using a data set included with Penn State's LCA plugin and consisting of LcaSampleDataset.dct and LcaSampleDataset.txt. That data set is a subsample of N = 1,000 cases drawn from a larger sample (N = 13,840). Example results from the larger data set are shown at https://methodology.psu.edu/ra/lca/example , and the results of my testing using the plugin with the subset are consistent with the web site results from the larger sample.

    The LCA plugin produces results in 1.9 seconds.

    I have also used the poLCA package with R 3.4.0 and reproduced the results with the same data in about the same amount of elapsed time.

    My attempts to reproduce these results in Stata/IC 15 (after recoding the indicator variables from 1/2 to 0/1 values) have all failed to converge and consistently have shown the "not concave" issue after about 50 iterations.

    I've tried many options with -gsem-, but the following is a typical example of the code I have tried. I've reduced the integration points to 5 (from the default of 7), have added a startvalues() option with 30 draws, have set the integration method to intmethod(ghermite), have added the -difficult- option, and have run 10,000 iterations.

    The Stata 15 sem.pdf manual in "intro 12" indicates that intmethod(ghermite) "is less accurate but quicker, and the calculation it makes converges more readily than either of the above methods." I used the intpoint(5) and intmethod(ghermite) options only to get an initial model to converge so I can obtain better starting values.

    Code:
    gsem /// 
    (SmokedBefore13 DailySmoke DroveDrunk DrankBefore13 BingeDrink MarijuanaBefore13 CocaineEver GlueEver MethEver EcstasyEver SexBefore13 ManyPartners <-, nocapslatent), /// 
    logit /// 
    lclass(C 5) /// 
    startvalues(randomid, draws(30) seed(123321)) ///
    intpoints(5) /// 
    iter(10000)
    This command produces an EM log likelihood that does not change (with four significant digits) after the 50th iteration. The model is also consistently reported to be not concave after that iteration, and the -difficult- option does not help. (I've tried the command with and without the -difficult- option.)

    I would appreciate any advice about other options to try or option values to tweak with the new Stata 15 -gsem- latent class analysis feature. I was surprised that Stata 15 does not seem to be able to converge to results that can be obtained in about 2 seconds with both the LCA Plugin and the poLCA package with R 3.4.0. I have not been able to try this with MPlus yet, but I've not had problems with Latent Class Analysis with Mplus before, and MPlus is generally very fast.

    Thanks for any help and advice.

    Red Owl
    (yes, my real name)

  • #2
    Thanks to Red Owl for bringing this to our attention, and for providing an example we can use to investigate the problem.

    We are looking into the problem, and hope to have something more to report later on today.

    Comment


    • #3
      A little later than I intended, but ...

      gsem is having trouble fitting the model because the data require some of the latent classes to have completely determined values for some of the logit outcome variables. This situation can be a challenge to detect because the model, as specified, is not identified without some additional constraints.

      It appears that Mplus and the Penn State Latent Class Analysis Plugin for Stata provide an automatic mechanism for adding these identifying constraints. For example, when Mplus encounters a logit intercept too close to -15, it sets/constrains the intercept to -15; and similarly for intercepts too close to 15.

      gsem does not provide automatic identifying constraints in this case, but Red Owl can still use gsem to fit this model with this data by adding the nonrtolerance option. With this option, gsem will iterate to a solution, even if that solution is in a non-concave region of the parameter space.

      For those interested in comparing the gsem fitted parameters with those from Mplus, be aware that the latent classes may not be labeled with the same values. A mapping between the two might be established by comparing the predicted latent class probabilities from Mplus with those reported by estat lcprob.

      Comment


      • #4
        More details:

        Here is a list of my modifications to Red Owl's original model specification:
        • moved option nocapslatent to the global options list
        • changed the number of random draws to 15, to speed things up
        • changed the number of EM iterations to 5, to speed things up
        • removed option intpoint(5) since gsem does not use quadarture to fit models with latent classes
        • removed option iter(10000) since it is not necessary
        • added option nodvheader to cut down on the redundant output
        • added option nonrtolerance to get gsem to iterate to a solution even if it's in a non-concave region of the paramter space

        Code:
        gsem   (SmokedBefore13                                          ///
                DailySmoke                                              ///
                DroveDrunk                                              ///
                DrankBefore13                                           ///
                BingeDrink                                              ///
                MarijuanaBefore13                                       ///
                CocaineEver                                             ///
                GlueEver                                                ///
                MethEver                                                ///
                EcstasyEver                                             ///
                SexBefore13                                             ///
                ManyPartners <- )                                       ///
                ,                                                       ///
                        nocapslatent                                    ///
                        logit                                           ///
                        lclass(C 5)                                     ///
                        startvalues(randomid, draws(15) seed(123321))   ///
                        em(iter(5))                                     ///
                        nodvheader                                      ///
                        nonrtolerance
        The above command produces lots of output, so here is a partial replay
        of the estimation results:

        Code:
        . gsem, nodvheader
        
        (some output omitted)
        
        Class          : 5
        
        ------------------------------------------------------------------------------
                     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        SmokedBef~13 |
               _cons |  -3.348531   .2816733   -11.89   0.000    -3.900601   -2.796462
        -------------+----------------------------------------------------------------
        DailySmoke   |
               _cons |  -3.947107   .3396679   -11.62   0.000    -4.612843    -3.28137
        -------------+----------------------------------------------------------------
        DroveDrunk   |
               _cons |   -4.25491   .5429224    -7.84   0.000    -5.319019   -3.190802
        -------------+----------------------------------------------------------------
        DrankBefo~13 |
               _cons |  -1.909907   .1230908   -15.52   0.000    -2.151161   -1.668654
        -------------+----------------------------------------------------------------
        BingeDrink   |
               _cons |  -3.045484   .3394023    -8.97   0.000      -3.7107   -2.380267
        -------------+----------------------------------------------------------------
        Marijuana~13 |
               _cons |  -4.781259   .4751201   -10.06   0.000    -5.712477    -3.85004
        -------------+----------------------------------------------------------------
        CocaineEver  |
               _cons |  -18.45057          .        .       .            .           .
        -------------+----------------------------------------------------------------
        GlueEver     |
               _cons |  -2.790898    .176965   -15.77   0.000    -3.137743   -2.444053
        -------------+----------------------------------------------------------------
        MethEver     |
               _cons |  -19.65086          .        .       .            .           .
        -------------+----------------------------------------------------------------
        EcstasyEver  |
               _cons |   -19.1691          .        .       .            .           .
        -------------+----------------------------------------------------------------
        SexBefore13  |
               _cons |  -4.929153   .4963824    -9.93   0.000    -5.902045   -3.956262
        -------------+----------------------------------------------------------------
        ManyPartners |
               _cons |  -3.469307    .499461    -6.95   0.000    -4.448233   -2.490382
        ------------------------------------------------------------------------------
        Notice that for latent class labeled "5", the intercepts for CocaineEver, MethEver, and EcstasyEver all need to be constrained.

        Here are the predicted latent class probabilities:
        Code:
        . estat lcprob, nose
        
        Latent class marginal probabilities             Number of obs     =      1,000
        
        --------------------------------------------------------------
                     |     Margin
        -------------+------------------------------------------------
                   C |
                  1  |    .065663
                  2  |   .1521513
                  3  |   .0508343
                  4  |    .059735
                  5  |   .6716164
        --------------------------------------------------------------

        Comment


        • #5
          The thorough researcher might use these results as starting values in a refit with additional Mplus style model identifying constraints.

          Note that in the following I use estimates store to store the results from both model fits for direct comparison using estimates table.

          Code:
          estimates store nonrtol
          matrix b0 = e(b)
          
          gsem   (SmokedBefore13                                          ///
                  DailySmoke                                              ///
                  DroveDrunk                                              ///
                  DrankBefore13                                           ///
                  BingeDrink                                              ///
                  MarijuanaBefore13                                       ///
                  CocaineEver                                             ///
                  GlueEver                                                ///
                  MethEver                                                ///
                  EcstasyEver                                             ///
                  SexBefore13                                             ///
                  ManyPartners <- )                                       ///
                  (4: SmokedBefore <- _cons@15)                           ///
                  (4: CocaineEver <- _cons@-15)                           ///
                  (5: CocaineEver <- _cons@-15)                           ///
                  (3: GlueEver <- _cons@-15)                              ///
                  (3: MethEver <- _cons@-15)                              ///
                  (4: MethEver <- _cons@-15)                              ///
                  (5: MethEver <- _cons@-15)                              ///
                  (5: EcstasyEver <- _cons@-15)                           ///
                  (3: ManyPartners <- _cons@15)                           ///
                  ,                                                       ///
                          nocapslatent                                    ///
                          logit                                           ///
                          lclass(C 5)                                     ///
                          em(iter(0))                                     ///
                          nodvheader                                      ///
                          from(b0)
          estimates store withCns
          
          estimates table nonrtol withCns, stat(ll)

          Comment


          • #6
            @JeffPitblado

            Thanks so much for very thorough and informative reply. That is very, very helpful.

            I now see the -nonrtolerance- option mentioned on p. 520 of the Stata 15 sem.pdf manual, but I had not known its purpose before your reply.

            May I suggest that a short comment related to this issue be added to the Remarks for Example 50g in the Stata 15 sem.pdf manual? Even though this issue doesn't apply to the example data, a caution for other data sets might be helpful.

            Again, thank you for the obvious amount of time you invested and your thoroughness in helping me understand this issue.

            Red Owl

            Comment


            • #7
              @JeffPitblado

              As I mentioned in my support request email a couple of days ago but which I did not include in my forum post, I am also having a problem in obtaining goodness of fit statistics for this model.

              Even after running your code, including manually setting the constraints, this is what happens:
              Code:
              . estat lcgof
              
              too many terms in interaction
              I expect this is related to the need to specify constraints, but that's just a guess.

              Any advice on how I can obtain the AIC, BIC, and G-sq for this model?

              (I'm using Stata/IC 15 on Windows 10.)

              Thanks for any more help you can offer.

              Red Owl

              Comment


              • #8
                Hi Red Owl,

                Thanks for pointing this out. You found a bug in estat lcgof.

                We have determined the cause and will have this fixed in a future Stata 15 update (hopefully the one that is about to happen very soon).

                In the mean time, AIC and BIC can be obtained from estat ic.

                Comment


                • #9
                  Jeff Pitblado (StataCorp)

                  Thanks so much.

                  That's very helpful, and I appreciate all the effort you invested in this request.

                  Red Owl

                  Comment


                  • #10
                    Hi @JeffPitblado,

                    I have tried using your code from the 12 Jun 2017, 23:01 using the LcaSampleDataset from Penn State's LCA plugin (1000 observations) , but get very different results.

                    gsem (SmokedBefore13 /// DailySmoke /// DroveDrunk /// DrankBefore13 /// BingeDrink /// MarijuanaBefore13 /// CocaineEver /// GlueEver /// MethEver /// EcstasyEver /// SexBefore13 /// ManyPartners <- ) /// , /// nocapslatent /// logit /// lclass(C 5) /// startvalues(randomid, draws(15) seed(123321)) /// em(iter(5)) /// nodvheader /// nonrtolerance
                    results:

                    Click image for larger version

Name:	1.PNG
Views:	1
Size:	37.6 KB
ID:	1451452


                    Click image for larger version

Name:	2.PNG
Views:	1
Size:	10.6 KB
ID:	1451453

                    When using the Penn State plugin and the same sample dataset my results are more similar to your latent class marginal probabilities.
                    Click image for larger version

Name:	3.PNG
Views:	1
Size:	6.9 KB
ID:	1451454
                    .


                    Thanks for any help and advice!

                    Hannah

                    Comment


                    • #11
                      Dear Hannah,
                      What you need to do is to deduct 1 from the dummy variables so that their coding changes from 1/2 into 0/1, like:
                      Code:
                      clear all
                      import delimited "LcaSampleDataset.txt" , delim(" ")
                      replace v1 = v1-1
                      forvalue i = 4/15 {
                          replace v`i' = v`i'-1
                      }
                      Next, to get the appropriate variable names (although I do not know what the second variable name is):
                      Code:
                      tokenize "Sex Unknown Age SmokedBefore13 DailySmoke DroveDrunk DrankBefore13 BingeDrink MarijuanaBefore13 CocaineEver GlueEver MethEver EcstasyEver SexBefore13 ManyPartners ID"
                      forvalue i = 1/16 {
                          rename v`i' `1'
                          macro shift
                      }
                      Next, I ran the code of Jeff Pitblado, i.e. that in comment #4, and that produces on my system (running Stata 15.1 18 April 2018), these Latent class marginal probabilities, which look about right:
                      Code:
                                1  |     .06568
                                2  |   .1520551
                                3  |   .0508999
                                4  |   .0597778
                                5  |   .6715872
                      However, note that the model results are different from what was reported by Jeff in #4:
                      Code:
                      Class           : 5
                      ---------------------------------------------------------------------------------
                                      |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                      ----------------+----------------------------------------------------------------
                      SmokedBefore13  |
                                _cons |   3.349003   .2818249    11.88   0.000     2.796636    3.901369
                      ----------------+----------------------------------------------------------------
                      DailySmoke      |
                                _cons |   3.947388   .3397552    11.62   0.000      3.28148    4.613296
                      ----------------+----------------------------------------------------------------
                      DroveDrunk      |
                                _cons |   4.255221   .5431407     7.83   0.000     3.190685    5.319757
                      ----------------+----------------------------------------------------------------
                      DrankBefore13   |
                                _cons |   1.909951   .1230945    15.52   0.000      1.66869    2.151212
                      ----------------+----------------------------------------------------------------
                      BingeDrink      |
                                _cons |   3.045215    .339345     8.97   0.000     2.380111    3.710319
                      ----------------+----------------------------------------------------------------
                      MarijuanaBef~13 |
                                _cons |   4.780941   .4750635    10.06   0.000     3.849834    5.712049
                      ----------------+----------------------------------------------------------------
                      CocaineEver     |
                                _cons |   17.36323   245.0288     0.07   0.944    -462.8844    497.6109
                      ----------------+----------------------------------------------------------------
                      GlueEver        |
                                _cons |   2.790897   .1769679    15.77   0.000     2.444047    3.137748
                      ----------------+----------------------------------------------------------------
                      MethEver        |
                                _cons |   16.28136   138.1936     0.12   0.906    -254.5731    287.1358
                      ----------------+----------------------------------------------------------------
                      EcstasyEver     |
                                _cons |   15.89278   157.6046     0.10   0.920    -293.0066    324.7921
                      ----------------+----------------------------------------------------------------
                      SexBefore13     |
                                _cons |   4.928787    .496215     9.93   0.000     3.956224    5.901351
                      ----------------+----------------------------------------------------------------
                      ManyPartners    |
                                _cons |   3.470972   .5008762     6.93   0.000     2.489273    4.452672
                      ---------------------------------------------------------------------------------
                      My assumption is that in a later release of Stata some improvements were implemented that no longer require the model parameters as explained by Jeff in #5 (actually, that code fails to converge on my system).

                      I hope that this is helpful.

                      PS you do not need to snapshot code or output from Stata's result window but copy what you need from there and wrap that with the CODE indicators (available through the Reply menu button #). You best first trigger the CODE indicators and next simply paste your text in between. Using the button Preview gives comfort as you can check if all is OK!
                      http://publicationslist.org/eric.melse

                      Comment


                      • #12
                        Dear ericmelse

                        Thank you very much for the fast and informative reply!

                        I have one more question. I am running a LCA with continuous observed variables, e.g. nearest distance to the pool from participant's house or the count of care facilities within a 400m radius of participant's home, but all my standard errors, p-values and confidence intervals are not showing, all variables are constrained.

                        Jeff Pitblado had a similar problem and used the constraint -15/+15 as a solution. However he had categorical observed variables and used a logit model so the same solution would not work for me. Is there a similar solution for continuous variables?


                        results:
                        Code:
                        -------------------------------------------------------------------------------------------
                                                  |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                        --------------------------+----------------------------------------------------------------
                        nrdist_pool             |
                                            _cons |   192.9849   .        .       .            .           .
                        --------------------------+----------------------------------------------------------------
                        cnt400_care             |
                                            _cons |   3.34          .        .       .            .           .
                        --------------------------+----------------------------------------------------------------
                        nrdist_market             |
                                            _cons |   115.1486          .        .       .            .           .
                        --------------------------+----------------------------------------------------------------
                        nrdist_toilet            |
                                            _cons |   134.29          .        .       .            .           .

                        Thanks for any help and advice!
                        Hannah

                        Attached Files

                        Comment


                        • #13
                          Originally posted by Hannah Schmidt View Post
                          Dear ericmelse

                          Thank you very much for the fast and informative reply!

                          I have one more question. I am running a LCA with continuous observed variables, e.g. nearest distance to the pool from participant's house or the count of care facilities within a 400m radius of participant's home, but all my standard errors, p-values and confidence intervals are not showing, all variables are constrained.

                          Jeff Pitblado had a similar problem and used the constraint -15/+15 as a solution. However he had categorical observed variables and used a logit model so the same solution would not work for me. Is there a similar solution for continuous variables?


                          results:
                          Code:
                          -------------------------------------------------------------------------------------------
                          | Coef. Std. Err. z P&gt;|z| [95% Conf. Interval]
                          --------------------------+----------------------------------------------------------------
                          nrdist_pool |
                          _cons | 192.9849 . . . . .
                          --------------------------+----------------------------------------------------------------
                          cnt400_care |
                          _cons | 3.34 . . . . .
                          --------------------------+----------------------------------------------------------------
                          nrdist_market |
                          _cons | 115.1486 . . . . .
                          --------------------------+----------------------------------------------------------------
                          nrdist_toilet |
                          _cons | 134.29 . . . . .

                          Thanks for any help and advice!
                          Hannah
                          Hannah, please don't attach screenshots, particularly not if you've already posted the output in code delimiters.

                          That said, it's a lot less clear what to do in this situation. An initial thought is that it's clear that you have one variable that may not be on the same scale as the others - cnt500_market. I don't know what the actual scale is, it could merely be low in the class you posted. I believe I've read elsewhere that in some cases, vastly different scales can mess up maximization processes.

                          Some more context would be helpful to answer:

                          1) What's the exact command you typed?

                          2) Can you post descriptive statistics for the sample? This should work:

                          Code:
                          tabstat nrstdist_care cnt500_market nrdist_market nrdist_pool, statistics(mean sd)
                          3) Can you post the full output from this estimation run? I am curious what the means are for the remaining classes that you identified. If it's not too much trouble, can you also run -estat lcprob- and post that output?

                          From my limited understanding, the issue with logit intercepts above is that as the probability of endorsing a binary item approaches zero or one in any particular class, the logit intercept goes off to + or - infinity. I believe that Stata's algorithm may have issues declaring convergence in those cases, because the likelihood is not concave*. I've certainly seen it happen when I've had several such cases of very high or very low logit intercepts in one run. In that case, you frequently will see missing standard errors associated with the coefficients in question.

                          In the case of continuous variables, I am less sure what a missing SE means. I am wondering if one class is very, very small, and the class-specific means are wandering off into some very high level. Or ... I'm not sure what else I'd look for, or even that I would know if I saw it, but output would help!

                          * This is why Jeff recommends that you can use the -nonrtolerance- option for initial runs. Stata won't check if the second derivative of the log likelihood is zero at the current point. My understanding is that this enables convergence in non-concave sections of the likelihood. However, my understanding is that you increase the risk of declaring victory in a local but not global maxima.
                          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                          Comment


                          • #14
                            Dear @Weiwen Ng,

                            Thank you very much for the helpful tip. The variables do indeed have different scales. The values of a few of the count variables are close to zero while the distance variables have larger values. If I only use one type of variable, either count or distance, I do get standard errors.

                            However, in case I do not get the standard errors etc. as in post #12, would I still be able to use the results?

                            Thanks for any help and advice!
                            Hannah

                            Comment


                            • #15
                              Originally posted by Hannah Schmidt View Post
                              Dear @Weiwen Ng,

                              Thank you very much for the helpful tip. The variables do indeed have different scales. The values of a few of the count variables are close to zero while the distance variables have larger values. If I only use one type of variable, either count or distance, I do get standard errors.

                              However, in case I do not get the standard errors etc. as in post #12, would I still be able to use the results?

                              Thanks for any help and advice!
                              Hannah
                              If you don't get standard errors, that means the model has not converged adequately - basically, you aren't sure (because Stata is not sure) if you are at a global maximum of the likelihood function. You do not have a valid maximum likelihood estimate. I would not present the affected result. I think most people would counsel the same.

                              Honestly, I am not sure that the different scaling is necessarily a cause of your problem. However, it seems like you can easily rescale distance, e.g. just divide by 10 such that you are presenting results in tens of kilometers/miles. I'd give that a shot.

                              You are treating all variables as normal random variables. Did you know you can fit different types of models to different variables? For example, you could treat the count variable as a Poisson variable, e.g. (I am assuming your variable names and syntax). Maybe this will help also:

                              Code:
                              gsem (nrdist_pool nrdist_market nrdist_toilet <- _cons,) (cnt400_care <- _cons, poisson), lclass(C 4)
                              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                              Comment

                              Working...
                              X