Latent Class Analysis with Stata 15 -gsem- Problem

Red Owl

Join Date: Nov 2016

Posts: 127
#1

Latent Class Analysis with Stata 15 -gsem- Problem

10 Jun 2017, 17:38

I'm trying out the new Latent Class Analysis feature of Stata 15's -gsem- command with Stata/IC 15, but I have been unable to get any model to converge except the very simple example in Example 50b in the Stata 15 sem.pdf manual.

The Methodology Center at Penn State provides a Latent Class Analysis Plugin for Stata (version 1.21), which I have been using for a couple of years with Stata 14 without difficulty, but I was hoping the new Stata 15 latent class feature would allow me to avoid using an additional plugin. (The plugin is only available for Windows and can be downloaded at https://methodology.psu.edu/downloads/lcastata .)

In my testing, I'm using a data set included with Penn State's LCA plugin and consisting of LcaSampleDataset.dct and LcaSampleDataset.txt. That data set is a subsample of N = 1,000 cases drawn from a larger sample (N = 13,840). Example results from the larger data set are shown at https://methodology.psu.edu/ra/lca/example , and the results of my testing using the plugin with the subset are consistent with the web site results from the larger sample.

The LCA plugin produces results in 1.9 seconds.

I have also used the poLCA package with R 3.4.0 and reproduced the results with the same data in about the same amount of elapsed time.

My attempts to reproduce these results in Stata/IC 15 (after recoding the indicator variables from 1/2 to 0/1 values) have all failed to converge and consistently have shown the "not concave" issue after about 50 iterations.

I've tried many options with -gsem-, but the following is a typical example of the code I have tried. I've reduced the integration points to 5 (from the default of 7), have added a startvalues() option with 30 draws, have set the integration method to intmethod(ghermite), have added the -difficult- option, and have run 10,000 iterations.

The Stata 15 sem.pdf manual in "intro 12" indicates that intmethod(ghermite) "is less accurate but quicker, and the calculation it makes converges more readily than either of the above methods." I used the intpoint(5) and intmethod(ghermite) options only to get an initial model to converge so I can obtain better starting values.

Code:

gsem /// (SmokedBefore13 DailySmoke DroveDrunk DrankBefore13 BingeDrink MarijuanaBefore13 CocaineEver GlueEver MethEver EcstasyEver SexBefore13 ManyPartners <-, nocapslatent), /// logit /// lclass(C 5) /// startvalues(randomid, draws(30) seed(123321)) /// intpoints(5) /// iter(10000)

This command produces an EM log likelihood that does not change (with four significant digits) after the 50th iteration. The model is also consistently reported to be not concave after that iteration, and the -difficult- option does not help. (I've tried the command with and without the -difficult- option.)

I would appreciate any advice about other options to try or option values to tweak with the new Stata 15 -gsem- latent class analysis feature. I was surprised that Stata 15 does not seem to be able to converge to results that can be obtained in about 2 seconds with both the LCA Plugin and the poLCA package with R 3.4.0. I have not been able to try this with MPlus yet, but I've not had problems with Latent Class Analysis with Mplus before, and MPlus is generally very fast.

Thanks for any help and advice.

Red Owl
(yes, my real name)
Tags: None
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 697
#2

12 Jun 2017, 08:43

Thanks to Red Owl for bringing this to our attention, and for providing an example we can use to investigate the problem.

We are looking into the problem, and hope to have something more to report later on today.
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 697
#3

12 Jun 2017, 22:54

A little later than I intended, but ...

gsem is having trouble fitting the model because the data require some of the latent classes to have completely determined values for some of the logit outcome variables. This situation can be a challenge to detect because the model, as specified, is not identified without some additional constraints.

It appears that Mplus and the Penn State Latent Class Analysis Plugin for Stata provide an automatic mechanism for adding these identifying constraints. For example, when Mplus encounters a logit intercept too close to -15, it sets/constrains the intercept to -15; and similarly for intercepts too close to 15.

gsem does not provide automatic identifying constraints in this case, but Red Owl can still use gsem to fit this model with this data by adding the nonrtolerance option. With this option, gsem will iterate to a solution, even if that solution is in a non-concave region of the parameter space.

For those interested in comparing the gsem fitted parameters with those from Mplus, be aware that the latent classes may not be labeled with the same values. A mapping between the two might be established by comparing the predicted latent class probabilities from Mplus with those reported by estat lcprob.
1 like
Comment

Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014
Posts: 697

12 Jun 2017, 23:01

More details:

Here is a list of my modifications to Red Owl's original model specification:

moved option nocapslatent to the global options list
changed the number of random draws to 15, to speed things up
changed the number of EM iterations to 5, to speed things up
removed option intpoint(5) since gsem does not use quadarture to fit models with latent classes
removed option iter(10000) since it is not necessary
added option nodvheader to cut down on the redundant output
added option nonrtolerance to get gsem to iterate to a solution even if it's in a non-concave region of the paramter space

Code:

gsem   (SmokedBefore13                                          ///
        DailySmoke                                              ///
        DroveDrunk                                              ///
        DrankBefore13                                           ///
        BingeDrink                                              ///
        MarijuanaBefore13                                       ///
        CocaineEver                                             ///
        GlueEver                                                ///
        MethEver                                                ///
        EcstasyEver                                             ///
        SexBefore13                                             ///
        ManyPartners <- )                                       ///
        ,                                                       ///
                nocapslatent                                    ///
                logit                                           ///
                lclass(C 5)                                     ///
                startvalues(randomid, draws(15) seed(123321))   ///
                em(iter(5))                                     ///
                nodvheader                                      ///
                nonrtolerance

The above command produces lots of output, so here is a partial replay
of the estimation results:

Code:

. gsem, nodvheader

(some output omitted)

Class          : 5

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
SmokedBef~13 |
       _cons |  -3.348531   .2816733   -11.89   0.000    -3.900601   -2.796462
-------------+----------------------------------------------------------------
DailySmoke   |
       _cons |  -3.947107   .3396679   -11.62   0.000    -4.612843    -3.28137
-------------+----------------------------------------------------------------
DroveDrunk   |
       _cons |   -4.25491   .5429224    -7.84   0.000    -5.319019   -3.190802
-------------+----------------------------------------------------------------
DrankBefo~13 |
       _cons |  -1.909907   .1230908   -15.52   0.000    -2.151161   -1.668654
-------------+----------------------------------------------------------------
BingeDrink   |
       _cons |  -3.045484   .3394023    -8.97   0.000      -3.7107   -2.380267
-------------+----------------------------------------------------------------
Marijuana~13 |
       _cons |  -4.781259   .4751201   -10.06   0.000    -5.712477    -3.85004
-------------+----------------------------------------------------------------
CocaineEver  |
       _cons |  -18.45057          .        .       .            .           .
-------------+----------------------------------------------------------------
GlueEver     |
       _cons |  -2.790898    .176965   -15.77   0.000    -3.137743   -2.444053
-------------+----------------------------------------------------------------
MethEver     |
       _cons |  -19.65086          .        .       .            .           .
-------------+----------------------------------------------------------------
EcstasyEver  |
       _cons |   -19.1691          .        .       .            .           .
-------------+----------------------------------------------------------------
SexBefore13  |
       _cons |  -4.929153   .4963824    -9.93   0.000    -5.902045   -3.956262
-------------+----------------------------------------------------------------
ManyPartners |
       _cons |  -3.469307    .499461    -6.95   0.000    -4.448233   -2.490382
------------------------------------------------------------------------------

Notice that for latent class labeled "5", the intercepts for CocaineEver, MethEver, and EcstasyEver all need to be constrained.

Here are the predicted latent class probabilities:

Code:

. estat lcprob, nose

Latent class marginal probabilities             Number of obs     =      1,000

--------------------------------------------------------------
             |     Margin
-------------+------------------------------------------------
           C |
          1  |    .065663
          2  |   .1521513
          3  |   .0508343
          4  |    .059735
          5  |   .6716164
--------------------------------------------------------------

Comment

Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014
Posts: 697

12 Jun 2017, 23:02

The thorough researcher might use these results as starting values in a refit with additional Mplus style model identifying constraints.

Note that in the following I use estimates store to store the results from both model fits for direct comparison using estimates table.

Code:

estimates store nonrtol
matrix b0 = e(b)

gsem   (SmokedBefore13                                          ///
        DailySmoke                                              ///
        DroveDrunk                                              ///
        DrankBefore13                                           ///
        BingeDrink                                              ///
        MarijuanaBefore13                                       ///
        CocaineEver                                             ///
        GlueEver                                                ///
        MethEver                                                ///
        EcstasyEver                                             ///
        SexBefore13                                             ///
        ManyPartners <- )                                       ///
        (4: SmokedBefore <- _cons@15)                           ///
        (4: CocaineEver <- _cons@-15)                           ///
        (5: CocaineEver <- _cons@-15)                           ///
        (3: GlueEver <- _cons@-15)                              ///
        (3: MethEver <- _cons@-15)                              ///
        (4: MethEver <- _cons@-15)                              ///
        (5: MethEver <- _cons@-15)                              ///
        (5: EcstasyEver <- _cons@-15)                           ///
        (3: ManyPartners <- _cons@15)                           ///
        ,                                                       ///
                nocapslatent                                    ///
                logit                                           ///
                lclass(C 5)                                     ///
                em(iter(0))                                     ///
                nodvheader                                      ///
                from(b0)
estimates store withCns

estimates table nonrtol withCns, stat(ll)

Comment

Red Owl

Join Date: Nov 2016

Posts: 127
#6

13 Jun 2017, 09:26

@JeffPitblado

Thanks so much for very thorough and informative reply. That is very, very helpful.

I now see the -nonrtolerance- option mentioned on p. 520 of the Stata 15 sem.pdf manual, but I had not known its purpose before your reply.

May I suggest that a short comment related to this issue be added to the Remarks for Example 50g in the Stata 15 sem.pdf manual? Even though this issue doesn't apply to the example data, a caution for other data sets might be helpful.

Again, thank you for the obvious amount of time you invested and your thoroughness in helping me understand this issue.

Red Owl
Comment
Red Owl

Join Date: Nov 2016

Posts: 127
#7

13 Jun 2017, 10:37

@JeffPitblado

As I mentioned in my support request email a couple of days ago but which I did not include in my forum post, I am also having a problem in obtaining goodness of fit statistics for this model.

Even after running your code, including manually setting the constraints, this is what happens:

Code:

. estat lcgof too many terms in interaction

I expect this is related to the need to specify constraints, but that's just a guess.

Any advice on how I can obtain the AIC, BIC, and G-sq for this model?

(I'm using Stata/IC 15 on Windows 10.)

Thanks for any more help you can offer.

Red Owl
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 697
#8

13 Jun 2017, 23:16

Hi Red Owl,

Thanks for pointing this out. You found a bug in estat lcgof.

We have determined the cause and will have this fixed in a future Stata 15 update (hopefully the one that is about to happen very soon).

In the mean time, AIC and BIC can be obtained from estat ic.
Comment
Red Owl

Join Date: Nov 2016

Posts: 127
#9

14 Jun 2017, 00:19

Jeff Pitblado (StataCorp)

Thanks so much.

That's very helpful, and I appreciate all the effort you invested in this request.

Red Owl
Comment
Hannah Schmidt

Join Date: Jul 2018

Posts: 3
#10

01 Jul 2018, 21:36

Hi @JeffPitblado,

I have tried using your code from the 12 Jun 2017, 23:01 using the LcaSampleDataset from Penn State's LCA plugin (1000 observations) , but get very different results.

gsem (SmokedBefore13 /// DailySmoke /// DroveDrunk /// DrankBefore13 /// BingeDrink /// MarijuanaBefore13 /// CocaineEver /// GlueEver /// MethEver /// EcstasyEver /// SexBefore13 /// ManyPartners <- ) /// , /// nocapslatent /// logit /// lclass(C 5) /// startvalues(randomid, draws(15) seed(123321)) /// em(iter(5)) /// nodvheader /// nonrtolerance
results:

When using the Penn State plugin and the same sample dataset my results are more similar to your latent class marginal probabilities.
.

Thanks for any help and advice!

Hannah
Comment

ericmelse

Join Date: May 2014
Posts: 434

#11

01 Jul 2018, 23:50

Dear Hannah,
What you need to do is to deduct 1 from the dummy variables so that their coding changes from 1/2 into 0/1, like:

Code:

clear all
import delimited "LcaSampleDataset.txt" , delim(" ")
replace v1 = v1-1
forvalue i = 4/15 {
    replace v`i' = v`i'-1
}

Next, to get the appropriate variable names (although I do not know what the second variable name is):

Code:

tokenize "Sex Unknown Age SmokedBefore13 DailySmoke DroveDrunk DrankBefore13 BingeDrink MarijuanaBefore13 CocaineEver GlueEver MethEver EcstasyEver SexBefore13 ManyPartners ID"
forvalue i = 1/16 {
    rename v`i' `1'
    macro shift
}

Next, I ran the code of Jeff Pitblado, i.e. that in comment #4, and that produces on my system (running Stata 15.1 18 April 2018), these Latent class marginal probabilities, which look about right:

Code:

          1  |     .06568
          2  |   .1520551
          3  |   .0508999
          4  |   .0597778
          5  |   .6715872

However, note that the model results are different from what was reported by Jeff in #4:

Code:

Class           : 5
---------------------------------------------------------------------------------
                |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
SmokedBefore13  |
          _cons |   3.349003   .2818249    11.88   0.000     2.796636    3.901369
----------------+----------------------------------------------------------------
DailySmoke      |
          _cons |   3.947388   .3397552    11.62   0.000      3.28148    4.613296
----------------+----------------------------------------------------------------
DroveDrunk      |
          _cons |   4.255221   .5431407     7.83   0.000     3.190685    5.319757
----------------+----------------------------------------------------------------
DrankBefore13   |
          _cons |   1.909951   .1230945    15.52   0.000      1.66869    2.151212
----------------+----------------------------------------------------------------
BingeDrink      |
          _cons |   3.045215    .339345     8.97   0.000     2.380111    3.710319
----------------+----------------------------------------------------------------
MarijuanaBef~13 |
          _cons |   4.780941   .4750635    10.06   0.000     3.849834    5.712049
----------------+----------------------------------------------------------------
CocaineEver     |
          _cons |   17.36323   245.0288     0.07   0.944    -462.8844    497.6109
----------------+----------------------------------------------------------------
GlueEver        |
          _cons |   2.790897   .1769679    15.77   0.000     2.444047    3.137748
----------------+----------------------------------------------------------------
MethEver        |
          _cons |   16.28136   138.1936     0.12   0.906    -254.5731    287.1358
----------------+----------------------------------------------------------------
EcstasyEver     |
          _cons |   15.89278   157.6046     0.10   0.920    -293.0066    324.7921
----------------+----------------------------------------------------------------
SexBefore13     |
          _cons |   4.928787    .496215     9.93   0.000     3.956224    5.901351
----------------+----------------------------------------------------------------
ManyPartners    |
          _cons |   3.470972   .5008762     6.93   0.000     2.489273    4.452672
---------------------------------------------------------------------------------

My assumption is that in a later release of Stata some improvements were implemented that no longer require the model parameters as explained by Jeff in #5 (actually, that code fails to converge on my system).

I hope that this is helpful.

PS you do not need to snapshot code or output from Stata's result window but copy what you need from there and wrap that with the CODE indicators (available through the Reply menu button #). You best first trigger the CODE indicators and next simply paste your text in between. Using the button Preview gives comfort as you can check if all is OK!

http://publicationslist.org/eric.melse

Comment

Hannah Schmidt

Join Date: Jul 2018
Posts: 3

#12

02 Jul 2018, 04:03

Dear ericmelse

Thank you very much for the fast and informative reply!

I have one more question. I am running a LCA with continuous observed variables, e.g. nearest distance to the pool from participant's house or the count of care facilities within a 400m radius of participant's home, but all my standard errors, p-values and confidence intervals are not showing, all variables are constrained.

Jeff Pitblado had a similar problem and used the constraint -15/+15 as a solution. However he had categorical observed variables and used a logit model so the same solution would not work for me. Is there a similar solution for continuous variables?

results:

Code:

-------------------------------------------------------------------------------------------
                          |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------------------+----------------------------------------------------------------
nrdist_pool             |
                    _cons |   192.9849   .        .       .            .           .
--------------------------+----------------------------------------------------------------
cnt400_care             |
                    _cons |   3.34          .        .       .            .           .
--------------------------+----------------------------------------------------------------
nrdist_market             |
                    _cons |   115.1486          .        .       .            .           .
--------------------------+----------------------------------------------------------------
nrdist_toilet            |
                    _cons |   134.29          .        .       .            .           .

Thanks for any help and advice!
Hannah

Attached Files

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#13

02 Jul 2018, 08:27

Originally posted by Hannah Schmidt View Post

Dear ericmelse

Thank you very much for the fast and informative reply!

I have one more question. I am running a LCA with continuous observed variables, e.g. nearest distance to the pool from participant's house or the count of care facilities within a 400m radius of participant's home, but all my standard errors, p-values and confidence intervals are not showing, all variables are constrained.

Jeff Pitblado had a similar problem and used the constraint -15/+15 as a solution. However he had categorical observed variables and used a logit model so the same solution would not work for me. Is there a similar solution for continuous variables?

results:

Code:

------------------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------------------+---------------------------------------------------------------- nrdist_pool | _cons | 192.9849 . . . . . --------------------------+---------------------------------------------------------------- cnt400_care | _cons | 3.34 . . . . . --------------------------+---------------------------------------------------------------- nrdist_market | _cons | 115.1486 . . . . . --------------------------+---------------------------------------------------------------- nrdist_toilet | _cons | 134.29 . . . . .

Thanks for any help and advice!
Hannah

Hannah, please don't attach screenshots, particularly not if you've already posted the output in code delimiters.

That said, it's a lot less clear what to do in this situation. An initial thought is that it's clear that you have one variable that may not be on the same scale as the others - cnt500_market. I don't know what the actual scale is, it could merely be low in the class you posted. I believe I've read elsewhere that in some cases, vastly different scales can mess up maximization processes.

Some more context would be helpful to answer:

1) What's the exact command you typed?

2) Can you post descriptive statistics for the sample? This should work:

Code:

tabstat nrstdist_care cnt500_market nrdist_market nrdist_pool, statistics(mean sd)

3) Can you post the full output from this estimation run? I am curious what the means are for the remaining classes that you identified. If it's not too much trouble, can you also run -estat lcprob- and post that output?

From my limited understanding, the issue with logit intercepts above is that as the probability of endorsing a binary item approaches zero or one in any particular class, the logit intercept goes off to + or - infinity. I believe that Stata's algorithm may have issues declaring convergence in those cases, because the likelihood is not concave*. I've certainly seen it happen when I've had several such cases of very high or very low logit intercepts in one run. In that case, you frequently will see missing standard errors associated with the coefficients in question.

In the case of continuous variables, I am less sure what a missing SE means. I am wondering if one class is very, very small, and the class-specific means are wandering off into some very high level. Or ... I'm not sure what else I'd look for, or even that I would know if I saw it, but output would help!

* This is why Jeff recommends that you can use the -nonrtolerance- option for initial runs. Stata won't check if the second derivative of the log likelihood is zero at the current point. My understanding is that this enables convergence in non-concave sections of the likelihood. However, my understanding is that you increase the risk of declaring victory in a local but not global maxima.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Hannah Schmidt

Join Date: Jul 2018

Posts: 3
#14

06 Jul 2018, 08:39

Dear @Weiwen Ng,

Thank you very much for the helpful tip. The variables do indeed have different scales. The values of a few of the count variables are close to zero while the distance variables have larger values. If I only use one type of variable, either count or distance, I do get standard errors.

However, in case I do not get the standard errors etc. as in post #12, would I still be able to use the results?

Thanks for any help and advice!
Hannah
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#15

06 Jul 2018, 09:15

Originally posted by Hannah Schmidt View Post

Dear @Weiwen Ng,

Thank you very much for the helpful tip. The variables do indeed have different scales. The values of a few of the count variables are close to zero while the distance variables have larger values. If I only use one type of variable, either count or distance, I do get standard errors.

However, in case I do not get the standard errors etc. as in post #12, would I still be able to use the results?

Thanks for any help and advice!
Hannah

If you don't get standard errors, that means the model has not converged adequately - basically, you aren't sure (because Stata is not sure) if you are at a global maximum of the likelihood function. You do not have a valid maximum likelihood estimate. I would not present the affected result. I think most people would counsel the same.

Honestly, I am not sure that the different scaling is necessarily a cause of your problem. However, it seems like you can easily rescale distance, e.g. just divide by 10 such that you are presenting results in tens of kilometers/miles. I'd give that a shot.

You are treating all variables as normal random variables. Did you know you can fit different types of models to different variables? For example, you could treat the count variable as a Poisson variable, e.g. (I am assuming your variable names and syntax). Maybe this will help also:

Code:

gsem (nrdist_pool nrdist_market nrdist_toilet <- _cons,) (cnt400_care <- _cons, poisson), lclass(C 4)

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement