Replicating Masyn's Latent Profile Models Stata Example 52g.

Simone Verswijveren

Join Date: Aug 2018

Posts: 4
#1

Replicating Masyn's Latent Profile Models Stata Example 52g.

13 Aug 2018, 01:30

Hi,

I am currently working on a latent profile model in Stata. I have read Masyn's book chapter (which has been used as a reference in Stata Example 52g - Latent profile model): Masyn, K. E. 2013. Latent class analysis and finite mixture modeling. In The Oxford Handbook of Quantitative Methods, ed. T. D. Little, vol. 2, 551–610. New York: Oxford University Press.

Masyn states that the conditional independence assumption is not necessary for the within-class covariance structure. Unlike LCA, latent profile models do not require partial conditional independence for model identification—all indicators can covary with all other indicators within class. With increased flexibility in the within-class model specification comes additional
complexity in the model-building process. She uses 4 main within-class variance-covariance structures - I am trying to model these in Stata, however, have some issues with this (described below).

Masyn compares 4 different LPA models: class-invariant, diagonal; class-varying, diagonal; class-invariant, unrestricted ; and class-varying, unrestricted.
She recommends to compare these options for the 1-class model, 2-class model, 3-class model, etc. (depending on what is relevant for your study) for the class enumeration.

I am interested in 7 observable variables, so I was going to do this for the models ranging from 1-class to 6-classes. However, I have been trying to replicate the models of Masyn by using the covstructure and lcinvariant options to the gsem model (as described in the latent profile example) - to then do this with my own models afterwards.

As an example I have provided my 1-class models below, using the example dataset Masyn used in her book. These are all different in the covstructure and lcinvariant options (based on the Stata example 52g). Models 1, 2, and 4 seem to be giving the same numbers as reported by Masyn in her chapter. However, I can't seem to get the model 3 right.

Is there anyone who has modelled these before and wants to share their commands/provide me with feedback?

Much appreciated,

Simone

* Masyn uses the stata-press data (see below).

clear
use http://www.stata-press.com/data/r15/gsem_lca2

* WITH 1 class
*1 CIDIA
* CIDIA. Class-invariant, diagonal (conditional independence is imposed and covariances between indicators are fixed at zero within class, covariances are constrained to be equal across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) covstructure(e._OEn, diagonal) nolog nodvheader

*2 CVDIA
* CVDIA. Class-varying, diagonal (conditional independence is imposed and covariances between indicators are fixed at zero within class while the variances are freely estimated and allowed to be different across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) lcinvariant(none) covstructure(e._OEn, diagonal) nolog nodvheader

*3 CIUNR
* CIUNR. Class-invariant, unrestricted (all indicator variables are allowed to covary within class, covariances are constrained to be equal across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 2) covstructure(e._OEn, unstructured)

// definitiely no lcinvariant

*4 CVUNR
* CVUNR. Least restrictive of variance-covariance structures: class-varying, unrestricted class-specific estimated means (all indicator variables are allowed to covary within class, and the variances are allowed to be different across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) lcinvariant(none) covstructure(e._OEn, unstructured) nolog nodvheader
Tags: Class enumeration, covstructure, latent class analysis, Latent profile model, lcinvariant

Weiwen Ng

Join Date: Jun 2015
Posts: 1241

13 Aug 2018, 10:24

For reference, this is a link to Masyn's chapter. It's hosted on the MPlus website. This is your code pasted in code delimiters (as recommended by the FAQ). I corrected a typo on the third structure (class invariant, unrestricted sigma-k) where your post specified 2 classes(vs 1 class for all other structures):

Code:

clear
use http://www.stata-press.com/data/r15/gsem_lca2

* WITH 1 class
*1 CIDIA
* CIDIA. Class-invariant, diagonal (conditional independence is imposed and covariances between indicators are fixed at zero within class, covariances are constrained to be equal across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) covstructure(e._OEn, diagonal) nolog nodvheader
estimates store c1

*2 CVDIA
* CVDIA. Class-varying, diagonal (conditional independence is imposed and covariances between indicators are fixed at zero within class while the variances are freely estimated and allowed to be different across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) lcinvariant(none) covstructure(e._OEn, diagonal) nolog nodvheader
estimates store c2

*3 CIUNR
* CIUNR. Class-invariant, unrestricted (all indicator variables are allowed to covary within class, covariances are constrained to be equal across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) covstructure(e._OEn, unstructured)
estimates store c3

// definitiely no lcinvariant

*4 CVUNR
* CVUNR. Least restrictive of variance-covariance structures: class-varying, unrestricted class-specific estimated means (all indicator variables are allowed to covary within class, and the variances are allowed to be different across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) lcinvariant(none) covstructure(e._OEn, unstructured) nolog nodvheader
estimates store c4

estimates stats c?

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
          c1 |        145         .  -1820.678       6    3653.357   3671.217
          c2 |        145         .  -1820.678       6    3653.357   3671.217
          c3 |        145         .  -1730.403       9    3478.806   3505.596
          c4 |        145         .  -1730.403       9    3478.806   3505.596
-----------------------------------------------------------------------------

I've shown the model statistics, which include the number of parameters, log likelihood, and BIC. The results that are comparable to all correspond to what Masyn showed in table 25.12 (df is the same as parameters estimated, and note she doesn't report straight up AIC). These don't show the estimated class parameters, but Masyn didn't show the estimated class parameters in any table I can find in the chapter. Anyway, for a 1-class model, you're expecting the 3rd model structure to produce an identical solution to the 4th model structure. The only difference between them is class variant versus invariant parameters (here, that affects the variance of the indicators). They both have unrestricted sigma-k (which means the error covariances are freely estimated, whereas structures 1 and 2 constrain them to 0, and estimate only variances of the indicators). With only 1 latent class, they will, again, produce identical results.

I am running Stata 15.1. I know Stata has made some updates that affect latent class estimation since 15.0 was first released, so I'd definitely urge you to update to 15.1 if the above didn't answer your question. Alternatively, you'll need to say exactly what results differ from Masyn's in your structure 3 (preferably present the Stata results in code delimiters; as you can see above, this makes them clearly readable on the forum).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

Comment

Simone Verswijveren

Join Date: Aug 2018
Posts: 4

13 Aug 2018, 22:53

Hi Weiwen Ng,

Thanks for your extensive response! Also, thank you for correcting the typo on the third structure, whilst that was incorrect, that was because I copied this wrongly, and not because of the issue I am experiencing. I have now further explained the issue that I have been experiencing below.

Firstly, I have updated the same models (1 class-invariant, diagonal; 2 class-varying, diagonal; 3 class-invariant, unrestricted; 4 class-varying, unrestricted) as before, but now for the 1-class models and 2-class models. I have like you suggested added the model statistics to this also. See below:

Code:

* Test for the actual numbers of Masyn

clear
use http://www.stata-press.com/data/r15/gsem_lca2
gsem (glucose insulin sspg <- _cons), lclass(C 1)  nolog nodvheader 

* WITH 1 class
*1 CIDIA
* CIDIA.    Class-invariant, diagonal (conditional independence is imposed and covariances between indicators are fixed at zero within class, covariances are constrained to be equal across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) covstructure(e._OEn, diagonal) nolog nodvheader 
estimates store c1

*2 CVDIA
* CVDIA.    Class-varying, diagonal (conditional independence is imposed and covariances between indicators are fixed at zero within class while the variances are freely estimated and allowed to be different across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) lcinvariant(none)  covstructure(e._OEn, diagonal) nolog nodvheader 
estimates store c2

*3 CIUNR
* CIUNR.    Class-invariant, unrestricted (all indicator variables are allowed to covary within class, covariances are constrained to be equal across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) covstructure(e._OEn, unstructured) nolog nodvheader
estimates store c3

*4 CVUNR
* CVUNR.    Least restrictive of variance-covariance structures: class-varying, unrestricted class-specific estimated means (all indicator variables are allowed to covary within class, and the variances are allowed to be different across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 1) lcinvariant(none) covstructure(e._OEn, unstructured) nolog nodvheader 
estimates store c4

*============================================================*

* WITH 2 classes
*1 CIDIA
* CIDIA.    Class-invariant, diagonal (conditional independence is imposed and covariances between indicators are fixed at zero within class, covariances are constrained to be equal across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 2) covstructure(e._OEn, diagonal) nolog nodvheader 
estimates store c5

*2 CVDIA
* CVDIA.    Class-varying, diagonal (conditional independence is imposed and covariances between indicators are fixed at zero within class while the variances are freely estimated and allowed to be different across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 2) lcinvariant(none)  covstructure(e._OEn, diagonal) nolog nodvheader 
estimates store c6

*3 CIUNR
* CIUNR.    Class-invariant, unrestricted (all indicator variables are allowed to covary within class, covariances are constrained to be equal across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 2) covstructure(e._OEn, unstructured) nolog nodvheader     // WRONG
estimates store c7

*4 CVUNR
* CVUNR.    Least restrictive of variance-covariance structures: class-varying, unrestricted class-specific estimated means (all indicator variables are allowed to covary within class, ///
* and the variances are allowed to be different across the latent classes).
gsem (glucose insulin sspg <- _cons), lclass(C 2) lcinvariant(none) covstructure(e._OEn, unstructured) nolog nodvheader 
estimates store c8

estimates stats c?

*That gives me the following results:

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
          c1 |        145         .  -1820.678       6    3653.357   3671.217
          c2 |        145         .  -1820.678       6    3653.357   3671.217
          c3 |        145         .  -1730.403       9    3478.806   3505.596
          c4 |        145         .  -1730.403       9    3478.806   3505.596
          c5 |        145         .  -1702.554      10    3425.108   3454.876
          c6 |        145         .  -1641.952      13    3309.905   3348.602
          c7 |        145         .  -1707.443      13    3440.885   3479.583
          c8 |        145         .  -1590.567      19    3219.134   3275.692
-----------------------------------------------------------------------------
               Note: N=Obs used in calculating BIC; see [R] BIC note.

You are correct, the test statistics obtained with my code above appear to be correct for the one-class model. All LL and BIC numbers seem identical to Masyn's outcomes (and yes - model 1 and 2 are identical, and 3 and 4 are also). However, this is not the case for the all 2-class models.

To be precise, models 1 (class-invariant, diagonal), 2 (class-varying, diagonal) and 4 (class-varying, unrestricted) are the same as Masyn's. However, this is not the case for the 3rd model (class-invariant, unrestricted).

Just to be clear, I have compared the LL's that I gained from my commands, with the by Masyn reported LL's. Masyn reports the following log likelyhoods for the 2-class models: 1. -1702.55 (same as my result); 2. -1641.95 (same as my result); 3. -1666.63 (DIFFERENT); 4. -1590.57 (same as my result).

This makes me think that I have to change some of the settings in covstructure and/or lcinvariant in order to be able to get the exact results as Masyn, however, I can't seem to figure out what the correct settings are.

Interestingly, when I do the same thing for 3-class models, the numbers are all the same again. Then, when doing the 4-class models, I have the same issue again, only for the third model (class-invariant, unrestricted).

I hope the description of my issue is a little more clear this way, thank you so much for your help!

Looking forward to hear from you.

Kind regards,
Simone

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

14 Aug 2018, 09:13

Simone, I've verified that just for the 2-class, class-invariant, unrestricted sigma-k model (i.e. the variances and covariances of the indicators are freely estimated, but they're constrained to be identical within class), the final LL differs from Masyn's (whereas the LLs for the 1 and 3-class version of that model structure are identical). This is puzzling. I reviewed the documentation for the lcinvariant option, and I can see nothing obvious to change. In particular, the default option constrains the error covariances and the scaling parameters to be equal across classes. You definitely want the error covariances constrained to be equal in this structure, as far as I can tell from my reading of Masyn.

If you changed the lcinvariant options to constrain the constants (cons option) across classes, you will get classes whose means of all the indicators are identical, which is definitely not what you want to do. I am not sure what the scaling or coefficient options do, but either one seems to produce results identical to a solution with unconstrained latent class parameters (i.e. lcinvariant(none)).

At this point, I am not sure what to do. You should feel free to send this thread to Stata's technical support. However, I am not sure if this difference is a material one. We are seeing discordant results only for one class. We don't have Masyn's parameter estimates for her other model structures, so we can't compare. I don't have access to MPlus, so I can't really compare results either.

One thought is that Stata very likely uses different likelihood maximization criteria than MPlus. Those differing criteria could contribute to a different final log likelihood (which implies different parameter estimates). I'm not clear why this would be a problem in a simpler model (i.e. 2 latent classes) but not a more complex one (i.e. 3 latent classes). I would expect that divergence is more likely if you tried to fit a more complex model (e.g. 6 or 7 classes) in Stata and MPlus. I have not tried 4 latent classes.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Simone Verswijveren

Join Date: Aug 2018

Posts: 4
#5

15 Aug 2018, 01:38

Weiwen,

Thank you so much again for your extensive reply. I have now emailed [email protected]. Hopefully they can help me out!
If I hear anything from them I will make sure to update you also.

Kind regards,
Simone
Comment
Simone Verswijveren

Join Date: Aug 2018

Posts: 4
#6

28 Aug 2018, 01:52

Weiwen,

Just following up on our discussion above. This is what I received from the Stata Technical Support:

Dear Simone,

It seems that the model you specified in c7 might not converge to a global maximum when using our default starting values. This is a known issue with fitting latent class models and the reason we provide so many starting value options. You could modify and use instead the following starting value to obtain a matched LL and BIC with those reported in Masyn's book.

**
gsem (glucose insulin sspg <- _cons), lclass(C 2) ///
startvalues(randompr, draws(5) seed(15)) ///
covstructure(e._OEn, unstructured) nolog nodvheader

estimates store c7

estimates stats c7
**

I hope this helps. Thank you.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

28 Aug 2018, 08:43

Originally posted by Simone Verswijveren View Post

Weiwen,

Just following up on our discussion above. This is what I received from the Stata Technical Support:

Dear Simone,

It seems that the model you specified in c7 might not converge to a global maximum when using our default starting values. This is a known issue with fitting latent class models and the reason we provide so many starting value options. You could modify and use instead the following starting value to obtain a matched LL and BIC with those reported in Masyn's book.

**
gsem (glucose insulin sspg <- _cons), lclass(C 2) ///
startvalues(randompr, draws(5) seed(15)) ///
covstructure(e._OEn, unstructured) nolog nodvheader

estimates store c7

estimates stats c7
**

I hope this helps. Thank you.

Simone, this makes sense. In my work, I had used the random draws options when fitting 4 or more latent classes. It seems like one should start using that option much sooner. The syntax you quoted will ask Stata to randomly assign each observation to a particular latent class. I believe that Stata calculates initial probabilities from the assigned classes, then it will run its expectation maximization algorithm (for the default 20 iterations). It will repeat that 5 times (that's what -draws(5)- does), then it will take the highest log-likelihood, and run its usual quasi-Newton method from there until that converges. -seed(15)- applies a random number generator seed of 15 to the randomly-assigned probabilities. If you specify -seed(15)- in the future, you can replicate your results.

As you increase the number of latent classes, you will need to increase the number of draws. As you hit 5 classes, I'd recommend 50 or more draws. The EM algorithm is relatively slow. You can ask Stata to use fewer EM iterations, for example using the syntax

Code:

gsem (glucose insulin sspg <- _cons), lclass(C 5) /// startvalues(randompr, draws(50) seed(15)) emopts(iterate(10)) /// covstructure(e._OEn, unstructured) nolog nodvheader

By the time you get to a high number of latent classes, you may be better off executing a big block of code and letting the model run overnight.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement