Use Output from LCA with Penn State Plugin

Dominik Harder

Join Date: Nov 2021

Posts: 11
#1

Use Output from LCA with Penn State Plugin

25 Nov 2021, 05:52

Dear Statalisters,

i am fairly inexperienced in Stata and i currently try to run some latent class analyses with the plugin from Penn State University (https://www.methodology.psu.edu/research-and-rigor/). While i can recreate the example models provided by the creators in the attached do-file, i now wonder how to use the generated output (i.e. the identified latent classes) for further analysis. First and foremost, i cannot figure out how to generate a variable that assigns all observations into their respective classes based on their estimated probabilities. I assume i have to store the generated predicitive class probabilities into a matrix using "mat" and "svmat", but do not know where to go from there. For example, in the proposed example 1, there are five classes identified and variables generated for estimated predicted probabilities (these are named _post_prob1, _post_prob2 etc.). When i compute the matrix for post_prob and save it through svmat, i just get a variable with missing values.
I use Stata 15 and before installing the Penn State plugin, i also used Stata´s gsem command to run lca. Unfortunately, this resulted in problems by not reaching model fits, despite trying the various methods described in Stata´s introduction to sem (section 12). Resorting to the plugin is fine with me, because it offers more comprehensive information criteria and runs much faster. I´m afraid i might just lack the basic understanding of the Stata syntax to handle the plugins´ output correctly.

I am sure this is not a very complex problem and it would be much appreciated if someone could provide some general information/coding examples on how to generate variables out of doLCA´s output.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

25 Nov 2021, 07:34

The link given in post #1 does make it immediately obvious where the Stata plugin is to be obtained. Can you give a link to a page on which the plugin is described and from which it can found for downloading?

Added in edit: On further reflection, your question generically seems to be

how to generate a variable that assigns all observations into their respective classes based on their estimated probabilities

How would you expect an observation to be assigned to a class? To the class with the highest probability for that observation? Or something else? Is there a specific technique described by Penn State?

Last edited by William Lisowski; 25 Nov 2021, 07:49.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

25 Nov 2021, 09:12

Originally posted by Dominik Harder View Post

...While i can recreate the example models provided by the creators in the attached do-file, i now wonder how to use the generated output (i.e. the identified latent classes) for further analysis. First and foremost, i cannot figure out how to generate a variable that assigns all observations into their respective classes based on their estimated probabilities...

First, a clarification. When you run a latent class model, you don't get the latent class that each observation belongs to. You get the probability that each observation is in each of the latent classes you assumed (if you are familiar with vectors, you get a vector of class membership probabilities).

You can assume that each observation belongs to the latent class where the membership probability is highest, i.e. you can assign them to their modal latent class, aka you can do modal class assignment. Presumably, you're trying to tabulate some characteristics by latent class membership. Now, depending on how good your indicators are (i.e. the variables you fed into the LCA), you may be more or less certain about the membership probabilities. This is quantified by the normalized entropy of the model, which is scaled from 0 to 1, where 1 is better. I would guess that above 0.8 is considered high, and 0.6 or so is low). The Penn State plugin should return this as the scalar r(EntropyRsqd). As a worked example, say that after a 3-class model, most observations have a membership probability vector looking something like (0.9, 0.05, 0.05), i.e. you're pretty sure which class they're in. That's high entropy. If they all tend to look something like (0.45, 0.28, 0.27), you're a lot less certain. If you somehow came out with everyone having a vector of (1/3, 1/3, 1/3), that would be an entropy of 0, and it would mean that the indicators tell you absolutely nothing about which latent class each person is in. I don't think the model would even converge in that case, so I'm offering this just as an extreme example.

Anyway, if you have a high entropy situation, then if you did modal class assignment, you're making an assumption that is technically wrong but not terribly wrong. People who are real technical experts on LCA might still object. It has been shown that modal class assignment will bias any relationships you get when you tabulate stuff by class membership. It's also, I believe, been shown that probabilistic or random assignment (meaning you do multiple random draws and then you work multiple imputation style to do your tabulations) are still biased. There's been work done to overcome this, which base Stata and I believe the PSU plugin don't implement. If you're interested in this, you could try searching for work by Jeroen Vermunt, but I find the math hard to understand.

After that major caveat, let's get back to your question. I don't have the plugin installed. However, if you go through the documentation for version 1.2.1, it seems to indicate that after your model converges, the plugin should automatically add variables for each posterior class probability and the modal class assignment, called BestIndex, to your dataset. In base Stata, this is something you'd do with the predict command post-estimation. Are those variables present? If not, which version of the plugin are you using?

In the plugin, you could also add 20 pseudo-class draws with the plugin's seeddraws() option (you just specify a random number seed in there). If you wanted to use them, I believe you would need to use the multiple imputation commands to manually declare the variables as mi data.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
2 likes
Comment
Dominik Harder

Join Date: Nov 2021

Posts: 11
#4

25 Nov 2021, 13:32

Dear William, dear Weiwen,

thank you for your replies. As you both rightfully assumed, I intend to assign every observation based on their modal latent class. Sorry for my unclear statements. I am aware that this comes with possible biases and errors and am familiar with the concept of normalized entropy.
I checked the models again and I´m a bit embarrassed to admit that I did not see the Best_Index variable. I just assumed it was necessary to compute the modal class assignment manually, as when using the gsem command. Anyway, this should solve my problem.
Thanks again for your help. I can also say that Weiwen´s responses in other threads in this forum have helped me quite a bit in learning to use LCA with Stata,

William, you were right about my link. In case you are still interested in the plugin, here is the right link: https://www.latentclassanalysis.com/...-stata-plugin/
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

26 Nov 2021, 09:02

Originally posted by Dominik Harder View Post

Dear William, dear Weiwen,

thank you for your replies. As you both rightfully assumed, I intend to assign every observation based on their modal latent class. Sorry for my unclear statements. I am aware that this comes with possible biases and errors and am familiar with the concept of normalized entropy.
I checked the models again and I´m a bit embarrassed to admit that I did not see the Best_Index variable. I just assumed it was necessary to compute the modal class assignment manually, as when using the gsem command. Anyway, this should solve my problem.
Thanks again for your help. I can also say that Weiwen´s responses in other threads in this forum have helped me quite a bit in learning to use LCA with Stata,

William, you were right about my link. In case you are still interested in the plugin, here is the right link: https://www.latentclassanalysis.com/...-stata-plugin/

Thanks for the kind words. Don't worry about missing the Best_Index variable - this is a complex method, and the PSU documentation is quite long and the options are more numerous than the bas Stata options. Plus adding these variables is something that base Stata users would normally think of as a post-estimation thing that you need to do manually.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#6

11 Feb 2023, 20:16

Hi
I am wondering is there any backwards in using plugin from Penn State University (even though I have access to Stata 15). I need to implement unbiased 3-step LCA_distal outcome analysis (all variables binary). it seems that the only way of doing this in Stata is through LCA Distal BCH function which I guess is can be used after conducting the LCA using plugin from Penn State University.

Last edited by Maryam Ghasemi; 11 Feb 2023, 20:19.
Comment
Parul Puri

Join Date: Apr 2022

Posts: 1
#7

23 Feb 2023, 21:46

Dear STATALIST User,

I am using is a secondary dataset with a sample size of around 60k, thus, it is extremely essential to use survey weights.

For the same reason, I was interested in using the LCA Stata Plugin < https://bpb-us-e1.wpmucdn.com/sites....2c-2e00dl9.pdf>. I have followed the steps below:
Set up the LCA Stata Plugin.

I have prepared the dataset in the required format: there are no missing and the manifest variables are coded as “1” and “2”.

Then I ran the LCA command.

While initially _Best_Index was not being generated after running the codes. The plugin program was not working properly, so I had to drop it and reinstate it . After numerous tries, I was able to generate _Best_Index once. However, it is a recurrent issue that _Best_Index is not generated on my system.

To the best of my knowledge, _Best_Index is supposed to divide the entire population into latent classes. I have tried to troubleshoot by creating a loop myself, but it also did not help. However, in my case, _Best_Index is generated for only a part of the population (around 15k observations). For the rest no decision has been made, primarily because on the bases of the posterior probabilities it is a draw. I am stuck on how to clearly demarcate the entire population into clear classes.

// Class-2
doLCA ds_1_r ds_2 ds_3 ds_4 ds_5 ds_6 ds_7 ds_8 ds_9 ds_10 ds_11 ds_12 ds_13 ds_14 ds_15 ds_16 ds_17 ds_18 ds_19 ds_20 ds_21 ds_22 ds_23 ds_24 ds_25 ds_26, ///
nclass(2) ///
seed(100000) ///
seeddraws(100000) ///
categories(2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2) ///
weight(wt)

return list
matrix list r(gamma)
matrix list r(gammaSTD)
matrix list r(rho)

I would be obliged if you can help me out with this issue-many thanks.

Regards
Parul
Comment
Vincent Hardy

Join Date: Apr 2023

Posts: 1
#8

20 Apr 2023, 08:59

Hi Parul,

I ran into the exact same issue, working with a dataset of around 18,000 cases. I don't think the issue stems from there being a tie in the posterior probabilities, but moreso that there is a bug when the program writes the posterior probabilities to the datafile. I've noticed that whenever my file has more than 2000 observations, posterior probabilities are only added to half the file.

My guess is that its a compatibility issue since I am running the plug-in with Stata 17. Maybe using an older version of stata would work for you. Since I dont have access to that, the work-around I used is to run the program once, save the posterior probabilities, re-order the cases in descending order, then run the program again. That should generate posterior probabilities for all your cases. I then created something similar to the Best Fit by identifying the column with the highest posterior probability.

I also got in touch with the Methodology Centre at Penn, and apparently the LCA plugin is no longer being maintained.
Comment

Announcement

Use Output from LCA with Penn State Plugin

Comment

Comment

Comment

Comment

Comment

Comment

Comment