GSEM identify latent classes

Jesus Salas

Join Date: Jun 2021

Posts: 1
#1

GSEM identify latent classes

20 Jun 2021, 08:41

I am trying to learn latent class models. I don't know much about them. I'm trying to figure out how to use GSEM in stata. In particular, I can estimate the first stage analyses.

I understand these classes are based on latent (unobservable) variables. Let's say I am estimating the following simple model:

Y = b1x1 + b2x2 + e

Let's say that I want to estimate b1 and b2 for 10 different classes. So that means that I will get 10 b1 coefficients and 10 b2 coefficients (one for each class). Is there a way to identify the observations in each class? Is that question silly/unanswerable? I understand the classes are unobservable, but I am wondering if I could somehow use GSEM to identify the classes.

Ultimately, what I want to do is the following. Let's say I have 10 coefficients for b1. Let's say that there are 100 observations with each b1. I want to identify which observations in the sample have which b1 coefficients. Then, I want to estimate regressions of these b1 coefficients against other variables to see if I can find in which way other firm characteristics are related to the differences in the 10 b1 coefficients. Does that make sense?

Thanks!
Tags: None
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#2

22 Sep 2021, 09:11

Let's say I am estimating the following simple model:

Y = b1x1 + b2x2 + e

Actually, this isn't quite right. That is the equation for a linear regression. In latent class, let's assume that x refers to the indicators of the latent class, and so X is the vector of all your indicators. You really are just estimating E(X | latent class = k) for k = 2 or more. That's it. No error term. No betas in the sense that they're used in the linear regression context. For each class, you have the mean of x1, x2, etc.

Is there a way to identify the observations in each class? Is that question silly/unanswerable? I understand the classes are unobservable, but I am wondering if I could somehow use GSEM to identify the classes.

This is not a silly question. However, the answer is complicated.

In one sense, you don't have each observation's latent class. You have a vector of probabilities that it belongs to each latent class. Latent class is probabilistic. You don't know for sure which class each observation belongs to. If you fit an LCA and then you type

Code:

predict pr#, classposteriorpr

and you go examine your data, you'll see k variables, one for each latent class.

In another sense, you can assume that each observation belongs to its modal class, i.e. the latent class with the highest probability of membership. You should always remember that this is an approximation. The code to do modal class assignment is given in SEM example 50 or 51. One issue is that if you are relatively uncertain about which classes people belong to, this might be too imprecise an approximation to reality to be very useful. You can use entropy as a one-number summary of classification uncertainty. One forum comment by Bengt Muthén (one of the MPlus principals) said that he considered entropy > 0.8 to be pretty certain. I don't think there's any formal guidance here. Stata doesn't provide entropy directly, but you can calculate it; just search for latent class entropy. Some people including myself have written code.

Ultimately, what I want to do is the following. Let's say I have 10 coefficients for b1. Let's say that there are 100 observations with each b1. I want to identify which observations in the sample have which b1 coefficients. Then, I want to estimate regressions of these b1 coefficients against other variables to see if I can find in which way other firm characteristics are related to the differences in the 10 b1 coefficients. Does that make sense?

This doesn't quite make sense.

Do you want to estimate the regression in my first quote, but you think that there are unobserved groups with heterogeneous responses to the independent variables? That is, say there are k = 2 groups with different sets of b1s, b2s, etc. That is finite mixture modeling, which is implemented in Stata.

Another possible interpretation is that you want to see if some external variables are related to membership in each latent class. If that's the case, then ... unfortunately the answer is complicated! You can conduct a latent class regression - your latent class is an unordered categorical variable, and we use multinomial logit regression on that type of variable. The mlogit command won't work with a vector of fractional membership probabilities. However, you can do the equivalent of fitting a multinomial regression to the latent class variable in gsem. However, those can be tricky, especially with multiple latent classes.

If you Googled, you would see a bunch of literature on "three step" latent class analysis or latent class with distal outcomes. There are some downsides to latent class regression. The theoretically correct three step procedures have you fit a latent class model in one step, then you can basically tabulate whatever variables you're interested in by latent class membership while also correcting for classification uncertainty. I can't understand the algebra, and thus I haven't found a way to implement this in Stata. If you don't understand what I am talking about ... that's OK, it is quite a complex topic.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Giovanni Russo

Join Date: Sep 2015

Posts: 14
#3

23 Sep 2021, 05:40

Weiwen explained things very clearly.

I take up from where the previous post end: The effect of class membership on a dependent variable Y.

One derives the latent classes, say there are 3 of them, then takes the modal class, and finally the modal class is used in a regression in which Y is the dependent variable.

The variable containing the modal class is measured with error (there is a (1 - modal probability) chance that the observation belong to another class). This is a non-classical error which is difficult to handle. The measurement error gets smaller the larger the modal posterior probability.

To avoid the measurement error problem the three class membership probabilities could be included in the regression at the same time. This is a problem because they sum to 1. In fact, they are compositional data. Leaving the probability of belonging to a given (reference) class out (as one would do with a dummy) is not an option because the coefficients on the retained variables are hard to interpret.

If you have three classes, the 3 probabilities represent a simplex (a triangle). To use these variables into a regression they need to undergo an isometric transformation (to map them from the simplex to the Cartesian plane).

I have found a procedure to transform and include this type of variables within a regression framework

Hron, K., et al. (2012). "Linear regression with compositional explanatory variables." Journal of Applied Statistics 39(5): 1115-1128.

Thoughts and suggestions on the soundness of the approach are very much appreciated.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

23 Sep 2021, 11:04

Originally posted by Giovanni Russo View Post

Weiwen explained things very clearly.

I take up from where the previous post end: The effect of class membership on a dependent variable Y.

One derives the latent classes, say there are 3 of them, then takes the modal class, and finally the modal class is used in a regression in which Y is the dependent variable.

The variable containing the modal class is measured with error (there is a (1 - modal probability) chance that the observation belong to another class). This is a non-classical error which is difficult to handle. The measurement error gets smaller the larger the modal posterior probability.

To avoid the measurement error problem the three class membership probabilities could be included in the regression at the same time. This is a problem because they sum to 1. In fact, they are compositional data. Leaving the probability of belonging to a given (reference) class out (as one would do with a dummy) is not an option because the coefficients on the retained variables are hard to interpret.

If you have three classes, the 3 probabilities represent a simplex (a triangle). To use these variables into a regression they need to undergo an isometric transformation (to map them from the simplex to the Cartesian plane).

I have found a procedure to transform and include this type of variables within a regression framework

Hron, K., et al. (2012). "Linear regression with compositional explanatory variables." Journal of Applied Statistics 39(5): 1115-1128.

Thoughts and suggestions on the soundness of the approach are very much appreciated.

Giovanni,

I may be wrong about this. However, I don't think that this approach will correct for the bias due to measurement uncertainty. You're just including the class proportions as an additional explanatory variable in the regression. You would want to do something closer to weighting the observations. Consider the article below (gated, unfortunately):

Vermunt, Jeroen. (2017) Latent Class Modeling with Covariates: Two Improved Three-Step Approaches. Political Analysis 18(4)

On pg 451, Vermunt describes what I think I can call a naive 3-step approach (contrast to his recommended 3-step approach, or to the 1-step approach which I think is latent class regression). Paraphrasing, he says the steps are build a standard LCA, then assign to membership based on posterior class membership probabilities. Finally, you'd conduct whatever subsequent analysis you're interested in, be it tabulation or a multinomial regression. About step 3, he says:

...Possible classification methods are modal, random, and proportional assignment...

Where modal is what I've described (and is the most wrong), random is something like multiple imputation (or plausible values in IRT; you do multiple random draws in which you assign a class based on the probability vector of class memberships, then combine results using Rubin's rules), or proportional assignment. I believe that in this one, you'd just take the class membership probabilities as a weight (probably iweight in Stata), then you'd run your analysis once for each class using its membership probability as the weight.

And then, immediately after that, he says:

Bolck, Croon, and Hagenaars (2004) demonstrated that irrespective of whether one uses modal, random, or proportional assignment, three-step approaches underestimate the re- lationships between covariates and class membership. More specifically, they showed that the larger the amount of classification error introduced in the second step, the larger the downward bias in the parameter estimates. Based on the same derivations, Bolck, Croon, and Hagenaars (2004) and Croon (2002) developed a method for correcting the three-step approach, which I will call the BCH method. Similar approaches were proposed by Croon (2002), Lu and Thomas (2008), and Skrondal and Laake (2001) for continuous latent variables.

Hence, it seems to me that even using a more advanced method like multiple imputation/probabilistic assignment or weighting is considered to produce biased results (at least by him). I don't think that a) including the membership probabilities as regressors is correct, and b) if it is, I don't see how it's superior to weighting, so I'd assume it's at least as wrong as the naive weighting approaches I described.

Vermunt then goes on to argue that you can estimate the amount of classification error in your LCA model based on the posterior membership probabilities you have. Those are denoted by W in his paper. After the second half of pg 454, I ceased to be able to follow him. If anyone else can do better, have at it.

The Bolck et al citation is: Bolck, Annabel, Marcel A. Croon, and Jacques A. Hagenaars. 2004. Estimating latent structure models with categorical variables: One-step versus three-step estimators. Political Analysis 12:3–27.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Gio Russo

Join Date: Nov 2016

Posts: 5
#5

25 Sep 2021, 09:56

Hi Weiwen, thank you for your suggestions. I admit: I am out of my depth here so it is entirely possible that you are right in your considerations.

I too went through the article you mention (plus some others), I went so far to calculate the weights as they suggest, but then I did not know what to do with them, are they pweights, aweights, iweights,......

The need for weights arises because vermunt has to solve a measurement error problem. In my understanding, in all the approaches described, one observation is matched to just one class (done in different ways, but it is always the same output, one observation is assigned to one class only). This generates the measurement error.

In the approach I follow, each observation is matched to the full set of probabilities. In a sense, there is no measurement error (except that arising from the fact that they are the outcome of a first step estimation). So, maybe there is no need to correct for measurement error.

On the other hand there is an estimation problem, the set of probabilities are multicollinear with the constant. This is solved using the routine in the article I cited in my post.

It all hinges on the origin of the problem. If the problem originate in the matching of observations to just one class then maybe the options of including the full set of probabilities would circumvent the need for weights. I might be wrong, though

Thank you for your thoughts (actually everybody is invited to chip in )
Comment

Announcement

GSEM identify latent classes

Comment

Comment

Comment

Comment