Distal LCA: latent variable not found error

Josephine George

Join Date: Dec 2018

Posts: 34
#1

Distal LCA: latent variable not found error

19 Mar 2021, 09:51

I recently did a latent class analysis with distal regression on a set of data, say A. The process was interesting and the results useful, so I decided to take the same approach on some new, unrelated data, B. Being new to this, I modified the code I'd used for A for my analysis of B. I was able to find a reasonably well fitting latent class model, and so attempted distal regression using this code:

Code:

gsem (var1 var2 var 3 var4 <-,ologit) /// (var5 var6 var7 var8<-, mlogit) /// (C->var9, mlogit) /// [pweight=weight], iterate(100) /// from(final) difficult lclass(c 4) lcinvariant(none)

This produced an r(111) error "variable C not found; Perhaps you meant 'C' to specify a latent variable. For 'C' to be a valid latent variable specification, 'C' must appear in the latent() option."

The reason I'm puzzled and not just figuring out how to specify the latent option is that this is almost exactly how I coded the analysis of A, and it worked perfectly.

Differences in code for the distal regression models between A and B are:
Original analysis of A was using Stata 15.1; I'm now on 16.1. The original code runs fine on A using 16.1

A contains a manifest linear variable, and so has a line (varX<-, reg)

I had to constrain two parameters in A using code

Code:

(3:var1 <- _cons@-15) ///(3:var2 <- _cons@-15) ///

while no such constraints were needed for B

The solution for A was three classes instead of four, and so the code for A is lclass(c 3)

So, I need to figure out why I get these two different results. My immediate goal is to get the distal regression to work for B. But it would also be good to know if A shouldn't have worked and there's a problem I need to go back and fix.
Tags: None
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#2

19 Mar 2021, 10:18

Code:

gsem (var1 var2 var 3 var4 <-,ologit) (var5 var6 var7 var8<-, mlogit) (C->var9, mlogit) [pweight=weight], iterate(100) from(final) lclass(C 4)

I think your first model should not have worked or it should have produced unexpected results. You specified the latent class option as lclass(c 4). I think that would have told Stata that there is a latent variable with the name c, in lower case.

In the latent class regression bit of the model, you are saying that there is some variable C, var9 is a predictor of C, and the family is multinomial logit. That would be fine, but Stata assumes that variables whose names start in upper case are latent variables. You did nothing else in the gsem statement to tell Stata how C came about. Remember that Stata is case sensitive, so it would have interpreted C and c as two separate latent variables. Hence, you need to name the latent class C, in upper case.

As an aside, the difficult option does not do anything in models with categorical latent variables, i.e. latent class/profile or finite mixture models. I recall one of Stata's statisticians mentioning this on the forum. Also, because you have no Gaussian indicators, the lcinvariant option also does nothing. It may be a bit counterintuitive, but if you have Gaussian indicators, Stata's default is to constrain the variance of each indicator's error terms in the model to be equal across classes. There's no equivalent with categorical indicators.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#3

22 Mar 2021, 04:19

Thank you for your reply Weiwen. I've removed the difficult option, and I recognise that lcinvariant shouldn't have been in the code for B.

Making the class indicator C uppercase both for the line that asks if C is a predictor of another variable and in the lclass option creates a new problem: the path is not allowed as specified. Reversing the path doesn't quite tell me what I want to know -- I want to be able to run margins on var9 to say 56% of class 1 are i1.var9, 33% are i2.var9, etc.

I have done a workaround running predict, classpostpr, assigning observations to the class with the highest probability, and then simply tabulating class and the variable of interest. But this isn't ideal, as I have between 5% and 15% of probabilities for each class that I would describe as middling, meaning I don't have a lot of confidence in class assignemnet this way.

Just to add, since the analysis in A shouldn't have worked, I also compared the results I had already obtained to results using the workaround. Everything was really close, within 1 percentage point. So I am at least not worried that my original results using data A are the result of something going very wrong.

Last edited by Josephine George; 22 Mar 2021, 05:19.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

22 Mar 2021, 09:31

Originally posted by Josephine George View Post

...

Making the class indicator C uppercase both for the line that asks if C is a predictor of another variable and in the lclass option creates a new problem: the path is not allowed as specified. Reversing the path doesn't quite tell me what I want to know -- I want to be able to run margins on var9 to say 56% of class 1 are i1.var9, 33% are i2.var9, etc.

I have done a workaround running predict, classpostpr, assigning observations to the class with the highest probability, and then simply tabulating class and the variable of interest. But this isn't ideal, as I have between 5% and 15% of probabilities for each class that I would describe as middling, meaning I don't have a lot of confidence in class assignment this way.
...

By your initial post, I thought you were referring to latent class regression. Basically, you want to treat the latent class as the dependent variable in a regression, to see what predicts membership in that latent class (e.g. var9). That is, P(C = c | X), where X is a matrix of covariates and C is the latent class. That's what latent class regression gets you. If you wanted that, it turns out that your arrow sign is in the wrong direction (and also, you don't need to specify the multinomial family in that case; Stata understands that C is a categorical latent variable and it applies the multinomial family automatically). Also, you would have wanted to type

Code:

(C <- i.var9)

. The heuristic I use is that your latent class, C, causes responses to your indicators, so the arrowhead points toward the indicators. var9 causes the latent class, so the arrowhead points to C.

What you described above is more like asking for E(X | K = k). I think I've heard people call this latent class with distal outcomes. Problem is that we don't directly get this from latent class regression. Now, I'm not familiar with this topic, but I don't think Stata directly does this.

Example of a latent class regression using Stata's stock dataset:

Code:

use https://www.stata-press.com/data/r16/gsem_lca2 gsem (glucose insulin sspg <- _cons) (C <- relwgt, mlogit), lclass(C 3) lcinvariant(none) covstructure(e._OEn, unstructured) *Omitting output; the multinomial regression output at the header is basically 'proof of life', i.e. we did actually fit a multinomial regression model Generalized structural equation model Number of obs = 145 Log likelihood = -1519.7738 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.C | (base outcome) -------------+---------------------------------------------------------------- 2.C | relwgt | 14.03413 2.819101 4.98 0.000 8.508794 19.55947 _cons | -14.50264 2.864154 -5.06 0.000 -20.11628 -8.889005 -------------+---------------------------------------------------------------- 3.C | relwgt | 5.186345 2.045551 2.54 0.011 1.177138 9.195552 _cons | -5.329615 1.930139 -2.76 0.006 -9.112617 -1.546613 ------------------------------------------------------------------------------

I've previously criticized using the modal probability to do things because it doesn't consider the uncertainty in the value of C. A number of solutions, some supremely complex (see a lot of Jeroen Vermunt's work), have been proposed. To my knowledge, Stata doesn't implement them. I have a possible improvement to what you did, however. Calculate the vector of class membership, then calculate a probability weight for each value of that. A probability weight is, in other contexts, the inverse of the probability of selection. I haven't seen any academic work on this method, but I have to think it would be an improvement on just tabulating by modal class (NB: if your model entropy is high, e.g. over 0.8, you could just ignore this and tabulate by modal class, since you should get pretty similar results). A worked example is below. Note that you can't compare the SD from the summarize command to the SE from the mean command. Remember that SE is the standard error of the mean.

Code:

predict pr*, classposterior foreach v of varlist pr? { gen ipw_`v' = 1 / `v' } *Generate modal class egen modalpr = rowmax(pr?) gen modalclass = . forvalues i = 1/3 { replace modalclass = `i' if pr`i' == modalpr } *Compare mean of relative weight for latent class 1: sum relwgt if modalclass ==1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- relwgt | 76 .9367105 .1264108 .71 1.2 mean relwgt [pw = ipw_pr1] Mean estimation Number of obs = 126 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ relwgt | .9490946 .0468285 .8564153 1.041774 --------------------------------------------------------------

Now, how do you produce summary statistics under inverse probability weighting? I'm not 100% sure. sum doesn't take probability weights. I believe that importance weights would produce the desired mean. Now, the thing is, I don't think that minimum and maximum have the correct meaning - we don't know for certain who's in class 1, and yet this summarize command replicates the minimum and maximum in the data - that's because it's using all observations. Every observation has a positive probability of being in class 1, even if that probability is miniscule. Also, the SD is different from the original summarize command.

Code:

sum relwgt [iw = ipw_pr1] Variable | Obs Weight Mean Std. Dev. Min Max -------------+----------------------------------------------------------------- relwgt | 126 5.0276e+37 .9490946 .0908034 .71 1.2

Hence, I'm pretty sure that minimum and maximum are ontologically invalid in this case (sorry for the big Greek word, but I think it applies here). I would also question if the SD has a real meaning. That said, we can see that the mean is equivalent to the mean command.

Now, that said, Lanza and Rhoades (they work at Penn Stata and they are authors of the PSU LCA plugin for Stata) do argue that if you have binary or categorical distal outcomes, you can use Bayes Theorem to flip from P(C = c | X) to P(X = x | C = c). They have an Excel sheet to assist here. This might be the most theoretically correct way for you that you can do with the tools at hand, actually. Now, if you read their paper (it's free), you'll see that they can also apply Bayes theorem with a continuous covariate ... but I'm not really sure how they did it. It does involve using the kdensity command to calculate the density function, but honestly I'm a bit fuzzy on density functions. Anyway, you said your covariate of interest was categorical.

Last edited by Weiwen Ng; 22 Mar 2021, 09:36.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#5

22 Mar 2021, 10:47

It looks like the PSU LCA plugin has a companion add on especially for distal outcomes. I will try playing around with that.

I still have two questions, although I realize Weiwen has already gone above and beyond and may not be inclined to spend any more time on this!

1. For any approach starting with assigning people to posterior probabilities -- whether the obvious way I have already done or using posterior probabilities as weights -- is it even necessary to have a path from C to the variable of interest? The probabilities of class membership are generated from the measurement model and ideally (or so I think I've read elsewhere) don't change when adding a structural model.
2. How was I able to get an LCA with distal outcomes to work for data set A, given that I mixed my upper/lower cases in referring to the latent classes and distal outcomes is apparently not something Stata does? The results of margins, predict are consistent with tabulating after calculating the class probabilities, so I think it really did model the effect of class on the outcome of interest.
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#6

30 Mar 2021, 11:37

Originally posted by Weiwen Ng View Post

Code:

predict pr*, classposterior foreach v of varlist pr? { gen ipw_`v' = 1 / `v' } *Generate modal class egen modalpr = rowmax(pr?) gen modalclass = . forvalues i = 1/3 { replace modalclass = `i' if pr`i' == modalpr } *Compare mean of relative weight for latent class 1: sum relwgt if modalclass ==1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- relwgt | 76 .9367105 .1264108 .71 1.2 mean relwgt [pw = ipw_pr1] Mean estimation Number of obs = 126 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ relwgt | .9490946 .0468285 .8564153 1.041774 --------------------------------------------------------------

I am playing around with this approach. I'm not sure I see the logic of using the inverse of the class probabilities as weights, rather than the probabilities themselves. In the usual context of inverse probability weighting, we think that those with low selection probabilities are the most valuable data points and so upweight them. I don't think the same is true for people with low probabilities of being in a particular class. My thought process is simply that I don't think I want someone with a probability of, say 0.3 of being in class 1 to contribute more to the estimation of statistics for class 1 than someone with a probability of 0.7 of being in that class. This seems consistent with the idea that if we had more certainty about class assignment in the form of very high probabilities, we'd be more confident in using modal assignment outright.

Playing with my data seems to support my instincts. The results seem plausible using the probabilities themselves, rather than their inverse, as weights. For example, one of the summary statistics I want is gender distribution. A priori, on the basis of the results of estat lcmean, I would expect my first class to be predominantly female, my second and third classes to be close to evenly split, and my fourth class to be predominantly male. If I use modal class, this is exactly what I see. It is also what I see using the class probabilities as weights, but the exact estimates differ by a percentage point or so compared to the modal class estimates. If I instead weight using the inverse class probabilities, I get the reverse: class 1 predominantly male and class 4 predominantly female.

Of course it is very possible that I am missing something obvious, and am happy to have that pointed out!
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

31 Mar 2021, 10:09

Originally posted by Josephine George View Post

I am playing around with this approach. I'm not sure I see the logic of using the inverse of the class probabilities as weights, rather than the probabilities themselves. In the usual context of inverse probability weighting, we think that those with low selection probabilities are the most valuable data points and so upweight them. I don't think the same is true for people with low probabilities of being in a particular class. My thought process is simply that I don't think I want someone with a probability of, say 0.3 of being in class 1 to contribute more to the estimation of statistics for class 1 than someone with a probability of 0.7 of being in that class. This seems consistent with the idea that if we had more certainty about class assignment in the form of very high probabilities, we'd be more confident in using modal assignment outright.

Playing with my data seems to support my instincts. The results seem plausible using the probabilities themselves, rather than their inverse, as weights. For example, one of the summary statistics I want is gender distribution. A priori, on the basis of the results of estat lcmean, I would expect my first class to be predominantly female, my second and third classes to be close to evenly split, and my fourth class to be predominantly male. If I use modal class, this is exactly what I see. It is also what I see using the class probabilities as weights, but the exact estimates differ by a percentage point or so compared to the modal class estimates. If I instead weight using the inverse class probabilities, I get the reverse: class 1 predominantly male and class 4 predominantly female.

Of course it is very possible that I am missing something obvious, and am happy to have that pointed out!

I think you are correct and I was too hasty about pweights.

One thing I can say for sure is that pweights, probability weights, are defined as the inverse of the probability of being sampled. If you download a big public dataset that's weighted to be nationally representative, like the US National Health Interview Survey, and you inspect their pweights, you'd see that the raw values are all several thousand. If you calculated the total of all pweights, or you svyset the data and did tabulations, the total population you get should correspond to the sampling frame - which I think is the estimated US civilian population age 18 or up in each year, which you can confirm from the Census Bureau. (NB: that's estimated population using some demographic methods I'm not aware about; the actual population is only measured every 10 years.)

So, I should have thought of that, and that rules out pweights. I am leaning towards iweights, or importance weights, but I'll need to think about this a bit more. Good catch for sure.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#8

31 Mar 2021, 10:57

The complication with my data are that I need to incorporate my actual pweights, which obviously tab doesn't take. To find a workaround for an earlier problem(trying to get svy: tab to give me matrices I could use with putexcel) I experimented with putting my weights, which are actually pweights, in the commands that tab takes. Luckily my data don't have psu or strata.

Using aweight gave me the same percentages for oneway and twoway tables, and the same cell counts for oneway tables, while twoway tables have vastly different cell counts. iweights give identical percentages and cell counts for both oneway and twoway tables.

So, my starting point is tab with aweights or iweights. I generated a new set of weights by multiplying the class probabilities by my pweights. Using tab with these weights again produces identical percentages for aweights and iweights. This time, though, the cell counts/total numbers for iweights is mucch smaller -- presumbaly because the probability weights will average out to be the proportion of each class in the sample. The percentages are also comparable to what I get if I just use class with the highest probability for each individual.

Going forward, then, if my goal is to produce descriptive statistics for my groups, it seems like iweights or aweights produce reasonable results. In principle this is an improvement on modal assignment, although substantively it hasn't changed anything so far.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#9

24 Sep 2021, 18:07

I wasn't able to assist Josephine with this, unfortunately. I wanted to cross-post this with a previous question I answered. I confess I didn't exactly do a ton more research since Josephine posed the initial question. It was just that some information percolated.

To recap, many people fit an LCA model, and then afterward they wonder, "I wonder what influences LCA membership. I'm going to go tabulate some other variables (may be called distal variables) by latent class membership and see what I find." It is not easy to appreciate that you can't do this directly! I realize that this sounds absurd. If this were a setting where I were trying to sell you something, you would be justified in wondering if I were trying to swindle you. However, remember that we don't actually know an observation's class membership - it's latent and we can't observe it directly, and we can only estimate the probability that each observation belongs to each class. That is the main stumbling block.

Basically, researchers (e.g. Bolck and colleagues; Vermunt and colleagues; citations at end) smarter than I have shown that both modal and probabilistic class assignment produced biased results. The other shows this with some quotes from the early part of Vermunt's article that assert this. I am not at Vermunt's level, but I am pretty sure I understood that part correctly. The key that these researchers realized is that you can correct for measurement uncertainty. Basically, you estimate some sort of correction factor based on that vector of class membership probabilities (it's more complex than that and this is the part I definitely do not understand). Then you can go and tabulate your variables by latent class, or you can go and fit a multinomial logit model if you want.

Actually, you can also fit a latent class regression (a synonym may be a one-step approach), where you simultaneously fit an LCA model and regress the class on covariates (parallel to a MIMIC model in SEM, or an explanatory IRT model). However, this has its own potential problems: multinomial logit and LCA are each fussy by themselves, so you're stacking two fussy models. Your multinomial model may influence the classes you identify - I can cite one instance in my own work where it did not, but it's not guaranteed. In my dissertation, I attempted to fit a latent class regression, but it didn't converge and that was likely due to sparseness - the prevalence of the distal variables of interest was probably zero in some of the latent classes, and I assume that prevented convergence.

So, what are we Stata folks to do in the mean time? I do not know for sure. One thing I can suggest is calculate your model's entropy. Recall that (normalized) entropy (which runs on a 0-1 scale) is a one-number summary that shows how certain we are that each case was classified correctly. If your classes are well separated on many indicators, that should generally produce high entropy. If your entropy is high, you could consider tabulating with probabilistic assignment like I suggested off the cuff in my earlier post here. Here is a post on the MPlus forum where Bengt Muthén (one of the MPlus principals) said informally that

The use of "most likely class membership" as a variable for further analysis, however, is problematic when the entropy goes much lower than 0.8.

Remember, that's an informal statement. It's not empirical. It's probably like expert opinion, except maybe less well considered than that. We have no good guides here. As a reviewer, I think I would accept a paper that calculated entropy, found it was high, and noted in limitations that this has been shown to produce biased associations, and there's no real guide on what's good entropy, etc. I would probably prefer probabilistic assignment over modal. Again, I believe you would tabulate while applying importance weights (i.e. [iweight = p_class_k]). If it matters, I did note on the Stata 18 wishlist that I would like Stata to implement one of the three-step estimators for the relationship between latent classes and distal variables (i.e. ones that weren't treated as indicators of the latent class). This is clearly a popular request. I would like to be able to say something more than just try latent class regression.

What do you do if your model entropy is low? I honestly don't know for sure. If you have one or two classes of interest and you can qualitatively judge that they're well-separated from the other classes, you might be able to get away with doing your tabulations just for those classes. The Kathryn Masyn chapter cited in the SEM examples on LCA does suggest (see page 570) that you can calculate the average posterior class probability for each latent class: for the observations you'd assign to class k by modal class assignment, what is the average probability of membership in class k (this is what you got when you typed predict pr*, classposteriorpr), and she provides a source that suggested that the class is well-separated from the others if that average probability is > 0.70. (But again, this statement is probably expert opinion-ish, although this was at least peer reviewed.)

What if you have no classes that are well-separated from the others? Don't do this. It is what it is. Your data are limited. We can't always have good indicators. Again, this is a complex method. There are many pitfalls for applied statisticians.

Sources:
Bolck, Annabel, Marcel A. Croon, and Jacques A. Hagenaars. 2004. Estimating latent structure models with categorical variables: One-step versus three-step estimators. Political Analysis 12:3–27.
Vermunt, Jeroen. (2017) Latent Class Modeling with Covariates: Two Improved Three-Step Approaches. Political Analysis 18(4)

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Distal LCA: latent variable not found error

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment