Latent Class Analysis - Is it possible to run ANOVA or Chi-test in STATA after LCA?

YU HUAN YU

Join Date: Dec 2019

Posts: 2
#1

Latent Class Analysis - Is it possible to run ANOVA or Chi-test in STATA after LCA?

14 Dec 2019, 10:04

Hi everyone,

I have been trying to conduct an ANOVA or Chi-test to examine whether certain variables are significantly different among classes after running the LCA estimation. From the published papers, STATA seems to be able to do so. (e.g., https://www.sciencedirect.com/scienc...67070X18306395)

However, I found no related samples nor syntax in the manual. Am I missing something here or there is just no such function in STATA.

Thanks in advance for responding.
Tags: None
Brian Flaherty

Join Date: Jun 2017

Posts: 11
#2

14 Dec 2019, 10:53

Hello,
Just to confirm, you wish to compare additional variables not included in the initial LCA between the classes, not one or more of the indicators themselves? I'll follow-up once I know your answer.
Brian
Comment
YU HUAN YU

Join Date: Dec 2019

Posts: 2
#3

14 Dec 2019, 23:26

Hi Brian, thanks for your attention.
Can both be done? I intend to figure out each indicator's differences among classes in the initial LCA and also conduct ANOVA tests on predicted variables.
Sorry if it sounds dumb, but I was wondering if LCA could generate a variable that labels each sample's class like cluster analysis so that ANOVA tests can be conducted as the way I know.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

17 Dec 2019, 16:44

Seeing as Brian hasn't answered yet, I'll raise a question.

You said this:

I have been trying to conduct an ANOVA or Chi-test to examine whether certain variables are significantly different among classes after running the LCA estimation.

For this example, let X be a vector of indicators of the latent class. Let Y be the latent class. Let Z be a vector of some other variables that aren't indicators. Your quote makes it sound like you want to know the mean of Z given Y. (or P(Z = z | Y) I believe that's latent class with distal outcomes, more discussion later.

The paper you cited says:

A broad range of indicator and predictor variables was used to generate class profiles that structure the Australian AV market...

Class 2 (20% of the sample) was differentiated from the other classes by very low reported intentions to be early adopters of AVs while demonstrating among the highest intentions to use shared AVs. They had relatively high expectations of positive outcomes from AVs and relatively low levels of concern. This class was titled “Ride-sharing preference”. There were no predictor variables for which members of this group were significantly different to those in most other classes...

Class 5 was the smallest class, accounting for 14% of the total sample. Individuals in this group reported the highest scores on all indicator variables except concerns, for which they were the lowest (albeit the difference from the total sample mean for concerns was non-significant). The highly favourable attitudes towards both personal and shared AVs resulted in this group being classified as “First movers”. Relative to those allocated to most other classes, members of this group tended to be more educated, have shorter driving histories, enjoy driving, and report regularly transporting the elderly/disabled.

The way the authors phrased this is a bit unclear. In isolation, when someone says "predictors of the latent class", this makes me think they fit a latent class regression. In addition to the usual latent class indicators, you are adding covariates Z to the multinomial model for class membership. Say you actually had observed the latent classes, you would just be fitting a multinomial logistic model with Z as covariates. You don't observe the latent class, but you can indeed simultaneously fit a model for P(Y = y | Z). Search the site for latent class regression, I have written some answers that show you how to fit this model.

For latent class with distal outcomes, you could assign each observation to the class they're most likely to be in, i.e. the modal class. For example, if Mrs. Chen has probabilities 0.9, 0.06, and 0.04, we're pretty sure she's in class 1. It's a bit wrong to just act as if we're certain she's in class 1, but it's not too bad a distortion of reality. However, what happens if you have a lot of people whose probability vectors look like 0.4, 0.3, 0.3? Then your model is more wrong.

At that point, you might want to Google for latent class with distal outcomes. The issue here is that some of the techniques proposed to handle them properly are not straightforward to implement. Stata has no specific implementation for any of the ones that I'm aware of (e.g. pseudo class draws, which is basically multiple imputation). It's relatively straightforward to do modal class assignment.

Whatever you do, please explain clearly what you're doing. In the article you cited, I'm about 60% sure that they are presenting latent class with distal outcomes (i.e. P(Z = z | Y = y) or Z^bar | Y = y), but they don't explicitly say so. I'm left assuming they did modal class assignment, then they just did a series of ANOVA models. The entropy for their final model was fairly high, so modal class assignment should not be too bad.

A paper by some MPlus researchers that was linked in the article you cited says a bit more than this.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Brian Flaherty

Join Date: Jun 2017

Posts: 11
#5

17 Dec 2019, 23:39

Hello,
I hadn't forgotten to reply, just having trouble getting the time!
For testing differences among the indicators of the latent classes, I'd suggest two ways. First, since Stata's gsem lca provides standard errors, you could calculate confidence intervals around the loglinear parameters, or around the conditional response probabilities output by

Code:

estat lcmean

(and using the transformed standard errors). Second, you could do a nested model comparison between a model with the response probabilities in question freely estimated and another model with the appropriate response probabilities constrained to be equal. (They are intercepts in the loglinear parameterization, I just always think in probabilities.)

Let's say I had a four class model with no restrictions. Here's a model I ran earlier today:

Code:

. gsem (Q46r Q47r Q48r Q49r Q50r Q51r Q52r Q53r Q54r Q55r <- _cons), family(bernoulli) link(logit) lclass(A 4) lcinvariant(none) startvalues(randomid, draws(16)) . estimates store model1

Next, let's say I want to constrain the probabilities of a 1 for Q46r and Q47r to be equal in classes 1 and 2. That is
P(Q46r = 1 | A = 1) = P(Q47r = 1 | A = 1) = P(Q46r = 1 | A = 2) = P(Q47r = 1 | A = 2), where A is my latent class variable. To do that, i have to make sure my classes come out in the right order. The way I've done that so far is to save the posterior class assignments from one analysis, re-order them as I wished, and then give the saved values as start values type classpr.

Code:

. predict tmp*, classposteriorpr . generate lc4post1 = tmp1 . generate lc4post2 = tmp3 . generate lc4post3 = tmp2 . generate lc4post4 = tmp4 . drop tmp* . gsem (Q46r Q47r Q48r Q49r Q50r Q51r Q52r Q53r Q54r Q55r <- _cons), family(bernoulli) link(logit) lclass(A 4) startvalues(classpr lc4post1 lc4post2 lc4post3 lc4post4 )

The gsem run at the bottom of the above code block does not have constraints, but it simply uses the posterior probabilities as start values. The problem I ran into in my data doing this was that the start values were too good and the algorithm couldn't find a better fit, so it just kept trying, but not calling it good. To fix that, I had to add an iteration limit to the startvalues list.

Code:

. gsem (Q46r Q47r Q48r Q49r Q50r Q51r Q52r Q53r Q54r Q55r <- _cons), family(bernoulli) link(logit) lclass(A 4) startvalues(classpr lc4post1 lc4post2 lc4post3 lc4post4 , iterate(20)) . estimates store model2

Once I can get my classes to come out in the right order, I would next impose the constraints.

Code:

constraint 1 _b[Q46r:1.A] = _b[Q47r:1.A] constraint 2 _b[Q46r:1.A] = _b[Q46r:2.A] constraint 3 _b[Q46r:1.A] = _b[Q47r:2.A]

The notation in the first constraint is that the intercept (_b) for item Q46r for class A = 1 is equal to the intercept (_b) for item Q47r for class A = 1. The same two parameters for the class 2 are all set equal too in the following two lines. (Let me add, I just figured this stuff out over the past couple weeks, and it appears to be working, but there well may be better ways to do this. If there is, please share!)

Once the constraints are given, run gsem with the constraints.

Code:

constraint drop _all constraint 1 _b[Q46r:1.A] = _b[Q47r:1.A] constraint 2 _b[Q46r:1.A] = _b[Q46r:2.A] constraint 3 _b[Q46r:1.A] = _b[Q47r:2.A] gsem (Q46r Q47r Q48r Q49r Q50r Q51r Q52r Q53r Q54r Q55r <- _cons), family(bernoulli) link(logit) lclass(A 4) startvalues(classpr lc4post1 lc4post2 lc4post3 lc4post4 , iterate(20)) constraints( 1 2 3 )

Now you have the two models and you can compute a likelihood ratio difference test.

Code:

lrtest (model1) (model2)

Hope this helps. If I'm doing something wrong or inefficiently, please let me know. I know LCA, but am new with Stata.

Brian
Comment

Announcement

Latent Class Analysis - Is it possible to run ANOVA or Chi-test in STATA after LCA?

Comment

Comment

Comment

Comment