Latent class analysis: - gsem- & pseudo R-squared

Lena Garnik

Join Date: Oct 2019

Posts: 8
#1

Latent class analysis: - gsem- & pseudo R-squared

08 Dec 2019, 09:09

Dear users,

This may be a dumb question, but I am trying familiarizing myself with Latent class cluster analysis. In most of the published papers in which they have employed a latent class analysis approach ( regardless of the software they chose) they report a pseudo-R2, alongside the log-likelihood value and BIC.

However, I do not see any R-squared in the outcomes when -gsem- is used to conduct LCA. Is it possible to get that?

Thanks in advance,
Best,
Lena
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#2

08 Dec 2019, 10:07

Code:

estat lcgof

McFadden's Pseudo-R2 would just be 1 minus the ratio of log-likehoods.
Comment
Lena Garnik

Join Date: Oct 2019

Posts: 8
#3

08 Dec 2019, 10:20

Thank you very much Andrew.
Does this mean that when the observed variables are NOT all categorical, in the case where Stata does not report the likelihood ration tests, we cannot calculate the Pseudo-R2?

Best,
Lena
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#4

08 Dec 2019, 10:36

The continuous variables do not contribute to the log-likelihood, so just run the command excluding these to obtain the fit statistics.

Edit: To be more specific, in logit, for example, if you include a continuous dependent variable, it does not vary. Can you show your command?

Last edited by Andrew Musau; 08 Dec 2019, 10:58.
Comment
Lena Garnik

Join Date: Oct 2019

Posts: 8
#5

08 Dec 2019, 10:55

Dear Andrew, thanks again for your response.
But I cannot exclude continuous variables, because all of the variables are continuous. My data looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double age float(gender educ inc x1 x2 x3 x4 x5 x6 x7) 49 0 1 0 3 -2 2 -2 -1 1 2 33 1 0 0 -3 3 0 4 1 0 -4 45 1 1 1 -2 2 1 4 1 0 -4 19 1 0 1 -2 2 1 4 0 1 -4 33 1 1 0 2 2 -3 3 -1 3 -2 29 0 1 0 2 0 0 -1 -2 0 -4 54 1 0 1 -1 3 2 4 1 1 -2 61 1 0 0 1 -1 3 1 0 3 0 47 1 0 0 2 1 -1 3 -2 4 -4 32 1 0 1 -3 0 1 4 -1 3 -2 end

And I am specifying the model as:

Code:

local x "x1 x2 x3 x4 x5 x6 x7" gsem ($x <- ) (C <- age gender educ inc), lclass(C 3) estat lcgof

Therefore, if I want to exclude the continuous variables, there will be no endogenous variables at all.
I am not sure if I can specify another family of distribution for these observed variables. They are ranging from -4 to 4, and they represent the number of times an attribute is picked as best minus the number of times an attribute is picked as worse.

Is there any other ways to get R, and/or specify the model?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#6

08 Dec 2019, 12:00

This is OK, you can include continuous variables here. I thought your model was logit. So in this case, Stata cannot compute the likelihood ratio. If I get time, I will check if there is a workaround.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#7

09 Dec 2019, 02:55

For linear models, while it is possible to calculate McFadden's Pseudo R2, it does not make much sense to do it as you have the R2 statistic = Model Sum of Squares/Total Sum of Squares. Recall that this is a substitute for R2 in nonlinear models such as logit and probit. Here is an example of how to calculate the former in a linear model.

Code:

sysuse auto qui glm mpg displacement weight gear local ll1= e(ll) qui glm mpg local ll0= e(ll) di "McFaddens Pseudo R2 is `= 1-(`ll1'/`ll0')'"

Res.

Code:

. di "McFaddens Pseudo R2 is `= 1-(`ll1'/`ll0')'" McFaddens Pseudo R2 is .1674128332010786

So in the case of latent class models in gsem, the issue is how to define the comparison model. If you define this as log-likelihood= 0, then the Pseudo R2 is not defined. I do not see an easy way of doing this, so I would just stick to the AIC and BIC. As I said, even if we were able to calculate the statistic, it is not useful for linear models.
Comment
Lena Garnik

Join Date: Oct 2019

Posts: 8
#8

09 Dec 2019, 06:50

Dear Andrew,
Thank you so much for your response.

I was wondering is assuming continuous variable is correct in this specification, and my Xs are ranging from -4 till max 4, and they are only in integers. They are basically count data, but effect coded. So number of times an attribute chosen as best minus no. of times it is chosen as worst. Is this correct to treat them as continuous?

The reason for asking is that, in a similar paper which used similar best and worst ranking of the attribute, the author is reporting R2 ( not mentioning which software is used though). Nevertheless, I was wondering maybe the problem is the continuous assumption that I am making here?

I really appreciate your help,
Best,
Lena
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#9

09 Dec 2019, 07:31

I was wondering is assuming continuous variable is correct in this specification, and my Xs are ranging from -4 till max 4

Yes, a linear model works here since the support is $(-\infty, \infty)$. Because you have differences in counts and not counts, you cannot use count models such as poisson since there is no such thing as a negative count.

The reason for asking is that, in a similar paper which used similar best and worst ranking of the attribute, the author is reporting R2

I cannot tell what the author did or what model he/she used. Do you have a link to the paper? The method of analysis should have been discussed in the paper.

Last edited by Andrew Musau; 09 Dec 2019, 07:35.
1 like
Comment
Lena Garnik

Join Date: Oct 2019

Posts: 8
#10

09 Dec 2019, 09:08

The method of analysis is not discussed in a great detail, because the latent class analysis is provided in the appendix as an alternative way of analyzing such data.
Here is the paper, and here is the appendix.

Thank you for the time you take,
Best,
Lena
Comment
Lena Garnik

Join Date: Oct 2019

Posts: 8
#11

09 Dec 2019, 09:40

Just wanted to point: in the appendix, under the table " A5: Latent Class Cluster Analysis Based on Effects-Coded Count Estimates" an R-squared is reported.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#12

09 Dec 2019, 10:45

Thanks for posting the link. I believe what the authors call R2 is entropy R2, which is an indication of the quality of classification. Here is the code to implement it in R.
https://gist.github.com/daob/c2b6d83...3cebfdc2c267b3

This is a supplementary statistic and I would not worry if I have not reported it. From the estimated model's point of view, AIC/BIC are more important in assessing fit.
1 like
Comment
Lena Garnik

Join Date: Oct 2019

Posts: 8
#13

09 Dec 2019, 11:30

Thanks a lot Andrew,
Best,
Lena

Last edited by Lena Garnik; 09 Dec 2019, 11:36.
Comment
Brian Flaherty

Join Date: Jun 2017

Posts: 11
#14

09 Dec 2019, 17:40

Hello,
In a latent class cluster analysis (aka, latent profile analysis, mixture of normals?), where the classes are nominal categorical and the indicators are continuous and assumed conditionally normally distributed, R^2 for each item is reported as measure of the variance in that item accounted for by the latent class/profile variable. It is akin to the item R^2 reported in a confirmatory factor analysis. It is telling you about measurement quality. AIC/BIC are relative model selection criteria. You could have a set of models, all with very poor measurement, and AIC/BIC is still going to say one is best of the set. Thus item R^2 is giving a different view of the quality of the model and estimates.
Hope this helps.
Brian
Comment

Announcement

Latent class analysis: - gsem- & pseudo R-squared

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment