Calculating entropy for LCA (latent class analysis) in STATA 15

Laura Brown

Join Date: Sep 2018

Posts: 8
#16

02 Sep 2018, 10:35

Hi all,

This has been a very helpful thread for me but I have a quick question RE plausible entropy values. Clyde's code in post #7 works but generates a negative entropy value of -.93447178 after my LCA model. I thought entropy values had to be between 0 and 1! Anyone know what may be causing this, and whether I can just take the absolute value?

Many thanks,

Laura
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#17

02 Sep 2018, 11:46

That code should not be generating negative values. Nor do I see anything in the code that is clearly the problem here. Please post an example data set that reproduces this result and I will try to troubleshoot it. Be sure to use the -dataex- command to post the example data.

If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#18

02 Sep 2018, 11:47

Originally posted by Laura Brown View Post

Hi all,

This has been a very helpful thread for me but I have a quick question RE plausible entropy values. Clyde's code in post #7 works but generates a negative entropy value of -.93447178 after my LCA model. I thought entropy values had to be between 0 and 1! Anyone know what may be causing this, and whether I can just take the absolute value?

Many thanks,

Laura

Laura, entropy definitely has to be between 0 and 1. I'd obviously check the code for typos. The first one that springs to mind is this: the denominator for the calculated entropy involves the natural log of k, the number of latent classes. The example code pertained to 2 latent classes, and you have to increase k every time you recalculate entropy. I think this wasn't explicitly stated in the thread. So, I'm stating it to remove ambiguity.

Also, I think the code in post #7 might not have been robust to situations where some people had predicted class membership probabilities of 0 (or at least 0 within floating point precision), which I addressed in a later post in the thread. This should be a rare scenario, but it did happen to me with real data. In that situation, I considered a class membership probability of 0 to be plausible.

If neither of these situations apply to you, can you post your code?

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Laura Brown

Join Date: Sep 2018
Posts: 8

#19

02 Sep 2018, 13:31

Thanks Clyde and Weiwen for your speedy replies.

This is the code I used:

Code:

*1) 2 class latent class model with all indicators included:
    set more off
    gsem (bfd menarche afb mwtkg <- ) ///
    (everbf    activities affection bwgst parcat rels ghq_75 regsmk alco <-, logit) ///
    (read vaxcat <-, ologit) ///
    /*if ethnic==1*/, ///
    lclass (C 2) startvalues(randompr, draws(5) seed(10))
    estat lcgof
    estimates store bib_c2_w
    
    *Entropy:
    quietly predict classpost*, classposteriorpr
    forvalues k = 1/2 {        
    gen sum_p_lnp_`k' = classpost`k'*ln(classpost`k')
    }
    egen sum_p_lnp = rowtotal(sum_p_lnp_*)
    summ sum_p_lnp, meanonly
    scalar E = 1+`r(sum)'/(e(N)*ln(2))
    drop classpost?    sum_p_lnp*
    di E

    *Sample size adjusted BIC:
    scalar SSBIC_bib_c2_w = -2 * e(ll) + e(rank) * ln((e(N)+2) / 24)
    di SSBIC_bib_c2_w

It is a 2 class model so k is correct. I have already adjusted the code as per Weiwen's suggestion for the 0 probability scenario as previous preliminary analyses suggested that the sample splits with more than 95% in one class.

I post a data example below but it only let me create it based on 100 observations. My actual dataset has 12,801 observations and the analysis above focuses on a subset of 3,938 White British mothers. I tried running the LCA and entropy code on several dataex datasets but a 100 cases is too small so either some categories of variables aren't present or the model just doesn't converge. Unfortunately I cannot share the full dataset due to access restrictions.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(bfd menarche afb mwtkg everbf activities affection bwgst parcat rels ghq_75 regsmk alco read vaxcat)
  .4602735        15 30  58 1 . 1 0 1 0 0 0 1 0 .
  .9205469        13 36  77 1 . 1 1 1 0 0 0 1 0 .
         .        15  .  65 0 . 1 1 1 0 0 0 1 0 .
  1.841095        15 24  65 1 . 1 1 1 0 0 0 0 0 .
 4.6027346        15  . 100 1 . 1 1 1 0 1 1 1 0 .
.032876678 14.666667  .  54 1 0 1 1 1 1 1 1 1 0 1
         .        15  .  80 0 . 1 1 1 1 1 1 1 0 .
        12     15.25 28  62 1 . 1 1 1 0 0 0 1 0 .
         .        12 21  52 0 . 1 1 1 0 0 0 1 0 .
 20.021896        12 36  58 1 . 1 1 1 0 0 1 1 0 .
         .      12.5  .  65 0 . 1 0 1 1 0 1 0 0 .
         .        13  . 110 0 . 1 1 1 0 0 1 1 . .
.032876678        12 31  64 1 . 1 1 1 1 0 1 1 . .
.032876678        16  .   . 1 . 1 1 1 0 0 0 0 0 .
 18.016418        10 39  95 1 . 1 1 1 0 0 0 1 0 .
         .        13  .  75 0 . 1 1 1 0 0 1 1 1 .
  .1643834        13 25  52 1 . 1 1 1 0 0 0 0 0 .
         .        11 28  98 0 . 1 1 1 0 0 1 1 . .
         .        13  .  64 0 1 1 1 1 0 0 0 1 0 1
  .2301369        13  .  89 1 0 1 1 1 0 0 0 1 0 1
        24         .  .  60 1 . 1 1 1 1 1 1 1 2 .
        24      11.5  .  54 1 . 1 1 1 0 0 1 0 0 .
 .13150671 12.916667  .  55 1 1 0 1 1 0 1 0 1 0 1
 4.6027346        13 30  73 1 . 1 1 1 0 0 0 0 0 .
.032876678        11 26  67 1 . 1 1 1 0 1 1 1 0 .
  7.002732        14 18  52 1 0 1 1 1 1 1 1 1 0 1
  .4602735     12.75 20   . 1 . 1 1 1 0 0 1 0 0 .
 11.013686        13  .  52 1 0 1 1 1 0 1 0 0 0 1
         .        12 18  86 0 . 1 1 1 0 1 1 1 0 .
  .4602735        15 22  57 1 . . 1 1 0 0 1 1 0 .
         .        13 19  50 0 . . 1 1 0 1 1 1 . .
         .        14  .  69 0 . 1 1 1 0 1 1 1 2 .
         .        14  .  60 0 . 0 1 1 1 0 0 1 0 .
         . 12.583333  .  65 0 . 1 1 1 0 0 1 0 . .
         .        13  .  82 0 . 0 1 1 0 0 1 0 . .
 1.1506836        13  .  60 1 . 1 0 1 0 0 0 0 1 .
         6        11  .  75 1 . 1 1 1 0 0 1 1 . .
         6        14 28  56 1 . . 1 1 0 0 0 1 . .
  .9205475        13 30  75 1 1 1 1 1 1 1 0 1 0 1
         .        13  .  59 0 . 1 0 1 1 0 1 0 2 .
         6        15  .  95 1 . 1 1 0 0 0 1 1 . .
  .6904102 13.083333 24  55 1 . 1 1 1 0 1 1 0 0 .
         .        14 21  58 0 . 1 1 1 0 0 1 1 1 .
         .        11  .  66 0 . . 1 1 0 0 0 1 . .
  6.016432        10 20 107 1 . 1 1 1 0 1 0 1 0 .
         .        15  .  55 0 1 1 1 1 0 0 0 0 0 1
  15.02464        14  .  65 1 1 1 0 1 0 1 0 1 0 1
.065753356        12  .  55 1 . 1 1 0 0 0 1 1 0 .
         .        13 15  46 0 . . 1 1 1 1 0 0 0 .
         .        12 15  69 0 . 1 1 1 1 0 1 1 . .
         .        14  .  48 0 . 1 1 1 0 0 0 1 1 .
 .23013674        13  .  88 1 1 1 1 1 1 1 1 1 0 1
 1.3808213        17 25  82 1 . 1 1 1 0 1 0 1 . .
         . 14.916667  .  53 0 . 1 1 1 1 0 1 1 . .
        24        13 24  96 1 . 1 1 1 1 0 1 1 2 .
         .        14 16  60 0 0 . 1 1 0 0 1 1 . 1
         .        12  .  75 0 . 1 1 1 0 1 1 0 0 .
         .         9 17  57 0 . . 1 1 0 0 1 1 0 .
.032876678        13  .  55 1 1 1 1 1 1 0 1 0 0 1
  .9205469        13  .  63 1 . 1 1 0 0 1 1 1 0 .
  1.841095 15.416667 29  85 1 0 1 1 1 0 0 0 1 0 1
.032876678        12  .  50 1 . 1 1 1 0 1 0 0 0 .
         .        13 21  90 0 . 1 1 1 0 0 0 1 0 .
  .9205469        12 17  39 1 . 1 1 1 0 0 0 0 . .
  1.841094     14.25 23  99 1 . 1 1 1 0 1 1 1 0 .
  9.008209        12 18  56 1 1 1 1 1 1 1 1 1 0 2
         .        13 22  69 0 . 1 1 1 0 0 1 1 . .
  9.008209        13 20 110 1 1 1 1 1 0 0 1 1 0 1
  8.021909        11  .  81 1 . 1 1 1 0 0 1 1 0 .
.065753356 15.666667 28  97 1 . 1 1 1 0 0 1 1 0 .
  .4602735 13.833333 21  55 1 . 1 1 1 1 0 0 0 0 .
 1.3808204        12 23 126 1 . 1 1 1 0 0 1 1 0 .
         .        14  . 107 0 . 1 1 1 0 1 1 1 0 .
.065753356      10.5 23  95 1 0 1 1 1 0 1 0 0 0 1
         .        11  .  75 0 . 1 1 1 1 1 1 1 0 .
        24        14  .  65 1 . 0 1 1 1 0 0 0 2 .
         .        14 20  46 0 . 1 1 1 1 1 1 0 2 .
.032876678        11 19  47 1 . 1 1 1 1 0 1 1 0 .
         .        12 18  67 0 1 0 1 1 0 0 0 0 0 2
         .        12  . 106 0 . . 1 0 1 1 1 0 . .
         .        16  .  78 0 . 1 1 1 0 1 1 0 0 .
 1.6109582        13 38 101 1 0 1 1 1 1 0 1 1 0 1
  .9205469        14  .   . 1 . 1 1 1 0 0 1 1 0 .
  15.02464        10  .  76 1 . 1 1 1 1 1 0 1 0 .
         .      12.5  . 120 0 1 1 1 1 0 1 1 1 0 1
 4.0109544        10  .  78 1 . 1 1 0 0 0 0 0 0 .
 .23013674        13 17  63 1 . 1 1 1 0 0 1 1 1 .
 1.3808213        11  . 108 1 1 1 1 1 0 0 0 1 0 1
         .         9  .  75 0 1 1 1 0 0 0 0 1 0 1
         .        15  .  82 0 0 1 1 1 0 0 1 1 0 0
         .     12.25 20  53 0 . 1 1 1 0 0 0 1 0 .
         .        13  .  45 0 . 1 1 1 1 0 1 1 . .
.032876678        12  .  55 1 1 1 0 1 1 0 1 0 . 1
  .3287668        13 22  62 1 0 1 1 1 1 1 0 1 0 1
  8.021909        13  .  68 1 . 1 1 1 0 1 1 0 0 .
         .        13  .  57 0 . 1 1 1 0 0 0 0 0 .
 2.0054772        11  .  78 1 1 1 1 1 1 1 1 0 0 1
         .        14  . 106 0 . 1 1 1 0 1 0 0 0 .
 1.1506836         9 20 104 1 . 1 1 1 1 0 1 0 0 .
         . 11.583333 17  74 0 . 1 1 1 1 1 1 1 2 .
end
label values everbf everbf
label def everbf 0 "No", modify
label def everbf 1 "Yes", modify
label values activities activities
label def activities 0 "No", modify
label def activities 1 "Yes", modify
label values affection affection
label def affection 0 "No", modify
label def affection 1 "Yes", modify
label values bwgst bwgst
label def bwgst 0 "LBW and/or premature", modify
label def bwgst 1 "Normal weight and term", modify
label values parcat parcat
label def parcat 0 "3+ other children", modify
label def parcat 1 "1 or 2 other children", modify
label values rels rels
label def rels 0 "Living with baby's father", modify
label def rels 1 "Not living with baby's father", modify
label values ghq_75 ghq_75
label def ghq_75 0 "<75th centile", modify
label def ghq_75 1 ">=75th centile", modify
label values regsmk regsmk
label def regsmk 0 "No", modify
label def regsmk 1 "Yes", modify
label values alco alco
label def alco 0 "No", modify
label def alco 1 "Yes", modify
label values read read
label def read 0 "Once a week or less", modify
label def read 1 "2-4 days per week", modify
label def read 2 "5-7 days per week", modify
label values vaxcat vaxcat
label def vaxcat 0 "None", modify
label def vaxcat 1 "1-9", modify
label def vaxcat 2 "All 10", modify

I'm baffled! Could it have anything to do with some variables having high levels of missingness? Several of my variables gave more than 50% missingness due to the questions only being asked of some women.I had understood that missingness on some variables wasn't a problem for LCA/LPA, but maybe it has an influence on the entropy calculation somehow?

Thanks for your help,

Laura

Edit: So the entropy code seems to work fine when I run the same 2 class model with Pakistani origin mothers , giving me an entropy value of 0.5481156. However, a 3 class model with White British mothers gives another bizarre entropy value of -1.6307782.

Last edited by Laura Brown; 02 Sep 2018, 13:56. Reason: Trying entropy code on different models

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#20

02 Sep 2018, 14:38

Originally posted by Laura Brown View Post

...Could it have anything to do with some variables having high levels of missingness? Several of my variables gave more than 50% missingness due to the questions only being asked of some women.I had understood that missingness on some variables wasn't a problem for LCA/LPA, but maybe it has an influence on the entropy calculation somehow?

Thanks for your help,

Laura

Edit: So the entropy code seems to work fine when I run the same 2 class model with Pakistani origin mothers , giving me an entropy value of 0.5481156. However, a 3 class model with White British mothers gives another bizarre entropy value of -1.6307782.

When you estimate an LCA model, listwise missing (i.e. some people don't respond on some indicators; contrast vs casewise missing where some people are missing everything) is OK. Per my understanding, -gsem- will use all information present in each case.

However, I'm not sure what happens when you use -predict- post-LCA estimation and one of the indicators is missing. If it produces missing predicted probabilities, then consistent with your intuition, I think this would explain your unusual results. You can try predicting class membership probabilities and checking if they're present or not in the cases with any missing indicator.

If it's true that this is the problem, then the solution I'd suggest would be to simply modify the denominator to account for the number of cases with present predicted probabilities. You can use the command -count- to count the observations with present predicted probabilities. That would return a scalar r(N). Some suggested modified code:

Code:

*Entropy: quietly predict classpost*, classposteriorpr forvalues k = 1/2 { gen sum_p_lnp_`k' = classpost`k'*ln(classpost`k') } egen sum_p_lnp = rowtotal(sum_p_lnp_*) summ sum_p_lnp, meanonly local sum = r(sum) quietly count sum_p_lnp scalar E = 1+`sum'/(r(N)*ln(2)) drop classpost? sum_p_lnp* di E

Does that resolve the issue?

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#21

02 Sep 2018, 14:44

I think Weiwen has the key to the problem. It is probably the missing values causing the e(N) in the original code to be the wrong number.

But there is an error in the code in #20. -quietly count sum_p_lnp- will produce a syntax error ("varlist not allowed"). I think he means

Code:

quietly count if !missing(sum_p_ln_p)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35809
#22

02 Sep 2018, 14:45

Entropy is not necessarily bounded by 0 and 1. It's not a probability. See e.g. https://stats.stackexchange.com/ques.../207093#207093

For k equally probable categories with equal probability (= 1 / k) the entropy is maximal at k (1/k) ln [1/(1/k)] = ln k. For k > 2 that is more than 1. Different details for other bases.

FWIW, I find it most congenial to define entropy as the (weighted) average of ln (1/p) i.e. the sum of p ln(1/p). Most texts rewrite that even before you see it as first the sum of p (-ln p) and then second - the sum of p ln p. It's immediate from the definition of logarithms that the two are the same, but that doesn't make me like the usual expression. (And, occasionally, the minus sign gets lost, to some minor bewilderment.)
Comment
Laura Brown

Join Date: Sep 2018

Posts: 8
#23

02 Sep 2018, 16:01

Originally posted by Weiwen Ng View Post

You can try predicting class membership probabilities and checking if they're present or not in the cases with any missing indicator.

I checked and the estimated probabilities are present for all cases, regardless of missingness, so it looks like missigness might not be the issue.

I tried Wiewen's and Clyde's combined code modification anyway as:

Code:

*Entropy: drop classpost* sum_p* quietly predict classpost*, classposteriorpr forvalues k = 1/2 { gen sum_p_lnp_`k' = classpost`k'*ln(classpost`k') } egen sum_p_lnp = rowtotal(sum_p_lnp_*) summ sum_p_lnp, meanonly local sum = r(sum) quietly count if !missing(sum_p_lnp) scalar E = 1+`sum'/(r(N)*ln(2)) drop classpost? sum_p_lnp* di E

This gave me an entropy value of 0.36942721 (in contrast to the previous estimate of -0.93447178).

Running the three class model with the modified code above (also changing k to 3) yields an entropy value of 0.14245471 (in contrast to the previous estimate of -1.6307782).

I am trying to wrap my head around Nick's post and what that means for my results. I am now wondering which of the entropy values is to be trusted. Or Nick, are you suggesting another edit to the code above is needed?

Even if the values of entropy don't have to range from 0 to 1, do scores closer to 1 still indicate clearer classifications (as suggested by Silverwood et al, 2011, p1409 for example)? That is, how does one interpret entropy values outside of the 0 to 1 bound? Is it just a case of taking absolute values and ignoring any negative sign and just assessing distance from 1? So in my case, as -0.93447178 is 1.93447178 away from 1 and -1.6307782 is 1.6307782 away from 1, is the three class model considered to indicate clearer classifications than the 2 class model? The missingness-adjusted entropy estimates would also suggest that the 3 class model is a better fit if we use this closeness to 1 interpretation. I apologise if I have completely missed the mark.

My brain hurts and it's 11pm here so I will check back tomorrow morning. Thanks for all of your input thus far, greatly appreciated.

Laura

Reference:
Silverwood, R. J., Nitsch, D., Pierce, M., Kuh, D., & Mishra, G. D. (2011). Characterizing longitudinal patterns of physical activity in mid-adulthood using latent class analysis: results from a prospective cohort study. American journal of epidemiology, 174(12), 1406-1415.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#24

02 Sep 2018, 16:16

I think Nick's post refers to the entropy of a probability distribution, which is a different animal from the entropy of the classification system. The former is calculated as Nick indicates (and as I coded in #2 of this thread).

But that is not the statistic that is referred to as the entropy of the classification system (or model). The latter is, indeed, normalized to be between 0 and 1. It is calculated by the code Laura Brown gives in #23*, which incorporates my correction to Weiwen's correction of the code in #7. When the code in #7 was written, neither Weiwen nor I gave any thought to the possibility that there would be missing values, so we incorrectly based the normalization on e(N) instead of on r(N), leading to the problems Laura Brown pointed out in #16.

*One correction to the code in #23. As Weiwen pointed out earlier, for the general k-class model, the ln(2) factor needs to be replaced by ln(`k').
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#25

02 Sep 2018, 22:23

Originally posted by Clyde Schechter View Post

I think Nick's post refers to the entropy of a probability distribution, which is a different animal from the entropy of the classification system. The former is calculated as Nick indicates (and as I coded in #2 of this thread).

But that is not the statistic that is referred to as the entropy of the classification system (or model). The latter is, indeed, normalized to be between 0 and 1. It is calculated by the code Laura Brown gives in #23*, which incorporates my correction to Weiwen's correction of the code in #7. When the code in #7 was written, neither Weiwen nor I gave any thought to the possibility that there would be missing values, so we incorrectly based the normalization on e(N) instead of on r(N), leading to the problems Laura Brown pointed out in #16.

*One correction to the code in #23. As Weiwen pointed out earlier, for the general k-class model, the ln(2) factor needs to be replaced by ln(`k').

Folks, thanks for catching the issues. Nick is correct about entropy as defined in his linked post. As Clyde pointed out, we were discussing normalizedentropy above, which is bounded by 0 and 1 as we defined it.

Laura, if your entropy calculations are correct, then 0.37 and 0.14 are very low values of entropy, which means a very low degree of class separation. Imagine you had a perfect set of indicators, such that your LCA model was able to say that each person had a 0% probability of being in class 1 and a 100% probability of being in class 2 and vice versa. That's a (normalized) entropy of 1. How do you get there? Say you have 6 indicators. If class 1 was very high on indicators 1-3 and very low on indicators 4-6, and class 2 was exactly the reverse, then this would result in very high classification certainty and very high entropy. In real life, I don't think we have a lot of situations like that.

The code you typed appears correct (but I shall review when I can get to a computer with Stata 15). If you're right that missingness on an indicator doesn't preclude predicted class membership probabilities, then it could be that missingness still reduces classification certainty. If item missingness were really pervasive, then that could result in low entropy values. Can you give us a sense of what percent of each indicator is missing? You may already know this, but you can use

Code:

misstable summarize bfd menarche afb mwtkg everbf activities affection bwgst parcat rels ghq_75 regsmk alco read vaxcat

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Laura Brown

Join Date: Sep 2018
Posts: 8

#26

03 Sep 2018, 03:38

Thanks for clarifying Weiwen and Clyde, this is starting to make more sense.

Originally posted by Weiwen Ng View Post

Laura, if your entropy calculations are correct, then 0.37 and 0.14 are very low values of entropy, which means a very low degree of class separation

A low degree of class separation is not necessarily a bad thing in this instance as I am trying to show that people do not split neatly into two reproductive strategies (as is often assumed in some evolutionary psychology applications of life history theory).

Originally posted by Clyde Schechter View Post

*One correction to the code in #23. As Weiwen pointed out earlier, for the general k-class model, the ln(2) factor needs to be replaced by ln(`k').

I re-ran the 3 class model replacing ln(2) with ln(3) as per Clyde's suggestion above (NB: ln(`k') did not work, resulting in "invalid syntax"). This gave me 0.45894916 as my new entropy value.

Originally posted by Weiwen Ng View Post

If you're right that missingness on an indicator doesn't preclude predicted class membership probabilities, then it could be that missingness still reduces classification certainty. If item missingness were really pervasive, then that could result in low entropy values. Can you give us a sense of what percent of each indicator is missing?then it could be that missingness still reduces classification certainty. If item missingness were really pervasive, then that could result in low entropy values. Can you give us a sense of what percent of each indicator is missing?

Missingness doesn't preclude class membership probabilities. I ran the following code to check this and class membership is predicted for all 3,938 White British women in my sample:

Code:

    
        predict cpost* if ethnic==1, classposteriorpr
        egen max = rowmax(cpost*) if ethnic==1
        gen predclass_w=1 if cpost1==max & ethnic==1
        replace predclass_w=2 if cpost2==max & ethnic==1
        tab predclass_w
        summ cpost1
        summ cpost2

Item missingness is however very high as shown by the results of mdesc below:

Variable	Missing	Total	Percent Missing

bfd	3,072	3,938	78.01
menarche	187	3,938	4.75
afb	1,945	3,938	49.39
mwtkg	146	3,938	3.71
everbf	80	3,938	2.03
activities	3,156	3,938	80.14
affection	3,438	3,938	87.30
bwgst	10	3,938	0.25
parcat	0	3,938	0.00
rels	5	3,938	0.13
ghq_75	590	3,938	14.98
regsmk	3	3,938	0.08
alco	7	3,938	0.18
read	3,498	3,938	88.83
vaxcat	3,167	3,938	80.42

With such high item missingness, does that then mean that the entropy values are less accurate, reflecting a missingness issue rather than a criticism of the tested classification?

Thanks,

Laura

Last edited by Laura Brown; 03 Sep 2018, 04:20. Reason: Reran 3 class model with Clyde's ln(`k') correction

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#27

03 Sep 2018, 06:36

Originally posted by Laura Brown View Post

Thanks for clarifying Weiwen and Clyde, this is starting to make more sense.

A low degree of class separation is not necessarily a bad thing in this instance as I am trying to show that people do not split neatly into two reproductive strategies (as is often assumed in some evolutionary psychology applications of life history theory).

...

With such high item missingness, does that then mean that the entropy values are less accurate, reflecting a missingness issue rather than a criticism of the tested classification?

Thanks,

Laura

Kathryn Masyn's chapter in the Oxford Handbook of Quant Methods, which is quoted in Stata's latent class manual, does not recommend that entropy be used for model selection. It's a descriptive measure of how well-separated the classes are. So, you're correct, if that low entropy is correct, it's not a criticism of the model. You would still want to select the number of classes with the highest BIC. That said,

My suspicion is that with that high a rate of missingness, the entropy assigned to observations with missing values will be lower. It will be less accurate because it's derived from less information. I'll need to simulate some data to confirm this. I'm going to take Stata's stock dataset for LCA, go run a model and predict probabilities, then I'll randomly knock 20% of the responses on each indicator and predict probabilities again. I'll report back, but if anybody wants to beat me to the punch, please feel free to do so.

Side note: You said that some of the missingness may be by design rather than at random. I hope you've considered what implications that may have for your class enumeration. I'm not sure I am qualified to advise on that!

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#28

04 Sep 2018, 09:10

Some code. I'm going to fit a 2-class LCA on the full data. Then, I'll create a dataset where the same indicators become missing completely at random (MCAR): each indicator has a 20% chance of being converted to missing.

Code:

use http://www.stata-press.com/data/r15/gsem_lca1 set seed 1000 gsem (accident play insurance stock <- ), logit lclass(C 2) predict probfull*, classposteriorpr foreach v in accident play insurance stock { gen `v'_miss = `v' replace `v'_miss = . if runiform() > .8 } gsem (*_miss <- ), logit lclass(C 2) predict probmiss*, classposteriorpr sum probfull* probmiss* Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- probfull1 | 216 .7207539 .3778755 .0410183 .999975 probfull2 | 216 .2792461 .3778755 .000025 .9589816 probmiss1 | 216 .7503538 .3818407 .0350281 1 probmiss2 | 216 .2496462 .3818407 7.41e-14 .964972

Indeed, when we changed some indicator variables to missing at random, the standard deviation of the predicted probabilities increases. Now, let's check how entropy changes:

Code:

forvalues k = 1/2 { gen sum_p_lnp_full_`k' = probfull`k'*ln(probfull`k') gen sum_p_lnp_miss_`k' = probmiss`k'*ln(probmiss`k') } egen sum_p_lnp_full = rowtotal(sum_p_lnp_full_*) egen sum_p_lnp_miss = rowtotal(sum_p_lnp_miss_*) quietly summ sum_p_lnp_full, meanonly local sum = r(sum) quietly count if sum_p_lnp_full != . scalar E_full = 1+`sum'/(r(N)*ln(2)) quietly summ sum_p_lnp_miss, meanonly local sum = r(sum) quietly count if sum_p_lnp_miss != . scalar E_miss = 1+`sum'/(r(N)*ln(2)) . display E_full .71929768 . display E_miss .79915984

Well, looks like my intuition was wrong for this example! The LCA model fit with some indicator variables MCAR actually has higher entropy. If you look at observations 2 and 5, this might give some insight into why. (Note, because you set the random number seed, you should get exactly the same results as I did despite using a random number generator to knock out some indicator variables in some observations, and your observations 2 and 5 will be the same as mine.)

Observation 5 had all 4 indicators changed to missing. That observation's predicted membership probabilities in complete data for class 1 and 2 were 0.999975 and 0.000025 respectively. In the LCA on the indicator variables that contain missing, that observation's predicted membership probabilities change to 0.9679506 and 0.0320494 respectively. That's more classification uncertainty. My intuition was correct for this observation.

Observation 2, however, had no indicators changed to missing. In the complete data, their predicted membership probabilities are the same as for observation 5 (because same response pattern). In the LCA on MCAR data, their predicted membership probabilities become more certain, not less. They change to 1 and 7.41 * 10^-14. There are far more cases like observation 2 than observation 5.

Where does that leave Laura? I'm not sure! Judging from entropy scores I had seen in real life (my own data plus papers I've read), my initial intuition was that entropy scores of 0.3 and below were very, very low. I've typically seen scores of 0.6 or higher in my work where most of the latent classes in the model were fairly close. I haven't seen papers reporting entropy scores of 0.3 or below. So, either there was an error stemming from some programming issue that we hadn't anticipated, or there's really that much classification uncertainty in the model! Laura, if you look at the -estat lcmeans- output from your models, I think you should see that the class-specific means for all your indicators are very close together. I would also fit a 1-class LCA model, and compare its BIC to the other models you fit.

Heuristically, in LCA, you propose that your data can best be explained by k classes that are homogeneous with respect to the indicators you specified. If the best fitting model has k = 1, then basically you have a homogeneous sample. If k is 2, then you've got a heterogeneous sample, and here are the characteristics of the two classes you think are present, and so on.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Laura Brown

Join Date: Sep 2018
Posts: 8

#29

05 Sep 2018, 10:54

Thanks for looking into this some more Weiwen.

Originally posted by Weiwen Ng View Post

Side note: You said that some of the missingness may be by design rather than at random. I hope you've considered what implications that may have for your class enumeration. I'm not sure I am qualified to advise on that!

My missingness comes from some questions only being asked in follow up sub-cohorts rather than in the main cohort. I will try running models restricted to women who are in these follow up sub-cohorts as a sensitivity analysis to see how that affects my results.

Originally posted by Weiwen Ng View Post

Laura, if you look at the -estat lcmeans- output from your models, I think you should see that the class-specific means for all your indicators are very close together. I would also fit a 1-class LCA model, and compare its BIC to the other models you fit.

Whilst the means for continuous vars (and probabilities of being in different categories for categorical vars) are relatively close together, the confidence intervals for the two classes do not overlap for the majority of indicators (everbf, activities, rels, ghq_75, regsml, alco, one category of read, and one category of vaxcat, bfd, afb and mwtkg) suggesting that there are clear differences in these traits between the two groups and that distinctive profiles are identifiable.

In terms of comparing BICs, I have been using the SS-BIC code you proposed in #43 in another thread:

Code:

scalar SSBIC_class_2 = -2 * e(ll) + e(rank) * ln((e(N)+2) / 24) di SSBIC_class_2

Perhaps now is a good time to check whether there are any adjustments that need to be made to the code for different numbers of classes? Or is it always the same? Where does the 24 come from?

Assuming the above SS-BIC code is correct, these are the values I get for the different models:

# classes	AIC	BIC	SS-BIC	Entropy
1	97738.21	97870.05	97803.33	.
2	96614.32	96859.18	96735.26	0.395577*
3	95999.71	96357.58	96176.46	0.563029*
4**	95847.83	96312.43	96077.3	0.548274
5**	95855.37	96432.98	96140.65	0.538238

*These entropy values are slightly different from what I reported in earlier posts as one of the indicator variables, afb, has been modified slightly to be more accurate.
**Would only converge using the nonrtolerance option

Based on the statistics above, and given the convergence issues with the 4 and 5 class models, it looks like the model with 3 classes fits the data best, having the lowest AIC and BIC values.

I am also going to see if I can calculate the Lo-Mendell-Rubin Likelihood Ratio Test (LMR-LRT) of goodness of fit (as per your posts here) to compare the models with different classes further…and then see if a similar story plays out for all fit statistics in the other ethnic group and in the sub-cohort restricted sensitivity analyses.

Thanks for all your input!

Laura

Comment

Weiwen Ng

Join Date: Jun 2015
Posts: 1241

#30

05 Sep 2018, 11:43

Originally posted by Laura Brown View Post

...
My missingness comes from some questions only being asked in follow up sub-cohorts rather than in the main cohort. I will try running models restricted to women who are in these follow up sub-cohorts as a sensitivity analysis to see how that affects my results.

Whilst the means for continuous vars (and probabilities of being in different categories for categorical vars) are relatively close together, the confidence intervals for the two classes do not overlap for the majority of indicators (everbf, activities, rels, ghq_75, regsml, alco, one category of read, and one category of vaxcat, bfd, afb and mwtkg) suggesting that there are clear differences in these traits between the two groups and that distinctive profiles are identifiable.

In terms of comparing BICs, I have been using the SS-BIC code you proposed in #43 in another thread:

Code:

scalar SSBIC_class_2 = -2 * e(ll) + e(rank) * ln((e(N)+2) / 24) di SSBIC_class_2

# classes	AIC	BIC	SS-BIC	Entropy
1	97738.21	97870.05	97803.33	.
2	96614.32	96859.18	96735.26	0.395577*
3	95999.71	96357.58	96176.46	0.563029*
4**	95847.83	96312.43	96077.3	0.548274
5**	95855.37	96432.98	96140.65	0.538238

So, for sample size adjusted BIC, I think the adjustment may be more applicable in smaller samples. I think relying on BIC alone is fine. I've cited Nylund et al elsewhere, but their simulation study showed that the LMR LR test had a high false positive rate in simulated data with a complex structure (very unequal class sizes, some indicators don't distinguish classes well, some classes close together). I have a feeling that your proposed class structure is complex by at least some criteria. Hence, I'd recommend omitting the LMR test entirely.

I generally recommend that if you're unable to get a model to converge without -nonrtolerance-, treat it as not converged at all. Note that you should examine if any of the binary indicators have logit intercepts near +/- 15; if so, it's justifiable to constrain them at + or - 15 and then attempt to fit the model (this corresponds to a class probability of 0 or 1 for that indicator).

Entropy scores around 0.5 are more reasonable. I was worried that an entropy of 0.3 or lower was almost implausibly low.

Last, I hate to make your life more complicated, but most properly, you do need to explore different class structures for the continuous indicators, as Masyn indicated in her chapter quoted in our SEM example. (I think I've cited this chapter elsewhere in this thread.) Those involve the -lcinvariant(none)- and -covstructure(e.OE_n unstructured)- options. (Note, check for typos; I'm going off memory for the covariance structure option.) Specifically, the former allows the variance of the error term to vary (vs not vary) by latent class, and the latter allows the error terms of the continuous indicators to be correlated (vs default of uncorrelated). Heuristically, for the former option: the default is as if you're making k stamps with the same cookie cutter, but -lcinvariant(none)- basically allows you to make k stamps with cookie cutters of any size. The SEM example for LPA does allude to this, although perhaps it could be made more explicit that varying the class structures this way is a recommended practice.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment