Let's get to entropy. This is the economist version of LCA, so there's one feature I hadn't appreciated: you actually have multiple observations per person, corresponding to their choice set. Hence, the links I gave would have been wrong. Normally, we have only one observation per entity.
Basically, for each observation, you have their predicted probability of class membership. You then need to calculate -1*p*ln(p) for each latent class. You then take the sum of those values over the full dataset (i.e. each person has k values, one for each of the k latent classes, to sum over, and you have n persons). The forum doesn't allow me to write algebra, but you basically divide that by n*ln(k), then you take 1 minus that. See the answer here for the proper algebraic representation.
Assuming you have Stata 16, because we can create a new frame, we're going to fit a lclogit model, then we collapse the data keeping just one line per person, then we calculate entropy. Now, I believe that the option cp is the correct one; I hope Hong Il will correct me if I'm wrong. Basically, we just want each person's posterior probability of being in each latent class. Also, it looks like the variable pid in the stock data identifies a unique person - again, this needs to be corrected if it's wrong.
So, that appears to be a high value of entropy. If you scroll down the list in the frame entropy4, you'll see that the highest class probability is generally very high.
The commentary by Bengt Muthen that I referenced earlier is here.
Basically, for each observation, you have their predicted probability of class membership. You then need to calculate -1*p*ln(p) for each latent class. You then take the sum of those values over the full dataset (i.e. each person has k values, one for each of the k latent classes, to sum over, and you have n persons). The forum doesn't allow me to write algebra, but you basically divide that by n*ln(k), then you take 1 minus that. See the answer here for the proper algebraic representation.
Assuming you have Stata 16, because we can create a new frame, we're going to fit a lclogit model, then we collapse the data keeping just one line per person, then we calculate entropy. Now, I believe that the option cp is the correct one; I hope Hong Il will correct me if I'm wrong. Basically, we just want each person's posterior probability of being in each latent class. Also, it looks like the variable pid in the stock data identifies a unique person - again, this needs to be corrected if it's wrong.
Code:
use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear lclogit2 y, rand(price contract local wknown tod seasonal) id(pid) group(gid) nclasses(4) estimates store class4 scalar N = e(N_i) lclogitpr2 p, cp frame create entropy4 frame change entropy4 collapse (max) p?, by(pid) foreach k of numlist 1/4 { gen plnp_`k' = -1 * p`k' * ln(p`k') } egen sum = rowtotal(plnp_?) total sum mat b = e(b) di 1 - b[1,1] / e(N)*ln(4) .85134748 *Note: if different number of latent classes, replace ln(4) with the correct number frame change default
The commentary by Bengt Muthen that I referenced earlier is here.
Comment