Latent class logit proportions

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#16

21 Jan 2021, 15:50

Let's get to entropy. This is the economist version of LCA, so there's one feature I hadn't appreciated: you actually have multiple observations per person, corresponding to their choice set. Hence, the links I gave would have been wrong. Normally, we have only one observation per entity.

Basically, for each observation, you have their predicted probability of class membership. You then need to calculate -1*p*ln(p) for each latent class. You then take the sum of those values over the full dataset (i.e. each person has k values, one for each of the k latent classes, to sum over, and you have n persons). The forum doesn't allow me to write algebra, but you basically divide that by n*ln(k), then you take 1 minus that. See the answer here for the proper algebraic representation.

Assuming you have Stata 16, because we can create a new frame, we're going to fit a lclogit model, then we collapse the data keeping just one line per person, then we calculate entropy. Now, I believe that the option cp is the correct one; I hope Hong Il will correct me if I'm wrong. Basically, we just want each person's posterior probability of being in each latent class. Also, it looks like the variable pid in the stock data identifies a unique person - again, this needs to be corrected if it's wrong.

Code:

use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear lclogit2 y, rand(price contract local wknown tod seasonal) id(pid) group(gid) nclasses(4) estimates store class4 scalar N = e(N_i) lclogitpr2 p, cp frame create entropy4 frame change entropy4 collapse (max) p?, by(pid) foreach k of numlist 1/4 { gen plnp_`k' = -1 * p`k' * ln(p`k') } egen sum = rowtotal(plnp_?) total sum mat b = e(b) di 1 - b[1,1] / e(N)*ln(4) .85134748 *Note: if different number of latent classes, replace ln(4) with the correct number frame change default

So, that appears to be a high value of entropy. If you scroll down the list in the frame entropy4, you'll see that the highest class probability is generally very high.

The commentary by Bengt Muthen that I referenced earlier is here.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Robert Asiimwe

Join Date: Mar 2021

Posts: 4
#17

25 Mar 2021, 04:37

Thank you all, this thread has been really useful to me.

I am working with a DCE (1022 households each household answering 7 choicesets out of 84 choicesets). I am using the lclogit2 command

1. I have succeeded get a seed for classes 2 and 3 but the model has failed to converge beyond the 4 classes. Anyway to go around this beyond changing seed and simplifying the model. The model will not converge even when I remove the membership option all togehter. I don't want to work with fewer attribute levels but if this could work I will need literature to defend that.

2. The unfrotunate bit, is that the lclogitml2 command has a failed to bring results. It brings an eror message "Hessian is not positive semidefinite". I was wondering if anyone has encountered this and has a way to go around it.

3. Total number of rows in my data is 21,462, could this be the problem the model wont work out well.

Thank you for your support
Comment
Hong Il Yoo

Join Date: Jan 2015

Posts: 292
#18

26 Mar 2021, 05:27

Robert Asiimwe: Ultimately you cannot estimate more parameters than what your data allow you to estimate. If your 4-class model fails to converge even after attempting multiple seeds, you should accept that you cannot estimate as many as 4 classes given your data.
1 like
Comment
Gabin Morillon

Join Date: Jan 2021

Posts: 23
#19

16 Aug 2021, 12:46

Originally posted by Hong Il Yoo View Post

You're estimating two different model specifications with -lclogit- and -lclogit2-. Your -lclogit- specification specifies a random coefficient on the status quo. Your -lclogit2- specification specifies a fixed coefficient on the status quo: To make the two procedures comparable, you should move -statusquo- into -rand(.)- and you'll see that -lclogit2- runs faster.

As explained in the first paragraph of p.418 of the background paper, the EM algorithm slows down when you include a fixed coefficient in the model specification. As advised in the same paragraph, I'd like to suggest that you estimate the unrestricted model using -lclogit2- (i.e., -lclogit2 choice, rand(statusquo A1L1 ...-) and then use the results as starting values for the constrained model that you estimate using -lclogitml2 choice statusquo, rand(A1L1 ...-.

Hi,

I am running a new DCE, and I wanted to know how can I use the values of the -lclogit2- command as starting values for the -lclogit2ml- command. Your answers were very helpful, thanks a lot !

Gabin.
Comment
Hong Il Yoo

Join Date: Jan 2015

Posts: 292
#20

16 Aug 2021, 14:15

Originally posted by Gabin Morillon View Post

Hi,

I am running a new DCE, and I wanted to know how can I use the values of the -lclogit2- command as starting values for the -lclogit2ml- command. Your answers were very helpful, thanks a lot !

Gabin.

Thanks for your kind words! You can refer to the example on pp. 414-415 in the background paper for -lclogit2- (https://journals.sagepub.com/doi/pdf...36867X20931003). I use -matrix start = e(b)- to save the -lclogit2- coefficient estimates in a row vector, and then specify the -from(start)- option in -lclogitml2- to start the numerical optimisation from that vector.
Comment
Gabin Morillon

Join Date: Jan 2021

Posts: 23
#21

16 Aug 2021, 15:59

Thanks for your answer !

Originally posted by Hong Il Yoo View Post

You're estimating two different model specifications with -lclogit- and -lclogit2-. Your -lclogit- specification specifies a random coefficient on the status quo. Your -lclogit2- specification specifies a fixed coefficient on the status quo: To make the two procedures comparable, you should move -statusquo- into -rand(.)- and you'll see that -lclogit2- runs faster.

As explained in the first paragraph of p.418 of the background paper, the EM algorithm slows down when you include a fixed coefficient in the model specification. As advised in the same paragraph, I'd like to suggest that you estimate the unrestricted model using -lclogit2- (i.e., -lclogit2 choice, rand(statusquo A1L1 ...-) and then use the results as starting values for the constrained model that you estimate using -lclogitml2 choice statusquo, rand(A1L1 ...-.

As you recommended few months ago, I used your specification. However, I have this error message :

"initial vector: extra parameter Class1:statusquo found
specify skip option if necessary"
Comment
Hong Il Yoo

Join Date: Jan 2015

Posts: 292
#22

16 Aug 2021, 16:58

Gabin Morillon: You can try -from(start, skip)-.
Comment
Gabin Morillon

Join Date: Jan 2021

Posts: 23
#23

19 Aug 2021, 14:02

Hi,

Thanks for the answer ! It works but I have some difficulties with the model. I was wondering if I used a seed option and an iterate option with the -lclogit2- command, should I use the same specifications when I run the -mlclogit2- command with the starting values of the -lclogit2- command ?
Comment
Hong Il Yoo

Join Date: Jan 2015

Posts: 292
#24

20 Aug 2021, 03:45

: The -seed(#)- option is irrelevant to -lclogitml2- when you use the -from()- option because your own starting values are used instead of randomly generated starting values. Whether you specify the -iterate(#)- or not depends on your own preferences. If you don't specify the -iterate(#)- explicitly, -lclogitml2- will use the same default number of iterations as other MLE commands in Stata.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment