Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Let's get to entropy. This is the economist version of LCA, so there's one feature I hadn't appreciated: you actually have multiple observations per person, corresponding to their choice set. Hence, the links I gave would have been wrong. Normally, we have only one observation per entity.

    Basically, for each observation, you have their predicted probability of class membership. You then need to calculate -1*p*ln(p) for each latent class. You then take the sum of those values over the full dataset (i.e. each person has k values, one for each of the k latent classes, to sum over, and you have n persons). The forum doesn't allow me to write algebra, but you basically divide that by n*ln(k), then you take 1 minus that. See the answer here for the proper algebraic representation.

    Assuming you have Stata 16, because we can create a new frame, we're going to fit a lclogit model, then we collapse the data keeping just one line per person, then we calculate entropy. Now, I believe that the option cp is the correct one; I hope Hong Il will correct me if I'm wrong. Basically, we just want each person's posterior probability of being in each latent class. Also, it looks like the variable pid in the stock data identifies a unique person - again, this needs to be corrected if it's wrong.

    Code:
    use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear
    lclogit2 y, rand(price contract local wknown tod seasonal) id(pid) group(gid) nclasses(4)
    estimates store class4
    scalar N = e(N_i)
    lclogitpr2 p, cp
    frame create entropy4
    frame change entropy4
    collapse (max) p?, by(pid)
    foreach k of numlist 1/4 {
        gen plnp_`k' = -1 * p`k' * ln(p`k')
    }
    egen sum = rowtotal(plnp_?)
    total sum
    mat b = e(b)
    di 1 - b[1,1] / e(N)*ln(4)
    .85134748
    *Note: if different number of latent classes, replace ln(4) with the correct number
    frame change default
    So, that appears to be a high value of entropy. If you scroll down the list in the frame entropy4, you'll see that the highest class probability is generally very high.

    The commentary by Bengt Muthen that I referenced earlier is here.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #17
      Thank you all, this thread has been really useful to me.

      I am working with a DCE (1022 households each household answering 7 choicesets out of 84 choicesets). I am using the lclogit2 command

      1. I have succeeded get a seed for classes 2 and 3 but the model has failed to converge beyond the 4 classes. Anyway to go around this beyond changing seed and simplifying the model. The model will not converge even when I remove the membership option all togehter. I don't want to work with fewer attribute levels but if this could work I will need literature to defend that.

      2. The unfrotunate bit, is that the lclogitml2 command has a failed to bring results. It brings an eror message "Hessian is not positive semidefinite". I was wondering if anyone has encountered this and has a way to go around it.

      3. Total number of rows in my data is 21,462, could this be the problem the model wont work out well.


      Thank you for your support

      Comment


      • #18
        Robert Asiimwe: Ultimately you cannot estimate more parameters than what your data allow you to estimate. If your 4-class model fails to converge even after attempting multiple seeds, you should accept that you cannot estimate as many as 4 classes given your data.

        Comment


        • #19
          Originally posted by Hong Il Yoo View Post

          You're estimating two different model specifications with -lclogit- and -lclogit2-. Your -lclogit- specification specifies a random coefficient on the status quo. Your -lclogit2- specification specifies a fixed coefficient on the status quo: To make the two procedures comparable, you should move -statusquo- into -rand(.)- and you'll see that -lclogit2- runs faster.

          As explained in the first paragraph of p.418 of the background paper, the EM algorithm slows down when you include a fixed coefficient in the model specification. As advised in the same paragraph, I'd like to suggest that you estimate the unrestricted model using -lclogit2- (i.e., -lclogit2 choice, rand(statusquo A1L1 ...-) and then use the results as starting values for the constrained model that you estimate using -lclogitml2 choice statusquo, rand(A1L1 ...-.
          Hi,

          I am running a new DCE, and I wanted to know how can I use the values of the -lclogit2- command as starting values for the -lclogit2ml- command. Your answers were very helpful, thanks a lot !

          Gabin.

          Comment


          • #20
            Originally posted by Gabin Morillon View Post

            Hi,

            I am running a new DCE, and I wanted to know how can I use the values of the -lclogit2- command as starting values for the -lclogit2ml- command. Your answers were very helpful, thanks a lot !

            Gabin.
            Thanks for your kind words! You can refer to the example on pp. 414-415 in the background paper for -lclogit2- (https://journals.sagepub.com/doi/pdf...36867X20931003). I use -matrix start = e(b)- to save the -lclogit2- coefficient estimates in a row vector, and then specify the -from(start)- option in -lclogitml2- to start the numerical optimisation from that vector.

            Comment


            • #21
              Thanks for your answer !

              Originally posted by Hong Il Yoo View Post

              You're estimating two different model specifications with -lclogit- and -lclogit2-. Your -lclogit- specification specifies a random coefficient on the status quo. Your -lclogit2- specification specifies a fixed coefficient on the status quo: To make the two procedures comparable, you should move -statusquo- into -rand(.)- and you'll see that -lclogit2- runs faster.

              As explained in the first paragraph of p.418 of the background paper, the EM algorithm slows down when you include a fixed coefficient in the model specification. As advised in the same paragraph, I'd like to suggest that you estimate the unrestricted model using -lclogit2- (i.e., -lclogit2 choice, rand(statusquo A1L1 ...-) and then use the results as starting values for the constrained model that you estimate using -lclogitml2 choice statusquo, rand(A1L1 ...-.
              As you recommended few months ago, I used your specification. However, I have this error message :

              "initial vector: extra parameter Class1:statusquo found
              specify skip option if necessary"

              Comment


              • #22
                Gabin Morillon: You can try -from(start, skip)-.

                Comment


                • #23
                  Hi,

                  Thanks for the answer ! It works but I have some difficulties with the model. I was wondering if I used a seed option and an iterate option with the -lclogit2- command, should I use the same specifications when I run the -mlclogit2- command with the starting values of the -lclogit2- command ?

                  Comment


                  • #24
                    : The -seed(#)- option is irrelevant to -lclogitml2- when you use the -from()- option because your own starting values are used instead of randomly generated starting values. Whether you specify the -iterate(#)- or not depends on your own preferences. If you don't specify the -iterate(#)- explicitly, -lclogitml2- will use the same default number of iterations as other MLE commands in Stata.

                    Comment

                    Working...
                    X