Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent class logit proportions

    Hi everyone,

    I am working on a DCE and I run a latent class analysis with four classes.
    The stata result gives "shares" for each classes. I need to assign each observation to their classes. To do so, I predict posterior probabilities using the command lclogitpr with cp option. Then, I take the max of the four values to decide if the individual belongs to class 1, 2, 3 or 4.
    My problem is here : doing this, the proportions of individuals belonging to each classes are different to those given by the lclogit command...
    I would really appreciate if someone may help me.

    Best regards, Gabin.

    PS : Sorry, I am new on the forum and I don't know if my post is correctly posted and pardon my english, I am actually french.


  • #2
    Here is the issue: after a latent class model, you don't know which latent class someone belongs to. You know their probability of being in each class, e.g.

    Mr Smith: 0.88, 0.07, 0.05
    Mrs Johnson: 0.50, 0.30, 0.20

    You used modal class assignment: you decided to treat people as definitely belonging to one class, the one with the highest probability of membership. So, Mr Smith and Mrs Johnson are in class 1.

    First, you will never get the same proportions as the model. The model is reporting class proportions based on the probability vectors it calculated for each person. You are ignoring the 12% chance that Mr Smith is in class 2 or 3.

    Second, you can see that in Mr Smith's case, your assumption isn't close enough to reality that this may be OK. If most of the sample have a very high probability of belonging to one class or the other, then this assumption is not too wrong. However, if most people look like Mrs Johnson, then that's not as good.

    Why are you using modal class assignment?
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Thank you for your answer !
      That was my guess... My aim is to describe the four classes with their socioeconomics characteristics and some other discrete variables from a survey. I do not know if I am able to show the data...
      The DCE is composed as follow. Each respondant had to choose between a choice A or a choice B. This is my variable "choice". They could also choose neither A nor B, this is my statu quo variable. There are 12 choice cards with 7 attributes.
      I already ran a conditional logit and a mixed logit.

      Comment


      • #4
        Originally posted by Gabin Morillon View Post
        Thank you for your answer !
        That was my guess... My aim is to describe the four classes with their socioeconomics characteristics and some other discrete variables from a survey. I do not know if I am able to show the data...
        This is one of the fundamental issues with latent class analysis. I am not deeply familiar with the literature, but people have proposed a large number of very complex schemes to deal with the issue of "latent class with covariates" - basically, can you just present descriptive statistics on the latent classes. For an example, see this paper by Jeroen Vermunt. And no, the solution he proposes is not implemented in Stata, and I do not know how to implement it.

        What is the entropy of your final model? Entropy is a concept that's a bit like the Herfindahl index. In economics or antitrust law, if you consider the market shares of the companies in a market, a high HHI (or entropy) indicates something close to a monopoly. This is bad for consumers. In latent class analysis, if you applied the same concept, it tells you how certain the model is about everyone's classification. Here, you would ideally want the HHI or entropy to be high, not low. Basically, if most respondents' vectors of membership probabilities are like Mr Smith, you have high entropy. If everyone looks more like Mrs Johnson, you have low entropy. If your entropy is high, around 0.8 or higher, then modal class assignment is close enough to reality - just explain what you are doing in the paper. 0.80 is a guideline that I think I saw on the MPlus forum at one time, but it was an off the cuff statement by Bengt Muthen, so it should be treated as an expert consensus sort of thing - like p-value cutoffs of 0.05 or Cohen's guidelines for effect sizes or the like.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Gabin Morillon given #3, see also the article below which builds on the Vermunt (2010) paper cited by Weiwen Ng . The authors have R code (Supplementary Material, I think), but I recall that Jouni may also have Stata code (though it can't do all that the R code can)


          psychometrika—vol. 83, no. 4, 871–892
          December 2018
          https://doi.org/10.1007/s11336-017-9592-7
          TWO-STEP ESTIMATION OF MODELS BETWEEN LATENT CLASSES AND
          EXTERNAL VARIABLES
          Zsuzsa Bakk
          LEIDEN UNIVERSITY
          Jouni Kuha
          LONDON SCHOOL OF ECONOMICS AND POLITICAL SCIENCE

          We consider models which combine latent class measurement models for categorical latent variables
          with structural regression models for the relationships between the latent classes and observed explanatory
          and response variables. We propose a two-step method of estimating such models. In its first step, the
          measurement model is estimated alone, and in the second step the parameters of this measurement model
          are held fixed when the structural model is estimated. Simulation studies and applied examples suggest
          that the two-step method is an attractive alternative to existing one-step and three-step methods. We derive
          estimated standard errors for the two-step estimates of the structural model which account for the uncertainty
          from both steps of the estimation, and show how the method can be implemented in existing software for
          latent variable modelling

          Comment


          • #6
            Originally posted by Weiwen Ng View Post

            This is one of the fundamental issues with latent class analysis. I am not deeply familiar with the literature, but people have proposed a large number of very complex schemes to deal with the issue of "latent class with covariates" - basically, can you just present descriptive statistics on the latent classes. For an example, see this paper by Jeroen Vermunt. And no, the solution he proposes is not implemented in Stata, and I do not know how to implement it.

            What is the entropy of your final model? Entropy is a concept that's a bit like the Herfindahl index. In economics or antitrust law, if you consider the market shares of the companies in a market, a high HHI (or entropy) indicates something close to a monopoly. This is bad for consumers. In latent class analysis, if you applied the same concept, it tells you how certain the model is about everyone's classification. Here, you would ideally want the HHI or entropy to be high, not low. Basically, if most respondents' vectors of membership probabilities are like Mr Smith, you have high entropy. If everyone looks more like Mrs Johnson, you have low entropy. If your entropy is high, around 0.8 or higher, then modal class assignment is close enough to reality - just explain what you are doing in the paper. 0.80 is a guideline that I think I saw on the MPlus forum at one time, but it was an off the cuff statement by Bengt Muthen, so it should be treated as an expert consensus sort of thing - like p-value cutoffs of 0.05 or Cohen's guidelines for effect sizes or the like.

            Actually, I am not using the command gsem as in the post you cited but the lclogit command. And the command described does not work.
            Code:
            local ent = 0
            forvalues i = 1/4 {
            gen temp`i'=(log(Pe`i')*(Pe`i'*-1))
            sum temp`i', meanonly
            local ent =`ent' + r(sum)
            }
            scalar ent=1-(`ent'/(e(N)*ln(e(k))))
            scalar list ent
            The error comes from the "r(sum)".

            Maybe, is it possible to have only the class of each respondent given by the model directly ?

            I also have another problem... When I use the glamm,
            Code:
            lclogitml, iterate(0)
            It does not give the p-values of the statu quo variable...

            Tell me if I should show stata results !
            Last edited by Gabin Morillon; 21 Jan 2021, 03:51.

            Comment


            • #7
              Gabin Morillon:

              If you type -lclogitml, iterate(0)- and see a missing standard error, I suspect that the EM algorithm satisfied the stopping rule before reaching a local maximum. You can let the gradient-based optimiser (i.e., -lclogitml-) to run for a few more iterations, say -lclogitml, iterate(25)-, and see if the problem persists. In hindsight, the default stopping rule that Daniele and I built into -lclogit- was not stringent enough; unless you used options to tighten the tolerance criterion and increase the maximum number of iterations, your -lclogit- is likely to have stopped prematurely.

              I also wish to point out that sometime ago I released an enhanced version of -lclogit- which is -lclogit2-. Please see -ssc install lclogit2- and the background paper in this [external link] to the Stata Journal. One of the enhancements I introduced is -lclogitml2- which is now a standalone -ml- program rather than a wrapper for -gllamm- and runs much faster than -lclogitml-.

              Others have shared many useful suggestions already. But I thought I'd add a more direct (though also less intelligent!) response to your original post "My problem is here : doing this, the proportions of individuals belonging to each classes are different to those given by the lclogit command..." There is no reason why the sample mean of your assigned class membership indicator should be identical to the corresponding population share estimate. For example, suppose that you estimate a 2-class model for a sample of 58 individuals and the share of Class 1 is estimated to be 0.73. Algebraically it is impossible to have a sample mean of the assigned Class 1 indicator that is equal to 0.73. If you assign 42 out of 58 to Class 1, then your mean is 0.72. If you assign 43 out of 58 to Class 1, then your mean is 0.74.

              Comment


              • #8
                Originally posted by Hong Il Yoo View Post
                Gabin Morillon:

                If you type -lclogitml, iterate(0)- and see a missing standard error, I suspect that the EM algorithm satisfied the stopping rule before reaching a local maximum. You can let the gradient-based optimiser (i.e., -lclogitml-) to run for a few more iterations, say -lclogitml, iterate(25)-, and see if the problem persists. In hindsight, the default stopping rule that Daniele and I built into -lclogit- was not stringent enough; unless you used options to tighten the tolerance criterion and increase the maximum number of iterations, your -lclogit- is likely to have stopped prematurely.

                I also wish to point out that sometime ago I released an enhanced version of -lclogit- which is -lclogit2-. Please see -ssc install lclogit2- and the background paper in this [external link] to the Stata Journal. One of the enhancements I introduced is -lclogitml2- which is now a standalone -ml- program rather than a wrapper for -gllamm- and runs much faster than -lclogitml-.

                Others have shared many useful suggestions already. But I thought I'd add a more direct (though also less intelligent!) response to your original post "My problem is here : doing this, the proportions of individuals belonging to each classes are different to those given by the lclogit command..." There is no reason why the sample mean of your assigned class membership indicator should be identical to the corresponding population share estimate. For example, suppose that you estimate a 2-class model for a sample of 58 individuals and the share of Class 1 is estimated to be 0.73. Algebraically it is impossible to have a sample mean of the assigned Class 1 indicator that is equal to 0.73. If you assign 42 out of 58 to Class 1, then your mean is 0.72. If you assign 43 out of 58 to Class 1, then your mean is 0.74.
                I already tried with more iterations and I have this error :
                Code:
                lclogitml, iterate(25)
                -gllamm- is initializing. This process may take a few minutes.
                (error occurred in ML computation)
                (use trace option and check correctness of initial model)
                equation p2_1 not found
                r(303);
                
                end of do-file
                
                r(303);
                I also tried the lclogit2, however it was so long that I just abandonned... I guess it is because I have 20 variables whose 18 random variables.

                Comment


                • #9
                  Originally posted by Gabin Morillon View Post

                  I already tried with more iterations and I have this error :
                  Code:
                  lclogitml, iterate(25)
                  -gllamm- is initializing. This process may take a few minutes.
                  (error occurred in ML computation)
                  (use trace option and check correctness of initial model)
                  equation p2_1 not found
                  r(303);
                  
                  end of do-file
                  
                  r(303);
                  I also tried the lclogit2, however it was so long that I just abandonned... I guess it is because I have 20 variables whose 18 random variables.
                  You see the "equation p2_1 not found" message because -gllamm- is not capable of handling a large number of random coefficients. It's one of the issues that motivated the development of the -lclogit2- package. Please see point c in Section 1 of the background paper.

                  I'm afraid that I don't get the "I also tried the lclogit2, however it was so long that I just abandonned... I guess it is because I have 20 variables whose 18 random variables" part. Whatever -lclogit- can do, -lclogit2- can do faster. If you're estimating exactly the same model specification, there is no reason why -lclogit2- should run intolerably slower when -lclogit- runs fine. Perhaps you can share with us the exact command lines that you have executed?
                  Last edited by Hong Il Yoo; 21 Jan 2021, 05:58.

                  Comment


                  • #10
                    Here are both clogit and clogit2.

                    Code:
                    lclogit choice statuquo A1L1 A1L2 A1L3 A1L4 A1L5 A2L1 A2L2 A2L3 A3L1 A3L2 A4L1 A4L2 A5L1 A5L2 A6L1 A6L2 A7L1 A7L2, id(ID) group(csid) nclasses(4)
                    lclogit2 choice statuquo, rand(A1L1 A1L2 A1L3 A1L4 A1L5 A2L1 A2L2 A2L3 A3L1 A3L2 A4L1 A4L2 A5L1 A5L2 A6L1 A6L2 A7L1 A7L2) id(ID) group(csid) nclasses(4)
                    Where A# stands for the attribute and L# stands for the level of the attribute. Each variable are binary.

                    I also precise that the variable csid is given by
                    Code:
                    egen csid = group(ID scenario)
                    where "scenario" corresponds to the choice set. One individual had to choose between 2 options or a statuquo 12 times.

                    I wanted to add that I used the command
                    Code:
                    lclogitpr cp, cp
                    sum cp*
                    in order to predict the posterior proabilities of belonging to each class and that the mean of the probabilities are equal to the "shares" given by the lclogit command.
                    Last edited by Gabin Morillon; 21 Jan 2021, 06:23.

                    Comment


                    • #11
                      Originally posted by Gabin Morillon View Post
                      Here are both clogit and clogit2.

                      Code:
                      lclogit choice statuquo A1L1 A1L2 A1L3 A1L4 A1L5 A2L1 A2L2 A2L3 A3L1 A3L2 A4L1 A4L2 A5L1 A5L2 A6L1 A6L2 A7L1 A7L2, id(ID) group(csid) nclasses(4)
                      lclogit2 choice statuquo, rand(A1L1 A1L2 A1L3 A1L4 A1L5 A2L1 A2L2 A2L3 A3L1 A3L2 A4L1 A4L2 A5L1 A5L2 A6L1 A6L2 A7L1 A7L2) id(ID) group(csid) nclasses(4)
                      Where A# stands for the attribute and L# stands for the level of the attribute. Each variable are binary.

                      I also precise that the variable csid is given by
                      Code:
                      egen csid = group(ID scenario)
                      where "scenario" corresponds to the choice set. One individual had to choose between 2 options or a statuquo 12 times.

                      I wanted to add that I used the command
                      Code:
                      lclogitpr cp, cp
                      sum cp*
                      in order to predict the posterior proabilities of belonging to each class and that the mean of the probabilities are equal to the "shares" given by the lclogit command.
                      You're estimating two different model specifications with -lclogit- and -lclogit2-. Your -lclogit- specification specifies a random coefficient on the status quo. Your -lclogit2- specification specifies a fixed coefficient on the status quo: To make the two procedures comparable, you should move -statusquo- into -rand(.)- and you'll see that -lclogit2- runs faster.

                      As explained in the first paragraph of p.418 of the background paper, the EM algorithm slows down when you include a fixed coefficient in the model specification. As advised in the same paragraph, I'd like to suggest that you estimate the unrestricted model using -lclogit2- (i.e., -lclogit2 choice, rand(statusquo A1L1 ...-) and then use the results as starting values for the constrained model that you estimate using -lclogitml2 choice statusquo, rand(A1L1 ...-.

                      Comment


                      • #12
                        Originally posted by Hong Il Yoo View Post

                        You're estimating two different model specifications with -lclogit- and -lclogit2-. Your -lclogit- specification specifies a random coefficient on the status quo. Your -lclogit2- specification specifies a fixed coefficient on the status quo: To make the two procedures comparable, you should move -statusquo- into -rand(.)- and you'll see that -lclogit2- runs faster.

                        As explained in the first paragraph of p.418 of the background paper, the EM algorithm slows down when you include a fixed coefficient in the model specification. As advised in the same paragraph, I'd like to suggest that you estimate the unrestricted model using -lclogit2- (i.e., -lclogit2 choice, rand(statusquo A1L1 ...-) and then use the results as starting values for the constrained model that you estimate using -lclogitml2 choice statusquo, rand(A1L1 ...-.
                        The lclogit2 was faster before reaching a certain iteration (29 in my case) and then it does not stop running...

                        Comment


                        • #13
                          Originally posted by Gabin Morillon View Post

                          The lclogit2 was faster before reaching a certain iteration (29 in my case) and then it does not stop running...
                          You can play around with the -seed(.)- option to check if using alternative sets of starting values helps.

                          Also please note that if you're trying to estimate an unidentified model, it is possible for -lclogit- and -lclogit2- to diverge; the EM algorithms used by the two commands are not good at picking up the failure of identification. If you pass the divergent results from the two commands to the -lclogitml- (when it works) and -lclogitml2-, you will be able to confirm the usual symptom of identification failure: the gradient-based optimiser will fail to find a solution. I understand that your -lclogit- model is too large for -lclogitml-, but you can still mix-and-match here to check whether you have an identified model; you can use -e(b)- from the older -lclogit- command as starting values for the newer -lclogitml2- command.

                          Comment


                          • #14
                            Does the number of seed have an impact on the coefficients ? I tried with a large number (1234567) and it works.

                            Comment


                            • #15
                              Originally posted by Gabin Morillon View Post
                              Does the number of seed have an impact on the coefficients ? I tried with a large number (1234567) and it works.
                              As usual with non-linear estimation, your coefficient estimates are sensitive to starting values. When you specify different numbers in -seed(.)-, you're effectively using different starting values. See my earlier paper with Daniele ([external link]) for how -lclogit- selects starting values; -lclogit2- selects starting values in the same way. A priori, -seed(1234567)- is as valid as any other seed. Ideally, you should experiment with a wide range of different seeds and make sure that you cannot locate a higher maximum by changing your current seed.

                              Comment

                              Working...
                              X