Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I obtain the belonging class of each individuals when using finite mixture model?

    I am doing a research over c-d production function, where I need fmm to discuss some heterogeneity. I have some states and the corresponding data from 2005 to 2020. When I type:
    Code:
    fmm 2, vce(r) : regress lnY lnK lnE lnL
    estat lcprob, nose
    I will get the result as following:
    --------------------------------------------------------------
    | Margin
    -------------+------------------------------------------------
    Class |
    1 | .6682124
    2 | .3317876
    --------------------------------------------------------------
    And I can only get the probability margins in the whole sample, no matter what predict or estat options I try. But in the paper I'm referring to, the author not only gave the probabilities of each class, it also gave the detailed provinces in the certain group. For example, in 21 states, states 1, 5, 6... fourteen in total belongs to class 1, and other states belongs to class 2, with the probabilities shown like this:
    Group1 probability Group2 probability
    state 1 0.5902 state 2 0.6416
    state 5 0.5563 state 3 0.5386
    with 14 provinces in the left and 7 in the right.

    How can I access such result ? Or is it not available using stata commands?




    Thanks very much,

    Carlos

  • #2
    Forgive me if you already know this, but in finite mixture and latent class models, you will get the marginal probability of being in each of the latent classes. Each individual observation does not belong to any single latent class. It will have a vector of membership probabilities, e.g. maybe Mrs. Chen's vector is (0.90, 0.10), but Mrs. Jensen's is (0.45, 0.55). This is a fairly common misconception that I see. I just want to clear that up first. I don't mean to insult your statistical skills, it's just that FMMs are a complex model with many parts, some of which are hard to comprehend. Anyway, you can make an assumption like modal class assignment, i.e. you assign Mrs Chen to class 1 and Mrs Jensen to class 2. However, the more people look like Mrs. Jensen, the more inaccurate this assumption is.

    Anyway, So, good question, this is pretty complex, but my current opinion is that what you want can't be done in your current setup. That said, I'm not an expert on FMM, and thus I would defer to someone who is an expert. Anyway, my reasoning is as follows. in an FMM, you posit some relation between Y = XB + e, and you posit that there are two or more latent classes where the betas are systematically different. Your target paper gives something like the probability of each latent class conditional on state. Now, you think, surely margins can give me the conditional probability of latent class membership, right? Well,

    First, if you check the FMM postestimation manual, margins after FMM doesn't allow the posterior class probability to be used as a statistic. Posterior class prob. = conditional prob.

    Second, even if margins did allow this, think about how you get to the conditional probability in your model. Conditional on what? Remember, the latent class membership derives from a multinomial logit model with intercepts only (unless you inserted predictors of class membership, which you did not). If you tried predicting the conditional latent class probs, i.e. predict classprob*, classposteriorpr, I have a feeling that those predicted probabilities are constant over all the observations. Try it.

    Third, your regression variables are all continuous. So you couldn't really do something like that paper, you'd have to specify certain values of those independent variables.

    What paper are you referencing? I'll take a look at it and see if my opinion changes.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Mr. Weiwen, thanks very much for your comments! Forgive me for not paying attention to this post very often

      I'm new to this complex statistical method, indeed, and I appreciate your kind introduction on what fmm should be like. I fully understand what you mean with the example of Mrs.Chen and Jensen, after I did some research on the fmm model and method itself.

      I did try predict classpr, all the observation did give the same probability for the first class, so it's not how it achieve the result in my target paper.

      But the target paper basically runs the same regression as mine, where Y and E refer to different variables but still the C-D function components. So I really confused how he give that table. I was thinking if he use the expected Y value, which is used as the classify standard in fmm (in high and low lnY groups, regression equations are not the same), to see the frequency of the expected lnY value that falls in the range of each class and gave that result table. Eg. 9 expected values out of 16 of State A fall in low group, then the probability of State A falling in low group will be 0.5625.

      However, that paper is in Chinese, if by any chance you can take a look, I'll be really appreciate it.
      The title of the paper is 《城市化进程中的“资源尾效”和“资源诅咒”——基于中国27个煤炭城市的面板数据分析》, and the code is https://kns.cnki.net/kcms/detail/det...bName=CMFD2018

      Again, thanks a lot for your kind reply!

      Carlos


      Comment


      • #4
        Unfortunately, I'm not able to decipher the article, so I can't see anything related to their model. You stated that the target paper ran an equation of a similar form as yours, which you gave as:

        Code:
        fmm 2, vce(r) : regress lnY lnK lnE lnL
        If you introduced predictors of latent class membership, e.g.

        Code:
        fmm 2, vce(r) lcprob(z1 z2) : regress lnY lnK lnE lnL
        Then while margins might or might not work, we could still manually calculate the probabilities of class membership at representative values of z1 and z2, be they continuous or categorical. If this other paper entered i.state into that part of the model, it makes sense that they would be able to produce such a table. It sounds more intimidating than it actually is. You're just applying the algebra of a multinomial logit model.

        That said, if you ask Stata to predict the class posterior probabilities in your dataset, you do actually get predictions that vary by obs, i.e. what I stated above wasn't correct. You should have typed

        Code:
        predict class*, classposteriorpr
        Whereas I have a feeling you typed

        Code:
        predict class*, classpr
        Remember, posterior probability is synonymous with the conditional probability of class membership (i.e. conditional on whatever variables went into the model). That said, if your target paper didn't enter any predictors of latent class membership, I still don't understand how they got their table.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Dr. Weiwen, thanks so much for such useful information! I just tried
          Code:
          predict pr*, classposteriorpr
          and get that probability of each observation. But what I'm now confusing is that, as far as I think, if I don't type any predictors of latent class membership, the Stata will go with the difference of dependent variable, which is also the way that I want. However, the results of class posteriorprobabilities are of course different when I use:
          Code:
          fmm 2, vce(r): regress lnY
          and
          Code:
          fmm 2, vce(r): regress lnY lnK lnE lnL
          May I ask you, if I want to draw a graph showing the kernel density of lnY in each class (like below), which regress equation should I choose in the above, and how should I distinguish the observations that belongs to that class? Is it by if the class posterior probabilities that is larger than 0.5 in the corresponding group?
          Click image for larger version

Name:	QQ图片20220826171037.png
Views:	1
Size:	136.1 KB
ID:	1679362

          where mode A,B and C are the similar concept of class in fmm.

          Thank you again, Dr.Weiwen!

          Carlos

          Comment


          • #6
            Originally posted by Carlos Wang View Post
            But what I'm now confusing is that, as far as I think, if I don't type any predictors of latent class membership, the Stata will go with the difference of dependent variable, which is also the way that I want. However, the results of class posteriorprobabilities are of course different when I use:
            Code:
            fmm 2, vce(r): regress lnY
            and

            Code:
            fmm 2, vce(r): regress lnY lnK lnE lnL
            We operate on a first-name basis here.

            First issue: the quoted pieces of code fit two different models. So, you will naturally get different sets of class posterior probabilities. Does this make sense? Think of it this way, with an FMM, you are positing the relationship y = XB + e holds over two (or k) different latent classes. You change the vector of Xs, of course you will get different latent classes. (side note, I believe your first bit of code will reduce to a latent profile model, because you're basically just asking for two latent classes with different mean of lnY.)

            Second: That graph you want would be a pretty intuitive way of understanding how the distribution of lnY varies by latent class. However, remember that after an FMM, each observation has a vector of class membership probabilities. That's the key problem!

            2a: Do you just want to show the unconditional mean (and maybe the standard error) of lnY by latent class? We already have estat lcmean. That might suffice! Note that SEs can take a long time to calculate, especially if you go above 4 or so latent classes.

            2b: There is a quantity called normalized entropy. This post shows how to calculate it. Normalized entropy ranges from 0 to 1, is frequently presented in LCA models, and it shows how certain we are in class membership. 0 means we have essentially no information on class membership. Basically, everyone's vector looks like (0.5, 0.5). 1 means we have complete certainty, e.g. everyone looks like (1, 0) or (0, 1). If you have relatively high entropy, you could just do modal class assignment, i.e. assign people to the latent class they are most likely to belong to, then plot the graph, and note in the limitations or a footnote that this is a limitation. What is "relatively high"? No clear guideline. I'd say probably over 0.8, which is based on a comment that someone else more knowledgeable than I made on the MPlus forum once, but again there's no clear guideline.

            I just want to repeat one point: strictly speaking, there is no way to be certain about who is in which latent class. The technique is fundamentally probabilistic. You can do modal class assignment, which is a simplifying assumption that is a bit wrong, but we make simplifying assumptions with every statistical model that we fit.
            Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

            When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

            Comment


            • #7
              Weiwen,

              Thanks for your patient reply!

              In my own exploration and my supervisor's requirement, the second
              Code:
              fmm 2, vce(r): regress lnY lnK lnE lnL
              will be more suitable, because we want to see the corresponding distribution of lnY belonging to different growth mode. (That is, how the C-D production function will be like when lnY falls in different group.) And the conditional mean is not enough, so we still decide to use predict, classposteriorpr to classify the group of each observation and draw such figure.

              And we've get a satisfying figure that looks quite well when we only simply distinguish the group by probability 0.5. I've seen similar guidelines like 0.6 or 0.8, and the author also emphasizes that this is only an statistical method to decide each observation is in one certain group, so basically he also support your idea which I can't agree more. Next we may try 0.6 or 0.8, to make sure the classification is robust and meaningful (to some extent, at least).

              Thank you again for the discussion and guidance for these days!

              Carlos

              Comment


              • #8
                qui fmm2 : regress Y X
                predict classpost*, classposteriorpr
                list classpost* in 1/10,abbrev(10)

                gen exclass =1 if classpost 1 > = 0.5
                replace exclass =2 if classpost 2>=0.5
                tab exclass

                Comment

                Working...
                X