Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent Class Analysis with Stata15


    Dear Statalist,

    When I try to conduct a latent class analysis (LCA) with
    some multiclass categorical variables
    in Stata 15. I have some problems.I have four variables(
    income5q/edu/occ/housource
    ) that are three or five category. Because they are neither binary nor continuous, when I type:

    . gsem(income5q edu occ housource <- ), lclass(C 3)
    .estimates store threeclass
    .estat lcprob
    .estat lcmean
    .estat lcgof

    I got the output results:


    . estat lcprob

    Latent class marginal probabilities Number of obs = 6,338

    --------------------------------------------------------------
    | Delta-method
    | Margin Std. Err. [95% Conf. Interval]
    -------------+------------------------------------------------
    C |
    1 | .4156174 .0066401 .4026647 .4286877
    2 | .5217023 .009338 .5033792 .5399671
    3 | .0626803 .0072906 .049822 .0785828
    --------------------------------------------------------------

    . estat lcmean

    Latent class marginal means Number of obs = 6,338

    ------------------------------------------------------------------------------
    | Delta-method
    | Margin Std. Err. z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    1 |
    income5q | .6024641 .0147057 40.97 0.000 .5736414 .6312867
    edu | 2.182525 .0133417 163.59 0.000 2.156376 2.208675
    occ | 3.199904 .0142397 224.72 0.000 3.171995 3.227814
    housource | 3.295172 .0233982 140.83 0.000 3.249312 3.341032
    -------------+----------------------------------------------------------------
    2 |
    income5q | 3.281452 .013222 248.18 0.000 3.255538 3.307367
    edu | 2.11298 .0165082 128.00 0.000 2.080625 2.145336
    occ | 3.210702 .0146888 218.58 0.000 3.181913 3.239492
    housource | 3.664078 .0220479 166.19 0.000 3.620865 3.707291
    -------------+----------------------------------------------------------------
    3 |
    income5q | 3.442805 .0471244 73.06 0.000 3.350443 3.535167
    edu | 3.328121 .0636223 52.31 0.000 3.203423 3.452818
    occ | 2.302029 .0729742 31.55 0.000 2.159002 2.445055
    housource | 3.593194 .0847707 42.39 0.000 3.427046 3.759341
    ------------------------------------------------------------------------------

    . estat lcgof

    ----------------------------------------------------------------------------
    Fit statistic | Value Description
    ---------------------+------------------------------------------------------
    Information criteria |
    AIC | 67769.182 Akaike's information criterion
    BIC | 67890.759 Bayesian information criterion
    ----------------------------------------------------------------------------

    My problems are:

    1. How can I get the l
    atent class marginal means of each category of the multivariate variables? Should I re
    coded the four variables as binary (0 and 1)?

    2. When I typed "estat lcgof" I only got AIC and BIC, how can I get G², p and para etc.index that can show the goodness of fit?


    Thanks,



  • #2
    Bruce, if you present code and results using code delimiters, they are more readable. Use the # button on the formatting toolbar that appears when you compose a post.

    At first glance, it looks like you did not tell Stata that the variables are ordinal. I think that in this case, Stata probably treats the variables as Gaussian by default (as if modeled via OLS regression). If not, then Stata probably assumes that 0 is 0 and anything else is 1, thus treating them as binary, which is arguably sub-optimal because it throws away data on the intensity with which the item was endorsed. You can, in fact, ask Stata to treat all the variables as ordinal by typing:

    Code:
    gsem(income5q edu occ housource <- , ologit), lclass(C 3)
    estat lcmeans
    I have tried this on my own data, and I believe you will see the marginal probability of each category in the latent class marginal means.

    The G2 test is a) not available if any variable has missing data, and b) not something I have seen a lot of papers report. In my opinion, you can simply report BIC, and select models via BIC alone. If you want to, you can calculate the sample-size adjusted BIC. However, Bengt and Linda Muthén both said that despite simulation studies favoring SSBIC over regular BIC, they can't strongly recommend one over the other, because there isn't strong statistical theory behind SSBIC. (I made the first post before reading the Muthéns' posts, so my tone then differs from my tone now.)
    Last edited by Weiwen Ng; 06 Apr 2018, 13:24.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Thanks for your suggests! And then I encountered another problem, when I typed:

      Code:
      gsem(income5q edu occ housource <- , ologit), lclass(C 3)
      estat lcmeans
      The model no longer be established. It kept iterating like this:

      Code:
      Iteration 279: log likelihood = -29277.083  (not concave)
      Iteration 280: log likelihood = -29277.083  (not concave)
      Iteration 281: log likelihood = -29277.083  (not concave)
      Iteration 282: log likelihood = -29277.083  (not concave)
      Iteration 283: log likelihood = -29277.083  (not concave)
      I cannot find where is the problem. Maybe my data is not suitable for LCA?

      Comment


      • #4
        Originally posted by Bruce Wong View Post
        Thanks for your suggests! And then I encountered another problem, when I typed:

        Code:
        gsem(income5q edu occ housource <- , ologit), lclass(C 3)
        estat lcmeans
        The model no longer be established. It kept iterating like this:

        Code:
        Iteration 279: log likelihood = -29277.083 (not concave)
        Iteration 280: log likelihood = -29277.083 (not concave)
        Iteration 281: log likelihood = -29277.083 (not concave)
        Iteration 282: log likelihood = -29277.083 (not concave)
        Iteration 283: log likelihood = -29277.083 (not concave)
        I cannot find where is the problem. Maybe my data is not suitable for LCA?
        Ah, this is a very common problem. I don't know how much you know about maximum likelihood theory, and I am certainly not an expert, but here is the issue. All statistical software programs use some sort of algorithm to iteratively find the parameters that maximize the likelihood. The maximizers declare victory when a convergence criterion has been met, usually that the log likelihood changes by less than a certain amount from iteration to iteration and that the gradient (first derivative of the log-likelihood) is close to 0. Stata also requires that the secondderivative of the LL function be essentially 0. This prevents the algorithm from declaring convergence in non-concave regions of the LL. It appears like a number of other programs commonly used for latent class analysis may not do that.

        Importantly, you can turn off Stata's default criterion about the second derivative. If you do this, I believe you substantially increase the chance that you converged on a local maxima, not a global maximum. The likelihood functions for latent class models are known to be prone to local maxima (OK, I don't know this, but people smarter than I have said this). Other programs get around this by running many starts from randomly selected starting parameters, and taking the highest LL they consistently converged upon; you can replicate this process in Stata to some extent, although it appears to be slower than the Penn State University Stata plugin. Stata's staff have recommended that for LCA, you can turn off the second derivative criterion (the software will run faster), but if you do, you are strongly recommended to save the parameters and then re-estimate the model with the default criterion restored just to be sure. The thread below describes how.

        https://www.statalist.org/forums/for...5-gsem-problem

        It could also be that you have a logit parameter at or over +/- 15, which means that one or more classes was nearly certain to endorse or not endorse an item. If you have several such instances, your model may also run into that convergence problem. It appears acceptable to constrain the parameters in that case, and the thread above also shows how to do that. I am not sure this applies to ordered logit parameters, but it could if some categories have 0 or 1 endorsement probabilities. In addition, if you run into convergence trouble, try using Stata's own options for multiple random starts (covered in one of the LCA or LPA examples in the SEM manual, I believe). That said, in general, this is a symptom of a model that is not identified.

        I know this may sound dense, and it is a bit dense, so if you don't know the basic outlines of likelihood theory, I would suggest you consult with someone who can explain it in person. I think many economists cover it in their econometrics courses, and statistics PhDs obviously know it, and some applied statisticians do as well. Having done one myself, latent class models are a lot trickier for many applied statisticians than regular regression models because of all these potential convergence issues.

        You can't conclude that your data are not suitable for LCA just based on some initial trouble like this. I would ask if there's some substantive reason to suspect your population and/or sample are genuinely heterogeneous. If so, would it be helpful knowledge for other researchers to consider? There's certainly some rationale for proceeding if so. But, do note that if the classes are very closely spaced, it gets harder for the algorithms to separate them in LCA. Also, this is an educated guess as opposed to something I remember reading or hearing, but if some of your categories are very sparsely populated (e.g. the top income category is very rarely endorsed), that might also lead to convergence trouble. Collapsing some categories might be acceptable there.

        In any case, at some point, with some number of classes, your model will genuinely cease to be well-identified. As I mentioned, the infinite not concave log is one symptom of that. If you have run multiple random starts and you constrained any appropriate parameters, and you still have this issue, then that number of classes is probably not identifiable.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Weiwen Ng

          Thanks so much for your help. In this respect I'm just a beginner, but I will try my best to understand the informative reply you provided. I thought it's simple when I saw the LCA examples for stata15. Now I realized I was wrong. There is still a lot to figure out first for me.

          No matter how thank you again. I appreciate all the efforts you invested.

          Comment


          • #6
            Bruce Wong
            another strategy you can try is to use the difficult option with ML estimators. It will shift the estimation algorithm behind the scenes to attempt maximizing the likelihood when encountering non-concave regions. Sometimes it works, sometimes it doesn’t. You can also try to specify different convergence criterion and tolerances to try forcing a solution. At that point you’d want to follow the previous guidance and use those parameter estimates as starting points with more stringent convergence criterion as a robustness check for the solution.

            Comment

            Working...
            X