Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent class - gsem command

    Dear Stata users,
    I am using the Latent Class Analysis feature available in Stata 15 and I would have some questions for the expert users:

    1) For the membership functions (covariates) I have a dummy variable for some classes which doesn't report the standard errors and confidence interval. What is the reason and how should I interpret that variable?

    2) I have some errors in the postestimation process. In particular, for the command -estat lcprob I obtain r(825438256) and for the command -estat lcmean I obtain (r909). What do they mean?

    Thank you very much for your attention

  • #2
    Andrea,

    To your first question: in my experience in latent class analysis, when I use the -nonrtolerance- option and I get a model with missing SEs for some parameters, then the model won't converge at all if I try to fit it using the normal convergence criteria (i.e. it will iterate to the default 16,000 maximum iterations and then return an error; you can and usually should change this option to a lower number when doing LCA). Basically, you have convergence trouble. This can happen with logit intercepts that are close to + or - 15 (these correspond to class-specific probabilities of near 1 or near 0 for that item).

    Can you show your command and your output in code delimiters? Omit the iteration log (or use the -nolog- option).

    To your second point, you can usually click on the error in the command window, and it will explain the error. The explanation may not be informative. From Googling, 909 is something about matrix size too small. Stata specifies a maximum size for any matrices involved in estimation. If you accidentally treat a continuous variable as categorical, then you need one matrix row and column for each unique value, and this can certainly exceed the maximum matrix size. I have no idea why you'd get that error in -estat lcmean- alone. This may be something to escalate to technical support.

    I've Googled the other error code and I haven't found it. Can you explain what you saw when you clicked the code in the command window? Again, seeing your code might help us spot if there's something clearly amiss that would cause the error.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Thank you for your reply Weiwen.
      Unfortunately I can't show the output as I am working inside the system of the National Statistics and I can't extract the output. I show you the code used:

      Code:
       gsem(company fulltime dependent <-, logit) (incometotal pct_income_art pct_gov time_art time_related <- _cons) (C <- age male retired educated member), lclass(C 5) startvalued(randomid, draws(15) seed(123321)) em(iter(5)) iterate(5000) nodvheader nonrtolerance
      Focusing on the first point of my post, I have reduced the maximum iterations to 10,000. After the 10,000th iteration I see "convergence not achieved" (but not in red, while with the default 16,000 iterations I have this message at the end, after the output tables, and in red. Does this make some difference?), and I have all the parameters estimated. Only for the dummies of the covariate "retired" I don't have the standard errors (and only for 2 out of 5 classes).

      Thank you again for your attention

      Comment


      • #4
        Originally posted by Andrea Baldin View Post
        Thank you for your reply Weiwen.
        Unfortunately I can't show the output as I am working inside the system of the National Statistics and I can't extract the output. I show you the code used:

        Code:
         gsem(company fulltime dependent <-, logit) (incometotal pct_income_art pct_gov time_art time_related <- _cons) (C <- age male retired educated member), lclass(C 5) startvalued(randomid, draws(15) seed(123321)) em(iter(5)) iterate(5000) nodvheader nonrtolerance
        Focusing on the first point of my post, I have reduced the maximum iterations to 10,000. After the 10,000th iteration I see "convergence not achieved" (but not in red, while with the default 16,000 iterations I have this message at the end, after the output tables, and in red. Does this make some difference?), and I have all the parameters estimated. Only for the dummies of the covariate "retired" I don't have the standard errors (and only for 2 out of 5 classes).

        Thank you again for your attention
        Hello, Andrea,

        That looks like a correctly specified latent profile regression model. If Stata did not declare convergence, you can't use your results. It does not matter where the "convergence not achieved" message appears.

        You said that only for the dummy variable retired, you have a missing SE in 2 of the 5 latent classes. Clearly, something about that variable is preventing your model from coming to convergence. But I have no idea what, unfortunately. Maybe you have complete or quasi complete separation, maybe very few people respond that they are retired. This sort of problem can be exacerbated if the latent class is small (e.g. say you have a class that's 5% of the sample). Perhaps you can try this: refit this as a latent profile model (i.e. remove the predictors of the latent class), then calculate the modal class, then tabulate your dummy variables by class, e.g.

        Code:
        gsem(company fulltime dependent <-, logit) (incometotal pct_income_art pct_gov time_art time_related <- _cons), lclass(C 5) startvalue(randomid, draws(50) seed(123321)) em(iter(5)) iterate(1000) nodvheader nonrtolerance
        predict classpr*, classposteriorpr
        egen modalpr = rowmax(classpr?)
        gen modalclass = .
        forvalues k = 1/5 {
        replace modalclass = `k' if classpr`k' == modalpr
        }
        drop modalpr
        tabstat age male retired educated member, by(modalclass)
        This should show you roughly how sparse the data are in each class. Some side notes: In my experience, I think you can reduce the max iterations to as few as 1000. Models that haven't clearly converged by that many iterations will probably not converge by Stata's standard criteria. Also, I would typically recommend using more random draws, often as many as 100, to ensure that you have the highest LL. When using -nonrtolerance-, you will markedly increase the chance that Stata declares convergence at a local maxima and not the global maxima. The likelihood function for latent class models typically has multiple local maxima.

        Anyway, you may know this, but after that exercise, you'll usually speed up estimation for the latent profile regression version if you add on this code:

        Code:
        matrix b = e(b)
        gsem(company fulltime dependent <-, logit) (incometotal pct_income_art pct_gov time_art time_related <- _cons) (C <- age male retired educated member), lclass(C 5) em(iter(5)) iterate(1000) nodvheader nonrtolerance from(b)
        Note that I erased the -startvalues- option; the -from(b)- option has Stata start estimation from the parameters of the latent profile model you fit last, which supercedes the start values. Last, I highly recommend you save the final parameter estimates the same way, then you re-fit the model from those parameters without nonrtolerance, i.e.

        Code:
        matrix b = e(b)
        gsem(company fulltime dependent <-, logit) (incometotal pct_income_art pct_gov time_art time_related <- _cons) (C <- age male retired educated member), lclass(C 5) em(iter(5)) iterate(1000) nodvheader from(b)
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          For others not familiar with latent profile regression, it's worth noting that in the bit of code that reads

          Code:
          gsem ... (C <- age male retired educated member), ...
          This essentially fits a multinomial regression model to the predicted latent class. Basically, do the covariates specified predict membership in a latent class. (In my experience, if you first run a latent profile/class model, then tack on the regression part, you usually have near identical class-specific probabilities or means for the indicator variables.) If you later ran margins, you should be able to get output as if you had run margins on a multinomial model (note: -margins-, not -estat lcmean-, in this context).

          In general, you should be cautious that the multinomial logistic estimator can be "brittle", as our own Clyde Schecter puts it in a different post on multinomial logit. Basically, the complete or quasi-complete separation problem in regular logistic regression, where you have some combinations of the independent variables perfectly predicting failure or success, is amplified in multinomial regression (and here, Andrea has a 5-value multinomial regression).

          From that post, I'll repeat his advice to Andrea to look at the parameter estimates of the dummy retired where its standard errors are missing. If those values look implausible, that can indicate what's happening. I am not very familiar with multinomial regression, but the coefficients should have a log odds interpretation, and a very high or very low log odds can perhaps be an indicator of what I was talking about.

          Unfortunately, in this situation, you may only be able to proceed by sacrificing the least substantively important predictor.

          Last, side note: I am assuming that the dummy for educated is really a dummy. Just take caution that if it is actually a categorical variable, you want to use the i. prefix to denote that, otherwise it will be treated as continuous. Generally I always code any binary or categorical variable with the i. factor variable syntax, even though it's probably not necessary for binary variables.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            Weiwen,
            you have been so kind in helping me. I will try with your suggestions

            Comment

            Working...
            X