Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • gsem lclass() option, trying to predict class membership from an auxiliary variable.

    Hi all,

    I am using gsem lclass() option in Stata 15.1. I have run several latent profile models and chosen the best based on fit statistics (AIC, BIC, and calculated entropy) and interpretability. I am using four continuous variables to create classes from 51 states.

    I fitted a five class model and now want to use an auxiliary variable (unemployment rate) to predict class membership. I wish to follow a three-step procedure as described in Vermunt 2010 and Asparouhov and Muthen 2013, in which I establish the LPA model, obtain posterior probabilities, and use these in a multinomial regression to account for class assignment uncertainty. I do not use Mplus, but it seems that this is implements in the R3STEP procedure in that program (if this helps frame my question).

    I am having difficulty figuring out how to account for class uncertainty in a multinomial regression predicting class from unemployment rate. Is there some way to incorporate the posterior probabilities into the regression, perhaps using the gsem suite and/or weights? Maybe multiple imputation? Any suggestions on predicting class membership while accounting for class uncertainty are greatly appreciated. I am fitting a model and predicting posterior probabilities using the code below. This all works without problems.

    Code:
    gsem (receivedSVS secondarystudent timetoIPE employment <- ), regress lclass(c 5)
    predict cpost*, classposteriorpr
    The Stata site that discusses the new features of gsem lclass() mentions an extension to include covariates determining the probability of class membership. They add example code to that page, which I pasted below. I imagine this is akin to the one-step procedure mentioned in the papers cited above, but I am unsure. I cannot find examples or mention of this extension in the Stata 15 gsem manual. Can anyone help me understand what the (C <- income) option is doing?

    Code:
     gsem (alcohol truant weapon theft vandalism <-, logit), (C <- income) lclass(C 3)
    I would prefer to run a three-step procedure, but I am also interested in understanding the extension proffered by Stata. However, I will note that I had to use the nonrtolerance option to reach convergence. This was not true before adding the (c <- unemployment) option.

    Code:
    gsem (receivedSVS secondarystudent timetoIPE employment <- , regress) (c <- unemployment), lclass(c 5)

    Thank you for any advice.

    Jessica


    For reference:

    Vermunt, J. K. (2010). Latent class modeling with covariates: Two improved three-step approaches. Political analysis, 18(4), 450-469.

    Asparouhov, T., & Muthén, B. O. (2013). Auxiliary variables in mixture modeling: A 3-step approach using Mplus (Mplus web notes: No. 15, Version 6). Accessed from http://www.statmodel.com/download/3stepOct28.pdf

  • #2
    Code:
    gsem (alcohol truant weapon theft vandalism <-, logit) (C <- income), lclass(C 3)
    Addressing this bit first. This is a latent class regression, and I believe that it corresponds to what Vermunt calls the one-step approach, i.e. that your description is correct. Here, you are saying that there are k classes of people, and each class has unique probabilities of reporting alcohol use, truancy, weapon use, theft, and vandalism. Without the (C <- income) bit, Stata will fit a latent class analysis that many people are now familiar with. In that LCA, Stata treats the latent class as a multinomial latent variable, and it estimates multinomial logit intercepts for each class (apart from the first).

    When you add (C <- income), you tell Stata that income now is a covariate on the multinomial side. Income does not influence any of the indicators directly. It's just a predictor of which latent class you end up in. So, if you think that you have an LCA model that fits well, and you want to see the effect of unemployment rate on class membership, you are already doing that. If you continue, you will get odds ratios that you would interpret just like you would a normal multinomial logit model.

    However, while income should not affect the proportions and item response probabilities or means of your indicator variables, it might do so. And clearly, you seem to have acquired some sort of convergence issue in the process. In a latent class regression that I ran, I got some very minor changes when I added my covariates of interest, but that was mainly because some people had missing values on the covariates.

    The rationale behind the three-step approach, as I understand it, is this. Often, we want to give our readers some descriptive statistics of each latent class, e.g. what proportion are female, mean age, etc. In doing so, we don't want the distal variables to influence class assignment at all. You can simply use modal class assignment, then run -tabstat- or -tabulate-. That approach doesn't account for classification uncertainty and is not recommended by experts (although I will say: the higher the entropy, the less error modal class assignment followed by tabulation will have, and I would guess if your entropy is over 0.9 then you can just do that if you only want descriptives). Vermunt's standard three-step method is another way. To be honest, I don't quite understand his rationale, but it sounds like you would just do modal class assignment, then run a multinomial logit model with the modal class as the dependent variable. I have no idea what his improved three-step routine does, so I can't comment.

    Back to your convergence problem. If you had to invoke -nonrtolerance- before your model would converge, then clearly, there's an issue with the likelihood function not quite being concave. One thing to check for is this: are any logit intercepts at +/- 15? In a logit model, that corresponds to an item endorsement probability of nearly 0 or nearly 1. MPlus will automatically constrain logit intercepts hitting that level. Stata does not, but you can go back and manually issue the constraint. I have had several cases where I did so, and was then able to obtain convergence without -nonrtolerance-. I bet there is a multinomial equivalent to that, e.g. one class has a prevalence of nearly 0 or 1, and you should check for that as well.

    If not, then save the parameter estimates from your model with -nonrtolerance-, then see if gsem can converge when you supply those as start values, e.g.

    Code:
    gsem (receivedSVS secondarystudent timetoIPE employment <- , regress) (c <- unemployment), lclass(c 5) nonrtolerance
    matrix b = e(b)
    gsem (receivedSVS secondarystudent timetoIPE employment <- , regress) (c <- unemployment), lclass(c 5) from(b)
    It might also be that the unemployment rate is too strongly correlated to your indicator variables, and that's somehow causing convergence problems. I experienced this when I was estimating my LCR model and I added metropolitan location - I already had race as a covariate, and in my data, most of the non-White respondents are in metro regions. I don't know how to fix that. However, my current thought is that you really want to make sure your model converges without -nonrtolerance- enabled. In my own work, my policy is that if I can't get convergence without -nonrtolerance-, I don't treat that model as identified.

    Side note: I'm not sure if you literally meant that you want to predict class membership from unemployment rate. Most properly, I would say you predict someone's class membership from their indicator variables. That said, if you want to show the model-estimated proportion falling into each class given unemployment rate, then yes, I am pretty sure you want to run a latent class regression like you already did. Then you run margins, as described in this post along with code to plot the margins - thanks to Clyde Schechter and Red Owl.
    Last edited by Weiwen Ng; 26 Jan 2018, 13:58.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment

    Working...
    X