Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent Profile Analysis: best practices in "starting values" selection?

    Dear Statalist users,

    I initially created a post here, where I was having difficulties understanding the basics of LPA syntax in Stata. Following the post here, I was able to replicate Masyn's (2013) LPA using startvalues(randompr, draws(5) seed(15)). I applied the same starting values uniformly across my 6 classes with 4 different model restrictions. My BIC statistics results are as follows:

    class BIC class-invariant, diagonal BIC class-varying, diagonal BIC class-invariant, unrestricted BIC class varying, unrestricted
    1 6536.391 6536.391 5726.355 5726.355
    2 6044.384 5982.513 DRE 5648.8
    3 5923.718 5917.452 5563.118 5620.018
    4 5915.317 5820.818 5587.027 5741.81
    5 5898.285 5838.543 5731.829 5756.148
    6 5843.54 5817.436 5259.08 5927.461
    Where DRE stands for discontinuous region encountered.
    I noticed what looks like an erratic behaviour of my BIC values for the class invariant, unrestricted model (column four), which is probably due to the stringent assumption of being class-invariant. I've also noticed that as I increase the number of classes, Stata struggles quite a bit in providing results for the same model specification.

    Now, comment # 7 in the same post recommends using startvalues(randompr, draws(50) seed(15)) emopts(iterate(10)) as one hits 5+ latent classes. I applied this criteria uniformly across my 6 class models with 4 different restrictions. My results look as follows:

    class BIC class-invariant, diagonal BIC class-varying, diagonal BIC class-invariant, unrestricted BIC class varying, unrestricted
    1 6536.391 6536.391 5726.355 5726.355
    2 6044.384 5982.513 5677.851 5648.8
    3 5923.718 5932.882 5698.263 5620.018
    4 5915.317 5820.818 5715.92 5773.844
    5 5898.285 5846.176 5750.381 5780.88
    6 5843.54 5818.873 5772.309 5924.519
    At this point, I am very confused about when to use a set of starting values or another. Every time I use a different set my class profiles change markedly, and I am trying to avoid the trap of choosing the starting values that best fit my research expectations.

    I'm leaning towards using Masyn's starting values, as in my first table. It just seems like a standard I can follow. But if anyone has some insights on this topic, I would be very grateful to discuss. Many thanks.

    P.S. I am aware of the "gsem estimation options" document from the Stata manual. Unfortunately, I could not solve my problem after reading it.
    Last edited by Jesus Pulido; 26 Jan 2021, 23:55.

  • #2
    Stata has a default way to calculate start values for LCA and other generalized SEMs. The manual description is very brief, and I don't know enough statistical theory to comment.

    Click image for larger version

Name:	Screen Shot 2021-01-27 at 8.24.03 AM.jpg
Views:	1
Size:	101.8 KB
ID:	1591602


    To my knowledge, the best practice for selecting start values in LCA is to select them randomly, and to do so multiple times. Here's why. The figure above is from Masyn's chapter. It's a simplified conceptual diagram of what likelihood functions look like - in reality, there's one dimension for every parameter, so really we are trying to maximize a multidimensional function, but the figure does perfect for teaching purposes. Most likelihood functions are relatively simple, with a clear global maximum, i.e. they look much like fig a. You could change the starting values and the maximizer would converge at the global maximum, even if it took a few more iterations.

    The likelihood function for latent class models is definitely more complex, and possible scenarios are figures b through e. You want to have a lot of randomly selected start values. At least a few will converge at the global maximum. It seems safe to assume that the more latent classes you have, the more complex the likelihood function gets.

    [For Bayesians, you know how you don't simply search for one global maximum, but you (or rather you ask Stata or R or whatever program to) search the entire likelihood space with a semi-random walk? It seems to me that there's a bit of a parallel to what we're trying to do above, although I acknowledge the obvious differences in estimation methods. Might it not be better to just use Bayesian estimation for LCA to begin with? I think so, but I can't really conceive of what the syntax would look like.]

    Now, just to clear something up:

    I applied the same starting values uniformly across my 6 classes with 4 different model restrictions.
    I'm not trying to be an English pedant here. I do want to make clear that the code below doesn't apply the same set of start values:

    Code:
    startvalues(randompr, draws(50) seed(15)) emopts(iterate(10))
    What this actually does is that for each model, it partially estimates results using 50 sets of random start values. How are these chosen? For each observation, you will end up with a vector of probabilities of latent class membership. That vector is part of what's being maximized every time you hit run! I think that this option creates a random starting vector, e.g.

    Mrs. Perez's start values, try 1: 0.24, 0.21, 0.10, 0.46
    Mrs. Perez's start values, try 2: 0.11, 0.35, 0.40, 0.04
    Mrs. Perez's start values, try 3: 0.89, 0.01, 0.05, 0.05
    ...

    You can actually see this process if you add the options

    Code:
    startvalues(randompr, draws(1) seed(15)) emopts(iterate(0)) noestimate
    (Stata will spit out a result, but declare convergence not achieved)
    predict pr*, classposterior
    The predicted posterior probabilities should be your start values. I don't know what happens if you ask for multiple draws then run that code, but Stata should just generate random start values, then try to fit the model. It will notice you asked for zero EM iterations, then it will say OK, don't run the EM algorithm. Then it goes outside the startvalues option, and it notices you asked for the noestimate option - it then says OK, I'm not going to run Newton-Raphson either, and it will spit out what results it has at that point.

    TL;DR: you asked for a method to select start values, not a set of start values. As far as I can tell, the LCA literature seems to regard as reasonable enough. Just use the random starts method. Setting a random number seed should be ultimately irrelevant; if you had enough random starts to identify the global maximum, then even if you re-estimated using a different seed, the program would still ID the global maximum.

    I hope this makes sense. I'm trying to convey information about a pretty complex method over an internet forum, which can sometimes result in things not being clear.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      As Weiwen has aptly pointed out, "the best practice for selecting start values is to select them randomly, and to do so multiple times," not only in LCA estimation but in non-linear optimisation in general. See, for example, Figure 1 and related discussion in Christensen, Hurn and Lindsay (2008): https://doi.org/10.1016/S0313-5926(08)50025-7.

      Arne Risa Hole and I have a paper that squarely focuses on the selection of starting values. Our application is a continuous mixture model (G-MNL) but I think our results and related discussion are relevant to finite mixture models too: https://doi.org/10.1111/rssc.12209.

      Comment


      • #4
        Thanks very much Weiwen and Hong for your valuable insights.

        I am clear now in that each draw constitutes a set of random start values. I read the papers provided by Hong and my takeaway is that "repeated starting values chosen at random offer a better chance of avoiding convergence to a local minimum" (Christensen et al, 2008). My understanding is that increasing the number of random draws would eventually lead me to a global solution.

        From our discussion so far, I gather that it is encouraged to use a large number of draws in my starting values syntax. It makes sense now why Weiwen recommends the use of 50 draws.

        I've ran my LPA model using 1 draw (default), 5 draws (as in Masyn's book chapter), 10 draws, 20 draws, 50 draws and even 60 draws. For example, the syntax for my 3-class varying, unrestricted model (with 50 random draws) looks as follows:

        gsem (norms trust farming lfunction informal engagement advisory <- _cons), lclass(C 3) startvalues(randompr, draws(50) seed(15)) lcinvariant(none) covstructure(e._OEn, unstructured) nolog nodvheader
        Unfortunately, as I increase the number of random draws my class profiles keep changing (!). Even as I increase the draws (50 and 60), results differ at the BIC level as well as in the class profiles.

        Is there something I am misinterpreting from our discussion above? Perhaps I should try increasing the number of draws even more? Thanks.

        P.S. Something else to add is if perhaps I should focus on which of these random draws iterations (i.e. 1, 5, 10, 20, 50, 60) provides the models with the best Log-Likelihoods - just a thought.
        Last edited by Jesus Pulido; 27 Jan 2021, 17:11.

        Comment


        • #5
          Unfortunately, as I increase the number of random draws my class profiles keep changing (!). Even as I increase the draws (50 and 60), results differ at the BIC level as well as in the class profiles.
          If I understand you correctly, with 3 latent classes, you keep getting different results with 50 vs. 60 random draws. If that's right, I'd try 100.

          I have to admit, this hasn't been consistent with my experience. In the analyses I've done, I have had about 6-8 variables (in my case, a mix of binary and Gaussian, but no more than 2 of the latter). For 2-3 latent classes, I have been fine with using relatively few random starts (30-40, I believe). The more latent classes, the more random starts I add. That said, a model with multiple Gaussian indicators is inherently more complex because in each class for each indicator, you have one parameter for the mean, one parameter for the (error) variance, and then you have a variance-covariance matrix of the (errors of the) indicators. I wonder if that's causing the issue and if this model is a lot more complex than it might seem at first.

          I think that Masyn's chapter may recommend 50-100 random starts. I think she did not make a clear statement about how many of those starts should converge at the same maximum likelihood estimate before you trust it. She just said that it should replicate multiple times, e.g. this sentence on pg. 565:

          ...as a high frequency of replication of the apparent global solution across the sets of random starting values increases confidence that the "best" solution found is the true maximum likelihood solution...
          Now, as I said before, there's no preprogrammed way in Stata to have the program estimate an LCA model to completion, save the result, then start again with a new random draw, estimate, save the result, etc. If you are sufficiently determined, you can use a loop and putexcel. I have done this myself, but honestly I am not really willing to post the code because it's complex to explain and I don't want to be responsible when it fails to work on someone else's machine - no offense to all of you, but I do have my own research to do. If you are an R user and you only have binary or ordinal items, you could consider the R package poLCA, which does have a mechanism for random starts and saving the results. I haven't used the R package flexmix, but I know that it handles more types of indicators and I think it also has the same mechanism re. random starts + saving results.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment

          Working...
          X