Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing values in Latent class analysis with gsem in Stata 15 - different results in comparison with R's package poLCA

    I am using Stata Version 15.0 "gsem" command to get a latent class model fromm a sample with 6860 (incomplete) participants. I have 11 manifest dichotomous variables. I repeated the same analysis with R's poLCA and get the same log-likelihoods and class probabilities, but the sample sizes differ and consequently the information criteria (BIC, AIC). I can't figure out why different sample sizes are produced, ca anyone give me a hint how to find out?

    In Stata (version 15) I used:
    Code:
    gsem (a b c d e f g h i j k <- ,logit) ///
    , lclass(C 2) nocapslatent  ///
    startvalues(randomid, draws(100) seed(15)) emopts(iter(20))
    Code:
    /*
    . di e(ll) -26958.644
    . di BIC 54119.589
    number of obersvations 6606

    In R (version 3.5.1) I used first the option where missing values are retained
    Code:
    library(poLCA)
    f<- cbind(a,b,c,d,e,f,g,h,i,,j,k)
    lc2<-poLCA(f, data=datsubset, nclass=2, maxiter=20, tol=1e-5, na.rm=FALSE, nrep=100, verbose=TRUE, calc.se=TRUE)
    Code:
    /*
    . di e(ll) -26958.644
    . di BIC 54120.46
    number of observations 6860
    number of fully observed cases 5872
    and the option with complete-cases

    Code:
    library(poLCA)
    f<- cbind(a,b,c,d,e,f,g,h,i,,j,k)
    lc2<-poLCA(f, data=datsubset, nclass=2, maxiter=20, tol=1e-5, na.rm=TRUE, nrep=100, verbose=TRUE, calc.se=TRUE)




    Code:
    .di e(ll) -24396.64  .di BIC 48992.87 .number of fully observations 5872
    Actually there are 5872 full observed cases in Stata and R dataset. To me it is unclear what Stata's gsem does with missing values. I only found an information in Statas manual:
    gsem’s method ML is sometimes able to use more observations in the presence of missing values than can sem’s method ML. Meanwhile,gsem does not provide the MLMV methodprovided by sem for explicitly handling missing values.
    How does Stata's gsem handle missing values?
    I checked the overall sample and the missings values in each variable and they don't differ. There must be something in the code or algorithm to exclude a different number of cases. Any suggestions how to figure out what is happening?

    Thanks in advance,
    Susanne







  • #2
    I believe that elsewhere in the sem/gsem manual, it said that gsem is an equationwise deleter. That means that for each individual equation, gsem uses all the observations with information. To illustrate, let's take SEM example 50 and randomly delete some data. Then, let's show the output for latent class #1.

    Code:
    use https://www.stata-press.com/data/r16/gsem_lca1
    set seed 4142
    replace accident = . if runiform() < .2
    replace play = . if runiform() < .1
    gsem (accident play insurance stock <- ), logit lclass(C 2)
    
    Class          : 1
    
    Response       : accident                       Number of obs     =        173
    Family         : Bernoulli
    Link           : logit
    
    Response       : play                           Number of obs     =        202
    Family         : Bernoulli
    Link           : logit
    
    Response       : insurance                      Number of obs     =        216
    Family         : Bernoulli
    Link           : logit
    
    Response       : stock                          Number of obs     =        216
    Family         : Bernoulli
    Link           : logit
    
    ------------------------------------------------------------------------------
                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    accident     |
           _cons |   .8019937   .2335897     3.43   0.001     .3441662    1.259821
    -------------+----------------------------------------------------------------
    play         |
           _cons |  -.7553152   .2445245    -3.09   0.002    -1.234574   -.2760561
    -------------+----------------------------------------------------------------
    insurance    |
           _cons |  -.6845751   .2215777    -3.09   0.002    -1.118859   -.2502909
    -------------+----------------------------------------------------------------
    stock        |
           _cons |  -2.311925   .4756805    -4.86   0.000    -3.244242   -1.379609
    ------------------------------------------------------------------------------
    So, this is a simple latent class analysis with 4 variables. I randomly deleted some answers for only two oof the variables (accident and play). gsem reports the number of obs for each equation, and it is correct.

    When you check the scalar e(N), it reports 216 total obs. There's a matrix called e(_N), which reports the total obs for each equation.

    Code:
    di e(N)
    216
    
    mat list e(_N)
    
    e(_N)[1,8]
              1.C:       1.C:       1.C:       1.C:       2.C:       2.C:       2.C:
         accident       play  insurance      stock   accident       play  insurance
    r1        173        202        216        216        173        202        216
    
              2.C:
            stock
    r1        216
    I exported the data to the R environment and attempted to fit the model with the poLCA package. Other Stata users who want to try this: you will need to add 1 to each variable before doing so, because poLCA expects the responses to be positive integers starting at 1.

    With the na.rm = FALSE option (Stata users: this tells the package to use casewise deletion, basically don't remove missing; the default seems to be listwise delete), R reports 164 fully observed cases out of 216 total. This is the correct number; I checked in Stata:

    Code:
     egen a = nmiss(accident play insurance stock)
    unknown egen function nmiss()
    tab a
    
              a |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |        164       75.93       75.93
              1 |         47       21.76       97.69
              2 |          5        2.31      100.00
    ------------+-----------------------------------
          Total |        216      100.00
    I think that in this case, R and Stata do the same thing if you tell R to use casewise deletion. As with you, I have the same LL. However, unlike you, both softwares report the same BIC. I can't explain why your total N and BIC under this condition differ between R and Stata, but I suspect you have some observations that are missing on all the variables. I deliberately did not create any such cases in my example. Perhaps you could check how many such cases you have? If you use my egen command on your data, then if I'm right, you'll have 254 cases where a = 11.
    Last edited by Weiwen Ng; 23 Mar 2020, 15:58.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Originally posted by Weiwen Ng View Post
      ...

      With the na.rm = FALSE option (Stata users: this tells the package to use casewise deletion, basically don't remove missing; the default seems to be listwise delete), R reports 164 fully observed cases out of 216 total. This is the correct number; I checked in Stata:

      ...
      First, the sentence above came out garbled. Casewise and listwise deletion are actually synonymous - these mean that if a respondent has any missing items, drop them entirely. Sorry about that, I was in a hurry to go do something else. Casewise deletion is the default behavior in most estimation commands, e.g. OLS regression. What I meant was pairwise deletion, aka complete case analysis, available case analysis - like the pw option in the correlation commands, you analyze all pairs of observations for which there's complete data, or in the gsem case, you discard cases that have missing data for one equation, then you move to the next equation and see who has missing data for that one. The default in the R package poLCA seems to be casewise/listwise deletion, but there's an option for pairwise deletion.

      Second, my hunch was this. Say some respondents have all the items missing. They contribute absolutely no information to an LCA model. Logically, it seems to me that they should not figure into the total N, which you might want to use to calculate the BIC. If I had a bunch of subjects with partial data, they can still contribute information to an LCA model. They should arguably figure into the total N somehow - BIC doesn't adjust for cases where some observations only contribute partial information, and maybe it should, but this is now way above my pay grade. Anyway, how do you figure out the total N?

      I went back to my modification of the stock Stata dataset. For some reason, on my Mac with this random seed, Stata had given the last 5 observations 2 missing variables (remember that I randomly set some observations to missing for the first two variables). I went and made the last two observations missing on all variables.

      . replace insurance = . in 215/216
      (2 real changes made, 2 to missing)

      . replace stock = . in 215/216
      (2 real changes made, 2 to missing)

      . drop a

      . egen a = rowmiss(accident play insurance stock)

      . tab a

      a | Freq. Percent Cum.
      ------------+-----------------------------------
      0 | 164 75.93 75.93
      1 | 47 21.76 97.69
      2 | 3 1.39 99.07
      4 | 2 0.93 100.00
      ------------+-----------------------------------
      Total | 216 100.00
      I then exported the data to R (incrementing all variables by 1, as discussed earlier, due to poLCA's requirements). Then I fit 2-class LCA models in both programs - note that in Stata, I had to invoke the nonrtolerance option (long story, but one latent class has a logit intercept around 15, which translates to a prevalence of basically 1; either search my previous posts or just trust me).

      In Stata:

      Code:
      di e(N)
      214
      estat lcgof
      
      ----------------------------------------------------------------------------
      Fit statistic        |      Value   Description
      ---------------------+------------------------------------------------------
      Information criteria |
                       AIC |    956.989   Akaike's information criterion
                       BIC |    987.283   Bayesian information criterion
      ----------------------------------------------------------------------------
      And in R:

      Code:
      number of observations: 216
      number of fully observed cases: 164
      number of estimated parameters: 9
      residual degrees of freedom: 6
      maximum log-likelihood: -469.4946
       
      AIC(2): 956.9892
      BIC(2): 987.3667
      Hence, poLCA is (arguably) not calculating its N correctly in the (presumably rare) case where some observations are missing on all the indicators. This is no biggie (I think), and it shouldn't really have a substantive impact. You don't care (I believe) about the absolute value of the BIC, you just use it to select between different models. In each program, the behavior is consistent, so the change in BIC between different numbers of latent classes should be unaffected.

      My usual caveats apply: latent class analysis is a pretty advanced technique, and it's not easy for most applied statisticians, and I'm an applied statistician, and furthermore I could certainly be wrong.
      Last edited by Weiwen Ng; 23 Mar 2020, 18:41.
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment


      • #4
        I hate to bump the thread for something minor, but I need to mention something for completeness.

        I believe I've showed that poLCA doesn't quite correctly calculate its N for cases that are missing on all the response variables. This affects the calculation of BIC, which is ln(N) * K - 2(log likelihood). K is the number of parameters. Here, there are 11 intercepts per response variable per latent class (so, 22 parameters), plus one more multinomial intercept (basically number of classes minus 1), so K = 23.

        In my example, the BIC doesn't seem to substantially differ between poLCA and Stata. However, consider that there are 216 total observations (so, poLCA thinks that N = 216), but I created 2 observations where all the data was missing (so N really = 214, which Stata detected). ln(216) = 5.375. ln(214) = 5.366. That's a tiny difference, so the packages' BICs only differ in the decimal places.

        Susanne's BICs differ a bit more. However, she has 6,860 total observations, but 6,606 observations with at least some data. The difference in ln(K) is greater for her dataset. That's why her BICs differ more substantially.

        For example, by hand calculation, I'd expect her copy of poLCA to report (under pairwise deletion, i.e. na.rm = FALSE):
        BIC_poLCA = ln(6860) * - 2(-26958.644) = 54120.458

        Whereas Stata should report:
        BIC_Stata = ln(6606) * 23 - 2(-26958.644) = 54119.590

        Both are consistent (allowing for some rounding error) with the numbers she posted. Again, it's my contention that this shouldn't be a material error on poLCA's part, because we just use BIC to compare models. I would rest assured that in this case, Stata and poLCA are in fact posting the same results. I would perhaps refrain from fitting k = x latent classes in Stata and then jumping to poLCA for k = x + 1 latent classes and comparing BICs. I don't suppose many people will need to do this.

        Last housekeeping: I was using Stata 16.0. If you want to make gsem do listwise deletion, you can add the option listwise. However, the Stata manual (intro 4) says this, emphasis mine:

        1. gsem has an option, listwise, that duplicates the sem rules. This is used in testing of Stata. There is no reason you would want to specify the option.
        Last edited by Weiwen Ng; 24 Mar 2020, 12:27.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          To finally close this thread:
          thank you so much!
          Actually, I had some cases with all variables missing - caution if using "heaven" package for imorting Stata datafile in R. Your suggestion that I get the same results using listwise deletion in R was right.

          Comment

          Working...
          X