Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • FMM with GLM yields strange results on simulated data

    Hello Statalist Community,

    I am trying to test the capabilities of STATA 15's FMM procedure to estimate the parameters of a zero-inflated distribution of a proportion. For the purpose, I simulate the underlying data and estimate the parameters. The FMM seems to do a good job recovering the latent class marginal probabilities, yielding estimates of 38.8% for class 1 (true prob = 39%) and 61.2% for class 2 (true prob = 61%). It captures well the magnitude of the pointmass at 0, but it seems to have a hard time recovering the parameters of the logistically distributed proportion within the (0, 1) interval. In contrast, the simple GLM procedure, using the class 2 data within the (0, 1) interval, recovers successfully the slope parameter b (estimate = 0.3903188, while the true parameter is set at 0.4).

    In particular, the code I am running is:

    Code:
    clear all
    set more off
    set obs 2000
    set seed 12345
    
    // generate class indicator
    gen class = inrange(_n, 1, 780)*0 + ///            // 39% in class 1
                 inrange(_n, 781, 2000)*1              // 61% in class 2
    
    // set parameters
    scalar mu = -0.1
    scalar sx = 0.3
    scalar se = 0.1
    scalar b = 0.4
    
    // generate random Normal variables
    gen x = rnormal(0, sx)
    gen e = rnormal(0, se)
    
    // generate simulated series for Y
    gen y = 0 if class == 0
    replace y = 1/(1 + exp(-(mu + b*x + e))) if class == 1
    
    // plot the ys versus the x
    twoway scatter y x, by(class) name(y_by_x, replace)
    histogram y, frequency by(class) width(0.03) fcolor(forrest_green%50) name(y_by_class, replace)
    histogram y, frequency width(0.03) fcolor(navy%50) name(y_hist, replace)
    
    // estimate paramters using known class and single GLM
    glm y x if class == 1, family(binomial) link(logit)
    
    sort y
    // use FMM with a pointmass at 0 and a GLM to estimate the parameters
    fmm, difficult :    (pointmass y, value(0)) ///
                        (glm y x, family(binomial) link(logit))
    predict exp_y*
    predict pr*, classposteriorpr
    format %4.3f pr*
    estat lcprob
    estat lcmean
    
    // compute the predicted values
    gen y_hat pr1*0 + pr2*exp_y1
    // summarize the dependent variable and its fitted values
    su y y_hat exp_y
    The distributions of the simulated dependent variable (y) and its FMM-fitted values (y_hat) are quite different in the (0, 1) interval. Do you know what might be going on and why the FMM has such a difficult time estimating the parameters of a GLM with a family(binomial) and link(logit) in this example?
    Last edited by Stefan Platikanov; 20 Mar 2019, 21:49.

  • #2
    Stefan,

    To sum up, you are creating a mixture distribution. 61% of the observations have a fractional logit outcome where x is related to the outcome y, and 39% are structural zeroes. You can recover the fractional logit parameters in the non-zero class if you could identify it. When you run an FMM, you can recover the class proportions - in fact, when you tab the modal class against the simulated class, you'll see that all but one observation were correctly classified, which is unsurprising because class had no zero proportions. y, in the non-zero class, ranges from about 0.36 to just under 0.6, with a mean of 0.475.

    First, when you predict the mean y (i.e. predict exp_y), I think that already accounts for the probability that the observation is in the structural zero class. So, your further calculation for predicted value may be unnecessary. You can read up on predictions on pages 20 and 21 of the FMM PDF manual, which I think should corroborate my point, unless I read them wrong.

    Second, I inspected the data after you predicted the expected Y, and it does look a bit odd. The model is predicting proportions for all observations in the 0.9s, which is much higher than any actual proportion you generated. This is pretty counterintuitive to me. I have to note, because you generated your DV as a proportion instead of an actual binary variable, and because your proportions are all very comfortably in the (0,1) range, you could easily have fit this using the beta family for your second class. This family isn't supported by the regular glm command, but it is available in its own command betareg, and you can easily verify that it recovers the parameters correctly on class 1. Unfortunately, the fmm command gets hung up on these data when you tell it to use the beta family, because the data do contain 0s.

    Now, I will also point out that you could fit a linear probability model to the data. We all know this isn't the preferred model given the existence of the fractional logit model, but in FMM, that model isn't implemented. And what do you know, the linear model does correctly recover the mean of y in the non-zero class. Not shown, but if you predict the expected proportion, the values are pretty good predictions of the actual value of y in class 1 (the non-zero class).

    Code:
    fmm : (pointmass y, value(0)) (regress y x)
    *some output omitted
    Class          : 2
    Response       : y
    Model          : regress
    
    ------------------------------------------------------------------------------
                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    y            |
               x |   .0972886    .002319    41.95   0.000     .0927434    .1018337
           _cons |    .474429   .0006968   680.84   0.000     .4730633    .4757948
    -------------+----------------------------------------------------------------
         var(e.y)|   .0005919    .000024                      .0005468    .0006408
    ------------------------------------------------------------------------------
    
    estat lcprob
    Latent class marginal probabilities             Number of obs     =      2,000
    
    --------------------------------------------------------------
                 |            Delta-method
                 |     Margin   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
           Class |
              1  |        .39   .0109064      .3688471    .4115749
              2  |        .61   .0109064      .5884251    .6311529
    --------------------------------------------------------------
    
    
    estat lcmean
    Latent class marginal means                     Number of obs     =      2,000
    
    Expression   : Predicted mean (y in class 2.Class), predict(outcome(y) class(2))
    
    ------------------------------------------------------------------------------
                 |            Delta-method
                 |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    2            |
               y |   .4754535   .0006966   682.57   0.000     .4740882    .4768187
    ------------------------------------------------------------------------------
    So, I can't precisely explain why fmm and glm fail in this instance. I can't explain why the predicted values of the proportion in the non-zero class are so close to 1. I do wonder why the fractional logit model isn't implemented in FMM (it won't explode if there are proportions of 0 or 1, whereas the fractional beta model will explode).
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Weiwen,

      thank you for looking into the issue. Your summary is correct.

      I am not sure about your point on the predicted values: "First, when you predict the mean y (i.e. predict exp_y), I think that already accounts for the probability that the observation is in the structural zero class. So, your further calculation for predicted value may be unnecessary. You can read up on predictions on pages 20 and 21 of the FMM PDF manual, which I think should corroborate my point, unless I read them wrong."

      As far as I understand the manual, (predict exp_y) will store in exp_y the predicted mean of y per latent class. In this case, we have class 1, which is degenerate (y=0) and class 2, whose predicted values are stored in exp_y. I added a line to the code (see below), plotting the histogram of exp_y, and STATA gives it automatically the name "Predicted mean (y in class 2.Class)".

      If you would like to generate predictions from the model, conditional on the observed x, e.g. E[y|x], I think the way to achieve this is to use predict y_post, mu pmarginal, which uses the posterior latent class probabilities. I added this line to the code, as well as a line to plot the histogram for y_post, and STATA names y_post "Predicted mean (y), using posterior probabilities". As far as I understand the manual (which is very general and could benefit from the use of more examples and details), this is the mean predicted value of y given a certain observation of x (e.g. E[y|x]). In my calculation of y_hat, I use the posterior probabilities (pr1 and pr2) and achieve the same - y_post and y_hat are identical. Please let me know if I am wrong.


      Code:
      clear all
      set more off
      set obs 2000
      set seed 12345
      
      // generate class indicator
      gen class = inrange(_n, 1, 780)*0 + ///            // 39% in class = 0
                  inrange(_n, 781, 2000)*1            // 61% in class = 1
      
                  // set parameters
      scalar mu = -0.1
      scalar sx = 0.3
      scalar se = 0.1
      scalar b = 0.4
      
      // generate random Normal variables
      gen x = rnormal(0, sx)
      gen e = rnormal(0, se)
      
      // generate simulated series for Y
      gen y = 0 if class == 0
      replace y = 1/(1 + exp(-(mu + b*x + e))) if class == 1
      
      // plot the ys versus the x
      twoway scatter y x, by(class) name(y_by_x, replace)
      histogram y, frequency by(class) width(0.03) fcolor(forrest_green%50) name(y_by_class, replace)
      histogram y, frequency width(0.03) fcolor(navy%50) name(y_hist, replace)
      
      // estimate paramters using known class and single GLM
      glm y x if class == 1, family(binomial) link(logit)
      
      sort y
      // use FMM with a pointmass at 0 and a GLM to estimate the parameters
      fmm, difficult :    (pointmass y, value(0)) ///
                          (glm y x, family(binomial) link(logit))
      predict exp_y*
      predict pr*, classposteriorpr
      format %4.3f pr*
      estat lcprob
      estat lcmean
      
      // predicted values with posterior probabilities
      predict y_post, mu pmarginal
      
      // compute the predicted values
      gen y_hat = pr1*0 + pr2*exp_y1
      // summarize the dependent variable and its fitted values
      su y y_hat exp_y1 y_post
      
      histogram y_hat,  frequency width(0.03) color(green)  fcolor(%50) name(yhat, replace)
      histogram y_post, frequency width(0.03) color(orange) fcolor(%50) name(ypost, replace)
      histogram exp_y1, frequency width(0.03) color(maroon) fcolor(%50) name(expy, replace)
      The GLM procedure (around the middle of the code) fits the fractional portion of the data pretty well. So I think it is its application within the FMM that produces strange results.

      I do not have much experience fitting FMM models, and therefore I started by generating well-behaved simulated data and testing if the procedure can estimate the underlying parameters. If FMM has a difficult time estimating the parameters of clean, well-behaved series, then it is pointless to try to fit a model of real-world data with it.

      In general, I am looking for a procedure that can fit bi- and tri-modal distributions, bounded in the [0, 1] interval, with pointmass on 0 and/or 1. Theoretically, FMM is a promising approach, but it seems to have difficulties with very simple data. I am not sure if I am doing something wrong, or there is a problem elsewhere, which led me to post in this community.

      Comment

      Working...
      X