Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Asymptotic Distribution Free SEM and error variances of binary variables

    I have been puzzled over the "sem" command with method(adf) and its handling of binary variables. In every case I can construct, the sem command does not produce standard errors for the error variance of the binary variables when using method(adf), whereas method(mle) will produce the standard errors. However, the advice for handling a CFA with binary variables is to use ADF, not MLE!

    I have reconstructed the problem with the auto dataset for demonstration’s sake (code below), because the dataset I’m working with is very large. I had to expand the auto dataset, adding some random noise. I know the constructs make no sense and the groups I constructed do not come from an EFA, but my goal was to show the behavior of the command on an artificial dataset and keep things simple.

    Click image for larger version

Name:	sem.png
Views:	1
Size:	105.3 KB
ID:	1557656


    A few comments:
    1. It is not a convergence problem.
    2. It is not an identification problem. The model is identified. The matrices are admissible according to the program.
    3. However, the program complains that “the fitted model is not full rank” whenever it does not produce the error variance of the dummy variables- I’m not sure this matters to me, because I do not think the statistics Stata would display based on the fitted variance matrix would be relevant to the case where the factors might not be MVN distributed. The statistics that I find logical to display under violations of normality are indeed produced.
    My context: I am doing a standard EFA followed by CFA, and half of the variables are binary.

    My questions are:
    1. Why is this occurring, and are the estimated error variances still trustworthy despite the fact that they do not have standard errors?
    2. The jackknife, surprisingly, can produce standard errors for the binary variables (and full rank e(V) matrix). Can I trust them? (The jackknife with my actual dataset has no invalid replications, unlike the auto dataset)

    Code:
    clear all
    sysuse auto, clear
    set seed 123456
    
    ********************************************************************************
    *ADF REQUIRES REALLY LARGE SAMPLE SIZE.
    *EXPAND THE SAMPLE AND ADD SOME NOISE TO THE EXPANDED SAMPLE.
    ********************************************************************************
    local rowstoadd=50
    
    local orig_obs=_N
    
    expand `rowstoadd'
    
    gen expandedpart=(_n>`orig_obs')
    
    local std_price=500
    local std_gear_ratio=.25
    
    foreach v in price gear_ratio {
        replace `v' = `v' + rnormal(0,`std_`v'') if expandedpart==1
    }
    
    local spread_mpg=3
    local spread_weight=200
    local spread_displacement=3
    local spread_turn=5
    local spread_rep78 = 1
    
    foreach v in mpg weight displacement turn rep78  {
        replace `v'=`v' + runiformint(-`spread_`v'',`spread_`v'') if expandedpart==1
        }
        
    local spread_headroom=.1
    
    foreach v in headroom {
        replace `v'=`v' + runiform(-`spread_`v'',`spread_`v'')  if expandedpart==1
        }
        
    gen switch_foreign=0
    replace switch_foreign=rbinomial(4,.03)
    
    gen new_foreign=0 if foreign==1 & switch_foreign==1 & expandedpart==1
    replace new_foreign=1 if foreign==0 & switch_foreign==1 & expandedpart==1
    replace new_foreign=foreign if missing(new_foreign)
    
    tab foreign new_foreign
    replace foreign=new_foreign if expandedpart==1
    drop new_foreign switch_foreign
        
    
    ********************************************************************************
    *SHOW THE SEM RESULTS
    ********************************************************************************    
        
    *SHOW THAT ADF HAS MISSING STANDARD ERROR OF FOREIGN'S ERROR VARIANCE:
    sem (turn <- Engine@1 )(gear_ratio displacement rep78 <- Engine) ///
        (headroom <- Luxury@1 )(foreign trunk <- Luxury) ///
        (mpg <-Basic@1 ) (weight length <- Basic), method(adf)
         
    di "Converged=`e(conv)'"
    di "Admissible Matrix:"
    mat li e(admissible)
             
    *SHOW THAT MLE WORKS:
    sem (turn <- Engine@1 )(gear_ratio displacement rep78 <- Engine) ///
        (headroom <- Luxury@1 )(foreign trunk <- Luxury) ///
        (mpg <-Basic@1 ) (weight length <- Basic), method(ml)  ///
         nolog

  • #2
    For those of you who have used other programs for asymptotic-distribution free SEM, do you encounter the exact same problem when including binary variables, or is this problem specific to Stata?

    Comment


    • #3
      I have verified that this has nothing to do with the latent nature of my CFA. I even find the problem with implementing a basic linear regression using sem, method(adf).

      In the below example, I used the original auto dataset.

      A quick calculation indicates that, despite the error variance missing standard errors, it is indeed correct: if we take the error variance estimated by sem, multiply by N/N-k-1 (74/71), and then take its square root, we indeed get the correct RMSE from the reg command as expected.

      But, does anyone know why the adf method produces this lack of error variance standard errors but the ml method does not?


      Code:
      sysuse auto, clear
      sem (foreign <- weight mpg),  method(adf)
      reg foreign weight mpg
      Click image for larger version

Name:	adf_bug.png
Views:	1
Size:	46.8 KB
ID:	1558342

      Comment


      • #4
        You have a clear and well-documented question. Given that no one has answered it over the last few days, I think it's perfectly appropriate to send it to Stata Tech Support. My experience is that they are prompt, competent, and friendly. However, I don't think that they typically comb StataList looking for unanswered technical questions, but rather that they rely on users to contact them.

        Comment


        • #5
          Thanks for your suggestion, Mike Lacy. I had been reluctant to do that because I'm not sure if it's a bug or there's something fundamental I'm missing (after all, I'm an economist trying to do a CFA!). Anyways, I think you are right and I will contact them. I will report what they say here in case anyone else is puzzled by the same behavior of the command.

          Comment


          • #6
            Following up in case anyone else has the same question: I have been in touch with tech support. Long story, but the error variances for binary variables with method(adf) are incorrectly calculated in general and cannot be trusted. They are copied over from the start value vector without being estimated. The developers are working on a fix.

            The -nomeans- option calculates them correctly, but this isn't a fix because there are cases where the program converges without -nomeans- but does not converge with -nomeans-.

            Comment


            • #7
              Hi Alecia, did you get any feeback as regard your questions ?
              Thanks

              Comment


              • #8
                Radhouene DOGGUI, I understand what's going on but I don't think the problem has been fixed yet. If you want the details, email me at [email protected] and I can forward you the emails I had with tech support.

                Comment


                • #9
                  Alecia Cassidy I will do now, Thanks

                  Comment


                  • #10
                    Following up: Stata tech support says there's no issue with the error variances of binary variables if you use the jackknife or bootstrap methods.

                    Comment


                    • #11
                      Hi Alecia. I'll ask a novice's question, as I don't use SEMs (at least as defined by the Stata sem command). How come the ADF method is recommended for binary variables when a binary variable can only have one distribution -- Bernoulli with the chosen link function? I would've thought that ADF would be recommended for continuous variables, where the distribution need not be normal even though that's the nominal assumption. With a Bernoulli outcome we really don't have a choice. I'm sure I'm missing something.

                      Regards to Traviss!

                      Comment


                      • #12
                        Hi Jeff Wooldridge!

                        I guess saying the ADF is "recommended" for binary variables is a strong statement- it depends what you are trying to do and what you see as the alternatives for handling binary variables. The sem command has four options- maximum likelihood (ML), maximum likelihood with missing variables (MLMV), quasi-maximum likelihood (QML), and ADF.

                        ML and MLMV: joint normality of all variables- latent and observed (and missing at random for MLMV).

                        QML: relaxes joint normality, but only when calculating standard errors.

                        For all 3 ML methods, the estimates of loadings/constants would still be asymptotically valid under the standard set of assumptions (measurement error associated with the measures uncorrelated across the measures and uncorrelated with the factors, model correctly specified). But, many people using SEM also care about the error variances associated with the measures. For all three ML methods, the error variances of the binary measures would be incorrect (QML does not adjust the point estimates of the error variances- it only adjusts their standard errors).

                        ADF: it's a weighted-least-squares-based approach that has its limitations (sample size needs to be very large- I couldn't simply use the auto dataset to illustrate my problem above). And, you are correct that we don't explicitly model the Bernoulli outcomes as binary. But, it relaxes MVN for all variables, including the latent ones, so at least it could be "better" for when you observe lots of dummy variables and suspect latent factors are non-normal than the above options.

                        The gsem command allows you to explicitly model more stuff- for example, you could specify a logit link function between a factor and a binary variable. Gsem relaxes the MVN assumption for the categorical latent variables as well as all of the observed variables.

                        I have not used gsem, and the reason is because the point estimates of the error variances are sensitive to the link function! In my case, I use these to correct for measurement error in latent factors in a second stage that is much more "econ-ey," and I wanted an approach where I didn't have to make choices that would impact the measurement error. ADF is that.

                        Traviss says hi!

                        Alecia

                        Comment


                        • #13
                          Thanks Alecia -- I think I get it. And, thanks to you, I was able to be lazy.

                          I hope you're all doing well!

                          Comment


                          • #14
                            Thanks! It was a good exercise to write out my thoughts on this, and maybe it will be helpful to someone else deciding between methods.

                            Comment


                            • #15
                              Never mind about jackknife and bootstrap. Though tech support claimed there was no problem with them, my own tests have indicated otherwise. It appears jackknife and bootstrap (without nomeans option) are also just copying the start vector's values for the point estimates of error variances of binary variables, but then they are acting like those start vector values are solutions and putting standard errors around them as if they were solutions!

                              Comment

                              Working...
                              X