Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is PCA appropriate with binary data?

    I have been working on a multilevel logistic regression with stunting as the dep var and individual predictors at level 1, health clinic predictors at level 2 and district at level 3. After accounting for level 1 predictors, a number of level 2 variables showed significance when included into the model individually. Howvever, when entered into a full model any significant values disappear.
    This lead my school statistician and I to consider whether PCA might be an interesting option to better explain the level 2 variables by including the components as predictors into the regression model.

    My level 2 data is all binary - 0 (No) 1 (Yes), and I've been a little ensure as to whether it is methodologically sound to use PCA with this data. I have given it a go and below is my code and output. I have only included the first 5 components as these all have eigenvalues >1.

    First question - should I be doing a PCA with binary data?
    Second question - what is the lowest cut-off for eigenvectors to meaningfully interpret my components? I've read that it should be 0.4 but nothing in my 1st component exceeds that.

    So assuming that PCA should be done on the data I have, with the results I'm getting, would it even give a meaningful contribution to my analysis?

    Many thanks!!

    Code:
    pca form_b1 form_b2 form_b3 form_c1 form_c2 form_c3 form_d1 form_d2 form_d3 form_e1 form_e3 form_f1 form_f2 form_f3, comp(5)
    Code:
    Principal components/correlation                 Number of obs    =      1,135
                                                     Number of comp.  =          5
                                                     Trace            =         14
        Rotation: (unrotated = principal)            Rho              =     0.6257
    
        --------------------------------------------------------------------------
           Component |   Eigenvalue   Difference         Proportion   Cumulative
        -------------+------------------------------------------------------------
               Comp1 |      3.50087      2.01768             0.2501       0.2501
               Comp2 |      1.48319      .084154             0.1059       0.3560
               Comp3 |      1.39904      .156531             0.0999       0.4559
               Comp4 |      1.24251      .107735             0.0888       0.5447
               Comp5 |      1.13477      .347163             0.0811       0.6257
               Comp6 |      .787612     .0950305             0.0563       0.6820
               Comp7 |      .692581    .00607674             0.0495       0.7315
               Comp8 |      .686504     .0411919             0.0490       0.7805
               Comp9 |      .645312     .0614412             0.0461       0.8266
              Comp10 |      .583871     .0308468             0.0417       0.8683
              Comp11 |      .553024     .0660023             0.0395       0.9078
              Comp12 |      .487022     .0411471             0.0348       0.9426
              Comp13 |      .445875     .0880659             0.0318       0.9744
              Comp14 |      .357809            .             0.0256       1.0000
        --------------------------------------------------------------------------
    
    Principal components (eigenvectors) 
    
        ------------------------------------------------------------------------------
            Variable |    Comp1     Comp2     Comp3     Comp4     Comp5 | Unexplained 
        -------------+--------------------------------------------------+-------------
             form_b1 |   0.2131    0.2332   -0.0044   -0.6302    0.0905 |       .2576 
             form_b2 |   0.0620    0.3261   -0.1045    0.1312    0.7055 |       .2274 
             form_b3 |   0.2687   -0.1544   -0.2096   -0.0530    0.3434 |       .5132 
             form_c1 |   0.3268    0.2693   -0.2007    0.1626   -0.1420 |       .4064 
             form_c2 |   0.3726    0.1053    0.0576    0.0959   -0.1454 |       .4574 
             form_c3 |   0.3104   -0.1792    0.0396   -0.1310   -0.3348 |       .4645 
             form_d1 |   0.1625    0.3111    0.3685    0.2940   -0.2593 |       .3904 
             form_d2 |   0.2091    0.1813    0.4271    0.3283    0.1697 |       .3762 
             form_d3 |   0.1524   -0.2816   -0.4357    0.4589   -0.0123 |       .2737 
             form_e1 |   0.2759   -0.4481    0.1188    0.1738    0.2267 |       .3201 
             form_e3 |   0.1924   -0.2250    0.5397   -0.1565    0.2336 |       .2955 
             form_f1 |   0.3271    0.3670   -0.1767    0.0206   -0.0982 |       .3706 
             form_f2 |   0.3728   -0.0133   -0.2265   -0.2494    0.0309 |        .363 
             form_f3 |   0.2901   -0.3232    0.0632   -0.0887   -0.1005 |       .5236 
        ------------------------------------------------------------------------------

  • #2
    Seems like a good example for lasso (Stata 16).

    Comment


    • #3
      Thanks Nick, will look into it!

      Comment


      • #4
        Just been looking into lasso a bit further. When you suggest it as an option, which functionality of lasso do you think is most appropriate? For prediction, model selection or inference? Thanks again!

        Comment


        • #5
          What most impressed me about #1 was the need to select predictors. Sorry, but I don't have deeper comments. I have Stata 16 but haven't even tried lasso yet.

          Comment

          Working...
          X