Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Performing factor analysis with panel data

    Hello everyone,

    I have panel data with six variables concerning governance quality for 59 countries and years from 2005 to 2015. All these six variables have the same scale of measurement (from -2.5 to +2.5) and I would like to aggregate them, building a unique index. Given the nature of these variables, I chose a factor analysis over a pca. My question is if it is correct to perform a factor analysis in panel data. This topic has been already discussed in several posts (here and here). Although other methods were suggested (ex: sem and gsem), nothing was said about the legitimacy to perform, or nor, factor analysis with panel data.
    I also attach a small sample of data

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str18 cou int year double(indg_accountability indg_control_corruption indg_govt_effect indg_political_stability indg_regulatory_quality indg_rule_law)
    "ARG" 2005 .2692317 -.3881924 -.1242145 -.0417681 -.5483672 -.5546548
    "ARG" 2006 .4034365 -.3412537 -.0455042  .0030557 -.6406317  -.568944
    "ARG" 2007 .4486213 -.3402916 -.0159589  .0988585 -.6682156 -.5916041
    "ARG" 2008 .3588959 -.4355765 -.1468119 -.0852222 -.7375205 -.6766081
    "ARG" 2009 .2800475 -.4449961 -.3182648 -.2324056 -.8450356 -.6754778
    "ARG" 2010 .3616894 -.3614692   -.16279 -.0847979 -.7623686  -.590509
    "ARG" 2011 .3449418 -.3661579 -.1200561  .1589937 -.7222099 -.5608934
    "ARG" 2012 .2952099 -.4431399 -.2385752  .1030405 -.9292172 -.6796814
    "ARG" 2013 .2769879  -.432282 -.2775489  .0653143  -.957261 -.7076761
    "ARG" 2014 .3453205 -.5416201 -.1591353 -.0051219 -1.074257 -.8860345
    "ARG" 2015 .4117875 -.5470577 -.0750081  .0147854 -.9114419  -.770812
    "AUS" 2005 1.507056  1.952358  1.751213  .8935112  1.600643  1.724451
    "AUS" 2006 1.382795  1.960568  1.711956  .9351878  1.623903   1.77004
    "AUS" 2007   1.3692  2.010918  1.825559  .9287898  1.683095  1.761237
    "AUS" 2008 1.368076  2.042482   1.79397  .9556448  1.765919  1.770851
    "AUS" 2009 1.383792  2.051661  1.705787   .855689  1.819984  1.740398
    "AUS" 2010 1.419766  2.031455  1.768756  .8888599  1.698415  1.764966
    "AUS" 2011 1.453721  2.044637   1.69595  .9357101  1.858609   1.74258
    "AUS" 2012  1.49919  1.985774   1.62144  .9979972  1.786468  1.766946
    "AUS" 2013 1.436414  1.785322  1.639869  1.031073  1.800868  1.778639
    "AUS" 2014 1.361716  1.853449  1.607115  1.032192  1.863708  1.923105
    "AUS" 2015 1.355591  1.882113  1.564534  .8849798  1.788684  1.825212
    "AUT" 2005 1.378657   1.92206  1.684595  1.105386  1.606079  1.859191
    "AUT" 2006 1.372685  1.914531  1.831036  1.075933  1.644535  1.913557
    "AUT" 2007 1.369262  2.013397  1.870187  1.283752  1.690031  1.960128
    "AUT" 2008 1.358464  1.843035   1.78084  1.339206  1.606344  1.922995
    "AUT" 2009 1.392846  1.703025   1.66658  1.190602  1.452587   1.78489
    "AUT" 2010 1.430291  1.585462  1.841763  1.152648  1.452815    1.8003
    "AUT" 2011 1.402444  1.431896  1.617761  1.193522  1.382634  1.801555
    "AUT" 2012 1.448773  1.389731  1.575873  1.340567  1.524189  1.858179
    end

    Thank you to anyone who is willing to shed light on this topic.
    Best



    Last edited by alessio lombini; 14 May 2021, 02:17.

  • #2
    Correct or incorrect are not words I would want to use here, but as the question implies there are drastically varying views on whether this is a good idea, and here is one. My comments don't distinguish PCA from factor analysis (FA).

    Context here would help. Are these variables responses or outcomes or to be used as predictors in some wider analysis?

    Some specific points:

    1. As you're interested in aggregation the simplest composite is the mean across quality measures. If your reaction is that that is too crude, then the interesting question is whether PCA or FA will attend to the nuances you care about.

    2. FA or PCA works best if there are clusters of variables that are very highly correlated in which case you can also spot that from the correlation structure and just choose some of the variables.

    3. WIth quality measures on the same scale how far variables agree (have the same values) is not the same question as how well variables are correlated.

    4. Are these quality measures measures in any strong sense or just ordinal?

    5. You can interpret FA or PCA results only by referring back to the original variables, so why go from A to B via Z?

    Comment


    • #3
      Dear Nick,

      thank you very much for your detailed and quick answer. I try to provide insights into some of the points you raised:

      - I would use them as predictors for a Pseudo Poisson Maximum Likelihood (PPML) gravity model. In case their construction may be relevant, these six variables are composite governance indicators based on over 30 data sources which are rescaled and combined by using an unobserved components model.

      2. These variables are highly correlated to each other:

      Code:
      . corr $xlist
      (obs=649)
      
                   | indg_a~y indg_c~n indg_g~t indg_p~y indg_r~y indg_r~w
      -------------+------------------------------------------------------
      indg_accou~y |   1.0000
      indg_contr~n |   0.7562   1.0000
      indg_govt_~t |   0.7101   0.9515   1.0000
      indg_polit~y |   0.5967   0.7338   0.7175   1.0000
      indg_regul~y |   0.7648   0.9210   0.9204   0.7295   1.0000
      indg_rule_~w |   0.7886   0.9643   0.9577   0.7622   0.9412   1.0000
      4. For all these variables, a more positive value means higher quality (respectively, of rule of law, of the control over corruption, etc...). The opposite for a more negative value. Given the way these six variables are constructed, if I am not wrong, I assume they are not ordinal.

      5. This is true, however, in my case ( I want to test the impact of non-tariff measures on the global value chain) I think it may be redundant to include all six variables separately since they have high correlation and they measure aspects that are strongly linked to each other.

      Have these comments helped you to create a clearer context?
      Thank you again and best regards
      Last edited by alessio lombini; 14 May 2021, 03:50.

      Comment


      • #4
        Thanks for that. Again, styles differ. With 6 predictors people go all the way from "keep the model simple; leave out any predictor that appears not to be helping" to "include all the predictors you think are relevant, because it's part of the picture that many predictors have small effects, and we shouldn't make a fetish of (e.g.) P < 0.05".

        I come from a geography or environmental science background which in turn bears most imprint of a physics or engineering stress on the ideal being simple models that can be summarized through one-line equations and on one diagram. In contrast people from a social science or medical background regard it as standard that there is always a very long list of predictors in good datasets and the key is to let them all speak, even if the result shows a small effect one way or the other. In my fields the peak of interest in factor analysis was around 1968, since when it has faded into obscurity!

        Again, correlation measures linearity, not agreement. The difference is starkest when y = bx exactly and b is far from 1 but positive. Then correlation is identically 1 but agreement minimal. With values on the same support, the scope for big differences between linearity and agreement is restricted, but (e.g.) concordance correlations are in practice nearer zero than Pearson correlation.

        I'm just summarizing the case for Don't do this and am optimistic that good arguments on the other side may be voiced too.

        Comment


        • #5
          Thank you very much Nick for all this useful information. I guess that a well-explained justification will be needed for using this method rather than including all the variables separately.

          Also, to provide the last suggestion to me and to those who are willing to perform this method, there is the need to adapt the factor analysis to account for panel data? In previous posts, I noticed several and different points of view, for instance, to compute the panel analysis separately for each year (here), to employ a sem or gsem instead, and, in one case, a user was also discouraged from FA with panel data. Therefore, it is not very clear to me whether or not this technique may be also extended to a dataset that takes into account time (such as panel data).

          Thank you in advance for any comments you may provide on this last topic.

          Comment


          • #6
            I deliberately avoided that part of the question, as knowing next to nothing specifically about variants of FA for panel data or SEMs. A separate analysis for each year might be anything from a major mess to a magnificent success, e.g. if coefficients change in intelligible and interesting ways.

            The context here of gravity models is one of which Joao Santos Silva has major expertise so you might hope to attract his attention.

            Comment


            • #7
              Entering the terrain of factor variables and latent variables in panel data leads to the question of measurement invariance.

              I would use factor analysis (start with or end with confirmatory factor analysis) and test for measurement invariance across measurement occasions. I would not use PCA.

              Comment


              • #8
                Thanks for the nod, Nick.

                I am not keen on PC or FA, so I would try something else. Without knowing more about the variables, it is difficult to give concrete advice, but things like including all the variables, just their mean, or just variables 1, 2, and 4, are potentially interesting solutions.

                Comment


                • #9
                  Normally I would test my model.

                  But: I believe the data that OP uses may do fine without factor analysis, it seems they do not build on an assumption of a latent variable.

                  Some measurements do not need a test (like counting alcohol units), but if we assume there is an unobserved variable causing answers to several items in a questionnaire, we should test that assumption. Confirmatory factor analysis is one method for that, which also allows for tests of measurement invariance across time (or across groups).

                  Test of measurement invariance is crucial if we want to justify our assumption that we are testing the same variable across measurement occasions (again assuming that simple counting is not justified, like counting alcohol units would be).

                  The reason why I became engaged in this now: I think I sometimes see suggestions at Statalist to adapt data to the model (like transposing data when the model might be wrong). I would recommend using models that fit the data and for instance not so quickly drop items unless we have good reason to do so. But again, this might not be a problem in this case.

                  I think "I am not keen on PC or FA" is not good advice, unless it is restricted to specific research questions or research areas.
                  Last edited by Christopher Bratt; 14 May 2021, 07:16.

                  Comment


                  • #10
                    Just a quick add-on to Christopher Bratt's excellent points. If you take the mean of the the items, that is an implicit factor model in which each of the variables equally contribute to the latent factor. In other words, this common data reduction technique, exemplified by
                    Code:
                    egen avg = rowmean(var1 var2 var3)
                    actually implies a latent variable model with very restrictive assumptions. Better in my opinion is to test it against a factor model that allows the loadings to freely vary across items (default behavior in EFA or CFA). .
                    Last edited by Erik Ruzek; 14 May 2021, 07:34. Reason: Changed egen syntax to rowmean

                    Comment


                    • #11
                      "I am not keen on PCA or FA" (@Joao Santos Silva). Me neither. I like to think that my view is based on experience -- notably trying techniques on data and deciding that there were simpler ways to do what was needed and also reading many papers which didn't seem to benefit at all from the techniques they used. But it can be hard to express such a view without seeming opinionated or dogmatic.

                      None of us is writing a paper for publication and fighting sceptical or antagonistic reviewers when we post here.

                      Specifically, Christopher Bratt in #9 wrote about suggestions

                      to adapt data to the model (like transposing data when the model might be wrong). I would recommend using models that fit the data
                      where first I take it that transposing is a typo for transforming. This is wrenched out of context but I think the antithesis is false. We should all be in favour of models that fit the data (other criteria being put on one side for the moment, although good fit has to be matched against simplicity, scope, and much else).

                      I consider that e.g. Y = exp(Xb) is just a different model from say Y = Xb; the data are what they were and are just being modelled with a different functional form That applies regardless of whether we transform data explicitly or fit using what in generalized linear model jargon is called a link function.


                      Comment


                      • #12
                        Nick, I disagree in some contexts, not in others. I would have no problem transforming data on the spreading of a virus, possibly also income (it depends). But generally forcing variables to resemble normal distributions or whatever to achieve nice residual distributions is not what I would prefer. Sometimes (or often) the very skewed data reflect something important. Heterogeneous groups in the data is just one example.

                        I think these two discussions are linked: Do we really test our model against the data, or do we make the data fit the model?

                        Example: You observe pedestrians at a crossing, apparently waiting for green light (or "Walk"). You count how many cross the street for each second over... let's say one minute. Since the light is red (or "Wait"), you may see a few crossing the street now and then. Then, suddenly more, and then all the rest when they have realised that the light has turned green ("Walk").

                        That distribution tells us something. In this case, we know what, in many other cases we don't know why we have an odd distribution. I would usually keep the distribution and avoid recoding. There are plenty of methods out there to handle such data. (Of course, my example is a count variable, but I assume I managed to make my point clear nevertheless.) And... If I drop outliers, I would also run the analysis with them to check how much they affect the results.
                        Last edited by Christopher Bratt; 14 May 2021, 08:40.

                        Comment


                        • #13
                          generally forcing variables to resemble normal distributions or whatever to achieve nice residual distributions is not what I would prefer
                          If a variable is (say) closer to lognormal than normal, then nobody's forcing the data. They are that shape already and it's just a case of how you view or work with them, as a lognormal or as a normal on a transformed scale.

                          You might as well criticise map projections for mapping (literally) large fractions of the earth onto flat paper or computer monitors as denying the sphericity (spheroidicity) of the planet. The map projection is justified by convenience and also by the fact that we can learn how it works, and we use the projection that helps most. (And globes are fascinating and often helpful too.)

                          I have to guess you have been bitten by authors who used transformations as a species of trickery. The underlying problem as I think Tukey says somewhere is that transformations are one of the worst explained (or unexplained) areas of statistics.

                          Comment


                          • #14
                            You might as well criticise map projections for mapping (literally) large fractions of the earth onto flat paper
                            I do! The size of Greenland is 2.166 million km²; the size of Australia is 7.692 million km². Australia is more than three times the size of Greenland.

                            Now, take a look at a map of 'the world'... I prefer a globe

                            I have to guess you have been bitten by authors who used transformations as a species of trickery. The underlying problem as I think Tukey says somewhere is that transformations are one of the worst explained (or unexplained) areas of statistics.
                            Maybe you've got a point. I still prefer the original data, though.

                            Comment


                            • #15
                              Erik Ruzek Sorry for being difficult. But...

                              egen avg = rowmean(var1 var2 var3)
                              would not really let us test the model with CFA. The model would be as complex as the data (df = 0). We would need to use four or more indicators, or fix parameters prior to the analysis, or for instance impose invariant parameters in longitudinal data to get df > 0,

                              Comment

                              Working...
                              X