Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal component analysis in panel data setting

    Hello to everyone,

    I have a panel of 190 industries over the 2000-2018 period. my data-set contains 4 variables (x1-x4) that are correlated and convey similar information. I would like to do a principal component analysis and extract one variable that accounts for the common variability and correlation of the 4 variables. I type the following

    Code:
    bysort industry: pca x1 x2 x3 x4
    the Principal component analysis is done per each industry (which takes same tome, as I have 190). Then i try to predict a single component, as on average it seems to explain the variation of the x1-x4. I type the following:

    Code:
    bysort industry: predict p1, score
    of course, i get the message

    Code:
    predict may not be combined with by
    r(190);
    i read in some previous tread that principal component "pays no attention to panel structure"

    https://www.statalist.org/forums/for...-in-panel-data

    Should I give up on the PCA analysis in a panel data setting. One option is to split my dataset by industry and do PCA analysis 190 times, which is nonsense.


    Any suggestions?

  • #2
    Let’s turn this around: Why are you applying PCA separately? Why are you applying it at all? If you can get the logic straight there you should be able to answer your own question about what to do.

    In Stata terms the only way I know to get separate PCAs for each panel is to loop over panels, including repetition of predict. I can’t recall a problem where that was what I wanted to do, but that may reflect limited imagination and experience on my part.

    With four predictors it really isn’t a problem that they are correlated. If one or two or three add little or nothing to what the most successful predictor tells you, then that will be evident in the modeling results.

    PCA doesn’t usually make any problem like yours easier to think about. At most your predictors are slightly different versions of each other, in which case the question is which to use.

    Comment


    • #3
      Thank you for the response and the useful questions.

      I do not want to have x1-x4 variables in my model as they are very similar. I need a variable that extracts the information on all four. Something like a simple average, but possibly more sophisticated. Hence I proceed with factor analysis/principal component analysis.
      My logic is: the industries are so different (different level of technological intensities, production process and policies regarding these industries affecting variable x1-x4). In my understanding, I should separate these industries when calculating PCA analysis, as i do no want to combine the variability over x1-x4 with the industry level variability (two different heterogeneities).

      But maybe, pca actually account for all these heterogeneities when making prediction? If this is the case - my problem is solved!

      Comment


      • #4
        I think you need advice from people in your field, which would depend on your telling them the definitions of the four variables in question. Why not use your own economic judgment or the results of preliminary analyses? PCA isn’t more than blending machinery.

        Comment


        • #5
          Mina:
          if you go panel data regression and your predictors are actually highly correlated, Stata will omit by default the ones that create extereme multicollinearity.
          If a quasi-extreme multicollinearity issue creeps up, it is usually mirrored in "weird" standard errors.
          I would also add a categorical predictor that represents the industry heterogeneity that you mention in your post.
          Kind regards,
          Carlo
          (Stata 18.0 SE)

          Comment


          • #6
            Pedantic note for anyone who needs it: PCA means principal component analysis, so “PCA analysis” is unnecessarily repetitive.

            Comment


            • #7

              Carlo-thanks for the advice. They do not drop when added to the model. Still, I do not want to have 4 similar variables in my model and taking the average of the 4 variables, sounds too simplistic. I had the idea to use just one variable as a good proxy for all four. However PCA is a more superior approach.


              Nick thank you. I know in advance that I want PC applied, also based on the empirical literature. The crux of the problem is that PCA needs to be applied per each industry separately. If anyone can help with the application of PCA with the loop function over all panel (industries)? I have four variables that are the subject of PCA. My dataset has 190 industries over 1990-2018 period.

              I would also be grateful for any link that helps with the "loop" function, as it is a very useful one.

              Comment


              • #8
                Sorry, but I won’t offer code for an approach that strikes me as misguided.

                Comment


                • #9
                  I agree with the advice given in posts 3-5. I find it hard to believe that x1-x4 represent a common underlying concept which I will call Z, but that the relation of Z to x1-x4 differs in important ways from panel to panel. It further appears that you plan on creating a single variable containing the Z-hat prediction from PCA across all the panels, making that variable conceptually difficult to explain. It is implicitly an interaction between panel and the underlying Z, but you don't seem to plan on taking account of that interaction. Or do you plan on interacting panel and Z-hat in your model, so that each panel will have a different estimated coefficient on Z-hat?

                  To my way of thinking, you are replacing four parameters in the basic model (on x1-x4) with 4*190 - the loadings on x1-x4 estimated for each of 190 panels - and perhaps another 190 coefficients, if panel is indeed interacted with Z-hat. And much of this estimation is hidden out of sight of the model you are ultimately running, on 19*190 observations, and the uncertainty in the 19*190 Z-hat values is not reflected in the estimates of the accuracy of your model. You are fitting many parameters in this model that are not accounted for by any reduction in the degrees of freedom, and thus your model will present itself as more certain than it actually is.

                  With that said, below is an example of the sort easily found on Statalist showing how to loop over panels to fit a model separately for each panel and create predicted values for each panel from its model. This uses regress but the principle is the same for other estimation commands.
                  Code:
                  sysuse auto, clear
                  drop if missing(rep78)
                  generate prd_price = .
                  levelsof foreign, local(type)
                  foreach f of local type {
                      regress price weight if foreign==`f'
                      predict temp if foreign==`f'
                      replace prd_price = temp if foreign==`f'
                      drop temp
                  }

                  Comment


                  • #10
                    William, thanks for the extensive and indeed a useful explanation. The variables differ and convey different sort of information, but similar.

                    Thank you for providing the example of the code. Best wishes!

                    Comment

                    Working...
                    X