Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Principal component analysis (PCA) using panel data

    I am a PhD student working with panel data and I am writing to ask whomever it may concern because I am really confused whether I should use PCA to measure my CEO greed measure (which is an independent variable when looking at its effect on firm performance or moderator when looking at its effect on the entrepreneurial orientation and firm performance relationship). In the paper, When More Is Not Enough: Executive Greed and Its Influence on Shareholder Wealth published in journal of management published by Haynes et al. 2014, that I am following which they also have panel data they measured CEO greed as a result of PCA of three proxies. I ran PCA and I got factors as well as eigenvalues for each of the three proxies, should I multiply this factor by each original standardised variable value and sum them up to get the final CEO greed measure to use in the regression (fixed-effect panel data regression)? Or do I multiply the eigenvalues by the original standardised variable to get the final CEO greed measure in the regression? Because I want to not lose the original variable measures so I cannot use just the PCA score in the regression. I discussed this with my professor and I noticed in one of your replies that you noted the same issue my supervisor told me which is that by using an index as a result of PCA you lose the variations that might be seen by each proxy. However, what if the proxies are highly multi collinear after I run the correlation matrix then I cannot put them in the final fixed effect regression equation as separate variables so do I use PCA then? Also how can perform a PCA in panel data? Do I get separate PCA values for each firm in each year? Or a value to use for all firms in all years? Could you please help me. Thank you

  • #2
    You have posed several questions. I see them as falling into two major, and rather separate issues. One is whether you can/should/must use PCA in the first place. The other is how to use PCA with panel data.

    So, first, if your assignment is to replicate the methodology in the Haynes paper, then you have to do what they did, for better or for worse. Assuming that you are not assigned to do that, then we need to consider the pros and cons in the context of your specific research question. If greed is indeed the focus of your research, as opposed to something that you need to adjust the analysis for because of possible confounding effects, then the decision about how to represent it in the model is important. It appears that you have three measures that relate to greed.

    Fact 1. If you perform a PCA and use only the first (largest eigenvalue) component in your model, then you are throwing away information. Just how much information you lose by doing this is quantified by the eigenvalues of the omitted components: the bigger they are, the more information you are losing.

    Fact 2. If you perform a PCA on three variables and then use all three principal components in your analysis, then all you are doing is organizing the representation of greed in a different way. No information is gained or lost. Nothing else in your analysis will change, and, indeed, the omnibus test of the joint significance of all three greed variables will come out the same regardless of whether you use the original variables of the three components.

    You express concern about the high level of multi-collinearity of the original three greed variables. Multi-collinearity is, in my opinion, the most over-rated problem in all of statistics. It is of no importance at all except in one circumstance. If it is an important part of your research goals to precisely quantify the separate contribution of each of the three measures to variation in your outcome, then multi-collinearity will prevent you from doing that. But PCA does not solve that problem for you. PCA will enable you to precisely quantify the independent contribution of the three principal components--but these are each blends of the original variables, and you cannot recover any precise estimation of the original variables' effects from that. The other thing to consider is that all that multi-collinearity does is increase the standard errors of the coefficients for the variables involved in the multicollinear relationship. It does not alter the results for the other variables in the model. So the way to decide if you have a multicollinearity problem (as opposed to just having some multicollinearity) is not to "test" for it, but to just look at the output of the model of interest using those variables. If the standard errors for those variables are so large that your estimation of the variables' effects is too imprecise to be useful, then you have a problem; otherwise you don't.

    Remember, too, that if you do have a substantive multicollinearity problem (in the sense of the preceding paragraph), you cannot overcome it with PCA. In fact, there is no way at all to solve it with the data at hand. The only solutions are to get a much larger sample, or to do an entirely different study design that breaks the linear relationship among the variables (e.g. matched pairs, stratified sampling, or something like that.)

    So the bottom line on multicollinearity is: it is rarely a problem. It is easy to tell whether it is a problem or not. And if it is a problem, there is nothing you can do about it with your existing data.

    So, multicollinearity is definitely not a reason to use PCA. There can be circumstances where using PCA is helpful If you have a large number of measures, too many to include in your model, you can reduce that by doing PCA and then using the first, or first few components instead (bearing in mind that, as noted earlier, this discards information). But you do not appear to be in that situation. If your data set is so small that using three independent variables instead of one is stretching its limits, then you really don't have a usable data set anyway. I'm guessing your data set is sufficiently large that it can handle the three variables. If it isn't, you need more data, not PCA.

    Now, one might also have concerns about the extent to which the three variables you have are good representations of greed, and to what extent they are influenced by other unrelated constructs and by measurement error. Opinions about handling this differ, but if this is the dilemma you face, my approach would not be to use PCA but rather to go to structural equations modeling including a latent variable indicated by the three measures. This takes us in a wholly different direction, with lots of issues about how to do it, and I won't go down this path (no pun intended) here.

    So, you have not stated a case that persuades me that PCA would be helpful for you.

    Let me comment briefly on PCA with panel data. If you can safely assume that the covariance among the three items is constant over time, then the loadings, coefficients, and eigenvalues you get from applying PCA to the panel data will be fine. What you cannot rely on, however, would be standard errors of any of these quantities, nor standard errors of estimated component scores. This is because in panel data, the assumption of independent observations is violated. I am not aware of any software that does PCA and adjusts for nesting of observations within panels. If the covariance structure of the three items does vary over time, then a PCA done on the entire data set will produce results that are difficult if not impossible to interpret and use. And there is an additional consideration. Remember that PCA does not involve the outcome variable in your model, only the three greed proxy measures. If the relationships between each of the three measures and the outcome vary differently over time, using PCA will completely obscure that. By contrast using the original three measures, you could determine whether their relationships to outcome evolve over time, and whether the evolve the same way or differently over time by including interaction terms.

    So, I've gone on for a very long time here. The really short version is that I see no benefit, and some potential difficulties, from using PCA in the circumstances you have described.

    Comment


    • #3
      Clyde Schechter I've also not heard of any PCA applications in a panel setting, but do know there is a growing body of work around longitudinal structural equation models. If nazha gali considers "greed" to be an underlying latent causing the response observed in the variables of interest, it'd be a pretty good candidate for this type of approach. Jonathan Templin has posted fairly decent slides that explain the process of testing for measurement invariance across groups, and in the application to longitudinal models the groups would be defined by points in time. van der Schoot, P, & Joop Hox, P. L. also have some useful/helpful information about testing for invariance.

      Comment

      Working...
      X