Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can I use PCA to create an Index and use that as one of my independent variable?

    Dear Statalist Members,

    It will be much appreciated if you could give me some advice on my idea.

    I am trying to run an OLS regression with price as my dependent variable. My initial result shows that some of my main independent variables (3 different independent variables that captures the seller's effort) only has a marginal effect in explaining the auctioned price. However, I have the reason to believe these are significant variables towards the overall seller's effort but only significant when it is looked as a combined effect. Therefore I have created an index for effort using the PCA. The first two factor explained 79% of the total variance.

    My question is can I simply predict these two factors and name them Effort1 and Effort 2, and including them in my original OLS regression, and to interpret the coefficient of these variables as the effect of effort on price?

    Is there anything I should beware of when interpreting the result? Has anyone read any research that uses a similar method?

    Thank you so much and look forward to any valuable idea!

    Kind Regards,
    Stefanie

  • #2
    There are plenty of people who do exactly what you are doing. For pure predictive purposes, it's fine.

    However, for *explanatory* purposes, iffy. You have two arbitrary constructs, forced to be orthogonal, which are difficult to interpret. What do the factors *mean*?

    An alternative with a cleaner interpretation is an EFA with rotated factors, so you can see how each item loads on each factor, and another is CFA, where you have even more confidence in your interpretation. However, with three items and two constructs, a CFA would be under-identified unless you had other variables in the model to help with identification.

    Two constructs from three items feels a bit iffy to me, even apart from identification problems -- a third option would be to simply combine the three items into a single effort scale.

    Comment


    • #3
      A short answer is that you don't tell us enough about your data and your precise objectives to allow really good advice.

      What's often key here is to imagine who you are presenting results too and what they expect and what they need to be told.

      I am puzzled at the description "OLS" which specifies an estimator and not a model. (See for example Statalist member Jeff Wooldridge's introductory text http://www.amazon.com/Introductory-E.../dp/1111531048 for firm advice against this usage.) If you mean just linear regression on the variables as they arrive then it's entirely possible to me that your mixed success with regression arises because price or the predictors or both might be better analysed on a transformed scale.

      With just three predictors you should have enough scope to choose a simplified combination of the predictors by looking at the output of the PCA and/or the patterns shown by a scatter plot and correlation matrix. The PCA gives you linear combinations of the variables fed to it; it may be the case that you can identify much simpler combinations that work just as well. Conversely, PCA won't do a good job at identifying nonlinear combinations if those are really needed.

      I am a moderate fan of PCA on its home ground. But resorting to PCA to choose predictors in such a situation introduces complexity and looks like a gesture of surrender.

      EFA and CFA have their advocates too, as Ben's post exemplifies.

      Either way, it is really best not to mix PCA and FA terminology. If you are using PCA, don't call the predicted variables "factors".
      Last edited by Nick Cox; 30 Mar 2015, 06:24.

      Comment


      • #4
        I did mean a linear regression model with OLS estimators. I have come up with a reasonable interpretation of my PCA component variables.

        Thanks a lot for clarifying the issue Ben and Nick!

        Comment


        • #5
          Dear all,

          I`m facing a similar issue or idea in creating a kind of index, which i want to use as an additional independent variable - using PCA. Generally I`m investigating the influence of independent directors (as part of the Board of directors in a company) on Earnings Management as a continuous variable in a Panel-Data setting.

          Specifically, I want to investigate whether independent boards are able to reduce earnings management. In a second step, I want to investigate in a moderation effect, whether the presumed earnings-management reducing effect of independence is weaker when firms face a higher degree of "complexity", due to the fact that monitoring becomes harder for independent (outside) directors.

          This approach of complexity and firm performance by using pca has been previously implemented by Olubunmi Faleye in his paper "The costs of (nearly) fully independent boards" in the Journal Empirical Finance.

          The latent construct of complexity is constructed by a combination of variables such as Firm Size, Number of Business Segments, Herfindahl Hirschman Index etc.
          Even though I know how the PCA would be implemented to predict a pc1, I wonder if this approach would be correct here.

          As i understand, the PCA serves in the actual sense to reduce variable dimensions. However, this is not my primary goal here (at least partially), since I still need some of the independent Variables such as "Firm Size" as independent Variables later on. This then leads to high collinearity with the newly generated variable "complexity".

          1. Would you consider this approach as correct / possible? Namely in predicting the score out of these variables and further work on with the initial independent-variables and the newly constructed "complexity" variable.

          2. If this is the case: Could the sample of companies then somehow be divided into "highly / less complex" companies in order to investigate the effects of independence on the two forms of highly / less complex companies or should the variable be treated as what it is, continuous?

          I would be grateful for any hints

          Comment

          Working...
          X