Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Building an index with PCA and MI

    Hello everyone. Thanks in advance to the forum users because very often it is a great help. I am a beginner in using Stata.

    I am writing because I am having some problems in the construction of an index on the quality of the rule of law by regions.

    Basically I have 5 variables and I want to summarize them in a single variable which would be the index. To do this aggregation I first need to find a weight for each variable. Studying the literature it seems that a good method is by using the Principal Components as a weight. So basically estimate the principal components of each variable, multiply this weight for each observation and finally aggregate (sum) them.

    The first problem: I have missing values, to solve this problem I am using the multiple imputation and then proceed with the analysis of the principal components. I'm trying to fix the "invalid file" error in the screenshot below (invalid '"C:\Users\........\index_results') but can't figure out if it's syntax or what. Could you kindly help me? And in general give me an opinion on the methodology I am using and any suggestions?

    This is the code I am currently using:

    Code:
    clear
    
    capture log close
    
    import excel "C:\Users\.......\PANEL.xlsx", sheet("panel") firstrow
    
    drop if year==.
    
    mi set mlong
    
    mi xtset, clear
    
    mi register imputed reportedcaught endofthesentence clearance_rate_civils collection_capacity tax_gap
    
    mi impute mvn reportedcaught endofthesentence clearance_rate_civils collection_capacity tax_gap, add(20) replace
    
    log using "C:\Users\......\log.log", replace
    
    mi estimate, cmdok saving ("C:\Users\.......\index_results", replace): pca reportedcaught endofthesentence clearance_rate_civils collection_capacity tax_gap, components(1) covariance vce(normal)
    
    mi predict using ("C:\Users\........\index_results")
    
    save "C:\Users\........\dataset_imputed", replace
    
    log close
    Click image for larger version

Name:	Stata_res.png
Views:	1
Size:	44.7 KB
ID:	1670758
    Last edited by Lorenzo Fabiani; 24 Jun 2022, 06:05.

  • #2
    I can't easily follow this.

    If you have five highly correlated variables then the first principal component calculated from those variables is a candidate for a summary. Whether that is a good idea depends, as already hinted, on the strength of correlations. If the variables really are highly correlated, then it would be a lot simpler just to choose one of them on substantive grounds.

    What you propose seems to be some kind of mishmash of the variables and the components, but there is no information in the components that was not in the original variables. The principal components already are weighted linear combinations of the variables; there isn't a obvious further step to use the principal component results to weight the variables.

    If your variables are on the same scale, a mean across variables might make as much or more sense. If the variables are on quite different scales, standardize first.

    I could see much point in an index if there were 50 or 500 variables and fitting a model with that many predictors is unattractive or impossible. With just 5, I would use them all as predictors, and see what fell out.

    Having missing values and wanting to do multiple imputation is orthogonal to all that.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      I can't easily follow this.

      If you have five highly correlated variables then the first principal component calculated from those variables is a candidate for a summary. Whether that is a good idea depends, as already hinted, on the strength of correlations. If the variables really are highly correlated, then it would be a lot simpler just to choose one of them on substantive grounds.

      What you propose seems to be some kind of mishmash of the variables and the components, but there is no information in the components that was not in the original variables. The principal components already are weighted linear combinations of the variables; there isn't a obvious further step to use the principal component results to weight the variables.

      If your variables are on the same scale, a mean across variables might make as much or more sense. If the variables are on quite different scales, standardize first.

      I could see much point in an index if there were 50 or 500 variables and fitting a model with that many predictors is unattractive or impossible. With just 5, I would use them all as predictors, and see what fell out.

      Having missing values and wanting to do multiple imputation is orthogonal to all that.

      Really thank you very much for the reply. I try to give you more information: The variables are basically on the same scale, they are ratios (%) ranging from 0% to ~ 140% (0 - 1.40) - below you find a screenshot that maybe can give you an idea.
      Dataset.png

      They don't seem to be correlated. I have tested the correlation with both, missing values and imputed values, the results are similar. The results in the screenshot Correlazione.png

      Comment


      • #4
        If your 5 variables have such poor correlations, there is nothing much for PC1 to capture.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          If your 5 variables have such poor correlations, there is nothing much for PC1 to capture.
          Thank you. Now I know I shouldn't be using PCA. I will look in the literature for other solutions. If you already have recommendations, they are appreciated. Greetings, Lorenzo

          Comment

          Working...
          X