Building an index with PCA and MI

Lorenzo Fabiani

Join Date: Jun 2022

Posts: 12
#1

Building an index with PCA and MI

24 Jun 2022, 06:03

Hello everyone. Thanks in advance to the forum users because very often it is a great help. I am a beginner in using Stata.

I am writing because I am having some problems in the construction of an index on the quality of the rule of law by regions.

Basically I have 5 variables and I want to summarize them in a single variable which would be the index. To do this aggregation I first need to find a weight for each variable. Studying the literature it seems that a good method is by using the Principal Components as a weight. So basically estimate the principal components of each variable, multiply this weight for each observation and finally aggregate (sum) them.

The first problem: I have missing values, to solve this problem I am using the multiple imputation and then proceed with the analysis of the principal components. I'm trying to fix the "invalid file" error in the screenshot below (invalid '"C:\Users\........\index_results') but can't figure out if it's syntax or what. Could you kindly help me? And in general give me an opinion on the methodology I am using and any suggestions?

This is the code I am currently using:

Code:

clear capture log close import excel "C:\Users\.......\PANEL.xlsx", sheet("panel") firstrow drop if year==. mi set mlong mi xtset, clear mi register imputed reportedcaught endofthesentence clearance_rate_civils collection_capacity tax_gap mi impute mvn reportedcaught endofthesentence clearance_rate_civils collection_capacity tax_gap, add(20) replace log using "C:\Users\......\log.log", replace mi estimate, cmdok saving ("C:\Users\.......\index_results", replace): pca reportedcaught endofthesentence clearance_rate_civils collection_capacity tax_gap, components(1) covariance vce(normal) mi predict using ("C:\Users\........\index_results") save "C:\Users\........\dataset_imputed", replace log close

Last edited by Lorenzo Fabiani; 24 Jun 2022, 06:05.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35652
#2

24 Jun 2022, 07:12

I can't easily follow this.

If you have five highly correlated variables then the first principal component calculated from those variables is a candidate for a summary. Whether that is a good idea depends, as already hinted, on the strength of correlations. If the variables really are highly correlated, then it would be a lot simpler just to choose one of them on substantive grounds.

What you propose seems to be some kind of mishmash of the variables and the components, but there is no information in the components that was not in the original variables. The principal components already are weighted linear combinations of the variables; there isn't a obvious further step to use the principal component results to weight the variables.

If your variables are on the same scale, a mean across variables might make as much or more sense. If the variables are on quite different scales, standardize first.

I could see much point in an index if there were 50 or 500 variables and fitting a model with that many predictors is unattractive or impossible. With just 5, I would use them all as predictors, and see what fell out.

Having missing values and wanting to do multiple imputation is orthogonal to all that.
Comment
Lorenzo Fabiani

Join Date: Jun 2022

Posts: 12
#3

24 Jun 2022, 09:10

Originally posted by Nick Cox View Post

I can't easily follow this.

If you have five highly correlated variables then the first principal component calculated from those variables is a candidate for a summary. Whether that is a good idea depends, as already hinted, on the strength of correlations. If the variables really are highly correlated, then it would be a lot simpler just to choose one of them on substantive grounds.

What you propose seems to be some kind of mishmash of the variables and the components, but there is no information in the components that was not in the original variables. The principal components already are weighted linear combinations of the variables; there isn't a obvious further step to use the principal component results to weight the variables.

If your variables are on the same scale, a mean across variables might make as much or more sense. If the variables are on quite different scales, standardize first.

I could see much point in an index if there were 50 or 500 variables and fitting a model with that many predictors is unattractive or impossible. With just 5, I would use them all as predictors, and see what fell out.

Having missing values and wanting to do multiple imputation is orthogonal to all that.

Really thank you very much for the reply. I try to give you more information: The variables are basically on the same scale, they are ratios (%) ranging from 0% to ~ 140% (0 - 1.40) - below you find a screenshot that maybe can give you an idea.

They don't seem to be correlated. I have tested the correlation with both, missing values and imputed values, the results are similar. The results in the screenshot
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35652
#4

24 Jun 2022, 11:16

If your 5 variables have such poor correlations, there is nothing much for PC1 to capture.
Comment
Lorenzo Fabiani

Join Date: Jun 2022

Posts: 12
#5

26 Jun 2022, 19:53

Originally posted by Nick Cox View Post

If your 5 variables have such poor correlations, there is nothing much for PC1 to capture.

Thank you. Now I know I shouldn't be using PCA. I will look in the literature for other solutions. If you already have recommendations, they are appreciated. Greetings, Lorenzo
Comment

Announcement

Building an index with PCA and MI

Comment

Comment

Comment

Comment