Building Index based on dummy and ordered multinomial variables

Michelle Ordonez

Join Date: Jun 2020

Posts: 4
#1

Building Index based on dummy and ordered multinomial variables

16 Sep 2020, 08:06

Dear Stata users,

I am working on the replication of a poverty index based on 34 socieconomic variables, such as internet access (dummy variable) and housing conditions (categorical variable with more than 2 categories) coming from a survey. As the aim of this index is to sum-up a large number of variables into a "common theme" singular variable, the approach would be to run a PCA. However, I read from the other forums that when dealing with categorical variables, it is not recommendable to use the command PCA, but did not find any insights about how to do this when dealing not only with binary but also with ordered multinomial variables. Which would be the right approach, i.e. the right command and procedure, for this specific case?

As a reference (maybe it is somehow helpful to better address my question), the authors that already performed this exercise, built this index based on similar 34 variables obtained from an older/different survey, and performed the 2.0 CATPCA algorithm available in SPSS 23. I would like to rebuild this index with up-to-date information but in Stata and trying to be as close to the method that they used in SPSS.

I am not any Stata expert, so I excuse myself if this question is not appropiate, but I have been looking for an answer to my doubts for a while without any success.

Thank you in advance for your coming insights and further ideas!

Best,
Michelle
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

16 Sep 2020, 11:26

I am not entirely sure how CATPCA works in SPSS, but I believe what it does is convert the ordinal variables into a series of indicator ("dummy") variable corresponding to each response option. It then calculates a polychoric correlation matrix for those indicator variables and does principal components analysis on that. If I am correct about that, you can emulate that process in Stata by creating the indicator variables yourself (-tab- with the -gen- option is probably the easiest way to do that) and then use Stas Kolenikov's polychoricpca.ado (You can get that by running -findit polychoricpca- in Stata, and then clicking on the blue link in the middle of the window that opens in the Viewer. Then when the next page opens, click on the blue link that says click here to install.) If your task is to replicate the approach used in that reference, I believe this will accomplish it.

That said, I don't think you should take that approach if you are not compelled to. The problem with it is that it completely ignores the ordinal properties of the ordered multinomial variables. So, what I would do is just use -polychoricpca- directly on the variables themselves, without creating indicators. That will respect the ordinal nature of the multinomial variables. (If some of the multinomial variables are just at the nominal level, you should create indicators for those and use the indicators rather than the original variable with polychoricpca. -polychoricpca- should be used only with ordinal (or dichotomous) variables.
1 like
Comment

Michelle Ordonez

Join Date: Jun 2020
Posts: 4

21 Sep 2020, 10:20

Dear Clyde,

Thank you for the recommendation. I already ran the polychoricpca command in order to build a single common variable that could sum-up the 34 input variables. For purposes of my further questions, I chose 5 out of the 34 indicators and include the results below:

Code:

. global Input_var flooring_type shower_acc_excl WC_type walls_material water_acc

. summ $Input_var


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
flooring_t~e |     43,311    2.247812    .7574951          1          3
shower_acc~l |     43,311     1.79197    .4059033          1          2
     WC_type |     43,311    3.403639    .8660932          1          4
walls_mate~l |     43,311    3.535361    .8448794          1          4
   water_acc |     43,311    2.607675     .726762          1          3

. 
. polychoricpca $Input_var, score(pca) nscore(1)

Polychoric correlation matrix

                   flooring_type  shower_acc_excl          WC_type   walls_material        water_acc
  flooring_type                1
shower_acc_excl        .62676209                1
        WC_type        .60100976        .64801277                1
 walls_material        .68810737        .51606265        .52305441                1
      water_acc        .57677818        .57841788         .7540365        .51149064                1

Principal component analysis

 k  |  Eigenvalues  |  Proportion explained  |  Cum. explained
----+---------------+------------------------+------------------
  1 |    3.412907   |    0.682581            |   0.682581
  2 |    0.626438   |    0.125288            |   0.807869
  3 |    0.436557   |    0.087311            |   0.895180
  4 |    0.287684   |    0.057537            |   0.952717
  5 |    0.236414   |    0.047283            |   1.000000

               Scoring coefficients

    Variable    |  Coeff. 1  |  Coeff. 2  |  Coeff. 3 
------------------------------------------------------
 flooring_type 
             1  | -0.649223  | -0.575753  |  0.123137 
             2  | -0.150742  | -0.133683  |  0.028591 
             3  |  0.409346  |  0.363021  | -0.077640 
 shower_acc_excl
             1  | -0.608344  |  0.164541  |  1.130971 
             2  |  0.159784  | -0.043217  | -0.297054 
 WC_type       
             1  | -0.912097  |  0.851732  | -0.296820 
             2  | -0.610421  |  0.570022  | -0.198647 
             3  | -0.298711  |  0.278941  | -0.097208 
             4  |  0.302362  | -0.282351  |  0.098397 
 walls_material
             1  | -0.914946  | -1.413795  | -0.685942 
             2  | -0.558547  | -0.863079  | -0.418747 
             3  | -0.337076  | -0.520857  | -0.252708 
             4  |  0.190888  |  0.294964  |  0.143110 
 water_acc     
             1  | -0.707103  |  0.719367  | -0.696202 
             2  | -0.386968  |  0.393679  | -0.381002 
             3  |  0.188853  | -0.192129  |  0.185942

As a background before continuing, I would like to say that these 34 variables are indicators built based on questions from the survey. For example, for the variable walls_material, HHs were categorized from worse to better (1 to 4) depending on their answer regarding the materials of their dwelling: 1"Cane or other" 2"Wood" 3"Asbestos/Cement/Adobe" 4"Concrete/Block/Brick". The same was done with the other variables.

What I would like to know is how to interpret the results of this analysis, i.e. the scoring coefficients for each one the variables, why do we have 3 Coefficients? Does it refer to the first three components?

Furthermore, I would like to understand how the final score variable -pca1- was built and how the variables created from each one of the Input variables, which are denominated as __tt plus the name of the respective variable (f.i. __ttwalls_material) contribute to the final score pca1, as I -in order to follow the methodology of the original index- will have to identify the category with the lowest quantification in each variable, assign a value of 0 and subtract the difference between the lowest value wtr. 0 to the remaining categories, rescale the modified categorical quantification from 0 to 100 and finally, translate this to the final score.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#4

21 Sep 2020, 18:03

The whole polychoric concept is that ordinal variables are observed manifestations of underlying latent continuous variables, where the observed manifestation arises by imposing a series of cutpoints on a latent continuous, normally distributed variable. The polychoric correlation between two ordinal variables is calculated by fitting a model with two latent continuous variables having a bivariate normal distribution, and unknown cutpoints imposed on those latent variables to give rise to the observed ordinal variables. The correlation parameter of the bivariate normal distribution of the latent variables is the polychoric correlation coefficient. Polychoric PCA starts by calculating the polychoric correlation coefficient matrix among the variables, and then applies ordinary principal components analysis to that.

The three coefficients do, indeed, correspond to the first three components: the help file for polychoricpca tells you that.

I have not worked with scoring coefficients from -polychoricpca- in a long time, and I don't remember how they are calculated or used in scoring. And, at least today, I don't have the time to read up on it. So I'm going to pass on those questions. In terms of just having an index, I don't see any reason to do anything complicated. The first component, whose score is in the variable pca1 that the command created, accounts for a little over 68% of the variance in the set of variables (or, rather, in their underlying latent continuous variables) and you can just use it, either as is, or re-scaled from 0 to 100 if you like.
Comment

Announcement