Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • May I use PCA to create an Index and use that as one of my independent variable?

    Dear Members,

    I would like to seek help in an issue regarding the usage of PCA/FA as my independent variable in a logit model. This question is similar to the one posted by Stefanie in 2015 (http://www.statalist.org/forums/foru...ndent-variable), but I am still unclear with certain matters.

    I am interested in studying consumers' likelihood to consume Artificially Sweetened Beverages (ASB), and I am currently using a mixture of 2012 and 2014 data of the same source. My dependent variable is 'ASBup', where ASBup=1 if there is an increase in ASB consumption in 2014 vs 2012, 0 otherwise.

    Other independent variables are for example:
    schoolgrade2 (education level as of 2014)
    sleephrs2 (average hours of sleep per week in 2014)
    dayshift2 (=1 if working day shift, =0 otherwise)

    There are several other independent variables of interest, such as 'exercise', 'readnutritionlabel', 'readingredientlabel', 'weeklyfruitconsumption' and 'weeklyvegetableconsumption'. These are 5 variables which appear to be insignificant as separate independent variables in my logit model, but I believe the Factor Analysis could be helpful in this scenario. However, I am a little worried about the interpretation of the Factor Analysis Index with regards to the probability of increasing ASB consumption. I am currently using 2 factors which I plan to call them "diet awareness1 & 2".

    Currently, my logit model looks something like this: Pr(ASBup=1)= a + b1(schoolgrade2) + b2(sleephrs2) + b3(dayshift2) + b4(dietawareness1) + b5(dietawareness2).

    Any help with the interpretation of factor analysis as an independent variable in a logit model would be helpful. Please do not hesitate to ask for further clarifications, as I would be happy to provide them!

    Many thanks,
    Kai


  • #2
    Welcome to Statalist. Your post has been up, and unanswered, for more than 10 hours now. Your explanation of what you have done is pretty clear. For my part, I'm not offering an answer because I don't see a question. What exactly do you want to know/learn/do at this point?

    Comment


    • #3
      I'll try an answer. Naturally you may do this. If the question is whether it's a good idea, it is difficult to say. Personally, the role of PCA I see here is to guide your choice of predictors by looking at loadings. What drives that advice is principally this:

      1. Your work is likely to be much more interesting to others with named predictors that people can think about rather than nameless PCs that themselves may not be easy to interpret. That is, if I read papers with PCs used as predictors, my impression of what others' regressions mean is much fuzzier.

      2. If individual predictors are not helping in your model, then mishmashes of them as PCs are unlikely to help more. That's advice from experience rather than a solid fact for your data (which naturally we can't see).

      3. You haven't got an enormous number of predictors, so careful selection of them isn't likely to be especially difficult.

      I have focused on PCA here. If you choose some as yet unnamed flavour of factor analysis, the advice from enthusiastic factor analysts (I am not one such) might be more optimistic.

      Comment


      • #4
        Nick Cox
        Thank you for your response. I understand what you are trying to tell me, and yes I do have a few more categorical independent variables, which will be included in the model as dummy variables. My total independent variable count would exceed 10 variables without the factors/ PCs, hence the urge for me to use factors/ PCs. Then again, I agree with your 1st point, but I am not exactly sure on what to name the factor.

        I've obtained 1 factor from the Factor Analysis of those 5 variables (using the command "factor $xlist, mineigen(1)". That factor shows positive coefficients for 'exercise', 'weeklyfruitconsumption' and 'weeklyvegetableconsumption', but negative coefficents for 'readnutritionlabel' and 'readingredientlabel', May I know what is your opinion on the appriopriate name to provide to this factor?

        @Clyde Schechter Hi Clyde, thank you for noticing my post. I would like to know whether it is necessary to use Factor Analysis in my logit model. Without the factor, all 5 of my independent variables were very insignificant, and when replaced with the factor analysis, the factor is nearly significant at the 10% level.

        2nd question: (copy & pasted from 2 paragraphs above)
        I've obtained 1 factor from the Factor Analysis of those 5 variables (using the command "factor $xlist, mineigen(1)". That factor shows positive coefficients for 'exercise', 'weeklyfruitconsumption' and 'weeklyvegetableconsumption', but negative coefficents for 'readnutritionlabel' and 'readingredientlabel', May I know what is your opinion on the appriopriate name to provide to this factor?


        Thank you guys for your inputs!

        Sincerely,
        Kai

        Comment


        • #5
          Kai: My comments included a mild (although utterly standard) warning that PCs (factors too) can be difficult to interpret. Not knowing about your data or your problem doesn't let me offer a better interpretation of your results.

          Comment


          • #6
            Without the factor, all 5 of my independent variables were very insignificant, and when replaced with the factor analysis, the factor is nearly significant at the 10% level.
            The use of PCA or factor analysis in these contents is usually for the purpose of reducing an excessively large number of variables into some "common theme" single variable. (Sometimes it's done to deal with multicolinearity, but you haven't mentioned any such issue here.) It isn't clear to me that there's a good reason to tr to do that here. It doesn't sound like you have some unwieldy regression model with more variables than you can handle. It sounds like you are on a quest for "significance." But if the five variable can contribute significance through a linear combination (which is what a PC or factor is), then a joint test of their significance in the original model will do that just as well, perhaps better. I'm not seeing a compelling case for this approach.

            May I know what is your opinion on the appriopriate name to provide to this factor?
            That's a tough one. There is no obvious concept that unites higher exercise and consumption of produce, which are healthy lifestyle behaviors, with less tendency to read nutrition and ingredient information. Even if the signs of all of the loadings had been the same, this is still a somewhat heterogeneous group of attributes: activities on the one hand and information-seeking on the other. One also has to be cautious in interpreting factors or principal components: sometimes the "common theme" is just the use of some word or some style of syntax. (For example, it is not uncommon for all of the questions that are worded negatively to load together on the same principal component or factor.) Really, I can't think of a good descriptor for this. This is the kind of thing that Nick is warning you about when he says that PCs and factors can be difficult to interpret.

            Having not seen the full results of the analyses, I'm reluctant to venture a firm opinion, but it's looking like your factor analysis just isn't helpful here and you should probably just stick with the original 5 variables.

            Comment


            • #7
              Okay that was really helpful. The only worrying sign of multicollinearity is between 'weeklyfruitconsumption' and 'weeklyvegetableconsumption' (+0.54). I will take your advice on performing a joint significance test. May I know on what basis, is the usage of joint significance 'better' than looking at individual t-tests? And how do I justify using joint significance (which leads to a significant outcome) as opposed to using individual t-tests (which leaves all of the predictors insignificant)?

              Oh alright, that makes sense. I guess principal components will be taken out of my model then!

              Comment


              • #8
                The joint test of significance tests the null hypothesis that all five coefficients are simultaneously zero. Joint hypothesis tests are used when you want to treat a group of variables as a single predictive unit.

                An example of this is in a classical analysis of variance context. When you do something like -anova outcome categorical_variable- you are doing the equivalent of -regress outcome i.categorical_variable-. The overall F-test for the ANOVA is exactly the same as the joint test of all the levels of the categorical variable in the regression. The different levels of a categorical variable form a conceptual unit: they must all be included in the model together (or all be omitted together). You have made a judgment that these five variables are like that: they represent different aspects of some underlying construct.

                When you are inclined to take a group of variables and make an index or score out of them, you are saying that there is something about these variables that goes together, they should be combined. When you do make an index (whether impressionistically or through PCA or factor analysis) you are selecting a particular linear combination of those variables, and using that as a predictor. Notably, that particular linear combination is defined without regard to the outcome variable you are trying to predict. If you use the variables as they are in the regression instead and do a joint significance test, you are testing the significance of the particular linear combination which is the best possible predictor of the outcome (at least given all the other variables in the model.) That is why the joint test is more powerful than using a factor or principal component.

                So why does anyone ever use indices or principal components or factors instead of the underlying variables? The usual reason is just because there are two many separate variables and the model is unwieldy (or even can't be estimated because the number of variables exceeds the available degrees of freedom in the data.) Psychological measurement scales are often like this: the standard scale for assessing post-traumatic stress disorder, for example, contains 17 questions. So it is usually more convenient to just use the sum of all 17 responses as a scale score instead. Or, the same scale has been factor analyzed previously into four or five factors, the factor scores being more convenient to use than the 17 separate items, and perhaps providing a more detailed and nuanced picture than just the grand total. It's also worth noting that any one of the 17 items usually does a poor job of predicting anything else: it is the combined signal that comes through when all 17 are put together. The single items are not very useful separately.

                Comment


                • #9
                  Thank you for the clarification above.

                  Alright, I believe the joint significance test fits my situation better than any data reduction method because I do not have tons of variables of interest. Sorry I am new to this command 'anova'. I have tested the -anova outcome categorical_variable- and the F test result is identical to the joint significance test in a normal regression. What about the case for multiple categorical variables? I attempted this command -anova outcome categorical_variable1 categorical_variable2 categorical_variable3 categorical_variable4 categorical_variable5- and I had obtained different F test scores for each variable. Is there something I could do with this result?

                  Comment


                  • #10
                    I think I confused you. I was not recommending you use the -anova- command for your data analysis. In my day, people were taught ANOVA before we learned regression. (I just rechecked the first two semester statistics curricula at my own institution, and regression is heavily covered, but ANOVA is not taught at all. I haven't taught that course in a long time, so I guess I'm out of touch.) So I (mistakenly, it appears) thought that you would be more familiar and comfortable with ANOVA than with regression. I was trying to show how the joint significance test in regression corresponded to the overall F-test of a one-way ANOVA. I thought that would help you understand what a joint significance test following regression is and why one might use it, by showing how it worked the same way as something I thought you knew in greater depth than regression.

                    Having gone that route, now I'll just point out that the separate F test scores for each of those variables equal the results of the joint tests for each of those categorical variables that you could carry out after a regression as follows:

                    Code:
                    anova outcome cat_var1 cat_var2 cat_var3 cat_var4 cat_var5
                    
                    regress outcome i.(cat_var1 cat_var2 cat_var3 cat_var4 cat_var5)
                    testparm i.cat_var1
                    testparm i.cat_var2
                    testparm i.cat_var3
                    testparm i.cat_var4
                    testparm i.cat_var5
                    But, again, I was not recommending that you switch to the -anova- command to analyze your data. It was intended just as a teaching point--but probably not an appropriate one for you.

                    Comment


                    • #11
                      Ah okay, sorry about that. I have been using regressions all the time and ANOVA was new to me just earlier today. I have tested the codes provided and both F scores correspond to each other. Thank you so much Clyde for your help, really appreciate that you are helping out on a Sunday!

                      I will be improving on my model for the next few days and hopefully everything will be under control.

                      Best wishes,
                      Kai

                      Comment


                      • #12
                        Sorry Clyde, I do have another question. Does it make sense to test the joint significance of multiple categorical variables, i.e. "testparm i.cat_var1 i.cat_var2 i.cat_var3" rather than the usual "testparm i.cat_var1" "testparm i.cat_var2" "testparm i.cat_var3"?

                        Comment


                        • #13
                          That depends on what the variables are and what they mean. It was sensible to jointly test the five variables that were the subject of this thread in the beginning because you thought of them as being closely related, effectively different indicators of some common construct. In fact, you thought that strongly enough that your original plan was to combine them into a single variable using factor analysis or PCA.

                          So you have to ask yourself what the relationships among cat_var1 cat_var2 and cat_var3 are. If they are also closely related variables that one might even think about combining into a single scale score, then it would also make sense to test their joint significance. If they are just three essentially unconnected categorical variables like, for example, sex, whether there is a family history of psoriasis, and history of post-traumatic stress disorder, then it would not be sensible to do a joint test of their significance.

                          Remember, when you do a joint significance test, you are testing the null hypothesis that all of the variables' regression coefficients are zero. Unless those variables have something in common, that's usually not a sensible question to ask. Of course, it depends on your research goals. You wouldn't ordinarily have reason to jointly test the significance of, say, age group, sex, race, ethnicity, marital status, and education. But if your research hypothesis is that your outcome variable is associated with "demographic factors," then these variables now share a common trait: they are all demographic factors. But if you aren't specifically interested in a hypothesis about "demographic factors" in a non-specific way, then there would be no reason to do a joint test.

                          So, bottom line: formulate the null hypothesis. Then decide whether asking that question is relevant to your research goals or not.

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            If they are also closely related variables that one might even think about combining into a single scale score, then it would also make sense to test their joint significance.
                            Regarding this point, I would like to ask another question (hopefully it doesn't sound silly). I have 2 categorical variables, regexer2 and strenuousexer2, which takes values 0 to 4. 0 being '0 exercises per week', 1 being '2 to 3 times per week' so on so forth for both variables (refer to attached picture). Given that categories [2,3,4,5] represent a range of values, is it still possible to combine these 2 variables into a single scale score?

                            This is my only idea of generating a single scale score, but I believe this method is flawed. These are the following commands:
                            i.e. "generate totalexercise=0 if strenuousexer2 + regexer2==0"
                            "replace totalexercise=1 if strenuousexer2+ regexer2==1"
                            "replace totalexercise=2 if strenuousexer2+ regexer2==2"
                            "replace totalexercise=3 if strenuousexer2+ regexer2==3"
                            "replace totalexercise=4 if strenuousexer2+ regexer2==4"
                            "replace totalexercise=5 if strenuousexer2+ regexer2==5"
                            "replace totalexercise=6 if strenuousexer2+ regexer2==6"
                            "replace totalexercise=7 if strenuousexer2+ regexer2==7"
                            "replace totalexercise=8 if strenuousexer2+ regexer2==8"

                            Sadly I realise this method is terribly flawed, because there are problems with the definition of 'totalexercise'. Any suggestions on generating a single scale score?
                            Attached Files

                            Comment


                            • #15
                              Before getting to your substantive question, let me comment on the code. It is not necessary to write out all 8 possible sums like that. YOu can do this in a single line of code:

                              Code:
                              gen total_exercise = regexer2 + strenuousexer2
                              Your options for generating a scale score here are the same as with the variables earlier in the thread. A simple sum of scores is a reasonable approach. You could also do a principal components analysis, although with two variables, all this will get you beyond a simple sum is some rescaling of each variable due to differences in their variances. But "eyeballing" the tables you show, the distributions aren't all that different, so the result you would get from a PCA would probably be very close to just adding them up--probably the differences would be, for practical purposes, negligible. Similar considerations apply to factor analysis here. (One difference is that with only two variables, confirmatory factor analysis is not possible, only exploratory.) So unless you have some reason to get into truly exotic scoring rules, for which I don't see any rationale, you're probably best off just summing the two variables.

                              I would also ask, again, why you want to combine these. I'm not saying it's unreasonable to do so, but I would expect that for many outcomes, the effects of regular exercise might differ from those of strenuous exercise--perhaps even go in opposite directions. So I think that folding them together into a single score might well obscure those differences and bring you to misleading conclusions. I don't know what outcomes your planning to use these in connection with, nor what your research goals are, but most of the contexts I can imagine for variables like this would be better handled by keeping these as separate variables.

                              Comment

                              Working...
                              X