Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I divide PCA-results into tertiles?

    Hi, I'm all new to Stata and PCA.
    Today I have done the PCA with 29 variables, where I got out with 9 components.

    I have tried with the following commands, but I don't think it worked and I did not understand the outcome I got...
    pca (29 variables)
    predict score1
    xtile score1_cat = score1, nq(3)

    From this I got a table where the values was quintile 1 = 4, quintile 2 = 3, quintile 3 = 3.

    What do I type as commands from here to 1) get a scatterplot, and 2) get the results divided into tertiles?

    I'm very grateful if someone can offer help with the commands.
    Thank you.

  • #2
    Latin is no longer lingua franca for anybody much outside the Vatican, so let's first clear up some terminology here.

    Quintiles are values that divide a range into 5 bins or classes, the resulting bins themselves also being called quintiles.

    Tertiles ... 3 bins or classes ...

    For a fuller list, see e.g. https://stats.stackexchange.com/ques.../235334#235334 and the Stata Journal article it references. The author tells me that he would be delighted to receive further terms or earlier references.

    Now to the point:

    You appear surprised that you have frequencies 4, 3 and 3 for tertiles. Assuming that you have just 10 observations, you can't do better than that!

    If I am missing your question, you may need to give more detail. But in any case, why throw away information by binning? Why not use PC scores directly?

    Comment


    • #3
      Hi again, and thank you for your reply.
      I'm sorry, I ment to write quantiles and not quintiles in my post. (English is not my mother language)

      My data is much more than 10 observations. It' a questionaire where I in my PCA used 29 of the questions (thereby 29 variables, where my PCA dropped 8 of them due to zero variance), but the PCA came out with 9 Components with eigenvalues.

      Further as I wrote above, then I don't understand the values in my tertiles...
      I wish to set up an sosioeconomic score in tertiles, so I don't understand where to go from here.

      I'm adding the results of my PCA .

      Thank you for your help this far.

      Attached Files

      Comment


      • #4
        Stata is telling you quite a lot. Eight variables are being dropped from the PCA because they are constant in practice. If your variables contain missing values then those observations can't be included either,

        As it stands, only 10 observations were included in the PCA. If that is only a small fraction of your data, then the PCA results are probably useless and there is no point to following up with tertile bins either.

        Showing us the results of a plain

        Code:
        summarize 
        might allow further advice. What needs explanation is why only 10 observations were used, and missings are the likely problem.

        Comment


        • #5
          10 observations is only a small fraction of my data, which has 3659 observations. I did the PCA by typing the command I've been told by another statistician. He has done a similar PCA on the dataset, with less variables, but even though I used the same command he did (I have only gotten it in writing so its possible its here things go wrong), he got "Number of observations = 3659" while I get only 10...
          Can we go back to basics due to performing of the PCA?
          The command I used to perform it was this:
          "pca M13 M14 M15 M16 M17_PCA M19 M20 M21 M22 M23 M24 H1 H2 H3 H4 H5 H6 H7Camilla H8 H9 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20 H21 H22 H23 H24 H25"

          I have done the code summarize and got a really long table. I had to put it in a pdf so it was readable, but I will add the pdf here.
          Attached Files

          Comment


          • #6
            As Nick already told you, you have missing values and they add up to only 10 valid observations. That is obviously unacceptable. So there is no use talking about pca before fixing that. Basically you have two options: either reduce the number of variables or use some technique that can deal with missing values. Given the level of skill apparent in your question, I strongly suggest the former. This mean do not include variables with too many missing values.
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              Hi again,
              Thank you for your support and recommendations. I have now used some time to go trough the data and found that several of the variables I had used in my PCA had too many missing scores. I have therefore done it all over again with a new set of variables, and now I think this is correct.
              Further on i have used these commands to divide the PCA into tertiles:
              predict score1
              xtile score1_cat = score1, nq(3)

              From there I got the tertiles. However, the more I read I thing I should have some form of weight variable with the info of observations to create proper quintiles. Is this correct? I have a variable in my data set thats called mater and gived me 3658 when i list it.

              How should I type the weight variable in the xtile-command so that it gets right?

              I'm adding the new PCA here.
              Attached Files

              Comment


              • #8
                You lost me earlier on

                1. I need PCA (rather than choosing a variable that is interesting and intelligible)

                2. I need tertile bins (rather than using all the quantitative information you have).

                but I'll not push those further.

                As for using weights, why? In general, I can't see why weights need enter the picture at all, so where does that idea come from?

                (At the risk of seeming tedious, I'll underline again that quintiles is not an acceptable alternative spelling for quantiles. This is not a matter of English not being your first language, it's a matter of Latin being almost no-one's first language.)

                What would "proper" quantiles be? I've used quantiles one way or another for some while, and I don't know what that means at all.

                In general, I have to say that you seem way out of your depth here. Depending on your status -- undergraduate, graduate student, researcher, whatever -- you seem to need much more support from teachers, supervisors or mentors than you will get capriciously from an internet forum.

                Comment


                • #9
                  Hi everyone!
                  I am new to the STATA and I am working to design choice sets for my research project. My choice includes two product alternatives and one optout alternative.
                  I have three attributes having 6,6,2 attributes levels. first and last are categorical and are dummy variables while second is continuous variable. So total number of coefficient will be 7. However I am using following commands but that does not help me to generate choice sets. Could You please help me fix the issue?

                  matrix levmat = 6,6,2
                  genfact, levels(levmat)
                  matrix optout = J(1,3,1)
                  matrix b = J(1,7,0)
                  dcreate i.x1 c.x2 i.x3,nalt(2) nset(24) fixedalt(optout) asc(3) bmat(b)


                  Comment


                  • #10
                    Hello everyone
                    The outcome variables of this study will be knowledge of MTCT. It will be a composite score of five different questions.
                    Ever heard of HIV, HIV transmitted by breast feeding, HIV transmitted during delivery, HIV transmitted during pregnancy, Drugs to avoid transmission of HIV to baby

                    Responses will be coded as 1 = Yes and 0 = No. The outcome of interest for this analysis will be “women’s correct knowledge of MTCT” and will be defined as “yes” if the respondent correctly answers more than four out of five questions and “no” if the respondent fails to answer more than three out of five .

                    Please may you help me with a command that can help me generate the index variable. Thanks



                    Comment


                    • #11
                      #10 is unrelated to tertiles, or any other kind of quantiles. or principal component analysis. So, you might be better off posting this in a new thread.

                      But here goes on trying to answer.

                      If you have five different indicators, the the summary you seek is, if I understand this correctly,

                      Code:
                      gen HIV_sum = HIV1 + HIV2 + HIV3 + HIV4 + HIV5 
                      
                      gen HIV_good = HIV_sum == 5  if HIV_sum < .
                      as "more than 4" can only mean 5. However, "4 or more" is different in English and would imply

                      Code:
                      gen HIV_good = HIV_sum >= 4  if HIV_sum < .
                      See

                      https://www.stata.com/support/faqs/d...mmy-variables/

                      https://www.stata.com/support/faqs/d...rue-and-false/


                      Here I am imagining that your variables have names like HIV1 but naturally you should use the names you have for real in your dataset.

                      Comment


                      • #12
                        Hello Sir

                        First of all, Thank you very much . You got it well and the commands worked perfectly.
                        May you help me with the name of this procedure for creating index variable from those five questions, Thank you.

                        Comment


                        • #13
                          It doesn't really need a name. Your procedure defines an indicator variable with value 1 if the total score is 4 or 5 and 0 otherwise. I don't know what advantages that has over using the total score directly. Some other names for "indicator" are binary, dichotomous, dummy, Boolean, and yet others.

                          Comment


                          • #14
                            Thanks A lot.

                            Comment


                            • #15
                              Hello
                              I am here again
                              I have also tried to run Principle Component Analysis (PCA) to generate the outcome from the five questions

                              Below are the commands I used .

                              pca (HIV1 + HIV2 + HIV3 + HIV4 + HIV5), means
                              predict Knowledge
                              xtile quintile= knowledge , nq(5)

                              Then I recorded using below command
                              recode quintile (1 = 0 "no") (2 =0 "no")(3 =0 "no") (4 =0 "no") (5=1 "yes"), gen(knowledgecat)

                              Any help if I am not right.

                              Thank you.

                              Comment

                              Working...
                              X