Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Losing observations

    Hi everyone,

    I am trying to perform pca. Before anything, I've been told to make sure about the correlation of the variables I will be using (the bigger the correlation the better). The problem is that even when I have 38 observations, correlate is only using 10 observations. Do you know why?

    I attach dta. Any other comment will be very helpful. I know it is not a big dataset to perform pca.

    Thank you

    joan
    Attached Files

  • #2
    -corr- drops any observation where any of the variables specified in the command is missing. Your data file, in fact, has only 10 cases with no variable having a missing value. If you want to see correlations where you include, for each correlation, any observation where that particular pair of variables has non-missing values, you can use -pwcorr- instead. However, you should be aware that -pca- will treat your data the same way that -corr- does, it will only look at the 10 cases where the data is complete. Furthermore, you cannot try to substitute the correlations from -pwcorr-, because pairwise correlation matrices like that are sometimes singular (not to mention that they are often biased as well).

    By the way, with only 10 complete cases and 12 variables, you will not be able to run -pca- at all: the covariance matrix will be singular.

    Bottom line: you need to fill in the missing values if that is possible, or else get a larger data set.

    Comment


    • #3
      By the way, rather than posting a data set as an attachment, it is better to get the -dataex- command from -ssc install dataex-, run it on your data, and copy its output directly into your post. Some forum members are reluctant to download attachments from strangers, given that they can contain malware.

      Comment


      • #4
        Missing values. Stata drop observations with at least one missing on one variable. You should examine before the number of missing values in separate variable and consider dropping variable with large amount of missing values. In your dataset only 10 observations have no missing values on all 13 variables, 16 observations have 1 missing value, seven have 2, etc. Alternatively, you can use pwcorr instead of corr to get a correlation matrix with pairwise deletion (which is a correlation between pair of variable regardless the other variables) and use this matrix with pcamatcommand.

        Code:
        . egen nomis=rowmiss( SbF- composta251bc)
        
        . ta nomis
        
              nomis |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  0 |         10       26.32       26.32
                  1 |         16       42.11       68.42
                  2 |          7       18.42       86.84
                  3 |          2        5.26       92.11
                  4 |          3        7.89      100.00
        ------------+-----------------------------------
              Total |         38      100.00

        Comment


        • #5
          Alternatively, you can use pwcorr instead of corr to get a correlation matrix with pairwise deletion (which is a correlation between pair of variable regardless the other variables) and use this matrix with pcamatcommand
          Not necessarily. Pairwise correlation matrices often turn out to be singular. If that happens, -pcamat- will complain and produce no output.

          Even if the matrix isn't singular, unless the missing values are known to be missing completely at random, the pairwise correlations could very well be biased estimates of the actual correlations, and I don't know what effect that has on the principal components.

          I think she needs to get more complete data.

          Comment


          • #6
            Thank you a lot Clyde and Oded..!

            This is information on a poll performed several years ago, so I have no chance to fill those missings from primary data.

            Could I do it "artificially"? Are there methods to systematically fill those missings with something reasonable? I am aware this is second best and any method will harm variability therefore eroding PCA quality, but as far as I have understood, this needs to happen to make PCA possible.

            Joan

            Comment


            • #7
              You should take a look at programs that run PCA with missing values. Stata has a built in command SEM to run Confirmatory Factor Analysis with missing values (option mlmv), but not EFA Exploratory Factor Analysis. But you can follow this website and use its approach. Other way around is to look for a program that preform single imputation or to use the mi suite in Stata and run multiple imputation and then run pca on each dataset separately and compare the results. Unfortunately (again) the mi suite does not support factor analysis.
              In your case, I will begin by dropping the SBF and LBF variables and then replace the missing values in the two other variables (OBF PBF) with single mean imputation to discover the latent structure of the data. From that point you can check if the results under single imputation are different then result with multiple imputation.

              Comment


              • #8
                Thanks Oded, Questions derived:
                1) I am from now on assuming confirmatory Factor Analysis is in practice the same than PCA (altogether with FA, SEM, EFA... I am a bit confused). When using SEM, according with the link provided, which are the latent variables and which is my output of interest to perform the index? This is their code
                sem (F1 -> y1 y2 y3 y4 y5@0 y6) /// (F2 -> y1 y2@0 y3 y4 y5 y6) , /// variance(F1@1 F2@1) standardized 2) I would be very grateful on further indications on single imputation.

                Comment


                • #9
                  Joan, I'm not an expert in SEM and you should better start by reading the manual entry of sem in Stata. As for your question the F1 and F2 are the latent (unobserved) factors. You can explicitly specify that with the latent (F1 F2) option.

                  Comment

                  Working...
                  X