Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable selection - Exploratory Factor Analysis

    Hi everyone,

    Context:
    I want to perform an Exploratory Factor Analysis (EFA) on a survey. The number of participants is 152, but the final number of observations is 124 because there are some missing values. The number of questions in the survey, which corresponds to the variables of the analysis, is 124.

    Problem:
    I want to perform the EFA considering all the variables, but I am having 2 major problems:
    • When I try to create the global variable that includes all the variables, with global varlist var1-var124, Stata gives me the following message: “too many variables specified r(103);
    • I tried the corr command trying to include a different (and always increasing) number of variables, always starting from the 1st (var1). What I know is that, as the number of variables that I correlate with each other increases, the number of observations on which Stata works decreases (obs = 114, 95, ...., 4). This situation becomes extreme when I try to include a very large number of variables, up to "var122", where I have (obs = 4). If I try to include one of the last two missing variables, Stata returns me the following error: “insufficient observations r (2001)”.

    Question:
    Do you know how to solve these issues? Is there a solution or should I work with a lower number of variables?

    Thanks in advance for your support,
    ​​​​​​​Filippo

  • #2
    sounds like missing data

    Comment


    • #3
      Filippo:
      welcome to this forum.
      As an aside to George's very likely diagnosis, it may well be that you have missing values scattered across the remaining observations, too.
      In addition, the (152-124) observartions are probably unit non responses, in that they skipped all the questionmaire items.
      The fix has to do with the missingness mechanism undelying the unobserved data, that has a relevant bearings on the methodological approach (es.: multiple imputation if unobserced data are missing presumed at random).
      That sais, if the survey was based on a questionnaire, questionnaire instructions cover how to deal with missingness in a given questionnaire item, whereas dealing with unit nonresponse issues is much more difficult (if feasible at all).
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        as a start, try
        Code:
        mdesc variablenames
        and as Carlo suggested, look for weird values that might indicate a missing variable substitute.
        hist or summ might be helpful in that regard. in yesteryear, some data put 99999 or some such to indicate missing variables and that can really muck up the results.


        Comment


        • #5

          Thanks for your reply!
          Sorry but I am a newbie in this forum and I have never performed an EFA.
          I provide you with further information, to have a better understanding of what I am doing:

          Context:
          The survey is made of 4 main blocks, which measure digital skills through different methods (self-assessment, engagement, ...) plus an introductory part where I gather general information about participants (gender age, ...).
          The objective of the survey is to measure the digital skills of civil servants, so I want to consider only the variables related to the 4 blocks.
          The questions of the survey (my variables on Stata) are:
          - Multiple Choice Questions
          - "Yes", "No", "I don't know what you mean"
          - 5 point- Survey Scale (-2, -1, 0, +1, +2) + "I don't know what you mean.

          What I have done:
          I coded the variables:
          - I assigned a numerical value (from 1 to X) for the MCQ.
          - Yes=1, No=0, "I don't know what you mean"=2
          - "I don't know what you mean"=".a"

          After your messages, I checked and I noticed that there are some missing values in my dataset. I plotted the results and I can see patterns that explain the reasons for these missing values, which are:
          - some people simply did not complete the survey;
          - some questions are not answered by more than one person;
          - some questions received only few replies.
          I manually eliminated observations for those people who did not respond to most of the questions and I deleted those variables where most people did not reply. Thus, I applied a listwise deletion.

          Result:
          My dataset is now composed of 115 observations (starting from 152, 75.66% of the original obs) and 110 variables (starting from 148). I do not have any missing data anymore, but I still have some "I don't know what you mean".

          Questions:
          1) Should I perform the EFA analysis considering all the 110 variables together or should I perform 4 different EFA analyses considering each block of variables each time?
          The blocks are made of 19, 27, 39, and 20 variables (the sum is 105 because 5/110 variables still belong to the introductory part of the survey).
          2) How should interpret "I don't know what you mean"? Because Stata recognizes it as a missing value since I coded it as a ".a".
          3) Do you think that the listwise deletion of variables could be a good solution here or should I apply inputational methods to cope with missing values? If the answer is the second, may you suggest me some methods?

          Thanks in advance for your help,
          Filippo
          Last edited by Filippo Merigo; 21 Oct 2022, 04:36.

          Comment


          • #6
            Filippo:
            some comments about you query:
            1) you cannot reasdonably expect to retrieve any informative results dealing with 110 variables.
            I think you should try to combine/grouped variables, first.
            2) I'm not clear with the way you numbered "I don't know what you mean" response: was it 2 or .a? Please uniform.
            3) listwise deletion, especially, as it seems, you diagnosed mechanisms/patterns of misisngness, is like making-up your original sample. This should be clearly reported in your paper/manuscript as listwise is acceptable in a very limited number of instances (see: https://statisticalhorizons.com/list...n-its-not-evil). Methods to deal with missing values depend on the missingness mechanisms (diagnosis should be made before therapy).
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              As Carlo suggests, you probably have too many variables.

              Also, if you have ordered/categorical variables, you need to account for that.

              Code:
              help polychoric

              Comment


              • #8
                No way are you going to get something with 115 observations and 110 variables. I'd try to reduce it to 5-10 variables for starters, maybe culling variables with very high correlations.

                Comment


                • #9
                  Thanks a lot for your replies!

                  1) My idea would be to run 4 different EFAs, using the variables of the 4 blocks/areas of my survey, which are made of 19, 27, 39, and 20 variables each. By doing so, I would reduce the number of variables that I would use during each analysis. Does it sound reasonable to you? Is the reduction of variables enough or should I do something more? Would it be better to group variables of the same block? If so, do you have any clue about how to do that? I have not been able to find anything about this procedure. Unfortunately, I am struggling to find the rationales behind the choice of the variable when running an EFA, this is the reason why I am not sure about this point.

                  2) You are right. In the questions where the allowed answers are Yes, No, I don't know what you mean, I assigned it the value of "2"; in the case of the 5-point scale survey I assigned it the value of ".a".
                  What I see is that Stata treats ".a" as a missing value. The consequence is that i.e. considering the correlation between 2 variables that include the ".a" value, Stata eliminates the observations that contain that value.
                  Do you have any suggestions about how to cope with this problem? Would it be better to assign "I don't know what you mean" a numerical value, such as "0.5" or "3", in order not to modify too much the statistics of my observations (mean, median, ...)?

                  3) I just realized that 24 people did not reply to the survey at all, so I eliminated them. Then, 5 people did not complete the survey, in particular:
                  - 2 people only replied to the introductory questions --> I would eliminate them
                  - 2 people replied also to the first 6 questions of the first block on digital skills
                  - 1 person replied also to the entire first block on digital skills.
                  Regarding these 5 observations, I have 3 options for how to deal with them:
                  a) I cancel all of them;
                  b) I keep the observation that replied to the entire block for the analysis related to it;
                  c) I adopt some imputation /ML methods to complete those observations.

                  Thanks again for your precious time and suggestions,

                  Filippo

                  Comment


                  • #10
                    Filippo:
                    1) the best advice is to discuss this issue with your supervisor/professor/more experienced collaeagues. Each research field abides by its own tribal rules and what is acceptable in some clans is rejected by others;
                    2) I fear there's a misunderstanding here. Numbers in categorical variables have no computational meaning. Therefore, you can safely assign "I don't know what you mean" level 2. In addition, no matter the numbers assigned, when dealing with categorical variables Stata renumbes them starting from 0 onwards.
                    3) again, it would be interesting to diagnosing, given the observed values (es: age; educational level; employment status) why those individuals skipped some of the questionnaire items (or the questionnaire altogether).
                    That said, if you decided to get rid of those 24 persons (a strong methodological choice that needs a strong justification, especially if you're going to submit your paper to a technical journal in your research field) the other 5 unit nonresponses should follow the same destiny. How this choice influences your sample size and related statistics is another topic that deserves some thoughts.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment

                    Working...
                    X