Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression - Mixture of single and multi item variables

    Hello everyone,

    I have a problem with a regression analysis. This is the first time that I work with stata so maybe this is a quite easy question.

    My research question is which variables have an impact on the preferences for small or large companies of students after their graduation ("which size of company would you prefer afetr graduation?". i have a panel dataset. the survey was conducted 5 times.

    i identified several characteristics that describe small (UG_W = 0) and large (UG_W = 1) companies. these attributes are career possibilities, solcial relationships and compensation. the students had to rate how important the attributes are to them in a work related context.

    For my work I identified single items which reflect the attributes the best: opportunity for advancement, social relationsships at the workplace, high wage

    I also pooled some of the single items to cover more relevant aspects:
    1) career possibilities: career_W = (opportunity for career advancement + managerial responsibility + opportunity for professional training)/3
    2) social relationships: social_W = (social relationsships at the workplace + teamwork + work life balance)/3
    3) compensation: payment_W = (high wage + high social benefits + additional benefits)/3

    the reliability coefficient for thepooled job attributes is around 0.7 which is a good value.

    If I run a regression first with the single items, not every coefficient is significant.

    Single Items (SI) : xtlogit UG_W career_pos_W social_rel_W high_wage_W, re


    If I run a regression with the pooled variables, not every coefficient is significant.

    Multi Items (MI): xtlogit UG_W career_W social_W payment_W, re


    But if i mix the variables i get a good fit.

    Mixed: xtlogit UG_W career_pos_W social_rel_W payment_W, re


    My question is whether there are reasons why it is not allowed to mix pooled with single item variables in a regression.

    Thank you very much!
    Best Katharina

  • #2
    There is no reason that a regression can't be done using a mixture of scale scores and single item responses. It does not make sense to combine a scale with individual items that are part of that same scale: in that case you will get a better fit from putting in all of the items of that scale separately and letting the data and model determine their relative contributions. If you are doing exploratory analyses to try to find a "best" model you probably shouldn't rely on the p-values of the individual predictors. The overall R2, or better still, a penalized version such as AIC or BIC, is a better guide to model selection.

    One caution: it sounds to me like you do not have a pre-specified plan for your analysis, nor a scientific theory to guide it, and you are trying out combinations of variables in search of low p-values. If that's so, your p-values in the end will not mean what they appear to mean. The effects of the variables you do end up identifying in that way are likely to be overestimated, and could even have the wrong sign. This is fine if your purposes is simply to explore the data and generate hypotheses to be tested in a replication of your study. But this approach does not lead to conclusions that stand on their own.

    Comment


    • #3
      Katharina:
      following the lines of Clyde's sound advice, before any search for a good (over)fitting of your variables, I would spend some time in reading what others did in your research field (i.e.: in terms of predictors and statistical models) when faced with the same research goal. This is highly advisable if, instead of simply exploring your data, you're planning to submit a manuscript concerning the topic you're after to some prominent journal of your research field.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Actually I did spend time to do some research. the attributes are linked to specific company sizes in the literature. the variable compensation for example is an attribute for large companies as one can assume that large companies are financially not as restricted as small companies. the dominant item for compensation (taking the literature research into account) is high wage. Things like high additional benefits (f.e. car, pensions...) are mentioned as well.
        So I did a regression with the dominant core items, a regression with the pooled items and a mixed regression.

        I looked at the BIC/AIC values and they are almost the same for the different regressions. AIC=[270;277] ; BIC=[293, 301]

        SI: AIC=278.2332 BIC=301.064
        Pooled: AIC=270.4645 BIC=293.2953
        Mixed: AIC=277.1535 BIC=296.1792

        Comment


        • #5
          If i have to choose a model should I solely rely on the AIC/BIC values or should I also include the p-values?

          I have a model which has better AIC vaues but not all of the p values are significant. the other regression shows highly significant coefficients but the AIC is not as good.

          Comment


          • #6
            does the rule to choose the model with the smaller AIC still hold if the number of observations isn't the same?

            Comment


            • #7
              Katharina:
              yes, it holds.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Thank you

                Comment

                Working...
                X