Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Changing reference category in categorical variable gives more significant results

    Hi everyone!

    I am new to statistics so I guess there is an obvious answer to this question, but I found it surprising:

    I am researching the effect of creativity on submitted innovation ideas in an organization. My dependent variable is a binary variable (dummy) that is 0 if the respondent has not submitted any ideas and 1 if they have submitted one or several ideas.

    One of the control variables is position. It is a categorical variable with 9 response items. In short: managers have submitted most ideas and the responding health workers have not submitted any ideas.

    Here's the thing:

    I have usually just told Stata to treat it as a categorical variable by including it as i.position. This will use the first category (managers) as a reference category. Doing this (almost) none of the categories came out significant.

    Click image for larger version

Name:	Screenshot 2021-04-14 at 12.02.49.png
Views:	1
Size:	394.6 KB
ID:	1603492

    However, when changing the reference category to health workers (who submitted 0 ideas) all positions came out significant.
    Click image for larger version

Name:	image_22081.png
Views:	1
Size:	139.8 KB
ID:	1603493


    Shouldn't the relationship between the different positions be the same regardless of which category is used as reference?


    Another question:
    When calculating the percentage of respondents from each position who have submitted ideas I find 89% of managers to have submitted ideas and 64% of physical therapists. However, in the regression results the physical therapists have a higher coefficient than managers (.6363 vs .5357). How can this be?
    Last edited by Aleksander Erichsen; 14 Apr 2021, 04:27.

  • #2
    Only if you have two categories choosing a base would be a mirror image of the other base.

    If you have more than two categories it is no longer so.

    In each and every case we are measuring something relative to something else. The measurement obviously depends on relative to what we are measuring it.

    Comment


    • #3
      The p-values for each of those 8 indicator variables will often change when the reference group is changed. They are basically a t-test of the mean difference for any one indicator group versus that of the reference group. So, when a group at the extreme low or high rank was chosen as a reference, larger differences are created, leading to possibly more statistical significance findings among those t-tests. This phenomenon, however, does not change the performance of the whole categorical model, as you can see the overall F-statistic at the top, sum of squares, etc. are identical. The key point is that: never judge if a categorical variable is significant basing on their indicators' p-values. Use -help testparm- to learn more on how to test the whole set of indicators; that test tells us if the whole categorical variable is predicting well or not.

      As for the second question, without the data I can't really tell. But if you just want to recover the mean response, then consider taking the constant out of the regression. Or perform a one-way ANOVA. It'd also be helpful to see the actual prevalence by showing the table produced through:

      Code:
      tabstat submittedIde, stat(n mean) by(position)

      Comment


      • #4
        Thanks so much for your quick responses!

        Is there a best practice to which category should be used as the reference group in a categorical variabel?

        Also, if one can't judge a categorical variable based on the p-values of the different categories, is there a way to know if the variable over all is significant when used in a multivariate regression? Or is the only way to test the categorical variabel as a whole to test it in a separate bivariate regression and look at the models F statistic?

        Comment


        • #5
          Is there a best practice to which category should be used as the reference group in a categorical variabel?
          Because it's a group that we need to compare against, it's usually not a good idea to use any "vague" group. For example, "Don't know/Refused" will be a challenging reference group. Similarly, unbounded top and bottom categories (e.g. "1,000,000 or above per year" in income category) can also be difficult. Reading the interpretation out loud and check if that sounds natural.

          For the second question, please see what I previous mentioned:

          Use -help testparm- to learn more on how to test the whole set of indicators; that test tells us if the whole categorical variable is predicting well or not.

          Comment

          Working...
          X