Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Differing results with stepwise regression (& categorical predictors)

    Hi,

    I am having trouble interpreting the results of my logistic regression analysis.

    I am running an exploratory analysis using stepwise regression (I understand this method has limitations) including categorical variables as follows:

    Code:
    xi: stepwise, pr(0.2): logistic outcome (i.nationality) gender age (i.medications) (i.quality)
    I reran the final model without using stepwise regression (ie, input all the variables that are retained in the model), with either of the following:

    Code:
    logistic outcome i.nationality i.quality
    logistic outcome _Inationali_2 _Inationali_3 _Inationali_4 _Iquality_2 _Iquality_3
    The results of these final two regressions are identical, but these differ slightly than the stepwise regression results. The number of observations retained in the stepwise analyses is reduced.

    I am having trouble understand the mathematics/reasons behind these models providing different results. Could anyone help with this? I am using Stata 14.2 & have included my data below.

    Thank you in advance for any advice,
    Bryony


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(outcome nationality) byte gender float(age age_cat medications    quality)
    0 1 1   43 2 6 2
    1 2 1   32 1 4 2
    0 2 0 39.5 1 6 1
    0 1 0   41 2 4 1
    0 1 0 32.5 1 3 1
    0 1 1 24.5 1 2 1
    0 3 0   35 1 3 1
    0 2 0 36.5 1 4 2
    0 1 1   39 1 3 2
    0 2 0   44 2 1 2
    1 2 0 26.5 1 2 2
    . 1 0 47.5 2 3 1
    0 1 0   37 1 6 1
    . 1 0 52.5 3 1 1
    . 1 0 36.5 1 1 2
    . 2 0   42 2 3 3
    0 1 0   53 3 5 1
    0 4 0   32 1 2 1
    1 1 0   52 3 6 3
    . 1 0 35.5 1 1 1
    . 1 0   27 1 4 1
    0 2 0   42 2 1 1
    0 2 0   39 1 1 1
    0 2 0 47.5 2 4 3
    . 1 0 40.5 2 1 3
    . 2 0   42 2 6 .
    0 2 0   40 1 2 2
    0 1 0   40 1 4 1
    0 1 1   28 1 3 3
    0 2 0   37 1 6 1
    1 1 0    . . 1 3
    0 1 0 34.5 1 4 2
    0 2 0   28 1 3 1
    0 1 1 28.5 1 4 1
    0 1 0   56 3 4 1
    0 4 0   43 2 3 2
    0 1 0 38.5 1 4 1
    0 1 0 60.5 3 3 2
    0 1 0   46 2 1 1
    0 1 0   44 2 2 2
    . 1 0 55.5 3 2 1
    0 1 0   35 1 5 1
    0 1 1   32 1 3 3
    0 4 0   39 1 3 1
    . 2 0   46 2 5 1
    0 2 0   36 1 4 1
    1 4 0 55.5 3 5 .
    0 4 0 28.5 1 3 1
    1 1 1   55 3 1 1
    1 1 0   34 1 4 2
    1 1 0 65.5 3 1 .
    1 1 0   55 3 4 3
    . 2 0 33.5 1 1 2
    0 1 0 42.5 2 1 1
    0 2 0   40 2 4 2
    0 2 0 27.5 1 1 1
    0 2 0 22.5 1 6 3
    . 1 0 45.5 2 4 1
    1 1 0   50 2 1 1
    0 1 0 48.5 2 5 2
    0 1 0 42.5 2 6 2
    0 2 0 51.5 3 3 1
    . 3 0   43 2 1 1
    0 2 0 30.5 1 4 1
    0 4 0   47 2 3 2
    0 2 0 36.5 1 1 1
    0 2 0   52 3 4 2
    0 2 0   35 1 1 1
    0 2 0 35.5 1 5 3
    1 1 1   48 2 2 2
    0 1 0 25.5 1 2 1
    . 1 0   45 2 4 2
    . 1 0 47.5 2 5 1
    . 1 0 49.5 2 3 1
    1 4 0 35.5 1 2 3
    0 1 0 30.5 1 4 3
    0 1 0   47 2 2 1
    . 2 0 36.5 1 4 1
    . 1 0 55.5 3 3 1
    . 2 0 40.5 2 2 2
    . 1 0   38 1 4 2
    0 2 1 25.5 1 4 3
    . 1 0 60.5 3 2 1
    . 1 0 29.5 1 3 1
    0 2 0 50.5 3 5 1
    . 1 0   32 1 2 2
    . 1 0 30.5 1 4 1
    0 1 0    . . 1 1
    . 1 0 28.5 1 3 2
    . 1 0 26.5 1 3 1
    0 2 0 44.5 2 2 1
    . 2 0 38.5 1 4 3
    0 2 0    . . 2 1
    1 3 0 42.5 2 4 2
    1 1 1   32 1 4 3
    . 2 0   30 1 6 2
    . 2 0   34 1 5 2
    1 2 1 30.5 1 1 3
    . 2 0   32 1 3 1
    . 2 0 40.5 2 2 1
    end

  • #2
    It is likely because of missing data in the variables that were not selected. From the manual:

    Whether you use backward or forward estimation, stepwise forms an estimation sample by taking observations with nonmissing values of all the variables specified (except for depvar1 and depvar2
    for intreg). The estimation sample is held constant throughout the stepping. Thus if you type

    . stepwise, pr(.2) hierarchical: regress amount sk edul sval

    and variable sval is missing in half the data, that half of the data will not be used in the reported model, even if sval is not included in the final model.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      that's really helpful, thank you!

      Comment


      • #4
        Incidentally in Stata 16 lasso is being suggested as an alternative to stepwise. I told StataCorp people that lasso sounded like high-tech stepwise regression to me and was therefore the work of the devil, but they didn't agree. Here are two presentations on lasso from the 2019 Chicago Stata Conference:

        https://www.stata.com/meeting/chicag...cago19_Liu.pdf

        https://www.stata.com/meeting/chicag...19_Drukker.pdf
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Thank you for the lasso comment, Richard. I thought I understood the stepwise process, but there is another thing I do not quite understand.

          In the dummy dataset below, I run regressions using two different outcomes (outcome1 and outcome2) & keep the covariates the same - the outcomes are fully answered.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(outcome1 outcome2 cat1 cont1 cat2 bin1 nonmiss)
          0 1 1 12 . 0 1
          1 1 1 13 2 0 0
          1 1 1 14 3 0 0
          0 1 . 11 1 1 1
          0 0 2 12 2 1 0
          0 0 2 13 3 1 0
          1 0 2 14 1 0 0
          1 0 2  . 2 0 1
          0 0 2 12 3 0 0
          0 0 3 14 1 1 0
          1 0 3 10 2 1 0
          1 1 3 12 3 1 0
          0 1 3  . 1 0 1
          0 1 3 14 2 0 0
          1 0 4 15 3 0 0
          1 0 4 16 1 . 1
          1 1 4 12 2 1 0
          0 1 4 13 3 1 0
          0 1 4 15 1 0 0
          end
          Code:
          xi: stepwise, pr(0.2): logistic outcome1 (i.cat1) cont1 (i.cat2) bin1
          xi: stepwise, pr(0.2): logistic outcome2 (i.cat1) cont1 (i.cat2) bin1
          I understood that the estimation sample would be formed by taking all observations with non-missing values (ie, n=14). However, the number of observations included in the model differs for outcome1 (n=14) and outcome2 (n=10). Is it possible to explain the reason for this?

          My apologies for my misunderstanding & thank you for your patience.

          Comment


          • #6
            It says observations are being dropped because of estimation problems. The sample is very small. It is a problem with the model and data, not with stepwise.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment

            Working...
            X