Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Where is the problem with stepwise proceedures?

    My understanding has always been that any form of "stepwise elimination of insignificant variables" invalidates further inference (especially SE-estimates and p-values). This would be because stepwise procedures pick up random patterns in the data and understate the number of estimated parameters. I am trying to write a simulation in Stata to illustrate that point, but fail.

    My generated data consists of a random treatment indicator, a 0-treatment effect, and 10 potential control variables, of which 4 actually go into the outcome. I use "stepwise" to pick the control variables for an ATE regression with control variables. My expectation would be, that if I repeat the process 1000 times and store the p-values for the average treatment effect, I would end up with a grossly non-uniformly distributed set of p-values.
    Code:
    set seed 1
    cap mat drop p
    set matsize 2000
    forvalues x = 1/2000 {
        clear
        //generate data with no treatment effect
        qui set obs 100
        gen treatment = runiform()>.5
        forvalues i = 1/10 {
            gen x`i'=rnormal()
        }
        gen y = x1 + 1/3*x2 + 1/9*x3 + 1/27*x4 + rnormal()
        //use stepwise to pick controls to be used in the ATE regression
        qui stepwise, pr(0.1) lockterm1: reg y treatment x*
        qui testparm treatment
        mat p=nullmat(p)\r(p)
        di "." _cont
    }
    clear
    svmat p
    hist p1, bin(5)
    The resulting histogram looks as if the p-values are nicely and uniformly distributed between 0 and 1 (maybe with a tiny kink in the higher ranges). This is in spite of using a rather small sample of only 100 observations
    Click image for larger version

Name:	Unbenannt.png
Views:	1
Size:	29.7 KB
ID:	1349823



    Is there something in my data generating process that is responsible for this? Does 'real' experimental data come with features which make these issues more pronounced? How could I adapt by simulation to illustrate the issues further, while retaining random treatment.

  • #2
    Generally, the outcome of step-wise procedures reflects both the pattern of associations with the outcome variable and the associations among the right side variables. For the data you have constructed, if I understand your code, the right side X variables will be uncorrelated in expectation and X4-X10 are constructed to be uncorrelated with the outcome. This is not the kind of data that one usually sees when step-wise methods are applied.

    Here is an example of a more common case: the outcome is days of hospitalization due to a specific illness, say asthma, and you have data on a set of comorbidities. The comorbidities are correlated, e.g. diabetes and high blood pressure, and you want to find a best fitting model. What you would find is that the p values on the various comorbidities vary depending on what else is in the model and small changes in the correlation structure lead to large changes in the pattern of results. Your simulated data are rather different.

    Aside from that, the usual textbook discussion of this problem says that significance tests "bounce around" depending on what is in a particular model. In your case, you see that you get p values for a treatment effect, which you constructed to be zero, which are all over the place, depending on what else is in the model, which is what I would expect to see.
    Richard T. Campbell
    Emeritus Professor of Biostatistics and Sociology
    University of Illinois at Chicago

    Comment


    • #3
      Thanks Dick.

      Yes, you interpreted my code correctly. I also rewrote the process to have some correlation structure between the Xs á la:
      Code:
      // draw x's to be correlated with some arbitrary structure in blocks of vars x1-x4, x5-x8, x9-x13, x14-x17
          matrix C = ( 1, 0.3, 0.1, 0.006, 1, 0.6, -0.7, 1, -0.1, 1 )
          drawnorm x1 x2 x3 x4, corr(C) cstorage(upper)
          drawnorm x5 x6 x7 x8, corr(C) cstorage(upper)
          drawnorm x9 x11 x12 x13, corr(C) cstorage(upper)
          drawnorm x14 x15 x16 x17, corr(C) cstorage(upper)
          //from each block, one variable matters for y
          gen y = x1 + 1/3*x6 + 1/9*x12 + 1/27*x17 + rnormal()
      This resulted in a very similar picture.

      Re your last point:
      I constructed the treatment effect to be zero, because I want to assess the size of the ATE-test. Judged by my results so far, ATE-tests based on stepwise procedures (only eliminating controls) seem to be (almost) well-sized. This would imply that a researcher with very limited knowledge of which variables to control for might be better off using a stepwise procedure than relying on his/her imprecise priors.

      Comment


      • #4
        Dear Simon,

        If I understand you correctly, you are doing backward stepwise. This is closely related to the general-to-specific modeling approach often used in Econometrics (AKA the LSE or Hendry's approach).

        The main difference is that when using the LSE approach one looks not only at p-values but also at the substantive meaning of the variables being dropped. Moreover, at each step one should check that the model assumptions are valid (for example, dropping a variable can create serial correlation and invalidate the usual standard errors). At the end of the process we should also test that the final model is a valid simplification of the starting model by performing an joint test for the significance of the variables that were excluded.

        A standard stepwise procedure does not account for these things and therefore may lead to invalid or uninteresting results. In your case these problems do not exist, so all is fine; this does not imply that backward stepwise is generally fine.

        Finally, notice that a forward stepwise is a very different approach and not generally recommended.

        All the best,

        Joao

        Comment

        Working...
        X