Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How many independent variables is too many?

    Just a general question as my understanding of multiple comparisons is in overdrive right now. Why would a person ever want to run an analysis with say 300, 400 or 500 independent variables? Isn't this fishing? How would one interpret these results and how would you adjust p-values?

  • #2
    May:
    set aside the cases in which tons of independent variables are filtered via -stepwise-procedures (that, in turn, has relevant drawbacks. See https://www.stata.com/support/faqs/s...sion-problems/), I'm not aware of data generating processes that need hundreds of variables to be investigated.
    In addition, the higher the number of predictors, the lower the chances to disseminate your results successfully.
    That said, my guess is that 20 predictors are already a relevant number to consider, provided that your sample size is large enough to support an informative regression (that does not imply that all the coefficients are statistically significant).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      well, this could depend on the situation; for example, in an adversarial situation (e.g., litigation), one (or both) side may think they need to cover all possible relevant variables within the main analysis; in fact, when I was doing lititagation support work, I did once come upon an econometrician who regularly had more than 100 independent variables and it was for just the reason I gave above (personally, I think that this is not the best way to handle this, but ...)

      Comment


      • #4
        Uhhh if you're doing weird SEM models this can happen. Also, if you're doing synthetic controls, under the hood it reshapes the data to have your unit of analysis as the column and your time as the row. After a little regularization, ideally you'll have a sparse solution, since the regression coefficients add to 1 and are positive. In those instances you can have 100s or 1000s of predictors, yet the LASSO or PCA/convex procedure would never permit all those to be used.

        In short it happens, but they're never all used.

        Comment


        • #5
          Thank you all for your input, this is very useful. For me, I am in basic science research and I do a lot of benchwork and once in a while, a colleague will look through a whole database and run analysis using one outcome and all the predictors in the database. I understand perhaps that it is essential in some fields to process very large independent variables but I wonder if this will ever be the case in basic research. Also, shouldn't the analysis be driven by the study aims? I don't understand the need to analyze everything just to see what is associated with your outcome. Based on research practices, this seems like a bad example of how not to conduct evidence based research. Even if you use a multiple comparisons test such as Bonferroni or Sidak (which uses a family wise error and includes # of comparisons), your corrected p-value will likely be very small.

          Comment


          • #6
            Also, shouldn't the analysis be driven by the study aims? ... this seems like a bad example of how not to conduct evidence based research....
            If you are trying to draw conclusions analysis should always be driven by study aims, which should be set out prospectively. In basic research, you are often doing randomized controlled experiments, so there is usually no need to introduce additional variables in to your analyses. In clinical and population based epidemiologic research, however, we are often working only with observational data and need to adjust our models for extraneous variation due to other variables. That said, it should not be done indiscriminately. Judea Pearl's framework for understanding causal inference makes it quite straight forward to determine which variables to include in the analysis: always condition on confounders, never condition on colliders. Occasionally one can justify adding a variable which is neither a confounder nor a collider but which accounts for appreciable outcome variance--this may improve the precision of estimates. But in these situations, you are not supposed to be doing hypothesis tests or effect estimation on every variable in the model: the tests and estimations are still to be confined to those variables that are prospectively considered potential causes of the outcome(s).

            Now, a completely different matter is when you are entering unexplored domains and have little or nothing to guide you in drawing a causal model (directed acyclic graph). When you are doing that, it may be appropriate to try many different variables and combinations of variables in many different models. But when doing that, you must always present your findings as exploratory analysis, and no conclusions are justified from such analysis until they are independently examined in a different study prospectively designed to replicate just those findings. To present exploratory results as conclusions is misleading, at least, and probably should be considered scientific misconduct.

            Why do people engage in this kind of misconduct? Well, the incentives for doing it are strong. Everyone is under pressure to publish--to keep their jobs and to get funding for their research. Many journals shun studies with "negative" findings. And some journals have a preference for findings that are "surprising." Well, the easiest way to get a "significant" and surprising "finding" is to p-hack your way through some data set that has enough variables and observations to enable you to run a large number of haphazardly selected analyses with different variable combinations and subsets of observations. Of course, such findings are preponderantly Type I errors. But the system does little to tamp down this kind of thing.
            Last edited by Clyde Schechter; 29 Nov 2022, 13:26.

            Comment


            • #7
              Hi Clyde, You bring up a great point about research misconduct but also the systems that perpetuate it. I hope I never fall into this hole. I am very fortunate to have the wisdom of many statisticians on this platform who are best at what they do. I have brought up these concerns before to a professor and I was told that everyone does it. It made me feel like I was taught about how to conduct research and it is not even practiced by those who teach it.

              Comment

              Working...
              X