Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Way to drop all variables that are highly correlated from a dataset?

    Dear All,

    I have a huge dataset with a few hundred variables with 10000 observations each on my hand. Few of these variables are highly correlated wich causes problems when trying to analyse them with a logistic regression. As of yet i was not able to figure out how to loop over all of the returned matrix elements to drop them.

    with
    Code:
    ds x1 x2 x3, not
    local all=r(varlist)
    i can get all variables that need to be tested.
    from the return matrix it should be possible to drop the affected variables structured somehow like this
    Code:
    correlate `all'
    if r(rho)>0.2 **drop variable**
    I would be very thankful if anyone of you could point me to a solution to this problem. I know that this approach does not have a lot of friends in the statistical community but my professor insists on me using it.

    Maxwell
    (a new and really confused STATA user)


  • #2
    If several variables are highly correlated, you may not need to use them all. But there is absolutely no need to drop even some of those variables. You ignore what you don't want to use.

    In any case the criteria here are highly vague. Suppose two variables are correlated at 0.9. Which should be dropped? That's just the simplest of several objections to this idea.

    I won't suggest code for misguided approaches.

    Last edited by Nick Cox; 12 Jun 2016, 11:01.

    Comment


    • #3
      Well the plan was to specify a cutoff value for correlation that is really low say 0.45 and only use the data that will meet this criteria. But i see no possibility to just "ignore" the variables i wont look at so dropping them is the only solution. From looking at the correlation table it would propably drop around 15 out of 350 variables. I could go through it manually but i would hav to do it all over again when i get new data.

      In case you wonder the data in question are sensor readings that really should not be correlated, at least i know of no effect that would afect a pair of them at the same time. Thanks for your reply anyways.

      Maxwell

      Comment


      • #4
        Sorry, but you don't answer my question. Given a criterion [NB] of a correlation of 0.45 what precisely what you do?

        Puzzlingly high correlations are unusual but typically good news. Just wanting to ignore what doesn't make obvious sense is poor practice in data analysis and science generally.
        Last edited by Nick Cox; 12 Jun 2016, 11:24.

        Comment

        Working...
        X