Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Choose independent variables based on their correlation with the dependent variable

    Hi

    I am trying to regress a y variable with a lot of x variables but I only want to include variables where the correlation between y and the individual x is above or below a certain value in the regression.
    The data is reported weekly.
    I have 474 x variables in total and have found each variable's rolling correlations with the y variable for the previous 25 weeks. But I cannot find a way to link the correlations with the variables. I even tried with dummy variables.
    Is this somehow possible?

    Code:
    use "Google.dta"
    gen fakedate=_n
    tsset fakedate
    merge m:m  fakedate using "Index.dta"
    
    forvalues i=1/439{
    gen co`i'=.
    gen cohej`i'=.
    forvalues j=27/747{
    corr OMXC20 Word`i' if fakedate<`j'-1 & fakedate>=`j'-26
    replace co`i'=r(rho) in `j'
    replace cohej`i'=1 if co`i' < -0.2 | co`i' > 0.2 in `j'
    replace cohej`i'=. if co`i'==.
    }
    *drop co`i'
    }
    
    forvalues j=27/747{
    reg OMXC20 Word1-Word439 if fakedate<`j'-1 & fakedate>=`j'-26 & co`i'==1
    }
    Last edited by Matilde Biil; 22 May 2018, 08:57.

  • #2
    Matilde:
    welcome to this forum.
    Some comments about your query:
    - do not use -m:m- option to merge your datasets, as the results could be catastrophic (see -merge- entry in Stata .pdf manual);
    - choosing predictors on the grounds of a given correlation cut-off/threshold with the dependent variable can be seriously misleading, because what is true in your sample might be different in repeated samples drawn from the same population;
    - the usual recommendation is to give the truest and fairest representation of the data generating process.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Thank you Carlo.
      It does not really matter in my case that it can be misleading.
      I am just trying to drop some of the variables before making my regressions since 439 variables is way to many to include.

      Comment


      • #4
        You don't seem to have taken Carlo's comment sufficiently seriously. By dropping variables based on their association with the dv, you're essentially data mining (in the old bad sense). You might as well move to stepwise regression. If you're going to use this strictly for prediction and can check the predictive accuracy on a hold out sample, this might be a good strategy, but for most work that tries to understand what influences y, it is generally seen as a very poor strategy.

        Comment

        Working...
        X