Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exclude variables with all-missing values

    Dear Statalist community,

    I have an unbalanced panel dataset at the month-day-hour level and spans around 10 years. Among my 20 independent variables, some have missing values for the entire year (e.g., in 2011, an independent variable has missing value for all the month-day-hours) and some have missing values for couple of month-day-hour within the year (e.g., in 2011, an independent variable only has missing value on Jan 1st, 10am).

    I want to run separate regressions for each year, using independent variables that do not have missing values for the entire year. That is, I will not include the independent variable if it has missing values for the entire year. However, I will replace the missing value with 0 and include the independent variable if it only have missing values for couple of month-day-hour within the year.

    My questions are:
    1. how can I find variables that do not have missing values for the entire year?
    2. how can I save these variables to a list so that the regression only use these variables?


    What I have in mind:
    Code:
    levelsof year, local(myyear)
      
       * loop through each time period and run one regression
        foreach i in `myyear' {
    
            * select variables that do not have missing values for the entire year i
            DO NOT KNOW WHAT TO DO HERE
            
            * for the variables that have only some missing value, replace missing with 0
            foreach var of `varlist' {
              replace `var'=0 if `var'==.&year==i
            }
    
            * run regression
            reg dep independents if year==i
    }
    Thank you very much for your help!

  • #2
    Yifei:
    I'd not recommend this approach, as you're actually cherry-picking your data and ending up with results that do not represent your original sample size (and replacing missing values with zero is highly questionable either, as the missing value could have been whatever value that falls in between the bounds of your variable and zero would make the variance collapsing).
    Moreover, running year-specific OLS implies forgetting the panel data structure of your dataset.
    In addition, Stata adopts the listwise approach: terefore, observations with missing values in any of the variables are simply ruled out from the subsequent statistical procedure: I'd run my panel data regression ignoring the missing values issue, first.
    Then, I'd consider -ipolate-, provided that I'm aware of the mechanism underlying the missingness of my data.

    Kind regards,
    Carlo
    (Stata 19.0)

    Comment

    Working...
    X