Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • replacing missing values with mean in time series data

    I have time series data containing 4 years (2019-2022) as shown below. Each observation represents a specific month. I do not have data for the year 2022 for the variable and I would like to replace this missing data with the mean of this variable for the previous years for which data are available. Please help. (I a aware about imputation, but in my case, the replacement with the mean is ok)

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str8 id int(t_workload_month year_data_po2)
    "C001F001" 1107 2019
    "C001F001"  991 2019
    "C001F001"  825 2019
    "C001F001"  707 2019
    "C001F001" 1111 2019
    "C001F001" 1421 2019
    "C001F001" 1411 2019
    "C001F001"    0 2019
    "C001F001" 1006 2019
    "C001F001" 1034 2019
    "C001F001" 1289 2019
    "C001F001" 1317 2019
    "C001F001" 2065 2020
    "C001F001" 1594 2020
    "C001F001" 1453 2020
    "C001F001"  592 2020
    "C001F001" 1039 2020
    "C001F001" 1679 2020
    "C001F001" 1853 2020
    "C001F001" 1898 2020
    "C001F001" 1700 2020
    "C001F001" 1597 2020
    "C001F001" 1276 2020
    "C001F001"  600 2020
    "C001F001"   26 2021
    "C001F001"  203 2021
    "C001F001" 1501 2021
    "C001F001"  791 2021
    "C001F001" 1087 2021
    "C001F001" 1258 2021
    "C001F001" 1241 2021
    "C001F001" 1261 2021
    "C001F001" 1345 2021
    "C001F001" 1665 2021
    "C001F001" 1988 2021
    "C001F001" 1727 2021
    "C001F001" 1320 2022
    "C001F001"  978 2022
    "C001F001"  675 2022
    "C001F001"    . 2022
    "C001F001"    . 2022
    "C001F001"    . 2022
    end

  • #2
    This is a dangerously organized data set. The variable t_workload_month clearly varies from one month to the next. Most reasonable imputation schemes here would impute the mean value for the same month in previous years. But there is no variable showing which month is which. I'm going to assume that the data are currently arranged in chronological order, so the first observation for a given year is January, and the second observation is February, etc. But this is a risky assumption to be making, as in the course of the data management creating this data set, the sort order may have been disturbed. Anyway, to assure no further disturbance occurs, the first step is to create a month variable. After that it's straightforward.

    Code:
    gen `c(obs_t)' obs_no = _n // RECORD CURRENET SORT ORDER
    
    //    CREATE A MONTH VARIABLE
    by id year (obs_no), sort: gen month = _n
    gen mdate = ym(year, month)
    format mdate %tm
    assert !missing(mdate)
    
    //    DO THE IMPUTATION
    by id month, sort: egen mean_t_workload = mean(t_workload_month)
    gen imputed_t_workload_month = cond(missing(t_workload_month), mean_t_workload, ///
        t_workload_month)
    isid id mdate, sort

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      This is a dangerously organized data set. The variable t_workload_month clearly varies from one month to the next. Most reasonable imputation schemes here would impute the mean value for the same month in previous years. But there is no variable showing which month is which. I'm going to assume that the data are currently arranged in chronological order, so the first observation for a given year is January, and the second observation is February, etc. But this is a risky assumption to be making, as in the course of the data management creating this data set, the sort order may have been disturbed. Anyway, to assure no further disturbance occurs, the first step is to create a month variable. After that it's straightforward.

      Code:
      gen `c(obs_t)' obs_no = _n // RECORD CURRENET SORT ORDER
      
      // CREATE A MONTH VARIABLE
      by id year (obs_no), sort: gen month = _n
      gen mdate = ym(year, month)
      format mdate %tm
      assert !missing(mdate)
      
      // DO THE IMPUTATION
      by id month, sort: egen mean_t_workload = mean(t_workload_month)
      gen imputed_t_workload_month = cond(missing(t_workload_month), mean_t_workload, ///
      t_workload_month)
      isid id mdate, sort
      Thank you Clyde, this works really well. Is there a way to round off the imputed values to the nearest whole number?

      Comment


      • #4
        Code:
        replace imputed_t_workload_month = round(imputed_t_workload_month, 1)

        Comment

        Working...
        X