replacing missing values with mean in time series data

Mary Atieno

Join Date: Jul 2021
Posts: 38

replacing missing values with mean in time series data

23 Jun 2023, 05:21

I have time series data containing 4 years (2019-2022) as shown below. Each observation represents a specific month. I do not have data for the year 2022 for the variable and I would like to replace this missing data with the mean of this variable for the previous years for which data are available. Please help. (I a aware about imputation, but in my case, the replacement with the mean is ok)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 id int(t_workload_month year_data_po2)
"C001F001" 1107 2019
"C001F001"  991 2019
"C001F001"  825 2019
"C001F001"  707 2019
"C001F001" 1111 2019
"C001F001" 1421 2019
"C001F001" 1411 2019
"C001F001"    0 2019
"C001F001" 1006 2019
"C001F001" 1034 2019
"C001F001" 1289 2019
"C001F001" 1317 2019
"C001F001" 2065 2020
"C001F001" 1594 2020
"C001F001" 1453 2020
"C001F001"  592 2020
"C001F001" 1039 2020
"C001F001" 1679 2020
"C001F001" 1853 2020
"C001F001" 1898 2020
"C001F001" 1700 2020
"C001F001" 1597 2020
"C001F001" 1276 2020
"C001F001"  600 2020
"C001F001"   26 2021
"C001F001"  203 2021
"C001F001" 1501 2021
"C001F001"  791 2021
"C001F001" 1087 2021
"C001F001" 1258 2021
"C001F001" 1241 2021
"C001F001" 1261 2021
"C001F001" 1345 2021
"C001F001" 1665 2021
"C001F001" 1988 2021
"C001F001" 1727 2021
"C001F001" 1320 2022
"C001F001"  978 2022
"C001F001"  675 2022
"C001F001"    . 2022
"C001F001"    . 2022
"C001F001"    . 2022
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

23 Jun 2023, 08:59

This is a dangerously organized data set. The variable t_workload_month clearly varies from one month to the next. Most reasonable imputation schemes here would impute the mean value for the same month in previous years. But there is no variable showing which month is which. I'm going to assume that the data are currently arranged in chronological order, so the first observation for a given year is January, and the second observation is February, etc. But this is a risky assumption to be making, as in the course of the data management creating this data set, the sort order may have been disturbed. Anyway, to assure no further disturbance occurs, the first step is to create a month variable. After that it's straightforward.

Code:

gen `c(obs_t)' obs_no = _n // RECORD CURRENET SORT ORDER // CREATE A MONTH VARIABLE by id year (obs_no), sort: gen month = _n gen mdate = ym(year, month) format mdate %tm assert !missing(mdate) // DO THE IMPUTATION by id month, sort: egen mean_t_workload = mean(t_workload_month) gen imputed_t_workload_month = cond(missing(t_workload_month), mean_t_workload, /// t_workload_month) isid id mdate, sort
Comment
Mary Atieno

Join Date: Jul 2021

Posts: 38
#3

23 Jun 2023, 11:31

Originally posted by Clyde Schechter View Post

This is a dangerously organized data set. The variable t_workload_month clearly varies from one month to the next. Most reasonable imputation schemes here would impute the mean value for the same month in previous years. But there is no variable showing which month is which. I'm going to assume that the data are currently arranged in chronological order, so the first observation for a given year is January, and the second observation is February, etc. But this is a risky assumption to be making, as in the course of the data management creating this data set, the sort order may have been disturbed. Anyway, to assure no further disturbance occurs, the first step is to create a month variable. After that it's straightforward.

Code:

gen `c(obs_t)' obs_no = _n // RECORD CURRENET SORT ORDER // CREATE A MONTH VARIABLE by id year (obs_no), sort: gen month = _n gen mdate = ym(year, month) format mdate %tm assert !missing(mdate) // DO THE IMPUTATION by id month, sort: egen mean_t_workload = mean(t_workload_month) gen imputed_t_workload_month = cond(missing(t_workload_month), mean_t_workload, /// t_workload_month) isid id mdate, sort

Thank you Clyde, this works really well. Is there a way to round off the imputed values to the nearest whole number?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#4

23 Jun 2023, 11:33

Code:

replace imputed_t_workload_month = round(imputed_t_workload_month, 1)
Comment

Announcement

replacing missing values with mean in time series data

Comment

Comment

Comment