Filling-in variable's missing values with the average of few variable's observations

Mina Wu

Join Date: Jul 2015

Posts: 79
#1

Filling-in variable's missing values with the average of few variable's observations

28 Jul 2015, 05:06

HI

I have the dataset counting 161 observations. The dataset looks as following:

ID_2 NAME_2 innovation
31462 Region 1 .3636364
31463 Region 2 .25
31464 Region 3 .4
31465 Region 4 .1428571
31466 Region 5 NA
31467 Region 6 0
31468 Region 7 0
31469 Region 8 .3333333
31470 Region 9 .4
31471 Region 10 1
31472 Region 11 NA
.....

I would like to replace missing values of varible innovation (NA) with the average value of few observations of variable innovation. Hypothetically, Region 5 has missing value of innovaiton. I would like to replace this missing value with the average innovation of Region 3, 7 and 10 (which are neigbouring regions of region 5).

How can I do that?

Thank you!

Last edited by Mina Wu; 28 Jul 2015, 05:10.
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

28 Jul 2015, 07:43

Given your circumstances, such that there is no way to tell from the ID variables or the region variables which cases are neighbors, probably the best you can do is to:

Code:

replace innovation=(.4+0+1)/3

which will be somewhat tedious,but with only 161 regions, not too bad.
1 like
Comment

Jorge Eduardo Perez Perez

Join Date: Mar 2014
Posts: 429

28 Jul 2015, 08:32

If you had a variable in your dataset to identify the neighbours of a given region, you could automate this. For example:

Code:

clear
input ID_2 str10 NAME_2 str3 innovation
31462 "Region 1" .3636364
31463 "Region 2" .25
31464 "Region 3" .4
31465 "Region 4" .1428571
31466 "Region 5" NA
31467 "Region 6" 0
31468 "Region 7" 0
31469 "Region 8" .3333333
31470 "Region 9" .4
31471 "Region 10" 1
31472 "Region 11" NA
end

* Get the region numbers
destring NAME_2, ignore("Region ") gen(region)
* Destring innovation
destring innovation, ignore("NA") replace
* Generate a neighbour variable
gen neighbour5=inlist(region,3,7,10)
* Hypothetically, generate some neighbours for region 11 as well
gen neighbour11=inlist(region,2,4,6)

* Get the regions
levelsof region, loc(regions)
* Loop over the regions
foreach r in `regions' {
    * Check if there are missing values of innovation for given region
    capture assert innovation!=. if region==`r'
    * Replace missings with neighbour average
    if _rc {
        sum innovation if neighbour`r', meanonly
        replace innovation=r(mean) if mi(innovation)
    }
}

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com

Comment

Mina Wu

Join Date: Jul 2015

Posts: 79
#4

29 Jul 2015, 02:22

ben earnhart and Jorge Eduardo Perez Perez thank you for your responses!

Jorge Eduardo Perez Perez , I tried applying your suggestions. My innovation variable was destringed already. I create neigbours with no difficulty. Here are the commands for generating all the neigbours.

gen neighbour4=inlist(region_sp,9,5,2)
gen neighbour10=inlist(region_sp,8,6,7,110,104)
gen neighbour11=inlist(region_sp,7,5,2,0,106)
gen neighbour40=inlist(region_sp,25,26,43,42,44,24)
gen neighbour41=inlist(region_sp,45,98,94,96,43)
gen neighbour44=inlist(region_sp,24,25,40,42,43)
gen neighbour55=inlist(region_sp,51,57,135,136)
gen neighbour62=inlist(region_sp,136,60,63,19)
gen neighbour74=inlist(region_sp,77,76,78,73,69,75)
gen neighbour91=inlist(region_sp,95,97,94)
gen neighbour109=inlist(region_sp,146,107,116,115,117)
gen neighbour126=inlist(region_sp,130,129,128,131,125)

I get the regions, as suggested. However, the loop does not work-I still have missing values(12 missing observations).. I do not get any reported error in Stata 13, when I run the loop- just this..

foreach r in `regions' {
2. capture assert innovation!=. if region==`r'
3. if _rc {
4. sum innovation if neighbour`r', meanonly
5. replace innovation=r(mean) if mi( innovation)
6. }
7. }

Do you know if there is any solution?

Thank you so much!
Comment

Jorge Eduardo Perez Perez

Join Date: Mar 2014
Posts: 429

29 Jul 2015, 02:28

In the loop, "region" should be "region_sp"

Code:

levelsof region_sp, loc(regions)
* Loop over the regions
foreach r in `regions' {
    * Check if there are missing values of innovation for given region
    capture assert innovation!=. if region_sp==`r'
    * Replace missings with neighbour average
    if _rc {
        sum innovation if neighbour`r', meanonly
        replace innovation=r(mean) if mi(innovation)
    }
}

Let me know if you have any further problems after this.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com

Comment

Mina Wu

Join Date: Jul 2015

Posts: 79
#6

29 Jul 2015, 03:43

Jorge Eduardo Perez Perez thanks again! it still does not work..

Here are the details of what I do:

*** neighbours

gen neighbour4=inlist(region_sp,9,5,2)
gen neighbour10=inlist(region_sp,8,6,7,110,104)
gen neighbour11=inlist(region_sp,7,5,2,0,106)
gen neighbour40=inlist(region_sp,25,26,43,42,44,24)
gen neighbour41=inlist(region_sp,45,98,94,96,43)
gen neighbour44=inlist(region_sp,24,25,40,42,43)
gen neighbour55=inlist(region_sp,51,57,135,136)
gen neighbour62=inlist(region_sp,136,60,63,19)
gen neighbour74=inlist(region_sp,77,76,78,73,69,75)
gen neighbour91=inlist(region_sp,95,97,94)
gen neighbour109=inlist(region_sp,146,107,116,115,117)
gen neighbour126=inlist(region_sp,130,129,128,131,125)

** get the regions

levelsof region_sp, loc(regions)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 5
> 9 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
> 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
> 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
> 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 15
> 2 153 154 155 156 157 158 159 160

region_sp takes values 0-160..

*loop
foreach r in `regions' {
* Check if there are missing values of innovation for given region
capture assert innovation!=. if region_sp==`r'
* Replace missings with neighbour average
if _rc {
sum innovation if neighbour`r', meanonly
replace innovation=r(mean) if mi(innovation)
}
}

Stata 13 reports the loop, but nothing happens to the missing values of variable innovation?

Thank you again very much!
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#7

29 Jul 2015, 03:51

Are you running the code by pieces? If so, you need to run the levelsof command and the loop together, otherwise the local `regions' will not be recognized by foreach and nothing will happen.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment
Mina Wu

Join Date: Jul 2015

Posts: 79
#8

29 Jul 2015, 04:31

Jorge Eduardo Perez Perez it worked now! This was more than helpful!! Many thanks!!
Comment

Mina Wu

Join Date: Jul 2015
Posts: 79

01 Sep 2016, 12:11

Hi Jorge Eduardo Perez Perez and other who could maybe help.

I had a freedom to ask a related question. I have a panel dataset. Below are three variables firm Id (1-3), year (2007-2012), and the variable X.

id	year	X
1	2007	0
1	2008	0
1	2009	0
1	2010	0
1	2011	0
1	2012	30
1	2013	30
1	2014	30
2	2007	0
2	2008	0
2	2009	50
2	2010	0
2	2011	50
2	2012	50
2	2013	50
2	2014	50
3	2007	66.66
3	2008	66.66
3	2009	66.66
3	2010	0
3	2011	0
3	2012	0
3	2013	0
3	2014	0

I would like to "correct" for the value of X in the year 2010 in the following manner: if the value of x in 2010 is zero, but the value of x in 2011 and 2009 are positive values-then i would like to replace that value zero in 2010 with the average of the value in 2009 and 2011. that would give me value 50 for firm 2. HOWEVER i do not want to make any changes to those zeros of variable X in 2010, if the value in 2007, 2008 and 2009 are zeros (firm 1) and if observations in 2011, 2012, 2013 and 2014 are 0 too(firm 3).

Similarly, i have firms with zero value of x for two/three consecutive years, but certain positive value of x after and prior these zeros, expressed in decimals (firm 4, table below). Is it possible to make a loop function/algorithm which would input the values prior and after the zero values into the cell with zero values, conditional that those value are the same and expressed in decimal. That means that I would NOT like to make any changes to the consecutive zero values if the years prior and after missing are the same, but expressed as round number (firm 5).

id	year	X
4	2007	25.31
4	2008	25.31
4	2009	25.31
4	2010	25.31
4	2011	0
4	2012	0
4	2013	25.31
4	2014	25.31
5	2007	25.31
5	2008	10
5	2009	10
5	2010	0
5	2011	0
5	2012	10
5	2013	10
5	2014	10

To sum it all up, the first loop/algorithm should refer to the year 2010. but the second, should refer to any two/three consecutive years having zeros, while prior/later values of, EXPRESSED IN DECIMALS are identical. Thank you so much in advance!

Announcement