Calculating average of variables from different years

Giuseppe Canzonieri

Join Date: Mar 2017

Posts: 14
#1

Calculating average of variables from different years

26 Nov 2020, 03:32

I would like to generate a variable equal to the simple average of some other variables. Problem is, I have an unbalanced dataset, like this one:

Code:

input str2 country int year str4 varName double value "AA" 2008 "varA" 4 "AA" 2009 "varA" . "AA" 2010 "varA" . "AA" 2008 "varB" 6.5 "AA" 2009 "varB" 8 "AA" 2010 "varB" 4.7 "AA" 2008 "varC" 3.2 "AA" 2009 "varC" 5.4 "AA" 2010 "varC" . "BB" 2008 "varA" 7 "BB" 2009 "varA" 8.2 "BB" 2010 "varA" . "BB" 2008 "varB" 7.3 "BB" 2009 "varB" 0 "BB" 2010 "varB" 9 "BB" 2008 "varC" 6 "BB" 2009 "varC" 5 "BB" 2010 "varC" . end

What I would like to achieve is the following:
1) calculate a simple average, by country, of varA, varB and varC taking the value from the latest year available for each variable. So it should take the average of the value of varA and varC in 2009 and the value of varB in 2010.

2) Moreover, note that country AA has its last observation for varA in 2008. However, the last observation
in the sample
for varA is 2009. Therefore, for country AA I would like to remove varA from the calculation of the average, so that for country AA we only take the average of varB in 2010 and varC in 2009.

In other words, the new variable should be calculated as:

for country AA: (4.7 + 5.4)/2 = 5.05
for country BB: (8.2 + 9 + 5)/3 = 7.4

In the dataset, it should
appear like this:

Code:

input str2 country int year str4 varName double value
"AA" 2010 "aver" 5.05 "BB" 2010 "aver" 7.4
end

i.e. it's enough to insert it just for the last year in the sample.

Point 1) is crucial to me. If implementing point 2) is too cumbersome, I could live with just 1).
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35720

26 Nov 2020, 05:01

Here is a way to do 1). How to do 2) is left as an exercise. A bigger deal than either is that this kind of layout -- which I've often seen in world development databases -- is, from Stata's point of view, perverse, and your desired outcome would make it worse. So, I suggest a reshape wide

Code:

clear 
input str2 country int year str4 varName double value
"AA" 2008 "varA" 4
"AA" 2009 "varA" .
"AA" 2010 "varA" .
"AA" 2008 "varB" 6.5
"AA" 2009 "varB" 8
"AA" 2010 "varB" 4.7
"AA" 2008 "varC" 3.2
"AA" 2009 "varC" 5.4
"AA" 2010 "varC" .
"BB" 2008 "varA" 7
"BB" 2009 "varA" 8.2
"BB" 2010 "varA" .
"BB" 2008 "varB" 7.3
"BB" 2009 "varB" 0
"BB" 2010 "varB" 9
"BB" 2008 "varC" 6
"BB" 2009 "varC" 5
"BB" 2010 "varC" .
end

reshape wide value, i(country year) j(varName) string 
rename value* * 

gen sum = 0
gen count = 0
gen work = .

quietly foreach v of var var* {
    replace work = `v'
    bysort country (year) : replace work = work[_n-1] if missing(work)
    bysort country (year) : replace sum = sum + work[_N] if work[_N] < .
    by country : replace count = count + !missing(work[_N])
}

gen mean = sum / count

list, sepby(country)

Comment

Giuseppe Canzonieri

Join Date: Mar 2017

Posts: 14
#3

26 Nov 2020, 07:03

Thank you very much Nick. I also have a "wide" version of this dataset where your code will be very useful. However I also need to have one version with this unusual structure, as I then need to export this data to Excel and work with pivot tables and the like in there.
Comment

Announcement

Calculating average of variables from different years

Comment

Comment