Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating average of variables from different years


    I would like to generate a variable equal to the simple average of some other variables. Problem is, I have an unbalanced dataset, like this one:

    Code:
    input str2 country int year str4 varName double value
    "AA" 2008 "varA" 4
    "AA" 2009 "varA" .
    "AA" 2010 "varA" .
    "AA" 2008 "varB" 6.5
    "AA" 2009 "varB" 8
    "AA" 2010 "varB" 4.7
    "AA" 2008 "varC" 3.2
    "AA" 2009 "varC" 5.4
    "AA" 2010 "varC" .
    "BB" 2008 "varA" 7
    "BB" 2009 "varA" 8.2
    "BB" 2010 "varA" .
    "BB" 2008 "varB" 7.3
    "BB" 2009 "varB" 0
    "BB" 2010 "varB" 9
    "BB" 2008 "varC" 6
    "BB" 2009 "varC" 5
    "BB" 2010 "varC" .
    end

    What I would like to achieve is the following:
    1) calculate a simple average, by country, of varA, varB and varC taking the value from the latest year available for each variable. So it should take the average of the value of varA and varC in 2009 and the value of varB in 2010.

    2) Moreover, note that country AA has its last observation for varA in 2008. However, the last observation
    in the sample
    for varA is 2009. Therefore, for country AA I would like to remove varA from the calculation of the average, so that for country AA we only take the average of varB in 2010 and varC in 2009.

    In other words, the new variable should be calculated as:

    for country AA: (4.7 + 5.4)/2 = 5.05
    for country BB: (8.2 + 9 + 5)/3 = 7.4

    In the dataset, it should
    appear like this:
    Code:
    input str2 country int year str4 varName double value
    "AA" 2010 "aver" 5.05 "BB" 2010 "aver" 7.4
    end
    i.e. it's enough to insert it just for the last year in the sample.

    Point 1) is crucial to me. If implementing point 2) is too cumbersome, I could live with just 1).





  • #2
    Here is a way to do 1). How to do 2) is left as an exercise. A bigger deal than either is that this kind of layout -- which I've often seen in world development databases -- is, from Stata's point of view, perverse, and your desired outcome would make it worse. So, I suggest a reshape wide


    Code:
    clear 
    input str2 country int year str4 varName double value
    "AA" 2008 "varA" 4
    "AA" 2009 "varA" .
    "AA" 2010 "varA" .
    "AA" 2008 "varB" 6.5
    "AA" 2009 "varB" 8
    "AA" 2010 "varB" 4.7
    "AA" 2008 "varC" 3.2
    "AA" 2009 "varC" 5.4
    "AA" 2010 "varC" .
    "BB" 2008 "varA" 7
    "BB" 2009 "varA" 8.2
    "BB" 2010 "varA" .
    "BB" 2008 "varB" 7.3
    "BB" 2009 "varB" 0
    "BB" 2010 "varB" 9
    "BB" 2008 "varC" 6
    "BB" 2009 "varC" 5
    "BB" 2010 "varC" .
    end
    
    reshape wide value, i(country year) j(varName) string 
    rename value* * 
    
    gen sum = 0
    gen count = 0
    gen work = .
    
    quietly foreach v of var var* {
        replace work = `v'
        bysort country (year) : replace work = work[_n-1] if missing(work)
        bysort country (year) : replace sum = sum + work[_N] if work[_N] < .
        by country : replace count = count + !missing(work[_N])
    }
    
    gen mean = sum / count
    
    list, sepby(country)

    Comment


    • #3
      Thank you very much Nick. I also have a "wide" version of this dataset where your code will be very useful. However I also need to have one version with this unusual structure, as I then need to export this data to Excel and work with pivot tables and the like in there.

      Comment

      Working...
      X