Create a new variable that computes mean of corresponding values based on three closest observations of another variable by year and region

David Wang

Join Date: Jul 2018

Posts: 30
#1

Create a new variable that computes mean of corresponding values based on three closest observations of another variable by year and region

04 Jun 2019, 16:32

Hello All,

I have been getting stuck on computing the average of the variable score based on the three closest observations for the variable rating. For example, the variable rating for the first observation in the year 2016 and region 1 is 6.3507. The three closest observations for this observation based on the variable rating are 7.2844, 3.9329 and 3.3063. The new variable score1 that I have been trying to compute should be the average of their corresponding values for the variable score: 0.18, 0.27 and 0.15, which equals 0.2. It is extremely complex and time-consuming to solve it using Excel. I tried different ways using Stata. However, no success so far and really getting frustrated.

Any help is much appreciated!

Code:

*To install: ssc install dataex clear input int year int region float rating float score 2016 1 6.3507 0.22 2016 1 7.2844 0.18 2016 1 3.3063 0.15 2016 1 1.9716 0.39 2016 1 3.9329 0.27 2016 2 2.3123 0.51 2016 2 2.3670 0.42 2016 2 6.7519 0.41 2016 2 1.2646 0.36 2016 2 1.2286 0.58 2015 1 1.1943 0.98 2015 1 2.4667 0.67 2015 1 8.9587 0.17 2015 1 7.1866 0.33 2015 1 6.6857 0.42 2015 2 3.6258 0.72 2015 2 7.7948 0.66 2015 2 1.6079 0.6 2015 2 2.4596 0.55 2015 2 3.0391 0.43 2014 1 5.1069 0.59 2014 1 8.5398 0.52 2014 1 5.3647 0.36 2014 1 3.7527 0.39 2014 1 3.4095 0.09 2014 2 1.9004 0.65 2014 2 2.7130 0.60 2014 2 8.4072 0.50 2014 2 2.9089 0.52 2014 2 1.0119 0.53 end

Many thanks,
David
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#2

04 Jun 2019, 16:55

Code:

// PAIR UP EVERY OBSERVATION WITH ALL OBSERVATIONS THAT // AGREE ON REGION AND YEAR gen long obs_no = _n tempfile copy save `copy' rename (obs_no rating score) orig_= joinby region year using `copy' // FOREACH ORIGINAL OBSERVATION, ORDER ALL OF THE // PAIRED OBSERVATIONS BASED ON CLOSENESS IN RATING // AND KEEP THE THREE CLOSEEST gen delta = abs(rating - orig_rating) drop if obs_no == orig_obs_no // DON'T PAIR WITH SELF by orig_obs_no (delta), sort: keep if _n <= 3 // NOW CALCULATE THE MEAN SCORE FOR THOSE // AND REDUCE TO ORIGINAL OBSERVATIONS collapse (first) year region orig_rating orig_score /// (mean) mean_of_3_scores = score, by(orig_obs_no) rename orig_* *

Note: Your question is ill-posed in one respect. It is possible that there will be ties for third place in closeness on rating, but that those observations tied on rating differ on score. The above code will break these ties in a random and irreproducible way, meaning that you would then get different results each time you ran this code. If you are confident that there are no such ties, or if you don't care about this issue, then the above code is what you want. If, however, this is an issue for you, post back describing how you would systematically break those ties and I'll try to modify the above accordingly.
2 likes
Comment

David Wang

Join Date: Jul 2018
Posts: 30

04 Jun 2019, 20:03

Hi Clyde,

Thank you so much for the prompt response! Your code worked perfectly as usual! Yes, I do have a couple of follow-up questions. First, I forgot to mention that there should be an organization ID (orgid). The panel data is uniquely identified by year and orgid. How should the code be modified so the variable orgid is kept? Second, if I want to compute the two closest observations as a new variable, in addition to the three. How should the code be modified so that the two new variables are there after the code is run? With regard to the ties, the rule is that if there is a tie (e.g., two with the same values), then the two will be counted as two, rather than one; similarly, if the tie includes 3 same values, then the three closest observations will be counted as three rather than one.

Code:

*To install: 
ssc install dataex
clear 
input int year str6 orgid int region float rating float score
2016    A    1    6.3507    0.22
2016    B    1    7.2844    0.18
2016    C    1    3.3063    0.15
2016    D    1    1.9716    0.39
2016    E    1    3.9329    0.27
2016    AA    2    2.3123    0.51
2016    BB    2    2.367    0.42
2016    CC    2    6.7519    0.41
2016    DD    2    1.2646    0.36
2016    EE    2    1.2286    0.58
2015    A    1    1.1943    0.98
2015    B    1    2.4667    0.67
2015    C    1    8.9587    0.17
2015    D    1    7.1866    0.33
2015    E    1    6.6857    0.42
2015    AA    2    3.6258    0.72
2015    BB    2    7.7948    0.66
2015    CC    2    1.6079    0.6
2015    DD    2    2.4596    0.55
2015    EE    2    3.0391    0.43
2014    A    1    5.1069    0.59
2014    B    1    8.5398    0.52
2014    C    1    5.3647    0.36
2014    D    1    3.7527    0.39
2014    E    1    3.4095    0.09
2014    AA    2    1.9004    0.65
2014    BB    2    2.713    0.6
2014    CC    2    8.4072    0.5
2014    DD    2    2.9089    0.52
2014    EE    2    1.0119    0.53


end

Many thanks,
David

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30063

04 Jun 2019, 22:05

I believe this does it:

Code:

clear 
input int year str6 orgid int region float rating float score
2016    A    1    6.3507    0.22
2016    B    1    7.2844    0.18
2016    C    1    3.3063    0.15
2016    D    1    1.9716    0.39
2016    E    1    3.9329    0.27
2016    AA    2    2.3123    0.51
2016    BB    2    2.367    0.42
2016    CC    2    6.7519    0.41
2016    DD    2    1.2646    0.36
2016    EE    2    1.2286    0.58
2015    A    1    1.1943    0.98
2015    B    1    2.4667    0.67
2015    C    1    8.9587    0.17
2015    D    1    7.1866    0.33
2015    E    1    6.6857    0.42
2015    AA    2    3.6258    0.72
2015    BB    2    7.7948    0.66
2015    CC    2    1.6079    0.6
2015    DD    2    2.4596    0.55
2015    EE    2    3.0391    0.43
2014    A    1    5.1069    0.59
2014    B    1    8.5398    0.52
2014    C    1    5.3647    0.36
2014    D    1    3.7527    0.39
2014    E    1    3.4095    0.09
2014    AA    2    1.9004    0.65
2014    BB    2    2.713    0.6
2014    CC    2    8.4072    0.5
2014    DD    2    2.9089    0.52
2014    EE    2    1.0119    0.53
end


//    PAIR UP EVERY OBSERVATION WITH ALL OBSERVATIONS THAT
//    AGREE ON REGION AND YEAR
gen long obs_no = _n
tempfile copy
save `copy'
rename (obs_no  rating score orgid) orig_=
joinby region year using `copy'

//    FOREACH ORIGINAL OBSERVATION, ORDER ALL OF THE
//    PAIRED OBSERVATIONS BASED ON CLOSENESS IN RATING
//    AND KEEP THE THREE CLOSEEST
gen delta = abs(rating - orig_rating)
drop if obs_no == orig_obs_no // DON'T PAIR WITH SELF
by orig_obs_no, sort: egen rank = rank(delta), field
keep if rank <= 3

//    NOW CALCULATE THE MEAN SCORE FOR THOSE
by orig_obs_no, sort: egen mean_of_3 = mean(score)
by orig_obs_no: egen mean_of_2 = mean(cond(rank <= 2, score, .))
by orig_obs_no: keep if _n == 1
keep year region orig_* mean_of_2 mean_of_3
rename orig_* *

Comment

David Wang

Join Date: Jul 2018

Posts: 30
#5

05 Jun 2019, 13:41

Thank you very much, Clyde! The code was near perfect! I went through it a few times. I think the only change that needs to be made to the code is to change the field to track because the field keeps the most distant distances whereas track keeps the closest ones.

Many thanks,
David
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#6

05 Jun 2019, 14:27

Sorry, I don't use -egen, rank()- very often, so my memory of track and field is shaky. But using track will not do what you want, because it handles ties incorrectly for your purpose. So what you actually need to do is:

Code:

by orig_obs_no, sort: egen rank = rank(-delta), field

Now you will have the tie-handling of field (which you want) and the ranking of the closest matches at lowest-numbered ranks.
1 like
Comment
David Wang

Join Date: Jul 2018

Posts: 30
#7

05 Jun 2019, 15:30

Thank you, Clyde! It makes sense. This line of code worked perfectly!
Comment

Announcement