Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create a new variable that computes mean of corresponding values based on three closest observations of another variable by year and region

    Hello All,

    I have been getting stuck on computing the average of the variable score based on the three closest observations for the variable rating. For example, the variable rating for the first observation in the year 2016 and region 1 is 6.3507. The three closest observations for this observation based on the variable rating are 7.2844, 3.9329 and 3.3063. The new variable score1 that I have been trying to compute should be the average of their corresponding values for the variable score: 0.18, 0.27 and 0.15, which equals 0.2. It is extremely complex and time-consuming to solve it using Excel. I tried different ways using Stata. However, no success so far and really getting frustrated.

    Any help is much appreciated!

    Code:
    *To install: 
    ssc install dataex
    clear 
    input int year int region float rating float score
    2016    1    6.3507    0.22
    2016    1    7.2844    0.18
    2016    1    3.3063    0.15
    2016    1    1.9716    0.39
    2016    1    3.9329    0.27
    2016    2    2.3123    0.51
    2016    2    2.3670    0.42
    2016    2    6.7519    0.41
    2016    2    1.2646    0.36
    2016    2    1.2286    0.58
    2015    1    1.1943    0.98
    2015    1    2.4667    0.67
    2015    1    8.9587    0.17
    2015    1    7.1866    0.33
    2015    1    6.6857    0.42
    2015    2    3.6258    0.72
    2015    2    7.7948    0.66
    2015    2    1.6079    0.6
    2015    2    2.4596    0.55
    2015    2    3.0391    0.43
    2014    1    5.1069    0.59
    2014    1    8.5398    0.52
    2014    1    5.3647    0.36
    2014    1    3.7527    0.39
    2014    1    3.4095    0.09
    2014    2    1.9004    0.65
    2014    2    2.7130    0.60
    2014    2    8.4072    0.50
    2014    2    2.9089    0.52
    2014    2    1.0119    0.53
    
    
    end
    Many thanks,
    David

  • #2
    Code:
    //    PAIR UP EVERY OBSERVATION WITH ALL OBSERVATIONS THAT
    //    AGREE ON REGION AND YEAR
    gen long obs_no = _n
    tempfile copy
    save `copy'
    rename (obs_no rating score) orig_=
    joinby region year using `copy'
    
    //    FOREACH ORIGINAL OBSERVATION, ORDER ALL OF THE
    //    PAIRED OBSERVATIONS BASED ON CLOSENESS IN RATING
    //    AND KEEP THE THREE CLOSEEST
    gen delta = abs(rating - orig_rating)
    drop if obs_no == orig_obs_no // DON'T PAIR WITH SELF
    by orig_obs_no (delta), sort: keep if _n <= 3
    
    //    NOW CALCULATE THE MEAN SCORE FOR THOSE
    //    AND REDUCE TO ORIGINAL OBSERVATIONS
    collapse (first) year region orig_rating orig_score ///
        (mean) mean_of_3_scores = score, by(orig_obs_no)
    rename orig_* *
    Note: Your question is ill-posed in one respect. It is possible that there will be ties for third place in closeness on rating, but that those observations tied on rating differ on score. The above code will break these ties in a random and irreproducible way, meaning that you would then get different results each time you ran this code. If you are confident that there are no such ties, or if you don't care about this issue, then the above code is what you want. If, however, this is an issue for you, post back describing how you would systematically break those ties and I'll try to modify the above accordingly.

    Comment


    • #3
      Hi Clyde,

      Thank you so much for the prompt response! Your code worked perfectly as usual! Yes, I do have a couple of follow-up questions. First, I forgot to mention that there should be an organization ID (orgid). The panel data is uniquely identified by year and orgid. How should the code be modified so the variable orgid is kept? Second, if I want to compute the two closest observations as a new variable, in addition to the three. How should the code be modified so that the two new variables are there after the code is run? With regard to the ties, the rule is that if there is a tie (e.g., two with the same values), then the two will be counted as two, rather than one; similarly, if the tie includes 3 same values, then the three closest observations will be counted as three rather than one.

      Code:
      *To install: 
      ssc install dataex
      clear 
      input int year str6 orgid int region float rating float score
      2016    A    1    6.3507    0.22
      2016    B    1    7.2844    0.18
      2016    C    1    3.3063    0.15
      2016    D    1    1.9716    0.39
      2016    E    1    3.9329    0.27
      2016    AA    2    2.3123    0.51
      2016    BB    2    2.367    0.42
      2016    CC    2    6.7519    0.41
      2016    DD    2    1.2646    0.36
      2016    EE    2    1.2286    0.58
      2015    A    1    1.1943    0.98
      2015    B    1    2.4667    0.67
      2015    C    1    8.9587    0.17
      2015    D    1    7.1866    0.33
      2015    E    1    6.6857    0.42
      2015    AA    2    3.6258    0.72
      2015    BB    2    7.7948    0.66
      2015    CC    2    1.6079    0.6
      2015    DD    2    2.4596    0.55
      2015    EE    2    3.0391    0.43
      2014    A    1    5.1069    0.59
      2014    B    1    8.5398    0.52
      2014    C    1    5.3647    0.36
      2014    D    1    3.7527    0.39
      2014    E    1    3.4095    0.09
      2014    AA    2    1.9004    0.65
      2014    BB    2    2.713    0.6
      2014    CC    2    8.4072    0.5
      2014    DD    2    2.9089    0.52
      2014    EE    2    1.0119    0.53
      
      
      end
      Many thanks,
      David

      Comment


      • #4
        I believe this does it:

        Code:
        clear 
        input int year str6 orgid int region float rating float score
        2016    A    1    6.3507    0.22
        2016    B    1    7.2844    0.18
        2016    C    1    3.3063    0.15
        2016    D    1    1.9716    0.39
        2016    E    1    3.9329    0.27
        2016    AA    2    2.3123    0.51
        2016    BB    2    2.367    0.42
        2016    CC    2    6.7519    0.41
        2016    DD    2    1.2646    0.36
        2016    EE    2    1.2286    0.58
        2015    A    1    1.1943    0.98
        2015    B    1    2.4667    0.67
        2015    C    1    8.9587    0.17
        2015    D    1    7.1866    0.33
        2015    E    1    6.6857    0.42
        2015    AA    2    3.6258    0.72
        2015    BB    2    7.7948    0.66
        2015    CC    2    1.6079    0.6
        2015    DD    2    2.4596    0.55
        2015    EE    2    3.0391    0.43
        2014    A    1    5.1069    0.59
        2014    B    1    8.5398    0.52
        2014    C    1    5.3647    0.36
        2014    D    1    3.7527    0.39
        2014    E    1    3.4095    0.09
        2014    AA    2    1.9004    0.65
        2014    BB    2    2.713    0.6
        2014    CC    2    8.4072    0.5
        2014    DD    2    2.9089    0.52
        2014    EE    2    1.0119    0.53
        end
        
        
        //    PAIR UP EVERY OBSERVATION WITH ALL OBSERVATIONS THAT
        //    AGREE ON REGION AND YEAR
        gen long obs_no = _n
        tempfile copy
        save `copy'
        rename (obs_no  rating score orgid) orig_=
        joinby region year using `copy'
        
        //    FOREACH ORIGINAL OBSERVATION, ORDER ALL OF THE
        //    PAIRED OBSERVATIONS BASED ON CLOSENESS IN RATING
        //    AND KEEP THE THREE CLOSEEST
        gen delta = abs(rating - orig_rating)
        drop if obs_no == orig_obs_no // DON'T PAIR WITH SELF
        by orig_obs_no, sort: egen rank = rank(delta), field
        keep if rank <= 3
        
        //    NOW CALCULATE THE MEAN SCORE FOR THOSE
        by orig_obs_no, sort: egen mean_of_3 = mean(score)
        by orig_obs_no: egen mean_of_2 = mean(cond(rank <= 2, score, .))
        by orig_obs_no: keep if _n == 1
        keep year region orig_* mean_of_2 mean_of_3
        rename orig_* *

        Comment


        • #5
          Thank you very much, Clyde! The code was near perfect! I went through it a few times. I think the only change that needs to be made to the code is to change the field to track because the field keeps the most distant distances whereas track keeps the closest ones.

          Many thanks,
          David

          Comment


          • #6
            Sorry, I don't use -egen, rank()- very often, so my memory of track and field is shaky. But using track will not do what you want, because it handles ties incorrectly for your purpose. So what you actually need to do is:

            Code:
            by orig_obs_no, sort: egen rank = rank(-delta), field
            Now you will have the tie-handling of field (which you want) and the ranking of the closest matches at lowest-numbered ranks.

            Comment


            • #7
              Thank you, Clyde! It makes sense. This line of code worked perfectly!

              Comment

              Working...
              X