Replace missing observations with the average of non-missing observations

Alberto Alvarez

Join Date: Jul 2016
Posts: 195

Replace missing observations with the average of non-missing observations

12 Jan 2019, 09:28

Dear Stata Users,

Please, help me to resolve the following issue with reduced sample of data below:

Below is the dataset, where “ID” – is company id, “project_id” – is an identifier of the project; “share” – is the percentage of shares of that company; “leader” – is equal to 1 for a leader of the purchase (the are might be more than one leaders); “participant_num” – is total number of participants. What I want is to impute missing “share” from the same “project id” structure that has non-missing “share”. For example: “project_id” = 19 has 5 participants with one leader. The same structure has “project_id” 16820 and 16000. So, on average the leader in 16820 and 16000 has shares of 35.7 ([50+21.4]/2) and all the rest participant have 16.0875 ([(21.5+14.3+28.6+14.3)/2] + [(10+10+10+20)/2]) / 2. This numers should be replaced for “project_id” = 19 (i.e. "ID" =
7827 should equal to 35.7 and all the rest of the missing "shares" of “project_id” = 19 should equal to 16.0875 . Please, help me with this issue.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(project_id ID share) float(participant_num leader)
   19  6003    . 5 .
   19  6039    . 5 .
   19  5849    . 5 .
   19  7827    . 5 1
   19  5952    . 5 .
   98  5893   33 3 .
   98  5862   50 3 1
   98  6835   17 3 .
   99  6835   17 3 .
   99  5893   33 3 .
   99  5862   50 3 1
  100  5862    . 3 1
  100  6835    . 3 .
  100  5893    . 3 .
  311  5831 66.7 2 .
  311  6162 33.3 2 1
  433  6140    . 2 1
  433  5950    . 2 .
  437  5862 85.7 2 1
  437  6168 14.3 2 .
  590  6123   30 4 .
  590  7878   30 4 .
  590  7839   20 4 .
  590  5914   20 4 .
  627  6003    . 4 .
  627  5891    . 4 .
  627  5893    . 4 .
  627  7910    . 4 .
  628  6003   25 4 .
  628  5893   25 4 .
  628  7910   25 4 .
  628  5891   25 4 .
  735  6036   15 7 .
  735  6139   10 7 .
  735  5952   10 7 .
  735  6033   25 7 1
  735  5978   15 7 .
  735  6075   15 7 .
  735  5905   10 7 .
  850  5889   25 7 1
  850  6075 12.5 7 .
  850  6135 12.5 7 .
  850  6014 12.5 7 .
  850  5952 12.5 7 .
  850  5893 12.5 7 .
  850  6123 12.5 7 .
  907  6059    . 7 .
  907  5891    . 7 1
  907  6033    . 7 .
  907  6082    . 7 .
  907  6036    . 7 .
  907  5982    . 7 .
  907  5831    . 7 .
 1116  6059   20 7 1
 1116  5980    8 7 .
 1116  6838    8 7 .
 1116  5914   24 7 1
 1116  2446   16 7 .
 1116  6442    8 7 .
 1116  6836   16 7 .
 1117  5980    . 7 .
 1117  2446    . 7 .
 1117  5914    . 7 1
 1117  6836    . 7 .
 1117  6442    . 7 .
 1117  6059    . 7 .
 1117  6838    . 7 .
16000  7847   10 5 .
16000  7852   10 5 .
16000 13497   20 5 .
16000  5862   10 5 .
16000  7855   50 5 1
16819  7847 28.6 5 1
16819 13497 21.5 5 .
16819  7852 14.3 5 .
16819  5862 21.4 5 1
16819  7855 14.3 5 .
16820  7852 14.3 5 .
16820  5862 21.4 5 1
16820  7855 14.3 5 .
16820  7847 28.6 5 .
16820 13497 21.5 5 .
end

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30179

12 Jan 2019, 11:18

Your explanation is a bit self-contradictory. The expression ([(21.5+14.3+28.6+14.3)/2] + [(10+10+10+20)/2]) / 2 does not make much sense. Why divide sums of four numbers by 2? Also the value of that expression is 32.175, is not the 16.0875 that you then refer to. The number 16.0875 is actually the value of ([(21.5+14.3+28.6+14.3)/4] + [(10+10+10+20)/4]) / 2, which I think makes more sense. So I'm going to assume you meant this latter expression instead.

Code:

//    CHANGE LEADER TO A 1/0, RATHER THAN 1/.
//    VARIABLE FOR BETTER COMPATIBILITY WITH STATA
replace leader = 0 if missing(leader)
by project_id, sort: egen leader_count = sum(leader)

//    CREATE A DATA SET OF STRUCTURE-BASED AVERAGES
//    STRUCTURE IS DEFINED BY NUMBER OF PARTICIPANTS
//    AND NUMBER OF LEADERS
preserve
//    CALCULATE AVERAGE SHARES FOR LEADERS AND NON-LEADERS
//    WITHIN EACH STRUCTURE
collapse (mean) share, by(participant_num leader_count leader)
tempfile structures
save `structures'

//    NOW UPDATE THE ORIGINAL DATA FROM THIS
restore
merge m:1 participant_num leader_count leader using `structures', update

Announcement

Replace missing observations with the average of non-missing observations

Comment