help with calculating the means for long form data

Karen Arulsamy

Join Date: Feb 2021
Posts: 78

help with calculating the means for long form data

11 Jun 2022, 02:48

Hi,

I'm working with long form data which contains information on financial outcomes (income, total wealth and financial wealth) for individuals when they are 75 to 80 years old. Most individuals are observed at least 2x during this period. I would like to calculate the means of these outcomes for each individual based on how many times they are observed in the data and use available data if there is missing data (so I don't lose all observations if at least one observation is missing per individual per age).

I tried this bysort idauniq: egen mean_inc = mean(totinc_main_z) but something odd is going and it might be because of missing observations although I'm not sure. For example, below, There is one observation for ID 100074 but the mean doesn't match but it matches for ID 100059.

Click image for larger version

Name: Capture.JPG
Views: 1
Size: 42.3 KB
ID: 1668763

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long idauniq byte timeperiod float totinc_main_z
100059 1 15060.684
100074 5 28076.156
103713 1 28268.996
103716 4 13396.002
103716 6     14446
103735 1 14148.883
103742 3 10623.372
103742 5 11547.483
103742 1 10288.215
103745 4  8303.543
103745 6      7695
103769 4 22106.467
103769 6 22694.467
103769 2   20616.8
103771 3     11388
103771 2     10972
103771 5      5356
103784 5     10916
103784 3      6606
103799 4      6604
103799 6      7072
103799 2      6084
103807 2      7942
103807 6  9824.485
103807 4     10647
103819 5 17073.887
103819 3 16348.684
103819 1  15326.37
103858 4      9932
103881 3     18458
103881 5  23039.67
103881 1  30421.67
103885 2  12687.31
103892 5  29782.16
103892 3 14868.387
103892 1  6767.619
103913 2  8163.268
103913 6     12485
103913 4     10930
103961 2 12854.076
103961 4 11408.326
103961 6 10978.244
103997 6 18112.082
103997 4  16577.62
104008 4 18715.166
104008 6  20535.08
104008 2 18058.775
104010 4 12726.148
104010 6 37818.293
104010 2  32405.73
104014 2  19559.33
104017 1 31547.953
104017 5  22600.83
104017 3  23206.26
104033 1 28838.895
104033 3   51693.4
104033 5 30657.824
104034 1 14854.358
104034 5     22074
104034 3 13889.146
104037 2 17483.955
104037 4 14044.006
104053 1 11025.682
104053 5  9298.152
104053 3 11221.432
104065 3      4316
104065 5      7450
104079 3  9143.234
104079 5  9956.428
104079 1  9969.695
104083 2 10075.413
104119 2  20639.88
104119 4  31236.86
104119 6   35228.1
104123 1  9811.217
104139 3      7803
104139 5 15812.103
104146 2  13284.16
104159 5 28792.264
104159 3  35081.67
104159 1   22786.6
104165 2     10401
104168 3     10727
104168 1      5305
104170 3 16082.013
104180 3  11583.87
104180 1  11401.75
104195 1  15733.76
104199 2 15317.443
104199 4      9360
104226 4      9091
104226 6  57623.44
104226 2  8908.379
104231 6  12545.57
104237 3  23340.32
104237 1  30862.78
104237 4  25030.24
104280 4  21309.22
104280 6 15175.072
104285 4     16757
end

Thanks in advance!

Many thanks
Karen

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 36058

11 Jun 2022, 03:35

Your code looks fine to me and I can't reproduce a problem. I would expect small discrepancies if the result of averaging a double was put into a float and in some other cases. A discrepancy like that you report for 100074 should have some other explanation, like other values for 100074 that are out of sequence somehow.

You don't have any missing values in your data example and if you did they would just be ignored. If otherwise "missing data" means data that might have been in the dataset but aren't, they can't have any impact. (I encourage the term "absent data" for that circumstance.)

Code:

. bysort idauniq: egen mean_inc = mean(totinc_main_z) 

. by idauniq: egen count = count(totinc_main_z)

. 
. list if count == 1 

     +--------------------------------------------------+
     | idauniq   timepe~d   totinc~z   mean_inc   count |
     |--------------------------------------------------|
  1. |  100059          1   15060.68   15060.68       1 |
  2. |  100074          5   28076.16   28076.16       1 |
  3. |  103713          1      28269      28269       1 |
  6. |  103735          1   14148.88   14148.88       1 |
 29. |  103858          4       9932       9932       1 |
     |--------------------------------------------------|
 33. |  103885          2   12687.31   12687.31       1 |
 51. |  104014          2   19559.33   19559.33       1 |
 71. |  104083          2   10075.41   10075.41       1 |
 75. |  104123          1   9811.217   9811.217       1 |
 78. |  104146          2   13284.16   13284.16       1 |
     |--------------------------------------------------|
 82. |  104165          2      10401      10401       1 |
 85. |  104170          3   16082.01   16082.01       1 |
 88. |  104195          1   15733.76   15733.76       1 |
 94. |  104231          6   12545.57   12545.57       1 |
100. |  104285          4      16757      16757       1 |
     +--------------------------------------------------+

. assert mean_inc == totinc_main_z if count == 1

Note the check above:

1. The mean of singleton values should be just that single value.

and in your case

2.

Code:

sort idauniq timeperiod 

list if idauniq == 100074

Comment

Karen Arulsamy

Join Date: Feb 2021

Posts: 78
#3

16 Jun 2022, 06:13

Thank you for your help Nick, appreciate it.

Many thanks
Karen
Comment

Announcement

help with calculating the means for long form data

Comment

Comment