Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • help with calculating the means for long form data

    Hi,

    I'm working with long form data which contains information on financial outcomes (income, total wealth and financial wealth) for individuals when they are 75 to 80 years old. Most individuals are observed at least 2x during this period. I would like to calculate the means of these outcomes for each individual based on how many times they are observed in the data and use available data if there is missing data (so I don't lose all observations if at least one observation is missing per individual per age).

    I tried this bysort idauniq: egen mean_inc = mean(totinc_main_z) but something odd is going and it might be because of missing observations although I'm not sure. For example, below, There is one observation for ID 100074 but the mean doesn't match but it matches for ID 100059.

    Click image for larger version

Name:	Capture.JPG
Views:	1
Size:	42.3 KB
ID:	1668763


    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long idauniq byte timeperiod float totinc_main_z
    100059 1 15060.684
    100074 5 28076.156
    103713 1 28268.996
    103716 4 13396.002
    103716 6     14446
    103735 1 14148.883
    103742 3 10623.372
    103742 5 11547.483
    103742 1 10288.215
    103745 4  8303.543
    103745 6      7695
    103769 4 22106.467
    103769 6 22694.467
    103769 2   20616.8
    103771 3     11388
    103771 2     10972
    103771 5      5356
    103784 5     10916
    103784 3      6606
    103799 4      6604
    103799 6      7072
    103799 2      6084
    103807 2      7942
    103807 6  9824.485
    103807 4     10647
    103819 5 17073.887
    103819 3 16348.684
    103819 1  15326.37
    103858 4      9932
    103881 3     18458
    103881 5  23039.67
    103881 1  30421.67
    103885 2  12687.31
    103892 5  29782.16
    103892 3 14868.387
    103892 1  6767.619
    103913 2  8163.268
    103913 6     12485
    103913 4     10930
    103961 2 12854.076
    103961 4 11408.326
    103961 6 10978.244
    103997 6 18112.082
    103997 4  16577.62
    104008 4 18715.166
    104008 6  20535.08
    104008 2 18058.775
    104010 4 12726.148
    104010 6 37818.293
    104010 2  32405.73
    104014 2  19559.33
    104017 1 31547.953
    104017 5  22600.83
    104017 3  23206.26
    104033 1 28838.895
    104033 3   51693.4
    104033 5 30657.824
    104034 1 14854.358
    104034 5     22074
    104034 3 13889.146
    104037 2 17483.955
    104037 4 14044.006
    104053 1 11025.682
    104053 5  9298.152
    104053 3 11221.432
    104065 3      4316
    104065 5      7450
    104079 3  9143.234
    104079 5  9956.428
    104079 1  9969.695
    104083 2 10075.413
    104119 2  20639.88
    104119 4  31236.86
    104119 6   35228.1
    104123 1  9811.217
    104139 3      7803
    104139 5 15812.103
    104146 2  13284.16
    104159 5 28792.264
    104159 3  35081.67
    104159 1   22786.6
    104165 2     10401
    104168 3     10727
    104168 1      5305
    104170 3 16082.013
    104180 3  11583.87
    104180 1  11401.75
    104195 1  15733.76
    104199 2 15317.443
    104199 4      9360
    104226 4      9091
    104226 6  57623.44
    104226 2  8908.379
    104231 6  12545.57
    104237 3  23340.32
    104237 1  30862.78
    104237 4  25030.24
    104280 4  21309.22
    104280 6 15175.072
    104285 4     16757
    end

    Thanks in advance!

    Many thanks
    Karen

  • #2
    Your code looks fine to me and I can't reproduce a problem. I would expect small discrepancies if the result of averaging a double was put into a float and in some other cases. A discrepancy like that you report for 100074 should have some other explanation, like other values for 100074 that are out of sequence somehow.

    You don't have any missing values in your data example and if you did they would just be ignored. If otherwise "missing data" means data that might have been in the dataset but aren't, they can't have any impact. (I encourage the term "absent data" for that circumstance.)

    Code:
    . bysort idauniq: egen mean_inc = mean(totinc_main_z) 
    
    . by idauniq: egen count = count(totinc_main_z)
    
    . 
    . list if count == 1 
    
         +--------------------------------------------------+
         | idauniq   timepe~d   totinc~z   mean_inc   count |
         |--------------------------------------------------|
      1. |  100059          1   15060.68   15060.68       1 |
      2. |  100074          5   28076.16   28076.16       1 |
      3. |  103713          1      28269      28269       1 |
      6. |  103735          1   14148.88   14148.88       1 |
     29. |  103858          4       9932       9932       1 |
         |--------------------------------------------------|
     33. |  103885          2   12687.31   12687.31       1 |
     51. |  104014          2   19559.33   19559.33       1 |
     71. |  104083          2   10075.41   10075.41       1 |
     75. |  104123          1   9811.217   9811.217       1 |
     78. |  104146          2   13284.16   13284.16       1 |
         |--------------------------------------------------|
     82. |  104165          2      10401      10401       1 |
     85. |  104170          3   16082.01   16082.01       1 |
     88. |  104195          1   15733.76   15733.76       1 |
     94. |  104231          6   12545.57   12545.57       1 |
    100. |  104285          4      16757      16757       1 |
         +--------------------------------------------------+
    
    . assert mean_inc == totinc_main_z if count == 1
    Note the check above:

    1. The mean of singleton values should be just that single value.

    and in your case

    2.

    Code:
    sort idauniq timeperiod 
    
    list if idauniq == 100074

    Comment


    • #3
      Thank you for your help Nick, appreciate it.

      Many thanks
      Karen

      Comment

      Working...
      X