Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating Median - Surprising Result

    Hi All,

    I have data that looks like the following:


    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(year Country GDP)
    1990 1 21
    1990 2  2
    1990 3 23
    1990 4  2
    1991 2  1
    1991 3  2
    1991 4  3
    1991 5  2
    1992 1 12
    1992 2  3
    1992 3 21
    1992 4  2
    1992 5  3
    end
    In the above, I have country GDP by year. What I wish to do as follows- for the entire sample (1990-1992), I wish to calculate average gdp by country. Thereafter, I wish to calculate the median of the averageGDP. Finally, I wish to then create an indicator variable that takes on a value of 1 if country i's GDP in year t is greater than the median gdp calculated as described. I do the following:

    Code:
    by Country, sort: egen avggdp=mean(GDP)
    egen medianGDP=median(avggdp)
    g median=0
    by Country (year), sort: replace median=1 if avggdp>medianGDP
    First, the code seemed OK to me. However, I then realized upon tabulating the values of median, that there were only around 20% 0s. Given that I am considering the median, I expected it to be close to half. I then attributed this to the fact that the panel is largely unbalanced, and as such, fewer countries have 0s than 1s, as data missingness is definitely correlated with GDP (richer countries have more datapoints per year, hence more 1's).
    However, I then noticed another problem. In the above, I would generate first the averageGDP by country. This variable will be constant within country over time. As such, when I would then generate the median, the median would be affected by how many datapoints each country has. Consider the USA- because it has datapoints for all years, it will influence the calculation of the median in a way I do not want it to. As such, I chose to drop duplicates, and re-calculate the median.
    Even after dropping duplicates, the median was still the same! Shouldn't the median be different after the duplicates of the averages have been dropped?

    Thanks!
    CS

  • #2
    Chinmay:
    I fail to get the final part of your statement and your concern about deleting duplicates.
    Both mean and median calculated via -egen- are constant over time for a given country:
    Code:
    . input float(year Country GDP)
    
              year    Country        GDP
      1.
    . 1990 1 21
      2.
    . 1990 2  2
      3.
    . 1990 3 23
      4.
    . 1990 4  2
      5.
    . 1991 2  1
      6.
    . 1991 3  2
      7.
    . 1991 4  3
      8.
    . 1991 5  2
      9.
    . 1992 1 12
     10.
    . 1992 2  3
     11.
    . 1992 3 21
     12.
    . 1992 4  2
     13.
    . 1992 5  3
     14.
    . end
    
    . bysort Country: egen mean_GDP=mean(GDP)
    
    . bysort Country: egen median_GDP=median(GDP)
    
    . g median=0
    
    . bysort Country year: replace median=1 if median_GDP>mean_GDP
    
    
    . list, sepby(Country)
    
         +-----------------------------------------------------+
         | year   Country   GDP   mean_GDP   median~P   median |
         |-----------------------------------------------------|
      1. | 1990         1    21       16.5       16.5        0 |
      2. | 1992         1    12       16.5       16.5        0 |
         |-----------------------------------------------------|
      3. | 1990         2     2          2          2        0 |
      4. | 1991         2     1          2          2        0 |
      5. | 1992         2     3          2          2        0 |
         |-----------------------------------------------------|
      6. | 1990         3    23   15.33333         21        1 |
      7. | 1991         3     2   15.33333         21        1 |
      8. | 1992         3    21   15.33333         21        1 |
         |-----------------------------------------------------|
      9. | 1990         4     2   2.333333          2        0 |
     10. | 1991         4     3   2.333333          2        0 |
     11. | 1992         4     2   2.333333          2        0 |
         |-----------------------------------------------------|
     12. | 1991         5     2        2.5        2.5        0 |
     13. | 1992         5     3        2.5        2.5        0 |
         +-----------------------------------------------------+
    
    .
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hi Carlo,

      Thanks for your reply. I think I failed to convey the issue properly.,
      In your example, you have calculated the Median GDP by country. That is not what I intended. Basically, what I wish to do is:
      1) Calculate average GDP by country over time.
      2) This results in the creation of an average GDP variable, one which is constant for countries over time due to its calculation.
      3) I wish to then calculate the median of the average GDPs, over countries. So, in the example you have constructed, it would amount to finding the median of :

      16,5 2, 15.33, 2.33, 2.5 or in sorted form: 2, 2.33, 2.5, 15.33, 16.5
      The median for this would be 2.5. This is what I want to obtain. However, if duplicate entries are considered, the median would be calculated based on:

      16.5, 16.5, 2,2,2, 15.33, 15.33, 15.33, 2.333, 2.333, 2.333, 2.5.2.5

      As such, the calculation of the median should not be invariant to the presence of duplicates in the calculation. What I have found in my calculations, however, is that the values coincide- or at least the number of countries that are above the median value of avg GDP by country is the same, whether or not I drop such duplicates.


      Thanks,
      CS

      Comment


      • #4
        Hi Carlo

        Apologies for wasting your time- the mistake was mine. The median is in fact not invariant to presence of duplicates. I doubled checked!

        Many thanks,
        CS

        Comment


        • #5
          Chinmay:
          admittedly, I suspected that something went wrong during your checking process and decided to wait instead of reacting promptly.
          Happy with reading that everything is consistent now.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment

          Working...
          X