Hi All,
I have data that looks like the following:
----------------------- copy starting from the next line -----------------------
In the above, I have country GDP by year. What I wish to do as follows- for the entire sample (1990-1992), I wish to calculate average gdp by country. Thereafter, I wish to calculate the median of the averageGDP. Finally, I wish to then create an indicator variable that takes on a value of 1 if country i's GDP in year t is greater than the median gdp calculated as described. I do the following:
First, the code seemed OK to me. However, I then realized upon tabulating the values of median, that there were only around 20% 0s. Given that I am considering the median, I expected it to be close to half. I then attributed this to the fact that the panel is largely unbalanced, and as such, fewer countries have 0s than 1s, as data missingness is definitely correlated with GDP (richer countries have more datapoints per year, hence more 1's).
However, I then noticed another problem. In the above, I would generate first the averageGDP by country. This variable will be constant within country over time. As such, when I would then generate the median, the median would be affected by how many datapoints each country has. Consider the USA- because it has datapoints for all years, it will influence the calculation of the median in a way I do not want it to. As such, I chose to drop duplicates, and re-calculate the median.
Even after dropping duplicates, the median was still the same! Shouldn't the median be different after the duplicates of the averages have been dropped?
Thanks!
CS
I have data that looks like the following:
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float(year Country GDP) 1990 1 21 1990 2 2 1990 3 23 1990 4 2 1991 2 1 1991 3 2 1991 4 3 1991 5 2 1992 1 12 1992 2 3 1992 3 21 1992 4 2 1992 5 3 end
Code:
by Country, sort: egen avggdp=mean(GDP) egen medianGDP=median(avggdp) g median=0 by Country (year), sort: replace median=1 if avggdp>medianGDP
However, I then noticed another problem. In the above, I would generate first the averageGDP by country. This variable will be constant within country over time. As such, when I would then generate the median, the median would be affected by how many datapoints each country has. Consider the USA- because it has datapoints for all years, it will influence the calculation of the median in a way I do not want it to. As such, I chose to drop duplicates, and re-calculate the median.
Even after dropping duplicates, the median was still the same! Shouldn't the median be different after the duplicates of the averages have been dropped?
Thanks!
CS
Comment