Generate median ignoring duplicate values

Paul Pelzl

Join Date: May 2016

Posts: 24
#1

Generate median ignoring duplicate values

01 Jul 2016, 04:45

I know this is a rather simple question but I could not find a solution to it. Suppose I have a panel dataset with X firms which are located in Y countries, where X>Y. The dataset contains a variable "GDP" which gives the GDP in year t of the firm's home country. Now, I want to create the median of GDP in a particular year. So for example if Y=5 and the GDP realizations in year t are {5, 10, 15, 20, 25}, then I want STATA to compute the median as 15. Of course, egen median=median(GDP), by(year) is what comes to my mind immediately, but the problem here is that if not every country hosts the same number of firms, then the median computed this way will be influenced by that; e.g. if I have 100 firms in the country that has GDP=25 and one firm in each other country, then the computed median will equal 25, while I want STATA to tell me the median is 15 also in this case.

Of course, one solution is to use "duplicates drop country year, force", then compute the median with the egen command, save the file with a different name and merge it to the original file. But I'm sure there is a much easier way to do that, and hope that someone has a good idea.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

01 Jul 2016, 04:53

Code:

egen tag = tag(country year) egen GDP_median = median(GDP / tag), by(year)

See the help for egen on the tag() function and also http://www.stata-journal.com/sjpdf.h...iclenum=dm0055 especially but not only Section 10.
Comment
Paul Pelzl

Join Date: May 2016

Posts: 24
#3

01 Jul 2016, 05:07

Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#4

01 Jul 2016, 07:26

From the pedantry corner: the title of the original post does not quite accurately represent the question posed there. Paul does not actually want to quash duplicate values: he wants to only count each country once. If it happened that two different countries had the same GDP in a given year, he would want to include the GDP value twice in the calculation of the median. So it is a matter of avoiding multiple-counting, not ignoring duplicate values.
Comment
Paul Pelzl

Join Date: May 2016

Posts: 24
#5

01 Jul 2016, 09:13

What I meant was of course duplicates in terms of GDP and country, rather than only in terms of GDP. Still, I agree with you that my title is misleading, thanks for pointing that out.
Comment

Announcement

Generate median ignoring duplicate values

Comment

Comment

Comment

Comment