Collapse command

Jessica Berrett

Join Date: Sep 2019

Posts: 57
#1

Collapse command

02 Sep 2019, 15:25

I'm trying to figure out the proper code to collapse a dataset and get the means and medians of a set of variables based on the top 20% and bottom 20% of another variable.

I've started with the following code:

collapse adminratio fundratio (median) medadmin=adminratio medfund=fundratio, by(portfolio) cw

But then how do I incorporate basing this on the top 20% and bottom 20% of another variable?

Thank you!
Tags: None
Hisham Yacob

Join Date: Aug 2019

Posts: 7
#2

02 Sep 2019, 17:43

I'm not sure if -collapse- allows an -if- at the end, but you can achieve similar things using the egen function since egen allows the if option.
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#3

02 Sep 2019, 18:41

Actually, let me be more specific as I think this is going to be more complicated that an if statement.

I have a batch of nonprofits, sorted into 5 portfolios based on asset size.

Within each portfolio, I need to calculate the mean and median of the administrative ratio and fundraising ratio, but for the bottom 20% and top 20% of their revenue index.

The results for each portfolio get aggregated, hence the collapse command.

And then I need to do t-tests and Wilcoxon sign tests to determine if there is a significant difference between the bottom and top groups for each mean and median of the administrative ratio and fundraising ratio.

I hope this makes sense!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#4

03 Sep 2019, 09:54

First, I'd say that you're much more likely to get help if you post a data example using the -dataex- command, as described in item 12.2 of the FAQ. It's hard to tell what your data structure is, so re-posting an example would be in your interest.

Second, your proposed analytic methods (dividing up observations, t-tests, sign test) don't sound to me like they are optimal for the kind of questions that I might guess lie behind them. For example, taking continuous data and then using sign tests on it is likely to discard most of the useful information.

Perhaps the methods you propose are standard in your field or otherwise required of you, but if not, I think you'd get some useful ideas if you described what you'd like to know about your data in a more conceptual way. (e.g., perhaps "Is an organization's fundraising ratio related predictive of its revenue index?") My guess --perhaps wrong--is that there are ways to answer your question that are not only more straightforward to do in Stata, but also might be more revealing in relation to your goals.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35589
#5

03 Sep 2019, 11:00

Mike Lacy gives excellent advice. I agree with it all. I would rather look to see how any response of interest varied with asset size, or a transform thereof, or even the percent rank thereof, rather than arbitrarily identify bins at either end of the asset size distribution.

Nevertheless I think you need some hints about technique. to do what you want to do [have to do???].

collapse is highly versatile, but not universally versatile. But if you calculate 20% and 80% percentiles -- which you can do for all your data, or for subsets -- then you can bin away to your heart's content:

Code:

. sysuse auto, clear (1978 Automobile Data) . egen mpg20 = pctile(mpg), p(20) by(foreign) . egen mpg80 = pctile(mpg), p(80) by(foreign) . gen wanted = cond(mpg <= mpg20, 1, cond(mpg <= mpg80, 2, 3)) if mpg < . . scatter wanted mpg , by(foreign) yla(1/3) ms(Oh) jitter(2)

Always draw a graph to show what you did -- in particular realising that bins identified in this way are rarely homogeneous.
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#6

04 Sep 2019, 16:48

Thank you everyone for your responses! I agree that this is not the idea method but I'm doing a replication study and unfortunately the Stata code is not available.

I got the code above to work for my dataset, which is great as I feel I'm a step closer. However, one concern I have is that this looks like it's for the 20th and 80th percentiles, whereas I'm looking for percentages of the top and bottom 20%. Should the code be a little different for deterring percentages rather than percentiles?
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#7

04 Sep 2019, 17:55

I actually think I figured it out...

bysort sizeportfolio: gen byte low = _n < (0.2*_N)
bysort sizeportfolio: gen byte high = _n > (0.8*_N)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#8

04 Sep 2019, 18:22

Well, actually I think that Nick is right in #5 and what you show in #7 will, depending on the data, either produce the same thing or will be wrong.

The definition of a 20th percentile is that 20% of the observations are less than or equal to that value. So, in a general sense, if we don't go too deep into the weeds, a 20th percentile cutoff gives you the bottom 20%. But it's a bit more complicated. Suppose that there are several different securities in the portfolio that have the same value of sizeportfolio and that value is near the 20th percentile cutpoint. Then there is no reason to assign some of them to the bottom 20% but not others: they should all be in or all be out. Your code fails that test: it will separate them, some in, some out. Worse, it will do so in a random and irreproducible way: if you run the code a second time, you will get different results. Your code will also give incorrect results if there are observations with missing values of sizeportfolio, as those will be counted in the _N, but they should not be included when reckoning the top or bottom 20%.

In short, if the data contain no ties and no missing values for sizeportfolio, your code and Nick's will produce the same results. But if those conditions fail, your code will produce incorrect results; Nick's will remain correct.
2 likes
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#9

06 Sep 2019, 12:35

Hi Clyde, thank you so much for your thoughts on this. I created a simple dataset with 100 observations to test this out (see attached).

I used Nick's commands and got the following:

egen HHI20 = pctile(HHI), p(20) by(Portfolios)
egen HHI80 = pctile(HHI), p(80) by(Portfolios)
gen indicator = cond(HHI <= HHI20, 1, cond(HHI <= HHI80, 2, 3)) if HHI < .
scatter indicator HHI, by(Portfolios) yla(1/3) ms(Oh) jitter(2)
tabulate indicator

When I do the above, it identifies 24 organizations that are below or equal to the 20th percentile and 24 organizations that are above the 80th percentile. However, if I'm trying to identify organizations in the bottom and top 20 based on a sample of 100, then I should have 20 organizations in each group not 24, correct?

I also tried the following:

bysort Portfolios: gen byte low = _n <= (0.2*_N)
tabulate low
bysort Portfolios: gen byte high = _n >= (0.8*_N)
tabulate high

When I tabulate, 19 organizations or 19% are low and 23 or 23% are high, which still doesn't equal the 20%...so I'm not sure where I'm going wrong as in the attached dataset there are no missing values.

Percent vs perctiles test.xlsx
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#10

06 Sep 2019, 14:13

Hi again! I found two problems...1) I have missing data, which is causing my counts to drop. 2) In my calculation of percentages, I didn't include HHI.

But I still don’t understand why I’m not getting 20 observations for each of the calculations above.

Last edited by Jessica Berrett; 06 Sep 2019, 14:24.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#11

06 Sep 2019, 19:38

So, missing data adds another wrinkle to the matter. If of 100 observations, 5 are missing, 20% will be 19, not 20.

On top of that there is the issue of ties. Run the following code and browse the data to see what's going on:

Code:

clear* set obs 100 gen HHI = _n replace HHI = 20 if inrange(HHI, 18, 22) replace HHI = 80 if inrange(HHI, 78, 82) egen HHI20 = pctile(HHI), p(20) egen HHI80 = pctile(HHI), p(80) gen indicator = cond(HHI <= HHI20, 1, cond(HHI <= HHI80, 2, 3)) if HHI < . tab indicator summ HHI*

The "top 20%" or "bottom 20%" you are looking for is a unicorn;. In this data (which exaggerates the issue compared to real world data, in the name of clarity) you can see that if you simply pick observations 1 through 20 to be the "bottom 20%" you are excluding from the bottom 20% some observations that have the same value of HHI as others that were included. There is no justification for treating these observations differently. And, indeed, by doing so, and doing so in an arbitrary manner, you will get irreproducible results down the line when you then look at how the "bottom 20%" differ from the other groups on other variables (such as subsequent performance). The point is that the whole notion of a "top 20%" or "bottom 20%" is illusory when there are tied values in the data. Any sensible subsetting of the data must treat observations with equal values the same way--or, if it doesn't, it must have some principled basis for choosing which to include and which to exclude.
2 likes
Comment
Jessica Berrett

Join Date: Sep 2019

Posts: 57
#12

09 Sep 2019, 13:52

What do you mean by tied values?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#13

09 Sep 2019, 20:24

Tied values means two observations having the same value for a variable. Run the code in #11 and then look at the data in the Browser (or -list- it). You'll see that HHI has the same value in observations 18 through 22: these are all tied. Similarly, observations 78 through 82 are tied values of HHI.
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment