finding max value in a stratified column

Don Spady

Join Date: Mar 2014

Posts: 47
#1

finding max value in a stratified column

17 Jan 2015, 17:29

I have data (a dummy set provided here to make it simpler) that are stratified by id and by varC, as illustrated below.
I want to be able to find the maximum values of varA & varB stratified by id and varC and then using it in an equation (e.g. divide it by varB)
pseudocode by id varC:gen varD = max(varA)/max(varB)

I know I can do this using egen to find the max values of varA (stratified by id and varC) and dividing it by the maximum value of varB
QUESTION: Is there a way to do this without making a bunch of interim 'max' variables, or do I have to create all the max variables first and then create varD.
It is just not apparent to me how to do it and I have not been able to find a similar question to use as a place to start.

id varA varB varC

1 12 225 0

1 34 346 0

1 22 349 1

1 23 359 1

2 45 12 0

2 65 25 1

3 33 55 0

3 25 77 1

3 17 122 0

3 44 143 1

4 1 12 1

5 4 26 1

5 9 55 1

5 2 77 0

5 12 125 0

Many thanks
Don
Tags: None

Joseph Coveney

Join Date: Apr 2014
Posts: 4421

17 Jan 2015, 18:02

Don, does the following do what you want?

Code:

version 13.1

clear *
set more off

input byte(id varA) int varB byte varC
1 12 225 0
1 34 346 0
1 22 349 1
1 23 359 1
2 45  12 0
2 65  25 1
3 33  55 0
3 25  77 1
3 17 122 0
3 44 143 1
4  1  12 1
5  4  26 1
5  9  55 1
5  2  77 0
5 12 125 0
end

assert !missing(varA, varB) // From my post in the last thread on this

sort id varC
by id varC: generate double varD = varA[_N] / varB[_N]

format varD %05.3f
list, noobs sepby(id varC)

exit

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#3

17 Jan 2015, 18:11

Joseph's code rests on the data's being sorted in such a way that varA and varB increase in tandem and are already in sorted order in the input. But if those relationships don't hold up in the actual data, then varA[_N] and varB[_N] will not necessarily be the maximum values of varA and varB when the data are sorted by id and varC.

Off hand, I don't see a way to do this in Stata without separately finding the max of varA and the max of varB within strata first. But maybe somebody else will come up with something.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#4

17 Jan 2015, 19:27

Clyde's right. Not even the intended bysort id varC (varA varB): would work when the columns don't increase in tandem.
Comment
Don Spady

Join Date: Mar 2014

Posts: 47
#5

17 Jan 2015, 21:32

I think, in the specific instance that I want to apply this to, that the two variables do increase in tandem, so Joseph's plan should work. I will give it a try and see.
If it doesn't work, then I will go back to the egen max(X) approach.
Many thanks to both of you.
Don
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#6

18 Jan 2015, 03:34

The inefficiency of using egen can be a little exaggerated here. At worst you need to generate one more variable than you need long-term:

Code:

bysort id varC: egen max = max(varA) by id varC: egen work = max(varB) replace max = max/work drop work

There is a lot of code hidden in the egen calls, but it is easy to spend minutes worrying about a cleaner way to do things when the code won't take that long. Only f you are doing this repeatedly and for enormous datasets is a faster solution needed. Still, it is always of interest to know what that would be.

If missings are possible this is longer code as typed, but should be faster than the above, given the extra code in egen

Code:

gen byte ismiss = missing(varA) bysort ismiss id varC (varA) : gen max = varA[_N] replace ismiss = missing(varB) bysort ismiss id varC (varB) : replace max = max/varB[_N] bysort id varC (max) : replace max = max[1] drop ismiss

Warning: nothing tested.
Comment
Don Spady

Join Date: Mar 2014

Posts: 47
#7

18 Jan 2015, 20:33

I initially used Joseph's suggested approach and it worked. However I realized that it placed the answer in only 1 row (which is sensible) but my analysis requires that all rows in the column be populated with the results, so I used a combination of Joseph's and Nick's suggestions and got what I needed.
Thank you for all your help. Besides getting the answer I needed, I learned a fair bit on the way.

Don
Comment

Announcement

finding max value in a stratified column

Comment

Comment

Comment

Comment

Comment

Comment