Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • finding max value in a stratified column

    I have data (a dummy set provided here to make it simpler) that are stratified by id and by varC, as illustrated below.
    I want to be able to find the maximum values of varA & varB stratified by id and varC and then using it in an equation (e.g. divide it by varB)
    pseudocode by id varC:gen varD = max(varA)/max(varB)

    I know I can do this using egen to find the max values of varA (stratified by id and varC) and dividing it by the maximum value of varB
    QUESTION: Is there a way to do this without making a bunch of interim 'max' variables, or do I have to create all the max variables first and then create varD.
    It is just not apparent to me how to do it and I have not been able to find a similar question to use as a place to start.

    id varA varB varC
    1 12 225 0
    1 34 346 0
    1 22 349 1
    1 23 359 1
    2 45 12 0
    2 65 25 1
    3 33 55 0
    3 25 77 1
    3 17 122 0
    3 44 143 1
    4 1 12 1
    5 4 26 1
    5 9 55 1
    5 2 77 0
    5 12 125 0
    Many thanks
    Don

  • #2
    Don, does the following do what you want?
    Code:
    version 13.1
    
    clear *
    set more off
    
    input byte(id varA) int varB byte varC
    1 12 225 0
    1 34 346 0
    1 22 349 1
    1 23 359 1
    2 45  12 0
    2 65  25 1
    3 33  55 0
    3 25  77 1
    3 17 122 0
    3 44 143 1
    4  1  12 1
    5  4  26 1
    5  9  55 1
    5  2  77 0
    5 12 125 0
    end
    
    assert !missing(varA, varB) // From my post in the last thread on this
    
    sort id varC
    by id varC: generate double varD = varA[_N] / varB[_N]
    
    format varD %05.3f
    list, noobs sepby(id varC)
    
    exit

    Comment


    • #3
      Joseph's code rests on the data's being sorted in such a way that varA and varB increase in tandem and are already in sorted order in the input. But if those relationships don't hold up in the actual data, then varA[_N] and varB[_N] will not necessarily be the maximum values of varA and varB when the data are sorted by id and varC.

      Off hand, I don't see a way to do this in Stata without separately finding the max of varA and the max of varB within strata first. But maybe somebody else will come up with something.

      Comment


      • #4
        Clyde's right. Not even the intended bysort id varC (varA varB): would work when the columns don't increase in tandem.

        Comment


        • #5
          I think, in the specific instance that I want to apply this to, that the two variables do increase in tandem, so Joseph's plan should work. I will give it a try and see.
          If it doesn't work, then I will go back to the egen max(X) approach.
          Many thanks to both of you.
          Don

          Comment


          • #6
            The inefficiency of using egen can be a little exaggerated here. At worst you need to generate one more variable than you need long-term:

            Code:
             
            bysort id varC: egen max = max(varA)
            by id varC: egen work = max(varB)
            replace max = max/work 
            drop work
            There is a lot of code hidden in the egen calls, but it is easy to spend minutes worrying about a cleaner way to do things when the code won't take that long. Only f you are doing this repeatedly and for enormous datasets is a faster solution needed. Still, it is always of interest to know what that would be.

            If missings are possible this is longer code as typed, but should be faster than the above, given the extra code in egen

            Code:
             
            gen byte ismiss = missing(varA) 
            bysort ismiss id varC (varA) : gen max = varA[_N] 
            replace ismiss = missing(varB) 
            bysort ismiss id varC (varB) : replace max = max/varB[_N] 
            bysort id varC (max) : replace max = max[1] 
            drop ismiss
            Warning: nothing tested.



            Comment


            • #7
              I initially used Joseph's suggested approach and it worked. However I realized that it placed the answer in only 1 row (which is sensible) but my analysis requires that all rows in the column be populated with the results, so I used a combination of Joseph's and Nick's suggestions and got what I needed.
              Thank you for all your help. Besides getting the answer I needed, I learned a fair bit on the way.

              Don

              Comment

              Working...
              X