Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sort Means of Multiple Variables

    Hi! I have what I hope is a pretty easy question to answer for the pros out there that I have been unable to figure out today. I am working with data on crop incomes and areas cultivated. I have generated a new variable for each crop which is basically income/area, let's call it xprof where x is the name of the crop. I am trying to find a way to generate the means each xprof variable and sort them. Basically, I am trying to generate a sorted list of which crops generate the most income based on the land they use.
    I have been playing around with this a bit and can't seem to figure out a way to sort them since the normal sort functionality only sorts within a single variable. I am working with over 100 crops/xprof variables so an efficient way of doing this would be much appreciated. Thanks!
    Last edited by Bryan Gensits; 20 Mar 2019, 04:59.
    Bryan Gensits
    Royal University of Bhutan: College of Natural Resources

  • #2
    Despite the lack of a data example, I have some guesses at what you mean. Here is some code you can run. The numbered comments *1 and so forth are echoed by footnotes below.

    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    * 1
    . egen mean = mean(mpg), by(rep78)
    
    * 2 and 3
    . egen wanted = group(mean rep78)
    (5 missing values generated)
    
    * 4
    . labmask wanted, values(rep78)
    
    * 5
    . tabdisp wanted , c(mean)
    
    ----------------------
    group(mea |
    n rep78)  |       mean
    ----------+-----------
            2 |     19.125
            3 |   19.43333
            1 |         21
            4 |   21.66667
            5 |   27.36364
            . |       21.4
    ----------------------
    
    * 6
    . gen negmean = -mean
    
    . egen wanted2 = group(negmean rep78)
    (5 missing values generated)
    
    . labmask wanted2, values(rep78)
    
    . tabdisp wanted2 , c(mean)
    
    ----------------------
    group(neg |
    mean      |
    rep78)    |       mean
    ----------+-----------
            5 |   27.36364
            4 |   21.66667
            1 |         21
            3 |   19.43333
            2 |     19.125
            . |       21.4
    ----------------------
    
    * 7
    . format mean %2.1f
    
    . tabdisp wanted2 , c(mean)
    
    ----------------------
    group(neg |
    mean      |
    rep78)    |       mean
    ----------+-----------
            5 |       27.4
            4 |       21.7
            1 |       21.0
            3 |       19.4
            2 |       19.1
            . |       21.4
    ----------------------
    1. Putting group means in a new variable is a task for egen, mean(). You can use two or more categorical or grouping variables in the by() option, not just one, and that is often done, e.g. summarizing something both by area and by time.

    2. We can sort those means into order and map the resulting groups to 1, 2, 3, etc. That is a job for egen, group(). So, the group with lowest mean 19.125 would be 1, and henceforth.

    3, A twist on this is that it's possible for two or more categories to have the same mean response, so that they should be shaken apart by splitting tied means on the original categories. That is why
    group() is fed not just the variable containing means but also the original categorical variable, here rep78.

    4. This is only part of the problem, as we really need to see the original category values (or value labels). That is what the community-contributed command
    labmask does. It should be downloaded from the Stata Journal website, and certainly must be installed before you can use it.

    5. So you can tabulate that then. There are various ways to do it. I mention
    tabdisp only because it's often neglected.

    6. If you want values running highest to lowest, negate the means to reverse the order.

    7. Lots of cleaning and tidying possible, such as sensible display format, intelligible variable labels, etc.

    This is all programmable. I won't suggest program code, not least because I suspect that one of the large and complicated reporting commands from SSC or the Stata Journal that I respect appropriately but never use can do it.

    Comment


    • #3
      Hi Nick,

      Thanks so much for your response, sorry for not being very clear in my initial post. Your response actually is a huge help on another project I'm working on but wasn't quite what I was looking for on this. I included a bit of the relevant code below and removed most of the variables from the local crop macro to make it easier to look at.

      Code:
      local crop app asp avo ....
      foreach z of local crop {
      gen `z'ipera = `z'inc / `z'a 
      }
      The -inc suffix denotes income and the -a denotes area. Thus, using the three variables I included in this example I have generated the income/area variable (`z'ipera) for apples, asparagus, and avocados for the farms in my study. What I am attempting to do is generate a table that simply is sorted by the mean of each of the `z'ipera variables to see which crops generate the highest income per area. Thanks again for your response and time. It's an immense help!
      Bryan Gensits
      Royal University of Bhutan: College of Natural Resources

      Comment


      • #4
        Here's one way. The example is fairly silly, but not the technique.

        Code:
        . sysuse citytemp
        (City Temperature Data)
        
        . collapse tempjan tempjul
        
        . l
        
             +---------------------+
             |  tempjan   tempjuly |
             |---------------------|
          1. | 35.74895   75.05377 |
             +---------------------+
        
        . gen id = 1
        
        . reshape long temp , i(id) j(month) string
        (note: j = jan july)
        
        Data                               wide   ->   long
        -----------------------------------------------------------------------------
        Number of obs.                        1   ->       2
        Number of variables                   3   ->       3
        j variable (2 values)                     ->   month
        xij variables:
                               tempjan tempjuly   ->   temp
        -----------------------------------------------------------------------------
        
        . sort temp
        
        . list
        
             +-----------------------+
             | id   month       temp |
             |-----------------------|
          1. |  1     jan   35.74895 |
          2. |  1    july   75.05377 |
             +-----------------------+
        
        . gsort -temp
        
        . list
        
             +-----------------------+
             | id   month       temp |
             |-----------------------|
          1. |  1    july   75.05377 |
          2. |  1     jan   35.74895 |
             +-----------------------+

        Comment


        • #5
          Aha, playing around with that basic premise did the trick for me! Thanks as always Nick - you're a wizard!
          Bryan Gensits
          Royal University of Bhutan: College of Natural Resources

          Comment

          Working...
          X