Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dividing values by other values within a group

    Dear all,

    I am currently trying to calculate a variable which is based on dividing values within a variable by other values of that variable. For example, this is an example of the data:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 country float year str1 variable float(value wanted)
    "USA" 2010 "a" 2 .
    "USA" 2010 "b" 6 .
    "USA" 2010 "c" 4 .
    "USA" 2011 "a" 9 .
    "USA" 2011 "b" 1 .
    "USA" 2011 "c" 3 .
    end
    and the variable wanted is what I want to compute. I would like to compute the value of a divided by b, per country-year observation, so that wanted would give me the value 1/3 for the first two observations, and 9 for observation 4 and 5. This is just an example, I have multiple countries and years in the dataset, hence why I need to do this by country and year.

    Any help on this is greatly appreciated.

    Best,

    Satya

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str3 country float year str1 variable float value
    "USA" 2010 "a" 2
    "USA" 2010 "b" 6
    "USA" 2010 "c" 4
    "USA" 2011 "a" 9
    "USA" 2011 "b" 1
    "USA" 2011 "c" 3
    end
    
    sort country year, stable
    by country year: gen wanted = value[1]/value[2] if _n <= 2
    Note: I would normally have done this somewhat differently relying on observing that in your example, variable appears sorted in alphabetical order within country and year. But I decided that might not be true in your actual data, so I did it in a way that will look at the 1st and 2nd observations in each country-year group regardless of the sort order in your data. It is incumbent on you to assure that the data are in the right order before running this code.

    Comment


    • #3
      Hi Clyde,

      Thanks for your quick response. This is a code that I also had in mind, but this requires the data to be sorted properly, is there not a way where I can tell Stata to divide values that belong to category a by values that belong to category b? I have no problem sorting the data correctly, but this is more for me to know whether this is possible in Stata. If I would generalize this example to one where a and b would be variable names, for example GDP and population, then I would want to be able to compute GDP per capita by simply writing a code that says gen gdpcap=value[gdp]/value[population], without having to worry about the sorting of the data. So rather than dividing numbers based on their number, I would be referring to the group they belong to. But ofcourse, if this is not possible then it is no trouble for me to check whether the data is properly sorted. Thanks for your help in any case

      Best,

      Satya

      Comment


      • #4
        If you do not like sorting, you can try this egen only solution:

        Code:
        . egen avalues = mean(value/(variable=="a")), by(country year)
        
        . egen bvalues = mean(value/(variable=="b")), by(country year)
        
        . gen ratio = avalues/bvalues

        Comment


        • #5
          See also https://www.stata-journal.com/articl...article=dm0055 for a general review of technique. @Joro Kolev's method in #4 is written up in Section 10 of that paper. Using subscripts after arranging a suitable sort order is a strong theme in earlier Sections.

          Understanding the technique in #4 hinges on knowing that division by zero produces missing values which are then typically ignored. It's an accident in one way that the division slash is a little reminiscent of conditional notation

          value | (variable=="b") that should be widely familiar from probability theory. Friends have intimated to me that they find the trick a little clever but more than a little cryptic. You may find

          Code:
           
           egen bvalues = mean(cond(variable=="b", value, .)), by(country year)
          more transparent.

          Comment


          • #6
            Hi Joro and Nick,

            Thanks for your help, it is clear. I am under the impression that in Stata, having variables in long rather than wide form (so a column titled variable, with values for GDP, population, etc) is rather cumbersome for performing calculations right? The reason I ask this is because if I transform the dataset to a wide form, then it is simply a matter of dividing one column by another column, or if you are interested in sums, then adding up the different columns, and this requires one line of code. Another example which I am thinking of now (this is unrelated to my original question, I hope that is okay) is that if you have a country-industry-year dataset, then adding up values from different industries becomes a bit more work compared to having a wide dataset where some columns represent the industries, such that again you can simply add up specific columns. So in other words, what I am saying is that when you are interested in computing means, sums, divisions, then maybe having the dataset in a long shape is not that handy? I wonder what your thoughts are on this?

            Best,

            Satya

            Comment


            • #7
              There are many ways to pet a cat, anything that can be done in a wide format can be done in a long format too.

              Most advanced users of Stata such as Nick and Clyde prefer to work in long format, and if you invest one day to learn about explicit subscripting by groups, you would see why they prefer so.

              I am agreed with you that for a less advanced user of Stata the wide format might be preferable, and -egen- solutions that do not involve explicit subscripting are definitely preferable.

              At the end it is all a matter of taste, if you are determined enough, you can find a solution for everything in any of the wide or long.

              Having the skill of switching between the two is crucial though.


              Originally posted by satya otil View Post
              Hi Joro and Nick,

              Thanks for your help, it is clear. I am under the impression that in Stata, having variables in long rather than wide form (so a column titled variable, with values for GDP, population, etc) is rather cumbersome for performing calculations right? The reason I ask this is because if I transform the dataset to a wide form, then it is simply a matter of dividing one column by another column, or if you are interested in sums, then adding up the different columns, and this requires one line of code. Another example which I am thinking of now (this is unrelated to my original question, I hope that is okay) is that if you have a country-industry-year dataset, then adding up values from different industries becomes a bit more work compared to having a wide dataset where some columns represent the industries, such that again you can simply add up specific columns. So in other words, what I am saying is that when you are interested in computing means, sums, divisions, then maybe having the dataset in a long shape is not that handy? I wonder what your thoughts are on this?

              Best,

              Satya

              Comment


              • #8
                Joro Kolev is right, that long-layout data is, in general, much easier to work with. But every principle has its limits. And when you get to the point where GDP and population are separate observations in a long layout, that is pushing it too far. So if you are going to be doing things like selecting out the GDP and population observations to form a ratio, then, no, the data shouldn't be that long and you would be wise to widen it to the extent that those things are separate variables. What you shouldn't do is widen it so that each country, or each industry is its own variable ("column"). Those you want to keep as single variables. The ideal layout for a country-industry-year data set looks has one variable for country, one variable for industry, one variable for year, and then one variable for each attribute of that country and industry in that year: GDP, growth rates, population, total exports, etc. With that layout, when you want to sum all the values for something for each industry, it's -by industry, sort: egen total_v = total(v)-. If you want to sum the values of a variable for each country, it's -by country, sort: egen total_v = total(v)-.

                I strongly endorse Joro Kolev's advice that you invest some time in learning the use of -by- and subscripting, and the -egen- functions as well. Your time will be amply and quickly repaid.

                Comment


                • #9
                  Hi Joro and Clyde, thank you for the very insightful answers. I agree completely, and will keep this in mind for my future research. Thanks a lot.

                  Best,

                  Satya

                  Comment

                  Working...
                  X