Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to sum specific variables over columns?

    Let's say I have three columns of data. Column 1 tells me the originating city, column 2 tells the destination city, and column 3 tells the price of the trip. Let's say I want to sum up all the prices for the trip from city X to city Y. How can I loop that?
    So, once I have a specific origin and destination pair, how can I search the rest of the columns to see if there's a match, and add the price of that row to the existing price?
    Last edited by Irina Z; 10 Apr 2015, 10:33.

  • #2
    So I interpret what you wrote to mean that you have three variables: origin, destination, and price in your data set. And apparently a given pair of origin and destination values can appear more than once, and you would like to get the total of the prices over all those pairs. You don't say whether you want to maintain the separate multiple observations, or reduce it to one observation per origin-destination pair.

    If you want to retain all of the observations:
    Code:
    by origin destination, sort: egen total_price = total(price)
    If you want to reduce the data set to one observation per origin-destination pair:
    Code:
    collapse (sum) price, by(origin destination)
    See [U] 11.5 for information on the -by varlist- construct. Also see the help for the -collapse- command. Both are extremely useful in data management and worth learning well.

    Comment


    • #3
      Professor Schechter,
      The first line is exactly what I was looking for, thank you!
      - Irina

      Comment


      • #4
        Thank you Schechter for those lines, it just saved my day.

        Comment


        • #5
          Hi!

          I tried both ways suggested by Schechter and they both worked as supposed to. Theres just something that I don't understand: Why do i get completely different means from resulting variables (About twice as big using egen vs. collapse)? Obviously you have a lot more observations with egen but the mean should be same or what am I not getting here?

          Thank you

          Br, Jussi

          Comment


          • #6
            Originally posted by Jussi Malila View Post
            Why do i get completely different means from resulting variables (About twice as big using egen vs. collapse)? . . .what am I not getting here?
            My guess is that you screwed up.

            But then, again, we're all just guessing here, right?

            .ÿ
            .ÿversionÿ16.1

            .ÿ
            .ÿclearÿ*

            .ÿ
            .ÿquietlyÿsysuseÿauto

            .ÿformatÿpriceÿ%8.0g

            .ÿ
            .ÿbysortÿforeignÿrep78:ÿegenÿfloatÿtotal_priceÿ=ÿtotal(price)

            .ÿbyÿforeignÿrep78:ÿgenerateÿbyteÿfirstÿ=ÿ_nÿ==ÿ1

            .ÿlistÿforeignÿrep78ÿtotal_priceÿifÿfirst,ÿnoobsÿsepby(foreign)ÿabbreviate(20)

            ÿÿ+--------------------------------+
            ÿÿ|ÿÿforeignÿÿÿrep78ÿÿÿtotal_priceÿ|
            ÿÿ|--------------------------------|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿ9129ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ47741ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿ178391ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ4ÿÿÿÿÿÿÿÿÿ52934ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ8409ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿ19162ÿ|
            ÿÿ|--------------------------------|
            ÿÿ|ÿÿForeignÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿÿ14486ÿ|
            ÿÿ|ÿÿForeignÿÿÿÿÿÿÿ4ÿÿÿÿÿÿÿÿÿ56353ÿ|
            ÿÿ|ÿÿForeignÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿ56634ÿ|
            ÿÿ|ÿÿForeignÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿ12990ÿ|
            ÿÿ+--------------------------------+

            .ÿ
            .ÿcollapseÿ(sum)ÿtotal_priceÿ=ÿprice,ÿby(foreignÿrep78)

            .ÿlistÿforeignÿrep78ÿtotal_price,ÿnoobsÿsepby(foreign)ÿabbreviate(20)

            ÿÿ+--------------------------------+
            ÿÿ|ÿÿforeignÿÿÿrep78ÿÿÿtotal_priceÿ|
            ÿÿ|--------------------------------|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿ9129ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿ47741ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿ178391ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ4ÿÿÿÿÿÿÿÿÿ52934ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ8409ÿ|
            ÿÿ|ÿDomesticÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿ19162ÿ|
            ÿÿ|--------------------------------|
            ÿÿ|ÿÿForeignÿÿÿÿÿÿÿ3ÿÿÿÿÿÿÿÿÿ14486ÿ|
            ÿÿ|ÿÿForeignÿÿÿÿÿÿÿ4ÿÿÿÿÿÿÿÿÿ56353ÿ|
            ÿÿ|ÿÿForeignÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿ56634ÿ|
            ÿÿ|ÿÿForeignÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿ12990ÿ|
            ÿÿ+--------------------------------+

            .ÿ
            .ÿexit

            endÿofÿdo-file


            .

            Comment


            • #7
              Hi,

              So here is my code so you don't have to guess so much.. I'm using stata 16.1

              Code:
              bysort id: egen total_price = total(price)
              
              . sum total_price
              
              Variable | Obs Mean Std. Dev. Min Max
              -------------+---------------------------------------------------------
              total_price | 69,175 2578.87 2717.16 -3.5 42371.65
              
              
              collapse (sum) total_price = price, by(id)
              
              sum total_price
              
              Variable | Obs Mean Std. Dev. Min Max
              -------------+---------------------------------------------------------
              total_price | 20,357 1223.65 1539.305 -3.5 42371.65
              Last edited by Jussi Malila; 30 Oct 2020, 05:03.

              Comment


              • #8
                When you use -egen- you are getting the total_price recorded in every observation. If an id appears in 6 observations, that total appears 6 times and is counted 6 times when the mean is calculated. . If some other id appears in 15 observations, then that total price occurs 15 times and is counted 15 times in calculating the mean.

                When you use -collapse-, the data set is reduced to one observation per id. So each value of total_price is then counted once and only once in the mean when you run -summarize-.

                That's why the means are different.

                Comment


                • #9
                  As Clyde implies, insert a line of code to mark out one observation per id like shown in #6 above:
                  Code:
                  bysort id: egen float total_price = total(price)
                  by id: generate byte first = _n == 1
                  summarize total_price if first
                  local N = r(N)
                  
                  collapse (sum) total_price = price, by(id)
                  quietly count
                  assert r(N) == `N'

                  Comment


                  • #10
                    Thank you Clyde and Joseph - I got it sorted now and also understood where my thoughts went wrong.

                    Comment


                    • #11
                      Hello,

                      I want to ask a similar question in a different way.
                      I have three columns of data. Column 1 shows the originating city, column 2 shows the destination city, and column 3 tells the amount of trade.
                      I want to sum up and find the trade volumes between city pairs (e.g. the sum of the trade city X to city Y and city Y to city X)
                      How can I loop that?

                      Thanks
                      Gizem

                      Comment


                      • #12
                        Code:
                        egen double exports = total(trade), by(origin destination)
                        and so on. See also https://www.statalist.org/forums/help#stata on how to give data examples and https://www.stata-journal.com/articl...article=dm0043 which may help tpp.

                        Comment

                        Working...
                        X