Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sum of values of a column until a certain point

    Hello everybody!

    I would kindly need if you could help me with an issue.

    I have a cross-section dataset, in which I have an ID and a variable containing numerical values (let's say "Var_1"). I would like to generate a variable (let's say X) which provides the sum of the first three elements of Var_1, as depicted below:

    ID Var_1 X
    1 1 7
    2 4 7
    3 2 7
    4 5 7
    5 4 7
    ... ... 7

    I tried with egen X = sum(cond(ID<4,sum(Var_1),0)) but it doesn't work.. maybe it is trivial (and I apologize if this is the case), but I can't obtain the right code :/

    Many thanks in advance!

    Kodi

    Ps.: if I had to do the same thing but in a panel data, which would be the code? I guess it would be the same code but starting with "bys date:" (?)

  • #2
    With nothing else said, then this could be

    Code:
    sort id, stable
    by id : gen sumfirst3 = Var_1[1] + Var_1[2] + Var_1[3]
    In practice, don't you have a time or other identifier to define which observations you want to use?

    (No; that can't be quite right, because you are summing across identifiers. So, something is wrong here, perhaps even your example. Give real data, not fake, unless those really are real data.)
    Last edited by Nick Cox; 24 Nov 2017, 08:39.

    Comment


    • #3
      Thank you a lot for your reply Nick!

      I just figured out (now) that also this works: egen X = sum(Var_1) if ID<4.

      Cheers

      K

      Comment


      • #4
        I much doubt that that is any use. At most it puts a constant in a variable, except where it is missing, so it wouldn't even correspond to your example in #1.

        I can't give better advice without a realistic and plausible data example.

        Comment


        • #5
          The title of the post asks about 'summing values up to a certain point' but then the data example in the OP asks about summing up Var_1 in the first 3 obs to get a constant total value of 7 stored in X. Similar to Nick's reply, I find this confusing.
          I *think part the source of confusion for the OP lies in the difference between the sum() running sum/total function and the egen total() function which would produce the constant like in 'X' in the example. Regardless, in toying with this example, it raised a few other interesting questions/suggestions about how to nest the 'cond() function inside the sum() function ( or really any sort of condition inside a function ). I've provided some of this toy code that deconstructs the original question/example with some remarks. I think building the code in this way might be useful to the OP for understanding what these functions are doing both separately and together/ nested ( and some of the potential issues/snares of doing so).

          Code:
          clear
          input ID    Var_1    X
          1    1    7
          2    4    7
          3    2    7
          4    5    7
          5    4    7
          end
          
          
          sort ID, stable
          
          **First, note the difference in using total instead of sum with an if condition:
          g y = sum(Var_1) if ID<4
          egen y2 = total(Var_1) if ID<4
          list
          **both have missings where the if condition fails
          **sum is the running sum, total is the overall summed total.
          
          
          **using a condition inside sum()
              **you can do the following inside the parenthesis of a function, consider the differences in how teh sum function is either summing/adding the values contained in Var_1 versus summing the "1" value that the expression produces when the condition is true.
              
          g z1 = sum(Var_1<5)  //evaluates to 1 if true and sums that "1" across observations , so this is essentially a counter for number of times this is true 
          g z2 = Var_1[4] //value of Var1 in the 4th observation 
          g z3 = Var_1[_n<4] //evaluates to 1 if condition is true
          g z4 = sum(Var_1[ID<4]) //evaluates to 1 if condition is true and then sums across panel/dataset
          g z5 = sum(Var_1==4)  
          list ID V* z*
          
          
          
          **Now, as to why didnt your statement work? ( egen X = sum(cond(ID<4,sum(Var_1),0)) ), consider: 
          g cond1 = sum(cond(ID<4,sum(Var_1),0))
          
          **take the cond out of the sum() to see how it first sum's the entire column of Var_1 (inside the cond() function) and then it sum's across these values if the ID<4 condition is true. The zero never shows up in the values where ID is not less than 4 because you've wrapped the cond() in a sum() that would just keep adding up the column. So:
           
          g cond2a = cond(ID<4,sum(Var_1),0)
          list ID V* cond*
          
          
          *so if the ID<4 condition will evaluate to 1 and you only want to add across Var_1 if that condition is true, why not write the sum() function like 'cond3' below?  also, I suspect that the logic in your question (and how your variable 'X' is a constant means that you actually want the egen total() function, not sum()), so:
          g cond3 = sum(Var_1*(ID<4))
          egen cond4 = total(Var_1*(ID<4))
          list ID V* cond*
          
          
          **you also asked about expanding this across panels:
          
          keep ID Var_1
          expand 2
          bys ID : g date = _n
              replace Var_1 = 9 in 1 //just to mix it up
          bys date (ID): egen cond4 = total(Var_1*(ID<4))
          bys date (ID): gen cond4a = sum(Var_1*(ID<4)) //running sum version
          list
          
          **which is different from the following b/c of the missings
          bys date (ID): egen cond5 = total(Var_1) if ID<4
          bys date (ID): egen cond5a = sum(Var_1) if ID<4 //notice how the sum isnt a running total in this case in a bysort/panel context. This result always puzzles/suprises me and is one reason I tend to avoid if conditions with -bysort- because if I hadnt checked the output I might have assumed the outcome would look like 'cond4a' 
          list
          
          bys date (ID): egen true = total(Var_1<5) //how many times was this true in the panel?
          list

          ​​​​​​​
          Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

          Comment


          • #6
            Thank you Nick. Thank you Eric!

            I solved

            many thanks again!

            K

            Comment


            • #7
              I have respondents with two income variable a is for income 1 variable b is income 2 , i want to create a variable of total income of a and b together for every respondents how should i do it

              Comment


              • #8
                #7 The problem doesn't match the thread title, but consider

                Code:
                gen total = a + b

                Comment

                Working...
                X