Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange functionality for egen rowtotal?

    Hi
    I was using egen function rowtotal, when I discovered the following:
    Code:
    . clear
    
    . input x1 x2 x3
    
                x1         x2         x3
      1. 1 2 3
      2. . 2 3
      3. . . 3
      4. .a .b .c
      5. 1 2 .a
      6. .c 2 3
      7. end
    
    . 
    . /* Help egen: rowtotal creates the (row) sum of the variables in varlist, 
    > treating missing as 0.*/
    . egen test1 = rowtotal(x*)
    
    . /* Help egen: If missing is specified and all values in varlist are missing for 
    > an observation, newvar is set to missing
    > */
    . egen test2 = rowtotal(x*), missing
    (1 missing value generated)
    
    . list
    
         +------------------------------+
         | x1   x2   x3   test1   test2 |
         |------------------------------|
      1. |  1    2    3       6       6 |
      2. |  .    2    3       5       5 |
      3. |  .    .    3       3       3 |
      4. | .a   .b   .c       0       . |
      5. |  1    2   .a       3       3 |
         |------------------------------|
      6. | .c    2    3       5       5 |
         +------------------------------+
    
    . 
    end of do-file
    I can see in the documentation that rowtotal with option missing require all variables to be missing before the new variables becomes missing.
    Just as shown above.
    In my world however rowtotal with option missing would be a far more interesting tool if totals only were calculated when all variables were nonmissing ie that test2 above is nonmissing only at row 1.

    Is this "almost a bug"? Or is it something for the wishlist?
    An extra missingOne option could be a solution?
    Kind regards

    nhb

  • #2
    Code:
    gen test3=x1+x2+x3
    will give you what you want, so I do not see this as a bug in any sense; you can, of course, put it on a wishlist

    Comment


    • #3
      Hi Rich
      Thank you for the suggestion.
      I know that, but what I like from rowtotal is that the argument can be a varlist.
      Expanding "+" is cumbersome and error prone when you have many variables to add.
      Also it is quite annoying when changes are to be made in the list of variables to sum.
      Kind regards

      nhb

      Comment


      • #4
        Of course, one can easily remedy this, eg by:
        Code:
        program define rowsum
            syntax varlist(numeric), Generate(name)
            
            mata st_local("__add", invtokens(tokens(`"`varlist'"'), " + "))
            generate `generate' = `__add'
        end
        This can easily be made into a egen function.
        I my case I need to document externally what I do and so the above code is easier to insert in a do file
        Kind regards

        nhb

        Comment


        • #5
          Apart from writing your own egen function to work as if you wish -- but it has already been done -- see rsum2() from 2002 in egenmore (SSC) -- you can use rowtotal() and rowmiss() in conjunction, i.e. replace row totals with missing if any value was missing.
          Last edited by Nick Cox; 22 Oct 2016, 07:52.

          Comment


          • #6
            You could also just write:
            egen test2 = rowtotal(x*) if !missing(x1-x3)

            Kind regards,
            Henrik

            Comment


            • #7
              #6 Not a good idea. That tests whether the difference x1 - x3 is missing. It's not a way to feed a varlist to missing(). You need commas as separators.

              Comment


              • #8
                Hi Nick
                I agree. There are many ways to solve this.
                But my key point was that rowtotal would be a better tool if it worked the way I described.
                And then there is rsum2 with option anymiss.
                It does what I need.
                I'll use rsum2 in my code, thank you very much.

                @Henrik L. Lolle: thank you very much for your suggestion, but as Nick says you cannot use varlists in the function missing.

                I still think that the standard egen "row functions" would be better off with options anymiss and allmiss instead of just missing.
                It shouldn't be hard to do, and sometimes small changes like this leads to better Stata code at the users.

                Anyway thank you all for your help
                Kind regards

                nhb

                Comment


                • #9
                  Actually, I recall that many years back there was a discussion about how rowtotal(), which was then called rowsum(), should handle missing values. Niels' view is one way of doing it, but arguments can be made in support of the existing approach. What is most useful to one person may be less useful to others. rowtotal() has been working this way for a long time now without complaints, and given how often it is used, I'm guessing most people like it the way it is. I know I do.

                  Comment


                  • #10
                    @Clyde Schechter I guess it is not a matter of either/or as rsum2 suggests
                    Kind regards

                    nhb

                    Comment


                    • #11
                      Hi,

                      Okay, my suggestion (#6) wasn’t good. I can see that. I should hold back a little until I’m a more experienced Stata user. For years I have used Stata for special procedures and used SPSS for everything else. I’m now trying to convert myself completely to Stata, but it is not that easy. And in this matter SPSS is actually the smarter one. You can, inside the variable generation command, tell SPSS exactly how many missing that are allowed before calculating the row statistic. I still prefer Stata over SPSS, though.

                      Kind regards from a newbie,
                      Henrik

                      Comment


                      • #12
                        Hi Henrik
                        I'm glad you tried.
                        Sometimes newbies see things in a new way
                        Kind regards

                        nhb

                        Comment


                        • #13
                          One fundamental is that StataCorp will not change the behaviour of this function! Clearly the existing option missing (which as Clyde says was introduced a while back in response to user pressure) will continue to have its present meaning.

                          All parties might be satisfied if there was a tunable option with arguments

                          MISSING2(arg)

                          with possible arguments

                          all return missing if all variables missing

                          any return missing if any variables missing

                          # return missing if at least # variables missing

                          But no-one has mentioned the main reason for the original behaviour. In Stata it is standard that sums are returned as 0 if all arguments are missing to preserve the fundamental identity

                          sum of group 1 + sum of group 2 + ... = sum of (group 1 plus group 2 plus ...)

                          Thus consider

                          Code:
                          clear 
                          input id group y 
                          1 1    .
                          2 1    .
                          3 1    . 
                          4 1    .
                          5 2    1  
                          6 2    2
                          7 2    3 
                          end 
                          
                          bysort group (id) : gen sum1 = sum(y)
                          by group : replace sum1 = sum1[_N]
                          gen sum2 = sum(y)
                          replace sum2 = sum2[_N]
                          list , sepby(group)
                          
                               +------------------------------+
                               | id   group   y   sum1   sum2 |
                               |------------------------------|
                            1. |  1       1   .      0      6 |
                            2. |  2       1   .      0      6 |
                            3. |  3       1   .      0      6 |
                            4. |  4       1   .      0      6 |
                               |------------------------------|
                            5. |  5       2   1      6      6 |
                            6. |  6       2   2      6      6 |
                            7. |  7       2   3      6      6 |
                               +------------------------------+

                          Users are always entitled to say that they don't want this behaviour for specific problems, but it's Stata's starting point, long before anyone wrote egen functions for any kind of total, observation-wise or row-wise.

                          Comment


                          • #14
                            Hi Nick
                            Thank you. We cannot argue against standards.

                            However I must say I find this one quite irrational.
                            In almost all cases I can bring to mind (And regrettably my mind can be very limited ): If I do not have full information I do not want a derived variable to have a value other than missing.
                            Take for example bmi. I don't have both height and weight I do wish bmi to be missing (Bad example, I know).

                            In my case I have several values that all must be present before I can use their sum for further calculations.

                            Look at the code below:
                            Code:
                            cls
                            clear
                            input x1 x2 x3
                            x1         x2         x3
                            1 2 3
                            . 2 3
                            . . 3
                            .a .b .c
                            1 2 .a
                            .c 2 3
                            end
                            
                            egen rt = rowtotal(x*)
                            generate sum = x1 + x2 + x3
                            I find it inconsistent that the variables rt and sum aren't leading to the same result as is seen below
                            Code:
                            . list
                            
                                 +-------------------------+
                                 | x1   x2   x3   rt   sum |
                                 |-------------------------|
                              1. |  1    2    3    6     6 |
                              2. |  .    2    3    5     . |
                              3. |  .    .    3    3     . |
                              4. | .a   .b   .c    0     . |
                              5. |  1    2   .a    3     . |
                                 |-------------------------|
                              6. | .c    2    3    5     . |
                                 +-------------------------+
                            Nice to know we have functions like rowtotal.
                            One day I might even find a use for it
                            Kind regards

                            nhb

                            Comment


                            • #15
                              Just to provide an alternative point of view, in my experience the way rowtotal handles missings is the best thing about the command. In my line of work, I often have to figure out whether at least one variable is positive. Doing this with if-statements is cumbersome, because missing is considered positive. Doing this with rowtotal is straightforward and crucially, intuitive for the next person to read and use my code.

                              Fun fact, just yesterday someone asked for exactly this functionality: http://www.statalist.org/forums/foru...missing-values

                              Comment

                              Working...
                              X