Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Source code for the summarize command

    Hello Statalisters,

    I am trying to locate the file that stores the code for the summarize command.
    I looked into the folder ...\Stata18\ado\base\s but could not find it.

    Would you happen to know where it is located?
    Thank you,

    Jan

  • #2
    If you run -which summarize- Stata will tell you that that is a built-in command. That means that its part of the executable and is not found in any ado-file. In other words its source code is known only to Stata Corp and is written in, I think, C, not in Stata.

    That said, you probably have some reason beyond curiosity for asking this question. What would you do with the information if you had it? Perhaps if you explain that, somebody can help you otherwise accomplish your underlying purpose.

    Comment


    • #3
      Thank you, Clyde!

      I was hoping to tweak the command so that the stored results of summarize, detail would include the four smallest and four largest values (that are already printed on the screen as part of the detailed output).

      I am aware of the workarounds that will yield the smallest and largest values (sorting the data / using egen functions), but I was hoping to use summarize because it's likely to be faster than the alternatives.

      Comment


      • #4
        Originally posted by Jan Kabatek View Post
        . . . I was hoping to use summarize because it's likely to be faster than the alternatives.
        Mata might be a reasonable compromise.
        Code:
        sysuse auto
        summarize price, detail
        mata: st_matrix("r(smallest)", sort(st_data(., "price"), 1)[1::4]); st_matrix("r(largest)", sort(sort(st_data(., "price"), -1)[1::4], 1))
        return list
        matrix list r(smallest)
        matrix list r(largest)

        Comment


        • #5
          I doubt that it would save much time, but it's a little bit cleaner to write:
          Code:
          mata: st_matrix("r(smallest)", sort(st_data(., "price"), 1)[1::4]); st_matrix("r(largest)", sort(st_data(., "price"), -1)[4::1])
          in that it obviates the explicit second sort().

          Comment


          • #6
            To spell things out, Stata's sort is as fast as or faster than Mata's sort(). The trick in #4 and #5 is sorting a single vector instead of the entire dataset, which might indeed be faster. The following picks up this idea* and generalizes it, allowing for sample restrictions and weights.

            Code:
            *! version 1.2.0  05may2025
            *  Statalist edition
            program summarize_details // , rclass
                
                version 16.1
                
                syntax varname(numeric)             ///
                [ if ] [ in ] [ aweight fweight ]   ///
                [ ,                                 ///
                    NSMALLest(integer 4)            ///
                    NLARGEst(integer 4)             ///
                    Detail                ///  stripped
                    *         /// options for summarize
                ]
                
                marksample touse
                
                summarize `varlist' if `touse' [`weight'`exp'] , detail `options'
                
                if ( !r(N) ) ///
                    exit
                
                local nsmallest = min(max(1,`nsmallest'),r(N))
                local nlargest  = min(max(1,`nlargest') ,r(N))
                
                tempname x
                
                mata {
                    
                    `x' = sort(st_data(.,"`varlist'","`touse'"),1)
                    
                    st_matrix("r(smallest)",`x'[1::`nsmallest'])
                    st_matrix("r(largest)", `x'[rows(`x')-`nlargest'+1::rows(`x')])
                    
                    
                }
                
            end
            
            /*  _________________________________________________________________________
                                                                          Version history
            
            1.2.0   05may2025   new options -nsmallest()- and -nlargest()-
                                posted to Statalist
            1.1.1   05may2025   avoid unnecessary sorting
                                posted to Statalist
            1.1.0   05may2025   use temporary name to preserve Mata objects
                                posted to Statalist
            1.0.0   05may2025   posted to Statalist
            
            _________________________________________________________________________  */
            Example
            Code:
            . sysuse auto
            (1978 automobile data)
            
            . summarize_details price if foreign == 1
            
                                        Price
            -------------------------------------------------------------
                  Percentiles      Smallest
             1%         3748           3748
             5%         3798           3798
            10%         3895           3895       Obs                  22
            25%         4499           3995       Sum of wgt.          22
            
            50%         5759                      Mean           6384.682
                                    Largest       Std. dev.      2621.915
            75%         7140           9690
            90%         9735           9735       Variance        6874439
            95%        11995          11995       Skewness       1.215236
            99%        12990          12990       Kurtosis       3.555178
            
            . matlist r(smallest)
            
                         |        c1
            -------------+----------
                      r1 |      3748
                      r2 |      3798
                      r3 |      3895
                      r4 |      3995
            
            . matlist r(largest)
            
                         |        c1
            -------------+----------
                      r1 |      9690
                      r2 |      9735
                      r3 |     11995
                      r4 |     12990
            
            .

            * Edit: The latest version steals Nick Cox idea in #7 and allows setting the (maximum) number of smallest and largest values returned.
            Last edited by daniel klein; 05 May 2025, 03:35. Reason: version 1.1.0 uses a temporary name to preserve existing Mata objects; version 1.1.1 avoid the unnecessary sorting; version1.2.0 adds new options

            Comment


            • #7
              I recollect extremes from SSC but it's focusing on listing values, not returning them.



              Joseph Coveney showed very nicely how a little bit of Mata gets you quite a long way. It's not a criticism of his work to mention that the problem becomes messier if you want to add if or in qualifiers -- or, unlikely in practice but not impossible in principle, if you have only 1, 2, or 3 values to play with.

              Why 4? Just because summarize uses 4?

              I wrote this, but feel lukewarm about it.

              Code:
              *! 1.0.0 NJC 5 May 2025
              program eachend, rclass
                      version 9
              
                      syntax varname(numeric) [if] [in] [, count(numlist >0 max=1)]
                      
                      if "`count'" == "" local count = 4
                      
                      marksample touse
                              
                      quietly {
                              count if `touse'
                              if r(N) == 0 error 2000
                              
                              local n = r(N)
                              
                              tempvar low high
                              gen double `low' = .
                              gen double `high' = .
                      
                              mata : work = st_data(., "`varlist'", "`touse'")
                              mata : _sort(work, 1)
                              mata : n = min((`count', rows(work)))
                              mata : st_store((1::n), "`low'", work[1::n])
                              mata : st_store((1::n), "`high'", work[rows(work)-n+1::rows(work)])
                      }
                      
                      local where = cond(`n' < `count', `n' + 1, `count' + 1)
                      forval i = 1/`count' {
                              return scalar low`i' = `low'[`i']
                              return scalar high`i' = `high'[`where' - `i']
                      }
              end
              Code:
              . sysuse auto, clear
              (1978 automobile data)
              
              . eachend mpg
              
              . ret li
              
              scalars:
                            r(high4) =  34
                             r(low4) =  14
                            r(high3) =  35
                             r(low3) =  14
                            r(high2) =  35
                             r(low2) =  12
                            r(high1) =  41
                             r(low1) =  12
              
              . eachend mpg if foreign
              
              . ret li
              
              scalars:
                            r(high4) =  31
                             r(low4) =  18
                            r(high3) =  35
                             r(low3) =  17
                            r(high2) =  35
                             r(low2) =  17
                            r(high1) =  41
                             r(low1) =  14
              
              . eachend mpg in 1/3
              
              . ret li
              
              scalars:
                            r(high4) =  .
                             r(low4) =  .
                            r(high3) =  17
                             r(low3) =  22
                            r(high2) =  22
                             r(low2) =  22
                            r(high1) =  22
                             r(low1) =  17
              EDIT: daniel klein posted his while I was writing mine. Clearly output as a matrix and as a set of scalars are different possibilities.
              Last edited by Nick Cox; 05 May 2025, 03:38.

              Comment


              • #8
                #3 spells out that you want the 4 smallest and 4 largest values (beyond minimum and maximum) as extra saved results.

                That still raises the question of how you want to use them.

                https://journals.sagepub.com/doi/pdf...6867X221106436 may not be an answer to this question, but some of the small points of technique may still be of interest.
                Last edited by Nick Cox; 05 May 2025, 05:26.

                Comment


                • #9
                  Thank you, everyone! These solutions are very good... FWIW, Nick's code won the race in terms of the relative performance.

                  I should also explain what motivated this line of questioning: I am working in a secure data environment with very strict output export criteria. One of these criteria applies to sample averages, and it stipulates that a sample average cannot be exported from the environment if the two most extreme (absolute) values of the respective variable constitute more than 67% of the sum of all its (absolute) values.


                  Click image for larger version

Name:	Screenshot 2025-05-06 18.40.28.png
Views:	1
Size:	8.8 KB
ID:	1776976



                  The rule itself is rather idiosyncratic, but it is what it is. To comply with the rule, I need to capture the two most extreme values on both sides of the distribution (the third and fourth ones are redundant, and I only mentioned them to avoid getting into the weeds), and compute the statistic above.

                  As you can imagine, this is a tedious exercise that is ripe for automation. The code that you produced is a good starting point.

                  Though I will note that it is too bad that summarize does not store all the output - that would be the ideal situation, since the command is already used by outreg2 for producing summary statistics tables, which would make the automation a breeze.
                  Last edited by Jan Kabatek; 06 May 2025, 02:41.

                  Comment


                  • #10
                    That's interesting but still leaves open the question of whether you are doing this for multiple groups of observations within one variable and/or for multiple variables. If either, even a summarize with more saved results would not help you much, as you would still need to call it repeatedly. The same applies to commands such as those posted in this thread.

                    Comment


                    • #11
                      I have no intention in getting lost off-topic but ...
                      Originally posted by Jan Kabatek View Post
                      FWIW, Nick's code won the race in terms of the relative performance.
                      Seriously? Are you trying to tell me that the program in #7 executed faster than the one-liner in #5. I don't think so. My own approach in #6 calls summarize because your initial query suggested you wanted something on top of that. Naturally, code that only extracts extreme values will be faster than one that computes means, standard deviations, and various other moments of the distribution.

                      Although I think the thread was quite instructive, I suggest you ask for what you want exactly next time.

                      Comment


                      • #12
                        Originally posted by Jan Kabatek View Post
                        To comply with the rule, I need to capture the two most extreme values on both sides of the distribution
                        I don't think so. The way you explain it and the formula you're showing imply that you need the two most extreme absolute values, which is a total of two values, not four.

                        Here's what I think you really want. Should perform reasonably fast.
                        Code:
                        program export_rule , rclass
                            
                            version 16.1
                            
                            syntax varname(numeric) [ if ] [ in ]
                            
                            marksample touse
                            
                            tempname x
                            
                            mata {
                                
                                `x' = abs(st_data(.,"`varlist'","`touse'"))
                                
                                if (rows(`x') < 2) ///
                                    exit(error(2000+rows(`x')))
                                
                                _sort(`x',1)
                                
                                st_numscalar("r(xp)",`x'[rows(`x')])
                                st_numscalar("r(xq)",`x'[max((1,rows(`x')-1))])
                                st_numscalar("r(sum)",colsum(`x'))
                                
                            }
                            
                            return scalar xp         = r(xp)
                            return scalar xq         = r(xq)
                            return scalar sum        = r(sum)
                            return scalar proportion = (r(xp)+r(xq)) / r(sum)
                            
                        end
                        Here are examples:
                        Code:
                        . sysuse auto
                        (1978 automobile data)
                        
                        . export_rule mpg
                        
                        . return list
                        
                        scalars:
                                 r(proportion) =  .0482233502538071
                                        r(sum) =  1576
                                         r(xq) =  35
                                         r(xp) =  41
                        
                        . export_rule mpg if foreign
                        
                        . return list
                        
                        scalars:
                                 r(proportion) =  .1394495412844037
                                        r(sum) =  545
                                         r(xq) =  35
                                         r(xp) =  41
                        
                        . export_rule mpg in 1/3
                        
                        . return list
                        
                        scalars:
                                 r(proportion) =  .7213114754098361
                                        r(sum) =  61
                                         r(xq) =  22
                                         r(xp) =  22
                        Last edited by daniel klein; 06 May 2025, 04:46. Reason: A previously posted much simpler solution wasn't one

                        Comment


                        • #13
                          Originally posted by daniel klein View Post
                          Seriously? Are you trying to tell me that the program in #7 executed faster than the one-liner in #5. I don't think so.
                          Apologies if I ruffled some feathers—I did not mean to. The one-liner sorts the data twice, which is why it is slower than the other commands.

                          Your summarize_details is very good, Daniel. I appreciate that it derives the other statistics as well. FWIW, the export_rule proves the fastest, and the restriction to two absolute extremes is a nice touch.

                          Comment


                          • #14
                            Originally posted by Jan Kabatek View Post
                            The one-liner sorts the data twice, which is why it is slower than the other commands.
                            I see that you've got a satisfactory solution from others, but for what it's worth, that can be easily rectified:
                            Code:
                            mata: Y=sort(st_data(., "price"), 1); n=rows(Y); st_matrix("r(smallest)", Y[1::2]); st_matrix("r(largest)", Y[n-1::n])
                            although it might be a little easier for a reader to parse as:
                            Code:
                            sysuse auto
                            summarize price, detail
                            mata {
                                Y = sort(st_data(., "price"), 1)
                                n = rows(Y)
                                st_matrix("r(smallest)", Y[1::2])
                                st_matrix("r(largest)", Y[n-1::n])
                            }
                            return list
                            matrix list r(smallest)
                            matrix list r(largest)
                            (For context, this latter includes the rest of the code.)

                            . . . the restriction to two absolute extremes is a nice touch.
                            This is also readily implemented as shown here.

                            Comment


                            • #15
                              Originally posted by Jan Kabatek View Post
                              Apologies if I ruffled some feathers—I did not mean to. The one-liner sorts the data twice, which is why it is slower than the other commands.
                              That's okay, I didn't want to appear overly sensitive, just surprised. I overlooked the second sort; you're right.


                              Originally posted by Jan Kabatek View Post
                              FWIW, the export_rule proves the fastest, and the restriction to two absolute extremes is a nice touch.
                              Let's be clear: if the issue is as described in #9, then looking at absolute values is not just a "nice touch"; it's essential! In fact, looking at the smallest and largest raw values is misleading. Consider this dataset
                              Code:
                              clear
                              input x
                              0
                              0
                              1
                              1
                              end
                              The smallest and largest raw value account for 50 percent ot the sum, suggesting you can safely export the sample average. The two largest absolute values are identical to the sum, indicating the average can't be exported at all.

                              So no, it's not just a cosmetic detail. It fundamentally affects the results.
                              Last edited by daniel klein; 07 May 2025, 01:03.

                              Comment

                              Working...
                              X