Source code for the summarize command

Jan Kabatek

Join Date: Sep 2022

Posts: 12
#1

Source code for the summarize command

04 May 2025, 20:11

Hello Statalisters,

I am trying to locate the file that stores the code for the summarize command.
I looked into the folder ...\Stata18\ado\base\s but could not find it.

Would you happen to know where it is located?
Thank you,

Jan
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

04 May 2025, 20:32

If you run -which summarize- Stata will tell you that that is a built-in command. That means that its part of the executable and is not found in any ado-file. In other words its source code is known only to Stata Corp and is written in, I think, C, not in Stata.

That said, you probably have some reason beyond curiosity for asking this question. What would you do with the information if you had it? Perhaps if you explain that, somebody can help you otherwise accomplish your underlying purpose.
Comment
Jan Kabatek

Join Date: Sep 2022

Posts: 12
#3

04 May 2025, 23:44

Thank you, Clyde!

I was hoping to tweak the command so that the stored results of summarize, detail would include the four smallest and four largest values (that are already printed on the screen as part of the detailed output).

I am aware of the workarounds that will yield the smallest and largest values (sorting the data / using egen functions), but I was hoping to use summarize because it's likely to be faster than the alternatives.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#4

05 May 2025, 00:50

Originally posted by Jan Kabatek View Post

. . . I was hoping to use summarize because it's likely to be faster than the alternatives.

Mata might be a reasonable compromise.

Code:

sysuse auto summarize price, detail mata: st_matrix("r(smallest)", sort(st_data(., "price"), 1)[1::4]); st_matrix("r(largest)", sort(sort(st_data(., "price"), -1)[1::4], 1)) return list matrix list r(smallest) matrix list r(largest)
2 likes
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#5

05 May 2025, 01:01

I doubt that it would save much time, but it's a little bit cleaner to write:

Code:

mata: st_matrix("r(smallest)", sort(st_data(., "price"), 1)[1::4]); st_matrix("r(largest)", sort(st_data(., "price"), -1)[4::1])

in that it obviates the explicit second sort().
Comment

daniel klein

Join Date: Mar 2014
Posts: 3859

05 May 2025, 02:53

To spell things out, Stata's sort is as fast as or faster than Mata's sort(). The trick in #4 and #5 is sorting a single vector instead of the entire dataset, which might indeed be faster. The following picks up this idea* and generalizes it, allowing for sample restrictions and weights.

Code:

*! version 1.2.0  05may2025
*  Statalist edition
program summarize_details // , rclass
    
    version 16.1
    
    syntax varname(numeric)             ///
    [ if ] [ in ] [ aweight fweight ]   ///
    [ ,                                 ///
        NSMALLest(integer 4)            ///
        NLARGEst(integer 4)             ///
        Detail                ///  stripped
        *         /// options for summarize
    ]
    
    marksample touse
    
    summarize `varlist' if `touse' [`weight'`exp'] , detail `options'
    
    if ( !r(N) ) ///
        exit
    
    local nsmallest = min(max(1,`nsmallest'),r(N))
    local nlargest  = min(max(1,`nlargest') ,r(N))
    
    tempname x
    
    mata {
        
        `x' = sort(st_data(.,"`varlist'","`touse'"),1)
        
        st_matrix("r(smallest)",`x'[1::`nsmallest'])
        st_matrix("r(largest)", `x'[rows(`x')-`nlargest'+1::rows(`x')])
        
        
    }
    
end

/*  _________________________________________________________________________
                                                              Version history

1.2.0   05may2025   new options -nsmallest()- and -nlargest()-
                    posted to Statalist
1.1.1   05may2025   avoid unnecessary sorting
                    posted to Statalist
1.1.0   05may2025   use temporary name to preserve Mata objects
                    posted to Statalist
1.0.0   05may2025   posted to Statalist

_________________________________________________________________________  */

Example

Code:

. sysuse auto
(1978 automobile data)

. summarize_details price if foreign == 1

                            Price
-------------------------------------------------------------
      Percentiles      Smallest
 1%         3748           3748
 5%         3798           3798
10%         3895           3895       Obs                  22
25%         4499           3995       Sum of wgt.          22

50%         5759                      Mean           6384.682
                        Largest       Std. dev.      2621.915
75%         7140           9690
90%         9735           9735       Variance        6874439
95%        11995          11995       Skewness       1.215236
99%        12990          12990       Kurtosis       3.555178

. matlist r(smallest)

             |        c1
-------------+----------
          r1 |      3748
          r2 |      3798
          r3 |      3895
          r4 |      3995

. matlist r(largest)

             |        c1
-------------+----------
          r1 |      9690
          r2 |      9735
          r3 |     11995
          r4 |     12990

.

* Edit: The latest version steals Nick Cox idea in #7 and allows setting the (maximum) number of smallest and largest values returned.

Last edited by daniel klein; 05 May 2025, 03:35. Reason: version 1.1.0 uses a temporary name to preserve existing Mata objects; version 1.1.1 avoid the unnecessary sorting; version1.2.0 adds new options

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35721

05 May 2025, 02:58

I recollect extremes from SSC but it's focusing on listing values, not returning them.

Joseph Coveney showed very nicely how a little bit of Mata gets you quite a long way. It's not a criticism of his work to mention that the problem becomes messier if you want to add if or in qualifiers -- or, unlikely in practice but not impossible in principle, if you have only 1, 2, or 3 values to play with.

Why 4? Just because summarize uses 4?

I wrote this, but feel lukewarm about it.

Code:

*! 1.0.0 NJC 5 May 2025
program eachend, rclass
        version 9

        syntax varname(numeric) [if] [in] [, count(numlist >0 max=1)]
        
        if "`count'" == "" local count = 4
        
        marksample touse
                
        quietly {
                count if `touse'
                if r(N) == 0 error 2000
                
                local n = r(N)
                
                tempvar low high
                gen double `low' = .
                gen double `high' = .
        
                mata : work = st_data(., "`varlist'", "`touse'")
                mata : _sort(work, 1)
                mata : n = min((`count', rows(work)))
                mata : st_store((1::n), "`low'", work[1::n])
                mata : st_store((1::n), "`high'", work[rows(work)-n+1::rows(work)])
        }
        
        local where = cond(`n' < `count', `n' + 1, `count' + 1)
        forval i = 1/`count' {
                return scalar low`i' = `low'[`i']
                return scalar high`i' = `high'[`where' - `i']
        }
end

Code:

. sysuse auto, clear
(1978 automobile data)

. eachend mpg

. ret li

scalars:
              r(high4) =  34
               r(low4) =  14
              r(high3) =  35
               r(low3) =  14
              r(high2) =  35
               r(low2) =  12
              r(high1) =  41
               r(low1) =  12

. eachend mpg if foreign

. ret li

scalars:
              r(high4) =  31
               r(low4) =  18
              r(high3) =  35
               r(low3) =  17
              r(high2) =  35
               r(low2) =  17
              r(high1) =  41
               r(low1) =  14

. eachend mpg in 1/3

. ret li

scalars:
              r(high4) =  .
               r(low4) =  .
              r(high3) =  17
               r(low3) =  22
              r(high2) =  22
               r(low2) =  22
              r(high1) =  22
               r(low1) =  17

EDIT: daniel klein posted his while I was writing mine. Clearly output as a matrix and as a set of scalars are different possibilities.

Last edited by Nick Cox; 05 May 2025, 03:38.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35721
#8

05 May 2025, 05:23

#3 spells out that you want the 4 smallest and 4 largest values (beyond minimum and maximum) as extra saved results.

That still raises the question of how you want to use them.

https://journals.sagepub.com/doi/pdf...6867X221106436 may not be an answer to this question, but some of the small points of technique may still be of interest.

Last edited by Nick Cox; 05 May 2025, 05:26.
Comment
Jan Kabatek

Join Date: Sep 2022

Posts: 12
#9

06 May 2025, 02:39

Thank you, everyone! These solutions are very good... FWIW, Nick's code won the race in terms of the relative performance.

I should also explain what motivated this line of questioning: I am working in a secure data environment with very strict output export criteria. One of these criteria applies to sample averages, and it stipulates that a sample average cannot be exported from the environment if the two most extreme (absolute) values of the respective variable constitute more than 67% of the sum of all its (absolute) values.

The rule itself is rather idiosyncratic, but it is what it is. To comply with the rule, I need to capture the two most extreme values on both sides of the distribution (the third and fourth ones are redundant, and I only mentioned them to avoid getting into the weeds), and compute the statistic above.

As you can imagine, this is a tedious exercise that is ripe for automation. The code that you produced is a good starting point.

Though I will note that it is too bad that summarize does not store all the output - that would be the ideal situation, since the command is already used by outreg2 for producing summary statistics tables, which would make the automation a breeze.

Last edited by Jan Kabatek; 06 May 2025, 02:41.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35721
#10

06 May 2025, 02:54

That's interesting but still leaves open the question of whether you are doing this for multiple groups of observations within one variable and/or for multiple variables. If either, even a summarize with more saved results would not help you much, as you would still need to call it repeatedly. The same applies to commands such as those posted in this thread.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#11

06 May 2025, 03:23

I have no intention in getting lost off-topic but ...

Originally posted by Jan Kabatek View Post

FWIW, Nick's code won the race in terms of the relative performance.

Seriously? Are you trying to tell me that the program in #7 executed faster than the one-liner in #5. I don't think so. My own approach in #6 calls summarize because your initial query suggested you wanted something on top of that. Naturally, code that only extracts extreme values will be faster than one that computes means, standard deviations, and various other moments of the distribution.

Although I think the thread was quite instructive, I suggest you ask for what you want exactly next time.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3859

#12

06 May 2025, 03:48

Originally posted by Jan Kabatek View Post

To comply with the rule, I need to capture the two most extreme values on both sides of the distribution

I don't think so. The way you explain it and the formula you're showing imply that you need the two most extreme absolute values, which is a total of two values, not four.

Here's what I think you really want. Should perform reasonably fast.

Code:

program export_rule , rclass
    
    version 16.1
    
    syntax varname(numeric) [ if ] [ in ]
    
    marksample touse
    
    tempname x
    
    mata {
        
        `x' = abs(st_data(.,"`varlist'","`touse'"))
        
        if (rows(`x') < 2) ///
            exit(error(2000+rows(`x')))
        
        _sort(`x',1)
        
        st_numscalar("r(xp)",`x'[rows(`x')])
        st_numscalar("r(xq)",`x'[max((1,rows(`x')-1))])
        st_numscalar("r(sum)",colsum(`x'))
        
    }
    
    return scalar xp         = r(xp)
    return scalar xq         = r(xq)
    return scalar sum        = r(sum)
    return scalar proportion = (r(xp)+r(xq)) / r(sum)
    
end

Here are examples:

Code:

. sysuse auto
(1978 automobile data)

. export_rule mpg

. return list

scalars:
         r(proportion) =  .0482233502538071
                r(sum) =  1576
                 r(xq) =  35
                 r(xp) =  41

. export_rule mpg if foreign

. return list

scalars:
         r(proportion) =  .1394495412844037
                r(sum) =  545
                 r(xq) =  35
                 r(xp) =  41

. export_rule mpg in 1/3

. return list

scalars:
         r(proportion) =  .7213114754098361
                r(sum) =  61
                 r(xq) =  22
                 r(xp) =  22

Last edited by daniel klein; 06 May 2025, 04:46. Reason: A previously posted much simpler solution wasn't one

Comment

Jan Kabatek

Join Date: Sep 2022

Posts: 12
#13

06 May 2025, 21:21

Originally posted by daniel klein View Post

Seriously? Are you trying to tell me that the program in #7 executed faster than the one-liner in #5. I don't think so.

Apologies if I ruffled some feathers—I did not mean to. The one-liner sorts the data twice, which is why it is slower than the other commands.

Your summarize_details is very good, Daniel. I appreciate that it derives the other statistics as well. FWIW, the export_rule proves the fastest, and the restriction to two absolute extremes is a nice touch.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#14

07 May 2025, 00:38

Originally posted by Jan Kabatek View Post

The one-liner sorts the data twice, which is why it is slower than the other commands.

I see that you've got a satisfactory solution from others, but for what it's worth, that can be easily rectified:

Code:

mata: Y=sort(st_data(., "price"), 1); n=rows(Y); st_matrix("r(smallest)", Y[1::2]); st_matrix("r(largest)", Y[n-1::n])

although it might be a little easier for a reader to parse as:

Code:

sysuse auto summarize price, detail mata { Y = sort(st_data(., "price"), 1) n = rows(Y) st_matrix("r(smallest)", Y[1::2]) st_matrix("r(largest)", Y[n-1::n]) } return list matrix list r(smallest) matrix list r(largest)

(For context, this latter includes the rest of the code.)

. . . the restriction to two absolute extremes is a nice touch.

This is also readily implemented as shown here.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#15

07 May 2025, 00:50

Originally posted by Jan Kabatek View Post

Apologies if I ruffled some feathers—I did not mean to. The one-liner sorts the data twice, which is why it is slower than the other commands.

That's okay, I didn't want to appear overly sensitive, just surprised. I overlooked the second sort; you're right.

Originally posted by Jan Kabatek View Post

FWIW, the export_rule proves the fastest, and the restriction to two absolute extremes is a nice touch.

Let's be clear: if the issue is as described in #9, then looking at absolute values is not just a "nice touch"; it's essential! In fact, looking at the smallest and largest raw values is misleading. Consider this dataset

Code:

clear input x 0 0 1 1 end

The smallest and largest raw value account for 50 percent ot the sum, suggesting you can safely export the sample average. The two largest absolute values are identical to the sum, indicating the average can't be exported at all.

So no, it's not just a cosmetic detail. It fundamentally affects the results.

Last edited by daniel klein; 07 May 2025, 01:03.
Comment

Announcement