Wish List for Stata 20

Julian Reif

Join Date: Dec 2018

Posts: 49
#31

14 Jul 2025, 14:07

(1) Add native support for reading Parquet files.

(2) Add built-in support in ivregress for the Sanderson-Windmeijer (SW) first-stage test of weak identification when there are multiple endogenous regressors. (This test is currently available in ivreg2.)

Associate Professor of Finance and Economics
University of Illinois
www.julianreif.com
2 likes
Comment
Tommy Morgan

Join Date: Aug 2022

Posts: 4
#32

16 Jul 2025, 08:06

Allow/create a few fill patterns for graphs!
3 likes
Comment

ericmelse

Join Date: May 2014
Posts: 437

#33

20 Jul 2025, 09:32

The standard format of the result produced in the Stata result window by the sum command while using the , d option, to display the additional statistics, is like:

Code:

. sysuse auto, clear
. sum mpg, d

                        Mileage (mpg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           12             12
 5%           14             12
10%           14             14       Obs                  74
25%           18             14       Sum of wgt.          74

50%           20                      Mean            21.2973
                        Largest       Std. dev.      5.785503
75%           25             34
90%           29             35       Variance       33.47205
95%           34             35       Skewness       .9487176
99%           41             41       Kurtosis       3.975005

But, in the above the result values r(sum), r(min) and r(max) are not included:

Code:

. return list

scalars:
                  r(N) =  74
              r(sum_w) =  74
               r(mean) =  21.2972972972973
                r(Var) =  33.47204738985561
                 r(sd) =  5.785503209735141
           r(skewness) =  .9487175964588155
           r(kurtosis) =  3.97500459645325
                r(sum) =  1576
                r(min) =  12
                r(max) =  41
                 r(p1) =  12
                 r(p5) =  14
                r(p10) =  14
                r(p25) =  18
                r(p50) =  20
                r(p75) =  25
                r(p90) =  29
                r(p95) =  34
                r(p99) =  41

although there is ample room available to include them.

My proposal to include them in the window report is:

Code:

                            Mileage (mpg)
---------------------------------------------------------------------
      Percentiles      Smallest
 1%           12             12       Obs                  74
 5%           14             12       Sum of wgt.          74
10%           14             14       Mean                 21.2973
25%           18             14       Std. dev.             5.785503

50%           20                      Variance             33.47205
                        Largest       Skewness               .9487176
75%           25             34       Kurtosis              3.975005
90%           29             35       Sum                1576
95%           34             35       Min                  12
99%           41             41       Max                  41

I suppose the above does not meet the criterium of the next rocket science contribution to the field of (medical) statistics or econometrics, but, using sum, d is a daily routine and having all results available on the fly might be of use for many Stata users.

http://publicationslist.org/eric.melse

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30187
#34

20 Jul 2025, 11:23

I agree with adding the sum to the statistics reported in the Results window. But min and max are redundant: Stata already shows the four smallest and four largest values, so the first and last of those, respectively, are the values of the min and max.
6 likes
Comment
Fahad Mirza

Join Date: Sep 2018

Posts: 248
#35

21 Jul 2025, 10:32

Ben Jann wrote a module called moremata which interestingly includes a routine to calculate percentiles. What differentiates this with the existing percentile calculation performed in Stata is the option to choose multiple methods. Apparently, based on the code for mm_quantile() within the routine allows for 12 different definitions to compute percentile. Within these 12 definitions, Stata uses definition 2 (default) and alternatively definition 6. Python, R, and other programs use a different definition.

Just a thought here, but would be nice to include all definitions to help folks replicate processes in other programs. These definitions also apply to the calculation of median and interquartile range (IQR)

Link to moremata: https://ideas.repec.org/c/boc/bocode/s455001.html

Definitions listed below:
Comment
Fahad Mirza

Join Date: Sep 2018

Posts: 248
#36

21 Jul 2025, 10:44

For some time, I have been requesting that the ability to read Raster files in Stata would be useful as many economists (at least in my circle) look for ways where they can create routine which allow for single program execution to do their analysis especially when it comes to geospatial analysis. Recently, a package was released by a team from Xiamen University and Hefei University called readraster (link below) that uses Java integration to allow for Raster analysis in Stata.

Maybe, and if it is worth the time and effort, the team at Stata can consider developing on this? Would be useful i believe.

GitHub - kerrydu/readraster: read and process raster data in Stata

https://github.com

read and process raster data in Stata. Contribute to kerrydu/readraster development by creating an account on GitHub.

Last edited by Fahad Mirza; 21 Jul 2025, 10:48.
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1142
#37

21 Jul 2025, 14:07

I would (still) like to see the documentation for ttest updated to clarify that the welch option produces Welch's (1947) adjustment, whereas unequal produces the adjustment that was developed by Welch (1938) and independently (apparently) by Satterthwaite (1946). See this old thread for details:
https://www.statalist.org/forums/for...-documentation

I think this is important because I believe that when people talk about Welch's t-test, they usually mean the Welch (1938) test, aka., the Welch-Satterthwaite method.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
4 likes
Comment
Chris Martin

Join Date: Nov 2015

Posts: 100
#38

24 Jul 2025, 08:57

It would be great to merge gen and egen so that they're interchangeable. It's fairly easy to write a manual workaround: capture gen ... capture egen ... but that's inelegant.

Right now, we all have to remember which one to use, even though both are the equivalent of compute, and it taxes my memory. The one thing I miss about SPSS is that it just has compute. R has <-.

If there's an easy workaround, it would be great. If not, I understand.
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30187
#39

24 Jul 2025, 09:58

I would like to agree with Chris Martin but I think the workaround would not be so easy.

Both -egen- and -gen- have associated functions named -max()- and -min()-, but they do different things. The same is (sort of) true of the function -sum()-. (The original -egen, sum()- was given an alias, -total()- which is now almost universally used instead, making them distinct, but -egen, sum()- still works and is equivalent to -egen, total()-, not to -gen ... sum()-)

Also, when used with -by-, the -gen- functions respect the use of _n and _N on the right hand side, where as -egen- functions may or may not do so: use at your own risk.

Also -egen- is extensible: you can write your own -egen- functions if you want to. -gen- is not.

I think these are sufficiently different that any attempt to combine them into a single function would produce chaos.
3 likes
Comment
ericmelse

Join Date: May 2014

Posts: 437
#40

26 Jul 2025, 05:03

Maybe this time I can suggest to take a look at the possible next rocket science contribution to the field of (medical) statistics or econometrics (e.g. DID analysis), considering this recently published paper (Open Access):

Korf, M. N., Van Geloven, N., Krijthe, J. H., & Labrecque, J. A. (2025). Causal clarity in statistical software. International Journal of Epidemiology, 54(4), dyaf136. https://doi.org/10.1093/ije/dyaf136

The authors argue that statistical software for causal inference often lacks transparency, failing to report the causal estimand, the assumptions required for causal interpretation, and diagnostics assessing whether these assumptions are plausible. This absence parallels how unhelpful it would be to report a regression coefficient without its standard error or confidence interval. To address this gap, the authors introduce the R package CarefullyCausal, which promotes transparency by reporting (i) the target causal estimand, (ii) estimates from multiple causal estimators that rely on distinct modeling assumptions (e.g., outcome regression, IPTW, standardization, TMLE), and (iii) explicit causal assumptions with supporting diagnostics.

Replicating the core functionality of CarefullyCausal in Stata would possibly enhance transparency and pedagogical clarity in applied causal inference. Integrating a (wrapper) command, using Stata’s existing causal estimation tools like teffects, ipw, tmle), could replicate the functionality of CarefullyCausal—returning a clearly labeled estimand, multiple effect estimates, and assumption diagnostics (e.g., covariate balance tables, PS overlap, S-values).

GitHub - mauricekorf/CarefullyCausal: Provides estimates, assumptions and diagnostics for fixed-exposure causal analyses

https://github.com

Provides estimates, assumptions and diagnostics for fixed-exposure causal analyses - mauricekorf/CarefullyCausal

http://publicationslist.org/eric.melse
Comment
Chris Martin

Join Date: Nov 2015

Posts: 100
#41

28 Jul 2025, 13:21

Originally posted by Richard Williams View Post

It turns out Chuck Huber wrote a blog post a few years ago about how to use chatgpt and Stata together. As is, it seems more complicated than I would like, but I bet Stata could come up with something better and simpler if it wanted to.

https://blog.stata.com/2023/07/25/a-...o-run-chatgpt/

Scott Cunningham wrote a long Substack post about using Claude 4.0 and ChatGPT with Stata:

https://causalinf.substack.com/p/an-...with-claude-35

You may need to ask for a 7-day free trial to read the full post.
1 like
Comment
Erik Ruzek

Join Date: Oct 2017

Posts: 443
#42

28 Jul 2025, 13:38

ARM64 compiled Stata for Windows. There is interest and demand.
1 like
Comment
Raymond Guiteras

Join Date: Sep 2022

Posts: 21
#43

28 Jul 2025, 17:14

The equivalent of R's ggrepel to prevent marker labels from overlapping, as suggested by David Flood in the Stata 18 wishlist

Originally posted by David Flood View Post

Ggplot "repel"-style labeling for overlapping labels in graphs, particularly for scatterplots

https://cran.r-project.org/web/packa...s/ggrepel.html
https://ggrepel.slowkow.com/articles/examples.html

(I know about mlabvpos but it never looks as nice as I want it to without a lot of manual tinkering. I usually end up using R.)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35804
#44

29 Jul 2025, 03:43

I have to agree with Clyde Schechter (#39) about egen.

Merging generate and egen is unfortunately a non-starter for more reasons then said.

Here are two more reasons, and i doubt I've recalled everything that could be said.

Functions like log() can be used outside generate, as with display or in calculating local or global macros. In contrast, what bites is that egen functions can't be used outside egen.

Function calls outside egen can be and often are nested, as is often done. Nesting isn't possible with egen functions.

egen is a series of wrapper functions (in its own sense) for generate. It is not an alternative or complement to generate as written. I know that users may think of it in that way when using it, and I often do too, but what bites for any merger scheme is how egen is implemented.

egen arouses mixed feelings. I've often seen posts elsewhere asking for the R equivalent of egen, which is striking both ways. Positively, in that people have evidently used egen in Stata, regard it favourably, and woild like to learn of something equivalent. Negatively, in that egen has a history, but it doesn't have a distinct rationale otherwise. Very many commands call up generate within their code to generate new variables and which new kinds of variables may be created with egen and which may be generated with other commands is a matter of caprice.

As Clyde points out, it is not hard for user-programmers to write their own egen functions. Many of the functions in official egen were originally community-contributed and have since been folded back into the official release. Many others exist outside official Stata.

But this kind of programming has long since passed its peak. Just as StataCorp are maintaining but not much extending egen, so also user-programmers are not often now writing new egen functions.

Why is that? I'll speak for myself only as someone who has written several egen functions in the past. Already the lists of official egen functions and of user-written functions are both quite long. If we keep adding more such lists could easily seem too long for anybody to want to scan and too much of a rag-bag to be attractive or convenient.

I'd much rather write a distinct command that allows or even has a main aim the generation of new variables.

I think there is a core question here: Which egen functions should rewritten as official function code? Unfortunately, they will usually need new names!
1 like
Comment
Dave Airey

Join Date: Apr 2014

Posts: 407
#45

04 Aug 2025, 13:39

I would like to see some of the parallel package added to StataNow. I got very nice speed up in a for loop context. parallel usage - Statalist
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment