Announcement

Collapse
No announcement yet.
This is a sticky topic.
X
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    (1) Add native support for reading Parquet files.

    (2) Add built-in support in ivregress for the Sanderson-Windmeijer (SW) first-stage test of weak identification when there are multiple endogenous regressors. (This test is currently available in ivreg2.)
    Associate Professor of Finance and Economics
    University of Illinois
    www.julianreif.com

    Comment


    • #32
      Allow/create a few fill patterns for graphs!

      Comment


      • #33
        The standard format of the result produced in the Stata result window by the sum command while using the , d option, to display the additional statistics, is like:

        Code:
        . sysuse auto, clear
        . sum mpg, d
        
                                Mileage (mpg)
        -------------------------------------------------------------
              Percentiles      Smallest
         1%           12             12
         5%           14             12
        10%           14             14       Obs                  74
        25%           18             14       Sum of wgt.          74
        
        50%           20                      Mean            21.2973
                                Largest       Std. dev.      5.785503
        75%           25             34
        90%           29             35       Variance       33.47205
        95%           34             35       Skewness       .9487176
        99%           41             41       Kurtosis       3.975005
        But, in the above the result values r(sum), r(min) and r(max) are not included:
        Code:
        . return list
        
        scalars:
                          r(N) =  74
                      r(sum_w) =  74
                       r(mean) =  21.2972972972973
                        r(Var) =  33.47204738985561
                         r(sd) =  5.785503209735141
                   r(skewness) =  .9487175964588155
                   r(kurtosis) =  3.97500459645325
                        r(sum) =  1576
                        r(min) =  12
                        r(max) =  41
                         r(p1) =  12
                         r(p5) =  14
                        r(p10) =  14
                        r(p25) =  18
                        r(p50) =  20
                        r(p75) =  25
                        r(p90) =  29
                        r(p95) =  34
                        r(p99) =  41
        although there is ample room available to include them.

        My proposal to include them in the window report is:
        Code:
                                    Mileage (mpg)
        ---------------------------------------------------------------------
              Percentiles      Smallest
         1%           12             12       Obs                  74
         5%           14             12       Sum of wgt.          74
        10%           14             14       Mean                 21.2973
        25%           18             14       Std. dev.             5.785503
        
        50%           20                      Variance             33.47205
                                Largest       Skewness               .9487176
        75%           25             34       Kurtosis              3.975005
        90%           29             35       Sum                1576
        95%           34             35       Min                  12
        99%           41             41       Max                  41
        I suppose the above does not meet the criterium of the next rocket science contribution to the field of (medical) statistics or econometrics, but, using sum, d is a daily routine and having all results available on the fly might be of use for many Stata users.
        http://publicationslist.org/eric.melse

        Comment


        • #34
          I agree with adding the sum to the statistics reported in the Results window. But min and max are redundant: Stata already shows the four smallest and four largest values, so the first and last of those, respectively, are the values of the min and max.

          Comment


          • #35
            Ben Jann wrote a module called moremata which interestingly includes a routine to calculate percentiles. What differentiates this with the existing percentile calculation performed in Stata is the option to choose multiple methods. Apparently, based on the code for mm_quantile() within the routine allows for 12 different definitions to compute percentile. Within these 12 definitions, Stata uses definition 2 (default) and alternatively definition 6. Python, R, and other programs use a different definition.

            Just a thought here, but would be nice to include all definitions to help folks replicate processes in other programs. These definitions also apply to the calculation of median and interquartile range (IQR)

            Link to moremata: https://ideas.repec.org/c/boc/bocode/s455001.html

            Definitions listed below:
            Click image for larger version

Name:	image (6).png
Views:	1
Size:	142.9 KB
ID:	1780102

            Comment


            • #36
              For some time, I have been requesting that the ability to read Raster files in Stata would be useful as many economists (at least in my circle) look for ways where they can create routine which allow for single program execution to do their analysis especially when it comes to geospatial analysis. Recently, a package was released by a team from Xiamen University and Hefei University called readraster (link below) that uses Java integration to allow for Raster analysis in Stata.

              Maybe, and if it is worth the time and effort, the team at Stata can consider developing on this? Would be useful i believe.
              read and process raster data in Stata. Contribute to kerrydu/readraster development by creating an account on GitHub.
              Last edited by Fahad Mirza; 21 Jul 2025, 10:48.

              Comment


              • #37
                I would (still) like to see the documentation for ttest updated to clarify that the welch option produces Welch's (1947) adjustment, whereas unequal produces the adjustment that was developed by Welch (1938) and independently (apparently) by Satterthwaite (1946). See this old thread for details: I think this is important because I believe that when people talk about Welch's t-test, they usually mean the Welch (1938) test, aka., the Welch-Satterthwaite method.
                --
                Bruce Weaver
                Email: [email protected]
                Version: Stata/MP 19.5 (Windows)

                Comment


                • #38
                  It would be great to merge gen and egen so that they're interchangeable. It's fairly easy to write a manual workaround: capture gen ... capture egen ... but that's inelegant.

                  Right now, we all have to remember which one to use, even though both are the equivalent of compute, and it taxes my memory. The one thing I miss about SPSS is that it just has compute. R has <-.

                  If there's an easy workaround, it would be great. If not, I understand.

                  Comment


                  • #39
                    I would like to agree with Chris Martin but I think the workaround would not be so easy.

                    Both -egen- and -gen- have associated functions named -max()- and -min()-, but they do different things. The same is (sort of) true of the function -sum()-. (The original -egen, sum()- was given an alias, -total()- which is now almost universally used instead, making them distinct, but -egen, sum()- still works and is equivalent to -egen, total()-, not to -gen ... sum()-)

                    Also, when used with -by-, the -gen- functions respect the use of _n and _N on the right hand side, where as -egen- functions may or may not do so: use at your own risk.

                    Also -egen- is extensible: you can write your own -egen- functions if you want to. -gen- is not.

                    I think these are sufficiently different that any attempt to combine them into a single function would produce chaos.

                    Comment


                    • #40
                      Maybe this time I can suggest to take a look at the possible next rocket science contribution to the field of (medical) statistics or econometrics (e.g. DID analysis), considering this recently published paper (Open Access):

                      Korf, M. N., Van Geloven, N., Krijthe, J. H., & Labrecque, J. A. (2025). Causal clarity in statistical software. International Journal of Epidemiology, 54(4), dyaf136. https://doi.org/10.1093/ije/dyaf136

                      The authors argue that statistical software for causal inference often lacks transparency, failing to report the causal estimand, the assumptions required for causal interpretation, and diagnostics assessing whether these assumptions are plausible. This absence parallels how unhelpful it would be to report a regression coefficient without its standard error or confidence interval. To address this gap, the authors introduce the R package CarefullyCausal, which promotes transparency by reporting (i) the target causal estimand, (ii) estimates from multiple causal estimators that rely on distinct modeling assumptions (e.g., outcome regression, IPTW, standardization, TMLE), and (iii) explicit causal assumptions with supporting diagnostics.

                      Replicating the core functionality of CarefullyCausal in Stata would possibly enhance transparency and pedagogical clarity in applied causal inference. Integrating a (wrapper) command, using Stata’s existing causal estimation tools like teffects, ipw, tmle), could replicate the functionality of CarefullyCausal—returning a clearly labeled estimand, multiple effect estimates, and assumption diagnostics (e.g., covariate balance tables, PS overlap, S-values).
                      Provides estimates, assumptions and diagnostics for fixed-exposure causal analyses - mauricekorf/CarefullyCausal
                      http://publicationslist.org/eric.melse

                      Comment


                      • #41
                        Originally posted by Richard Williams View Post
                        It turns out Chuck Huber wrote a blog post a few years ago about how to use chatgpt and Stata together. As is, it seems more complicated than I would like, but I bet Stata could come up with something better and simpler if it wanted to.

                        https://blog.stata.com/2023/07/25/a-...o-run-chatgpt/
                        Scott Cunningham wrote a long Substack post about using Claude 4.0 and ChatGPT with Stata:

                        https://causalinf.substack.com/p/an-...with-claude-35

                        You may need to ask for a 7-day free trial to read the full post.



                        Comment


                        • #42
                          ARM64 compiled Stata for Windows. There is interest and demand.

                          Comment


                          • #43
                            The equivalent of R's ggrepel to prevent marker labels from overlapping, as suggested by David Flood in the Stata 18 wishlist

                            Originally posted by David Flood View Post
                            Ggplot "repel"-style labeling for overlapping labels in graphs, particularly for scatterplots

                            https://cran.r-project.org/web/packa...s/ggrepel.html
                            https://ggrepel.slowkow.com/articles/examples.html

                            (I know about mlabvpos but it never looks as nice as I want it to without a lot of manual tinkering. I usually end up using R.)

                            Comment


                            • #44
                              I have to agree with Clyde Schechter (#39) about egen.

                              Merging generate and egen is unfortunately a non-starter for more reasons then said.

                              Here are two more reasons, and i doubt I've recalled everything that could be said.

                              Functions like log() can be used outside generate, as with display or in calculating local or global macros. In contrast, what bites is that egen functions can't be used outside egen.

                              Function calls outside egen can be and often are nested, as is often done. Nesting isn't possible with egen functions.

                              egen is a series of wrapper functions (in its own sense) for generate. It is not an alternative or complement to generate as written. I know that users may think of it in that way when using it, and I often do too, but what bites for any merger scheme is how egen is implemented.

                              egen arouses mixed feelings. I've often seen posts elsewhere asking for the R equivalent of egen, which is striking both ways. Positively, in that people have evidently used egen in Stata, regard it favourably, and woild like to learn of something equivalent. Negatively, in that egen has a history, but it doesn't have a distinct rationale otherwise. Very many commands call up generate within their code to generate new variables and which new kinds of variables may be created with egen and which may be generated with other commands is a matter of caprice.

                              As Clyde points out, it is not hard for user-programmers to write their own egen functions. Many of the functions in official egen were originally community-contributed and have since been folded back into the official release. Many others exist outside official Stata.

                              But this kind of programming has long since passed its peak. Just as StataCorp are maintaining but not much extending egen, so also user-programmers are not often now writing new egen functions.

                              Why is that? I'll speak for myself only as someone who has written several egen functions in the past. Already the lists of official egen functions and of user-written functions are both quite long. If we keep adding more such lists could easily seem too long for anybody to want to scan and too much of a rag-bag to be attractive or convenient.

                              I'd much rather write a distinct command that allows or even has a main aim the generation of new variables.

                              I think there is a core question here: Which egen functions should rewritten as official function code? Unfortunately, they will usually need new names!

                              Comment

                              Working...
                              X