Wishlist for Stata 17

daniel klein replied

10 Aug 2020, 00:52
While I do understand the rationale behind #312 and posts cited therein, I think this digs pretty deep into Stata's design. Stata programs are either e-class or r-class (or s-class, but that does not matter here). Stata programs might call other Stata programs, i.e., e-class programs might call r-class programs (and the other way round). Thinking beyond the specific r(table) and perhaps even beyond estimation commands, do you suggest that every command should document what exactly is left behind in e(), r(), and s() even if these things are just side products of helper routines? I think someone suggested pulling r(table) into e(), perhaps as e(table), which might be the better way to go about this.

Last edited by daniel klein; 10 Aug 2020, 00:58.
Leave a comment:
William Lisowski replied

08 Aug 2020, 11:06
There is no formal statement about the creation of r(table) following estimation commands in the Version 16 documentation I have searched all the Version 16 PDFs and do not find a reference that a new user of Stata estimation commands is likely to stumble across, although it is mentioned implicitly in the reference for for certain estimations commands that notably do not include regress. Additionally, the only expository references to _b and _se are in tutorial information in the Stata User's Guide PDF.

I find the emphasis on the e-class results in the help output and full reference documentation for most estimation commands, underneath the heading "Stored results", with no mention of these other stored results, to be not only incomplete but misleading.

The documentation would benefit with (a) a simple exposition of the non-e-class stored results as part of the output of help estimation and [U] 20 Estimation and postestimation commands and then (b) a linked reference to that exposition underneath every "Stored results" heading.

This is based on a post in a discussion earlier today at

https://www.statalist.org/forums/for...03#post1567603
3 likes
Leave a comment:
wbuchanan replied

08 Aug 2020, 06:50
Christopher Bratt
I listed a couple of exceptions, but the vast majority of the R language does not return data frame objects from function calls. The return object depends completely on the function being used and sometimes the design of the class system in R makes the behavior of functions unpredictable (particularly when there are several different object types [which really just amount to an attribute containing strings of object type names] that get associated with the return object. So I would say that specific point you made is clearly and falsifiably invalid, while the broader point that you intended to make (e.g., all commands do not return data that can be subsequently used/stored for downstream use) is still a fair and reasonable issue that could be addressed. Rather than just stepping away from things and completely disengaging, I would encourage you to provide more detailed descriptions and examples of the issues that you want to be resolved. Aside from clarifying your intended meaning, it all gives others in the Stata community a better understanding of problems that other users face to which they may want to apply their time/effort to solve.
Leave a comment:
Nick Cox replied

08 Aug 2020, 06:12
Christopher Bratt

Let's copy your #299 here again so that people don't waste time looking for #229. (Sure, that was a typo.)

For me, there are still primarily three drawbacks in Stata compared to other programs: 1. not easy to develop reproducable research with flexible options (compared to R Markdown), 2. difficulties handling output from analyses (compared to R, where results are objects accessible as data frames and easy to manipulate), and (3) speed.

Speed seems to be a recurring problem. Are there some fundamental, old choices in Stata once made and now preventing Stata from running at the speed of other programs?

A recent Twitter post stated:
"Just ran a regression with about 100 million observations.
In Stata (MP, 16): 97 minutes
In Julia (using FixedEffectsModels.jl): 13 seconds."

https://twitter.com/clibassi/status/1289948339962646528

Is it possible to rewrite some of the fundamentals of Stata to improve speed?

I am very comfortable with the idea that official Stata can be improved. I've been watching it improve for nearly 30 years and have been helping as I can and also waging small campaigns directly with StataCorp. Some commands are overdue for rewrites as sometimes painfully slow, such as reshape, and I am confident that StataCorp knows this.

But complaints like these don't make anything precise.

1. Reproducible research. StataCorp (as now is) has been wedded to reproducible research from the beginning, before the phrase even become standard. What specifically is weak?

2. DIfficulties in handling output. Which results are inaccessible? Stata doesn't pretend to be a clone of R anymore than the converse. It is understandable, but quite amusing, when people migrating one way or the other want X to behave like Y which is much more familiar to them. Some fair fraction of Twitter posts in this territory are on all fours with "Why isn't this cat more like a dog?"

3. Speed. We all prefer faster to slower, What slows things down in Stata includes use of memory -- where the user's machine and/or OS are sometimes part of the problem -- and also heavy use of interpreted code in many cases.

The Twitter post you cite certainly raises a hard question of what is going on to see such differences in timing, but it can't be discussed easily without more details.
1 like
Leave a comment:
Christopher Bratt replied

08 Aug 2020, 03:07
Ad #229 and #302
I believe wbuchan's response to my comments obscured more than clarified. (Exception are not important in this context, and reproducible research with knitr is not well described in the response.) But since I no longer use Stata (for the three reasons given), I shouldn't follow up on this. I assume that StataCorp is familiar with alternative statistical packages and knows that all three points were valid.

A programmer in Stata knows me personally and is aware that I long ago offered to do some testing with speed comparisons. That offer remains. Because ... I still miss the Stata language!! It's second to none.
Leave a comment:
Leonardo Guizzetti replied

07 Aug 2020, 08:26
I have also recently learned about the undocumented -saving()- option for margins-, -which will turn your margins results into a separate dataset. This has been around since at least 2011 so at this point, I don't know why either of these aren't part of the official documentation. It's not hard to see legitimate uses for either of these undocumented options.
3 likes
Leave a comment:
John Mullahy replied

07 Aug 2020, 07:47
Re #304

(e.g., what do I do if I want to generate a variable containing the slope and the rmse of some regression?), even if there is, what if I want to save slopes from poisson, probit, tobit, etc.

For this particular issue consider the undocumented generate option with the margins command. E.g.

Code:

poisson y x1 x2 margins, dydx(*) generate(me)

See

Code:

help margins generate

On numerous occasions I've argued—thusfar unsuccessfully—for making this option part of the official documentation for margins.
3 likes
Leave a comment:
wbuchanan replied

07 Aug 2020, 06:11
Joro Kolev
I just wanted to clarify that -egen- is used to create new variables. There is a -statsby- command that would allow you to fit models and all that based on groups of cases identified by values of a variable: https://www.stata.com/manuals13/dstatsby.pdf
Leave a comment:
wbuchanan replied

07 Aug 2020, 06:07
Clyde Schechter
Apparently I have not looked through the Python API documentation enough. Zhao Xu (StataCorp) told me about the class in the API for handling date/datetime data. So looks like the request I had was already solved.
Leave a comment:
Joro Kolev replied

06 Aug 2020, 06:59
Hi Fernando,

I am saying in #291 that there is a major design flaw in Stata (in my view), and your answer is that "Yes, but there are patches to fix this."

1. The patch you are suggesting works if Stata Corp or a kind person has thought about your problem before you, and has programmed -egen- function to patch up the problem you have. What do we do if no kind people have thought to patch up your problem? I can give you one million examples, in no particular order: there is no egen function to save any statistics of -regress- by group (e.g., what do I do if I want to generate a variable containing the slope and the rmse of some regression?), even if there is, what if I want to save slopes from poisson, probit, tobit, etc. There are million examples where you might want to generate certain variable, or save some results from something by group, and there is no egen to accommodate this.

2. For problems where no kind person has patched it up, it is not even a bit of programming as you admit, it is even simpler--we just write a loop and we are done... Or are we?
Here is the speed comparison between the egen function and the loop--I think you misunderstood the mean example I gave, it is not that I do not know that some things can be done with egen, I am saying that it would be a better design to rely on the -by- construct for everything.

Code:

clear set obs 100000 gen x = rnormal() egen group = seq(), block(10) timer clear timeit 1: egen meanx = mean(x), by(group) timer on 2 sort group summ group, meanonly gen mymeanx = . qui forvalues i = 1/`r(max)' { summ x if group==`i', meanonly replace mymeanx = r(mean) if group==`i' } timer off 2 . timer list 1: 0.13 / 1 = 0.1270 2: 145.77 / 1 = 145.7700 . dis 145.77/0.13 1121.3077

So the loop solution is 1121.3077 slower than the egen solution in this case.

3) Clyde is right in #293 that there are user contributed solution. I have tested his solution on one example, and it was fast as lightning. However, can not Stata Corp just pick up the best of those solutions, and make it part of the Stata core?

Originally posted by FernandoRios View Post

Hi Joro,
that is true, -by- doesnt allow for multiple operations like that, directly, but you can implement or use -egen- functions.
for example

Code:

by group:egen newvar=mean(x)

does exactly what you describe.
Granted, for more complicated problems it requires bit of programming.
Fernando
1 like
Leave a comment:
wbuchanan replied

05 Aug 2020, 09:59
Allow the inlist() function to accept a longer list of values (string values seem to be where this is most problematic where I believe the number of values in the list is limited to 7).
4 likes
Leave a comment:
wbuchanan replied

05 Aug 2020, 09:55
Christopher Bratt
Regarding your second point in #299, not all results from R functions return dataframe objects; take for example anything involving the data.table, xts, or zoo libraries. But I think the broader point that you are trying to make is still valid/reasonable (e.g., non-data manipulation commands should return values in some way that they can be accessed later). Regarding the first statement, I think you may be referring more to the output format of the reproducible research than being able to reproduce the research in general (e.g., code is code, but formatting narrative, code, and output is a bit different).

From the several Stata Conferences I've attended, I know the team at StataCorps are always interested in identifying cases where the software does not perform as fast/well as other software and welcome examples that they can test/use to identify those performance bottlenecks. Have you attempted sending anyone from StataCorp a reproducible example of non-performant commands?
Leave a comment:
wbuchanan replied

05 Aug 2020, 09:48
Clyde Schechter
It is also an issue that comes up when using the Java or Python API that requires a lot of additional leg work to handle the datetime and date variables consistently. I'm not sure if the C plugin API has the same issues, but I would imagine the same thing would happen there as well. Keep in mind that I wasn't suggesting changing anything with the Excel import/export functionality, just setting the mask for a datetime value of 0 to represent 01jan1970 00:00:00 instead of 01jan1960 00:00:00 (again this would assume that the value does not include adjustments for leap seconds). I'm not sure how much existing code would be broken by implementing it since it is a fairly simple transformation, but would avoid the possibility that anyone using any of Stata's APIs would potentially incorrectly translate the date/datetime values.
Leave a comment:
Clyde Schechter replied

04 Aug 2020, 11:28
@ #298. I think this is a terrible idea. If the rest of the world used the POSIX epoch and Stata were the only outlier, than sure, it would be the thing to do. But anarchy still reins on this. I suspect that among Stata users (or at least Stata users who are Forum members) there are many more people who work with Microsoft Excel than with various APIs that rely on the POSIX epoch. Excel doesn't even have a consistent epoch base across versions! -import excel- deals with this properly when importing date variables from Excel. It similarly deals correctly with importation of dates from SAS and SPSS. Perhaps over time Stata will extend the -import- command to work with some of the more popular APIs.

But at the moment, adopting the suggestion in #298 would provide convenience for what I believe is a small number of Stata users while breaking enormous amounts of existing code. And really, if you are a sophisticated enough programmer to be bringing in data from a range of API's, how hard is it to just add a line of code to correct date variables for different a different epoch convention?
4 likes
Leave a comment:
Christopher Bratt replied

04 Aug 2020, 08:58
For me, there are still primarily three drawbacks in Stata compared to other programs: 1. not easy to develop reproducable research with flexible options (compared to R Markdown), 2. difficulties handling output from analyses (compared to R, where results are objects accessible as data frames and easy to manipulate), and (3) speed.

Speed seems to be a recurring problem. Are there some fundamental, old choices in Stata once made and now preventing Stata from running at the speed of other programs?

A recent Twitter post stated:
"Just ran a regression with about 100 million observations.
In Stata (MP, 16): 97 minutes
In Julia (using FixedEffectsModels.jl): 13 seconds."

https://twitter.com/clibassi/status/1289948339962646528

Is it possible to rewrite some of the fundamentals of Stata to improve speed?
1 like
Leave a comment:

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: