Wishlist for Stata 18

Leonardo Guizzetti replied

05 May 2022, 07:45
Extension of the existing -tabulate- and -table- commands to have an option that would look at attached variable labels of the input variables and use those defined levels of the label to add them to the tabulations. In the one dimensional case, this is closest to Ben Jann's -fre- command with the -i()- option, which lets the user include specific values that would otherwise have zero frequency. One related request of mine has now been implemented as the new -table, zerocounts- option. However, this option is limited in scope to where zero counts are implied by the cross-tabulation.

There is no support for this currently in any of the official commands, but this can be a useful feature when trying to create tabulations where you specifically want to show zero frequency counts.

As a quick example to demonstrate this, consider the following.

Code:

tabi 0 1 2 \ 0 3 4

The output eliminates the first column because there are no observations in any of those cells.

Code:

. tabi 0 1 2 \ 0 3 4 | col row | 2 3 | Total -----------+----------------------+---------- 1 | 1 2 | 3 2 | 3 4 | 7 -----------+----------------------+---------- Total | 4 6 | 10
4 likes
Leave a comment:
Jean-Michel Galarneau replied

04 May 2022, 09:59
a slightly bigger arrow in the replace all button in the do files such that we can access the replace all in selection with more ease.
Leave a comment:
Mead Over replied

29 Apr 2022, 14:06
Wish: To facilitate future replication of Stata results, a StataCorp utility to help users "freeze" a collection of user-contributed ADO, Mata and MLIB programs for publication/posting with the Stata DO files that call those programs

Many more responsible journals require that referees and eventual readers be able to replicate the analytical results in submitted papers. Through support to the Stata Journal and periodic Stata user conferences, Stata also encourages and helps users to produce and publish Stata programs that extend Stata's capabilities in small and large ways. But in the future when a researcher attempts to replicate the results in a published paper, the community-contributed programs originally used might not be available in the same version, or at all.

Thus a user wishing to enable future replication of a set of interlocking DO files and community-contributed ADO/Mata/Mlib files must figure out how to assemble and "freeze" the community-contributed ADO files used in a given research project. This is doable and many users are already doing it, each in his or her own way. But it would be great if there were a set of StataCorp supported conventions and utilities to standardize the process. (Ideally some journals, starting with Stata Journal, would even require that Stata users conform to such StataCorp-recommended conventions and use the recommended program-freezing utilities.)

diana gold 's SSC program -dependencies- seems to me to be an excellent model for a Stata-supported way to "freeze" (her word) a set of user-contributed Stata ADO/MATA/MLIB files in order to facilitate future replication of research results. I like the fact that it allows the future replicator to temporarily modify their -adopath- as they replicate and then undo this change and delete the replication-specific collection of ADO/Mata/MLIB programs at will. Other SSC programs that accomplish some of the same objectives include -zippkg-, -rqrs-, -which_version-, -copycode-, -adolist- and -usepackage-.

I take the point that results produced using the updated community-contributed ADO files may differ from those originally published exactly because the ADO file's bugs have been fixed. The new results might be "better". But I think this is an argument in favor of, rather than against, requiring authors to publish their frozen ADO files as part of a journal submission. I think that replicators need to start with a script that reproduces as exactly as possible the published result, before they experiment to discover the sensitivity of those results to different approaches and/or data. It is the replicator's responsibility to discover that the newer version of the community-contributed program produces a different result.

diana gold, daniel klein Nick Cox, Sergio Correia and others have extensively discussed these issues on these threads:
https://www.statalist.org/forums/for...lable-from-ssc
https://www.statalist.org/forums/for...ge-require-ado
https://www.statalist.org/forums/for...o-local-folder
https://www.statalist.org/forums/for...os#post1523554
https://www.statalist.org/forums/for...79#post1662079

Last edited by Mead Over; 29 Apr 2022, 14:55.
9 likes
Leave a comment:
Joro Kolev replied

29 Apr 2022, 12:40
Can Stata Corp please fix the documentation for -matrix accum- and add some examples that illustrate how these commands
matrix glsaccum
matrix opaccum
matrix vecaccum
are used?

In particular -matrix glsaccum- has been in the current state of documentation since about Stata 7. I do not understand nearly anything from the current abstract explanation of what -glsaccum- does, there are no examples showing how the command is used anywhere, etc.

And the command -glsaccum- is useful because once upon the time I had a lucky day and managed to implement all the estimators in Wooldridge 2010 "Chapter 7: Estimating Systems of Equations by OLS and GLS" from scratch, just using -glsaccum-, without even reaching the limits of the command (I used the same weighting matrix accross groups, and it apparently allows the weighting matrix to vary).

In short, the plain -matrix accum- is clear. But the more complicated versions listed above and in particular -glsaccum- are not clear at all in the manual .
1 like
Leave a comment:
Jeremy Lim replied

28 Apr 2022, 20:01
For Latent Class Analysis (LCA) as conducted in gsem, to have Latent Transition Analysis and Stata equivalents of Mplus' R3STEP for 3-step latent class regression accounting for classification error, and Knownclass for multiple-group LCA.

Gsem currently estimates multiple-group LCA using the groups() and ginvariant() options, but ginvariant() does not offer an option to constrain coefficients but not variances. More details in this earlier post. Hence, this has to be done manually as advised by Stata Tech Support: "You could place those needed constraints into the constraint definition and then supply to the -constraints()- option after -gsem-", which can be very tedious with 100s of constraints for a 4-class model with 7 groups.
2 likes
Leave a comment:
alejoforero replied

28 Apr 2022, 07:46
When using merge with a very large master dataset and a small using dataset, Stata saves the master dataset in preserve to sort the using dataset first. This can be extremely slow if the master does not fit in preserve memory and so Stata has has to save it in I/O temporary files. This is completely unnecessary if the using dataset fits easily in the preserve memory, and can be sorted there before the merge.

A nice trick I use is:

Code:

frame create sorter frame sorter: use usingdataset.dta, clear frame sorter: sort merging_variables frame sorter: save usingdataset.dta, replace merge 1:1 merging_variables using "usingdataset.dta"

Adding this trick into merge code could potentially save a lot of time doing the merge without any sacrifice whatsoever.
1 like
Leave a comment:
daniel klein replied

26 Apr 2022, 07:03
Repeated requests (unlikely to happen, I know):

1. Do not allow m:m merges. These are never useful outside of StataCorp. Even if they were useful in very rare situations, they surely produce more harm than good. If required, keep m:m under version control.

2. Remove mi's suggestion to use the force option. You never want that; yet, we regularly see it used (blindly copied) on posts to Statalist.
8 likes
Leave a comment:
Anthony Killeen replied

22 Apr 2022, 14:32
Thank you!
Leave a comment:
Leonardo Guizzetti replied

22 Apr 2022, 08:41
Originally posted by Anthony Killeen View Post

Show the name of the frame in the Data Browser!

Anthony, this is shown in the Properties pane of the Data Browser/Editor (View > Properties).
Leave a comment:
Luis Pecht replied

22 Apr 2022, 08:12
I great addition to Stata´s Machine Learning capabilities ( eg. Lasso ) would be an automated feature engineering, just like python's FeatueTools (https://featuretools.alteryx.com/en/stable/).

Once one establishes relationships among Level 0 and higher-levels (Level 1,2,3) datasets and define a cut-off time (Any data after this point in time will be filtered out before calculating features, to avoid "label leakage"), hundreds of features ( variables) are "automagic" created using max, min, means, counts,sum, etc and can be feed into ML algorithims.

IMHO, would involve frames, frlinks, rangestats and lots of egen to translate that in Stata.
Leave a comment:
Anthony Killeen replied

21 Apr 2022, 20:43
Show the name of the frame in the Data Browser!
4 likes
Leave a comment:
Leonardo Guizzetti replied

19 Apr 2022, 15:59
Originally posted by Jay Patel View Post

More comprehensive "table" creation. For example, neither *tabdisp* nor *list* allow a *collect* command to readily gather output appropriate for a table to be output via *putdocx*.

I agree with the spirit of the request to expand the collect system, and I expect those features are being developed. A slight aside, putdocx directly supports listing data already.
2 likes
Leave a comment:
Jay Patel replied

19 Apr 2022, 15:46
More comprehensive "table" creation. For example, neither *tabdisp* nor *list* allow a *collect* command to readily gather output appropriate for a table to be output via *putdocx*.
1 like
Leave a comment:
Jay Patel replied

19 Apr 2022, 15:41
Sophisticated tools to support investment/portfolio development and analyses. Examples: estimation of weights for optimal diversification (with constraints; robust; Bayesian; ...). Bootstrap standard errors for estimates of "optimal" portfolio weights. Portfolio performance analyses that relaxes assumption of i.i.d. observation periods.
Leave a comment:
William Lisowski replied

12 Apr 2022, 07:57
Nothing in #348 contradicts the point I made in #345 that

... SAS approached the problem back in the day by creating PROC SQL to understand SQL commands (and interface directly with SQL databases) rather than shoehorn SQL capabilities into existing DATA step commands.

Statalist is not the forum in which to argue the strengths of SAS's particular implementation of SQL within PROC SQL, which was not the case I was making. I will point out that my copy of the SQL In a Nutshell (2nd Edition, 2004) reference sets the stage for the book in its Preface:

SQL in a Nutshell, Second Edition, describes the latest ANSI standard, SQL2003, version of each command and then documents each platform's implementation of that command.

The platforms comprise six databases popular at the time. So SAS was not alone in adapting the standard (originally published in 1986) to the needs of their implementation. I would prefer Stata do the same should it implement SQL functionality. It that, it would differ from what was done with Python. My preference is to apply SQL syntax to native Stata datasets rather than have to move the datasets into and out of an external database for what are largely data management tasks.
Leave a comment:

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: