Wishlist for Stata 18

Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#106

04 Sep 2021, 07:34

As someone who does quasi-experimental research, I would appreciate formal Stata implementation for procedures such as augmented synthetic-controls, robust synthetic controls, and similar estimators. In Stata 17, Stata formally implemented Difference-in-Differences estimators, so I'd like to see formal extensions in this area too, given recent methodological advances.
1 like
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#107

18 Sep 2021, 13:45

I frequently work on several do files at once. It would be nice to be able to apply the Find command to all of them at once rather than just the one that is open.

I should clarify this a bit. The do file editor allows multiple files to be active at once, each with its own tab. One file is open. I want to be able to do a search on all tabbed files at once.

Last edited by Dick Campbell; 18 Sep 2021, 14:04.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
3 likes
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#108

21 Sep 2021, 13:57

This post concerns latent profile analysis. It's a repost of an earlier post I made on the Stata 17 wishlist. Basically, the default options will nudge less experienced users into making a very restrictive set of assumptions. Moreover, the manual example does not make clear how restrictive the assumptions are.

Stata's defaults are to A) assume that the error variances are equal across each latent class and B) a diagonal covariance structure for Gaussian indicators. I think assumption a is more problematic. I'll deal with it first. Consider the graphic below from Masyn's chapter on mixture models that was referenced in the SEM examples dealing with LCA and LPA.

Issue A

Panel a) in the diagram is a dataset with 2 indicators, Y1 and Y2. Panel b) illustrates a 3-class LPA model with equal error variances. Do you see how the circles are the same size? The center of each circle represents the means of each indicator, and the diameters along the x- and y- axes represent the error variance of each indicator (they are equal in this example, but they need not be).That's what equal error variances means.

That's just a sample dataset. Knowing nothing about what Y1 and Y2 are, maybe it's not absurd to suppose that the group of dots might stem from three separate sources. That's fine. The thing is that with real data, you might not be able to make this assumption. However, if you don't override Stata's default assumptions, you will be telling Stata to take your magic multidimensional cookie cutter and to cut out k cookies of equal size from your data. If you relax that default, you tell Stata that it can resize the magic cookie cutter as appropriate after each stamp.

For clarity, I show Stata's default behavior with code after I discuss issue B.

Issue B
Now, let's deal with case b), or Consider the diagram below, which borrowed from the manual for the R package flexmix. This is an artificial (I think) dataset with two dimensions. You could simply think of them as physical x and y coordinates for this post. Both panels represent the results of latent profile models with 4 classes. Each color represents observations assigned to each latent class.

The model for the left panel had diagonal covariance structure for the Gaussian indicators, i.e. within each class, all the Gaussian indicators have 0 correlation. That's described starting on pg 14 of the manual. (NB: I believe flexmix's default is to assume unequal error variances across classes for Gaussian indicators. You can see that the size of each circle is different. So, there's precedent for not using equal error variance across classes as the default.) Note classes 1 and 4. There's a small swathe of points running diagonally. The first model broke that group into two distinct classes.

On the right panel, the model has unstructured covariance, i.e. within each class, a correlation between the (errors of) each indicator variable is explicitly modeled. It can turn out to be 0, as with class 3 on the right. However, note that the left panel's classes 1 and 4 have become one class, #2, on the right. See how the ellipse is slanted - that tilt represents the correlation. In magic multidimensional cookie cutter terms: Stata's default behavior is to cut (multidimensional) ellipses at only angles of 0 or 90 degrees. Relaxing Stata's default behavior lets it tilt the cookie cutter as appropriate with each cut.

My ask
Relax Stata's default behavior when fitting latent profile models. Clarify in the manual that you need to explore models with fewer constraints. SEM example 52 mentions only that the final model relaxed both constraints described above, not why you need to do this and what this does. I believe it would be better if the default were the least restrictive set of assumptions.

This is one recent example where a new poster fit LPMs with only Stata's default (and restrictive) assumptions. I link the post not to criticize the user. Again, they were nudged into this action by Stata's defaults.

Code example for case A

Code:

use https://www.stata-press.com/data/r16/gsem_lca2 gsem (glucose insulin sspg <- _cons), lclass(C 2) byparm ... var(e.glucose)| C | 1 | 191.5596 23.83815 150.0992 244.4723 2 | 191.5596 23.83815 150.0992 244.4723 var(e.insulin)| C | 1 | 119.0542 14.00336 94.54204 149.9217 2 | 119.0542 14.00336 94.54204 149.9217 var(e.sspg)#C| 1 | 55.91283 6.713667 44.18801 70.7487 2 | 55.91283 6.713667 44.18801 70.7487 -------------------------------------------------------------------------------

The bolded coefficients are the error variances for each of the two latent classes. See how they're all equal? Now, with unequal error variances:

Code:

gsem (glucose insulin sspg <- _cons), lclass(C 2) lcinvariant(none) byparm ... var(e.glucose)| C | 1 | 22.62693 4.35593 15.5153 32.99827 2 | 1263.401 223.8804 892.6978 1788.043 var(e.insulin)| C | 1 | 26.36603 4.285562 19.17298 36.25767 2 | 283.2775 50.93803 199.137 402.9697 var(e.sspg)#C| 1 | 25.26045 5.003334 17.1334 37.24247 2 | 70.49358 12.7819 49.4094 100.5749 -------------------------------------------------------------------------------

I don't feel it's necessary to demonstrate with code, but the option covstructure(unstructured) will fit LPMs where the error terms for each indicator are allowed to correlate within each latent class. For each class, you'll see the covariance between each indicator at the end of the results table. Recall that you can convert covariance to correlation: rho = covariance(x1, x2) / sqrt[Var(x1) * Var(x2)].

Attached Files

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
3 likes
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#109

22 Sep 2021, 09:14

Adding a wish for support of the OpenType font format (so the same OpenType fonts, and features , can be used in Stata plots, LuaLaTeX, and Adobe InDesign).

Last edited by Bjarte Aagnes; 22 Sep 2021, 09:18.
4 likes
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#110

22 Sep 2021, 09:28

Originally posted by Giovanni Russo View Post

An expansion of the SEM and GSEM suite, the 3-step approach for LCA and other estimation methods that are less sensitive to deviations from normality. Mplus and LatentGold offer a wider set of options

Can you clarify what you mean by other estimation methods that are less sensitive to deviations from normality in this context? If you have binary indicators, I don't see what normality means. You can treat continuous indicators as Gaussian. However, you can also treat them as Poisson, negative binomial, as any type of survival model, etc.

I would second the statement about 3-step approaches. For other readers: very often, after we fit an LCA, we want to know how other variables that weren't used in the LCA model are related to class membership. For example, say we fit an LCA on profiles of adolescent risk behavior, and say we found some subtypes with qualitatively different types of risk. How is, say, being raised by a single parent related to risk profile membership?

Many readers will go and do modal class assignment, i.e. predict latent class membership probabilities, then take the class with the highest probability and assume you belong to that class. Then you tabulate Y by K, so you have E(Y| K = k). This is theoretically erroneous, because we don't know which class someone belongs to, we only know the vector of probabilities that they belong to each latent class. Now, after the LCA model, we may be relatively certain about which classes people belong to (i.e. high entropy), and this exercise would be slightly wrong but still useful. However, we aren't always certain. Its been shown by smarter people than I that this classification uncertainty will bias your estimates.

One way around this is to go and fit a latent class regression. Say K refers to latent class membership, Indicators is the vector of indicators of the latent class (e.g. going with my prior example, you might use sex, alcohol consumption, smoking, other drug use, etc as indicators), and Y is a vector of covariates which might influence latent class membership but aren't indicators (e.g. single parent, income). In a LCA, you estimate E(Indicators | K = k). In a latent class regression, you simultaneously fit an LCA and estimate P(K = k | Y).

The problem with latent class regression is that what if you have a lot of indicators? Also, what if your latent class characteristics change substantially when you introduce Y? Three step approaches will fit an LCA model and then tabulate Y by K while also correcting for classification uncertainty. A whole bunch of articles can be found if you Google this. Many are written by Jeroen Vermunt and colleagues. Quite frankly, it's taken me a long time to understand what they're talking about, and I still can't understand their algebra for how they correct for classification uncertainty. I can gather that it's not straightforward to implement in Stata (or at least it's beyond my math and programming skill), so I haven't tried.

And speaking of entropy, that calculation is fairly straightforward to implement, and many forum members have given code. However, I'd like to see this implemented in Stata 18. I made this request in a post on the wishlist for 17. As another part of that wishlist, when doing latent class with binary indicators, we will sometimes have situations where the class-specific proportion of an indicator is 0 or 1. That corresponds to logit intercepts of +/- infinity, and it will prevent convergence with Stata's default convergence criteria. MPlus (and possibly the R package polca plus the Penn State LCA plugin for Stata) will constrain the logit intercepts to +/- 15 as appropriate, and then declare convergence while providing a warning. I'd like to see this implemented as default behavior in Stata, with a warning made fairly prominently. I'd like the manual to outline this case, describe why it happens, and warn that too many such constraints is a sign that you're trying to extract too many latent classes (i.e. drop this model and go back to the one with k-1 latent classes).

All 3 of the issues here (latent class vs distal outcomes, entropy, and no convergence due to logit intercepts wandering to infinity) are all fairly frequent issues raised on the forum.

Ideally, I would also like to see the bootstrap likelihood ratio test for k vs k-1 latent classes implemented. That seems to be a well-accepted test, but it appears complex to implement and also very processor-intensive to execute.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Giovanni Russo

Join Date: Sep 2015

Posts: 14
#111

23 Sep 2021, 06:30

I was referring to alternative to ML to estimate SEM models which are less sensitive to deviations from normality assumption, for example Diagonally Weighted Least Squares (DWLS) also referred to as WLSM or WLSMV.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#112

23 Sep 2021, 07:21

Originally posted by Giovanni Russo View Post

I was referring to alternative to ML to estimate SEM models which are less sensitive to deviations from normality assumption, for example Diagonally Weighted Least Squares (DWLS) also referred to as WLSM or WLSMV.

That makes sense. Stata does have an asymptotic distribution free (ADF) estimator for traditional SEM. I am under the impression that it a type of weighted least squares estimator. However, this isn't my specialty. Perhaps someone more knowledgeable can comment.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Jack Edmondson

Join Date: Sep 2021

Posts: 2
#113

23 Sep 2021, 13:25

I would like to suggest that StataCorp reconsider how it abstracts and utilizes different commands and functionalities from a computer science perspective. What I mean by this is that there are features that have been added long ago that, if one were to redesign today, would likely never implement them in the way they currently exist in Stata, or that there are functionalities that other software/languages have which Stata lacks. To give some examples:

Code:

sort

and

Code:

gsort

have no relative areas of strength over the other as best I can tell, with the only difference being that sort's functionality is merely a subset of gsort. If I were to re-configure the sorting functionality in Stata, there seems to be no reason that I would keep it as-is. Shouldn't the sort command not be restricted into sorting only by ascending order? If I want to sort in a different direction, it would be much simpler and a better abstraction, in principle, to have this feature shared with the same command that does ascending sorting, particularly when the command that does descending sort in the current state can also do ascending sorting.

Other examples include the distinction between gen and egen, or the fact that one is unable to use the merge command to join two datasets together using variable columns that have different names, a feature present in many other alternatives to Stata. These examples may not seem like a big concern, but nevertheless their improvement would represent quality of life increases over the long-run. Indeed, in my experience, it is the little distinctions of things like this that frustrate new users of Stata the most. Refining the functionality of already existing commands in Stata to better and more natural levels of abstraction would certainly be an improvement to its long-term accessibility (even though I'm sure a few veterans might grumble about some changes to core functionalities on account of having to re-learn something they knew for years, you could suppose). I don't think StataCorp needs to throw everything away, of course, but certainly some introspection on how core features might benefit from small changes or minor improvements would be very welcome.

Originally posted by Jared Greathouse View Post

As someone who does quasi-experimental research, I would appreciate formal Stata implementation for procedures such as augmented synthetic-controls, robust synthetic controls, and similar estimators. In Stata 17, Stata formally implemented Difference-in-Differences estimators, so I'd like to see formal extensions in this area too, given recent methodological advances.

I would also like to say I second this suggestion by Jared. I know that there exists an R package that can do this already. If you have read Causal Inference: a Mixtape by Scott Cunningham and look at his example code for implementing these methods in Stata and in R, the Stata example is 3 pages long, whereas the R example is only a few lines when using the package. Something in Stata that incorporates these methodologies would be great.
2 likes
Comment
Christopher Bratt

Join Date: May 2019

Posts: 144
#114

24 Sep 2021, 15:34

In interactive use, Stata demands /// for multi-line commands, except when { } are required in loops or if-statements.

Other languages use indent (e.g. Python) or allow for added brackets to indicate where a multi-line command starts and where it ends (e.g. R).

It shouldn’t be too difficult to allow for such an option, even in interactive use, I hope? It would at least ease coding (and maybe make code prettier too).
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#115

24 Sep 2021, 17:42

the parts of the above that I understand are not correct ("///" can only be used in do files, not in interactive use) and the rest is confusing; please clarify
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#116

24 Sep 2021, 18:41

Originally posted by Christopher Bratt View Post

In interactive use, Stata demands /// for multi-line commands, except when { } are required in loops or if-statements.

Other languages use indent (e.g. Python) or allow for added brackets to indicate where a multi-line command starts and where it ends (e.g. R).

It shouldn’t be too difficult to allow for such an option, even in interactive use, I hope? It would at least ease coding (and maybe make code prettier too).

If you really don't want the added /// (or /* */) for multi line commands, you have the option already to change the delimiter to a semicolon.

Code:

#delim ; Your long multi line command; #delim cr // to revert to carriage return
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#117

24 Sep 2021, 19:12

It would be convenient if there were a single command that would copy a label from one frame to another.
2 likes
Comment

Christopher Bratt

Join Date: May 2019
Posts: 144

#118

25 Sep 2021, 01:57

Responding to #115 and #116:

("///" can only be used in do files, not in interactive use

You can use /// in interactive use: when running parts of code in a do-file.

I assume most people develop a do-file interactively. Or, one can use a do-file to code interactively without keeping the do-file later. In Stata's do-file editor, running parts of the code in a do-file requires that the user selects the code in question (this is a bit cumbersome, running parts of a do-file code is easier in external editors.)

you have the option already to change the delimiter to a semicolon.

Not for interactive use; only when you run the whole do-file. (An earlier request at Statalist, not by me, was that Stata should be more consistent and allow for the semicolon in interactive use.)

the rest is confusing; please clarify

Take this code:

Code:

tabplot disagree_home workplace2,              ///
   title("Use of patients' home",              ///
          size(medlarge))                      ///
   xtitle("")                                  ///
   b1title("Nurses' workplace")                ///
   subtitle("") ytitle("")                     ///
   percent(workplace2)                         ///
   showval separate(disagree_home)             ///
   bar1(bfcolor(green) blcolor(green))         ///
   bar2(bfcolor(green*0.1) blcolor(green*0.3)) ///
   bar3(bfcolor(red*0.2) blcolor(red*0.3))     ///
   bar4(bfcolor(red*0.5) blcolor(red*0.6))     ///
   bar5(bfcolor(red*0.9) blcolor(red))         ///
   scheme(s1color) yreverse aspect(1)          ///
   name(tabplot1, replace) nodraw

I would prefer to be able to use some sort of brackets: ( ) [ ] { } to show where the command starts and where it ends, like I do when coding in R.

Or, see below. Even when the code makes clear where the command starts and where it ends, Stata needs its ///.
Brackets -- here, left parenthesis at the start, then right parenthesis at the end -- make it clear where the code starts and where it ends.

Code:

runmplus(                                               ///
    predage_r2 lkrspag_r2 trtbdag_r2 c_age c_agesq      ///
    country pspwght,                                    ///
    saveinputfile(mplusin) saveinputdatafile(mplusin)   ///
    savelogfile(e01_5a_MNLFA_all_MI)                    ///
    variable(                                           ///
        weight      = pspwght;                          ///
        categorical = predage_r2 lkrspag_r2 trtbdag_r2; ///
        constraint  = c_age c_agesq;                    ///
        cluster     = country;                          ///
    )                                                   ///
    analysis(                                           ///
        type      = complex;                            ///
        estimator = mlr;                                ///
        link      = logit;                              ///
    )                                                   ///
    model(                                              ///
        discrim BY predage_r2*2.53025;                  ///
        discrim BY lkrspag_r2*6.52504;                  ///
        discrim BY trtbdag_r2*4.80997;                  ///
                                                        ///
        discrim ON c_age*-0.05025;                      ///
        discrim ON c_agesq*0.04353;                     ///
                                                        ///
        [ discrim@0 ];                                  ///
                                                        ///
        [ predage_r2$1*1.33603 ];                       ///
        [ lkrspag_r2$1*2.48037 ];                       ///
        [ trtbdag_r2$1*3.12506 ];                       ///
                                                        ///
        discrim*999 (v_disc);                           ///
                                                        ///
    model constraint:                                   ///
        new(v_disc1*0.01080);                           ///
        new(v_disc2*-0.00116);                          ///
        v_disc = exp(v_disc1*c_age + v_disc2*c_agesq);  ///
    )                                                   ///
    output(svalues);                                    ///
    savedata(                                           ///
    save=fscores;                                       ///
    file=mnlfa0.dat;                                    ///
    )                                                   ///
)

I don't want to have to type all the ///, and I would prefer not having to look at them.

I have experience with coding only in Stata and R. R uses brackets to indicate start and stop of a specific command (semicolon can also be used, most convenient for separating two commands on one line.)

Since I don't code in Python, I don't know much about it. But I think its use of indentation is elegant. An indented line means: "Code continues!"

Code:

if a==1:
    print(a)
    if b==2:
        print(b)
print('end')

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35699
#119

26 Sep 2021, 10:45

Braces without content are allowed. I just tried this in the do-file editor, which indented automatically. I might use it more. (tabplot is from the Stata Journal, and just an example here.)

Code:

sysuse auto, clear { tabplot foreign rep78 }

I am not fond of the effect of lots of lines ending /// but sometimes it is the least unattractive choice. Without quite putting my finger on why, I dislike ; as a delimiter in Stata but am happy to use it sometimes in Mata. I think it's because needing to type #delimit ; and #delimit cr is ugly as well as irritating.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#120

26 Sep 2021, 11:20

As a note to #118 above:

You can use /// in interactive use: when running parts of code in a do-file.

In discussing Stata the term "interactive" is generally reserved for commands typed one-at-a-time in the Stata Command window, or generated from the menus. (This is especially true for Mata.) Submitting do-files, or portions of do-files, is typically not considered interactive. In section 16.1.2 of the Stata User's Guide PDF this distinction is reinforced with the implication that interactive use of /// is different than use in a do-file.

The /* */, //, and /// comment indicators can be used in do-files and ado-files only; you may not use them interactively.
4 likes
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment