Wishlist for Stata 17

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2291
#466

04 Mar 2021, 15:40

I just noticed that -estat ovtest- always produces the traditional, nonrobust version of RESET even if the -vce(robust)- or vce(cluster id)- options are used in -regress-. This leads to logically inconsistent analysis where one uses robust or cluster-robust inference for everything except specification tests. Why should the original version of the test always be used when it's known how to easily make it robust? RESET is easily set up as a variable addition test, and therefore one can use the same robust statistics as when testing for significance of omitted variables use the -test- option. As far as calling RESET -ovtest-, I've already complained about that because it makes users think it's testing for omitted variables when it is a pure functional form test.

Admittedly, it's easy to carry out the fully robust RESET in a few lines, but many don't know that. So they rely on -estat ovtest-, never suspecting that they are using a flawed test.
6 likes
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#467

05 Mar 2021, 16:02

This post pertains to latent class analysis. There are innumerable posts where someone says that they are running a maximizer and it eventually stalls out at a certain log-likelihood that isn't concave. Here is one example, but again, this is a very frequent question.

This occurs in binary items when one latent class has an item intercept going to over +15 or under -15. In plain language, that means the you have one latent class that has a 100% or 0% chance of endorsing the item. The maximizer won't declare convergence because it thinks that the likelihood function is not concave. Other software packages, e.g. MPlus and the Penn State University LCA plugin, will automatically constrain the offending intercepts at +/- 15. I believe that MPlus will return an alert that some of the parameters were constrained.

I would like Stata a) to automatically constrain logit intercepts as I described, b) to report the number of parameters thus constrained and to warn the user that too many constrained parameters might be a sign that you're trying to fit more latent classes than the data can support, and c) to alert users about this in the manual. Alternatively, have the current failure mode be the default, but add an option to constrain the logit intercepts.

To be honest, part of my reasoning is selfish. I am spending a bunch of time responding to this sort of query on the forum. This would enable me to spend less time. Now, it's true that I could simply elect not to respond to these issues. But other packages do this, and it seems logical that Stata should as well.

A possible objection is that it enables users to get into some trouble. Maybe it does. However, a knowledgeable reviewer would be able to catch this. If you are reviewing an LCA paper with binary items and you see that there are multiple reported class proportions that are 0 or 1, ask them about it. Mention the Kathryn Masyn article cited in our SEM example 52, which says that too many intercepts constrained like this is a sign of that you may be trying to extract too many classes (or I phrase it in English a couple paragraphs above). Ask them how many parameters were constrained in the k-1st solution, and ask them how that solution differs substantively from the candidate solution. There, now you too know what to ask. I don't think that automatically constraining the parameters is necessarily fail deadly if you report and warn the user.

Moreover, some users are going to follow the advice that Stata's Jeff Pitblado posted in this thread and turn off the secondary convergence criterion (i.e. the nonrtolerance option) without quite realizing what they're doing. I think that doing this enables worse failures than constraining and warning.

Second, I'd like Stata to implement an automatic calculation for model entropy. Admittedly this is fairly easy to calculate manually, but people do keep asking.

Third, I would ideally like to see Stata implement the bootstrap likelihood ratio test for model selection. Normally, we would just compare BIC between models. However, MPlus, which is regarded as the best available software option for this type of modeling, does implement the BLRT. Some simulation work appears to show that this is a good method to choose the correct number of classes, and it does offer a backup to the BIC. However, this doesn't appear to be trivial to implement it. gsem doesn't work with the bootstrap prefix. Moreover, it's possible that some of the bootstrap samples may refuse to converge; Stata would have to decide how to adjudicate that type of event.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
2 likes
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#468

05 Mar 2021, 17:14

Speaking about fail safe vs fail deadly, I have a conceptual issue with the default settings for latent profile analysis. The short version is that the default options assume that all the indicators are uncorrelated within each latent class, and that the variances of (the errors of) all the indicators are equal across all latent classes. I believe that this is a highly restrictive assumption, and that it may not be realistic for most situations. Many people will use the default. The manual starts with this default, and doesn't really clarify that you need to examine alternatives to the default. It's not immediately obvious to people, but this can have you fitting this extremely restrictive model and not realizing that you should be at least testing something different.

The long version requires some setup to explain, so please bear with me. Latent profile analysis is basically latent class analysis with Gaussian indicators. You fit a mean and an error variance for each latent class. I assume that error variance, in this context (because it's not a regression, it's a simple model for a mean of an indicator), translates to the variance of the indicator within the class. Now, consider the diagram below, which I'm borrowing from the R package flexmix. This is an artificial (I think) dataset with two dimensions. You could simply think of them as physical x and y coordinates for this post. When you look at the dataset, you see a few obvious clusters of points, but our computers don't think like we do, so we have to use things like latent profile analysis to get them to put these points into clusters.

Now, the two panels have the same underlying data. The left panel was classified using a latent profile model. If you read the example starting on pg 14, you'll see that for the left panel, the author ran the wrong model. The covariance matrix for this model is diagonal. In English, the rows and the columns represent the within-class error covariance structure. The covariance of x and x is the variance of x. In a diagonal structure, you estimate the variances and fix the covariances at 0. E.g. for each latent class:
x y

x Var(e.x) 0

y 0 Var(e.y)

Basically, within each class, x and y are not allowed to be correlated. The alternative assumption is that the covariance structure isn't diagonal. So, you assume the structure below. I believe, from my reading of flexmix's manual that the non-diagonal covariance structure may be the default when you assume Gaussian indicators, and you have to actively specify the more restrictive diagonal structure. Stata's default is a diagonal covariance structure in each class, and you manually specify the unstructured covariance structure as depicted in the table below. (For the record, you add the option covstructure(e._OEn, unstructured) to fit an unstructured covariance structure)

x y

x Var(e.x) Cov(e.x, e.y)

y Cov(e.y, e.x) Var(e.y)

Imagine that you have a magic multidimensional cookie cutter. In LPA, you tell your cookie cutter that you're going to cut k elliptical cookies out of the data, and please cover as many of the points as you can. Your cookie cutter will decide (through magic) how long each axis of the ellipse will be. In a diagonal covariance structure, you're telling your cookie cutter that the ellipse has to be angled at 0 or 90 degrees, and nothing in between. If you do that, then the magic cookie cutter is going to treat that diagonal cluster of points as two circles. It would be more correct to let the cookie cutter choose the correct angle. If you do that, then the cookie cutter will treat that diagonal group as one group. Again, Stata's default is to tell the cookie cutter that it can only position itself at 0 and 90 degrees, so the picture on the left is our default result in Stata.

There is a separate issue. Stata's default is to also constrain the class error variances to be equal across latent classes. To demonstrate with Stata's stock dataset:

Code:

use https://www.stata-press.com/data/r16/gsem_lca2 gsem (glucose insulin sspg <- _cons), lclass(C 2) byparm ------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- 1.C | (base outcome) --------------+---------------------------------------------------------------- 2.C | _cons | -1.541025 .2205682 -6.99 0.000 -1.973331 -1.10872 --------------+---------------------------------------------------------------- glucose | C | 1 | 41.22237 1.298051 31.76 0.000 38.67824 43.7665 2 | 115.7123 2.849914 40.60 0.000 110.1266 121.2981 --------------+---------------------------------------------------------------- insulin | C | 1 | 20.98005 1.000974 20.96 0.000 19.01817 22.94192 2 | 7.553144 2.160949 3.50 0.000 3.317761 11.78853 --------------+---------------------------------------------------------------- sspg | C | 1 | 14.96579 .6868081 21.79 0.000 13.61967 16.31191 2 | 34.5529 1.53117 22.57 0.000 31.55187 37.55394 --------------+---------------------------------------------------------------- var(e.glucose)| C | 1 | 191.5596 23.83815 150.0992 244.4723 2 | 191.5596 23.83815 150.0992 244.4723 var(e.insulin)| C | 1 | 119.0542 14.00336 94.54204 149.9217 2 | 119.0542 14.00336 94.54204 149.9217 var(e.sspg)#C| 1 | 55.91283 6.713667 44.18801 70.7487 2 | 55.91283 6.713667 44.18801 70.7487 -------------------------------------------------------------------------------

The bolded coefficients are the error variances for each of the two latent classes. This is akin to telling your magic cookie cutter that now, it also has to cut equal-sized ellipses out of the data. flexmix doesn't appear to have an equivalent option at all (or because my R knowledge is less advanced than my Stata knowledge, maybe I've missed the option). For the record, you turn that assumption off this way:

Code:

gsem (glucose insulin sspg <- _cons), lclass(C 2) lcinvariant(none) byparm ------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- 1.C | (base outcome) --------------+---------------------------------------------------------------- 2.C | _cons | -.236545 .1805566 -1.31 0.190 -.5904294 .1173394 --------------+---------------------------------------------------------------- glucose | C | 1 | 35.98797 .5732451 62.78 0.000 34.86443 37.11151 2 | 77.638 4.657351 16.67 0.000 68.50976 86.76624 --------------+---------------------------------------------------------------- insulin | C | 1 | 16.5196 .5887541 28.06 0.000 15.36566 17.67354 2 | 21.26216 2.112797 10.06 0.000 17.12115 25.40317 --------------+---------------------------------------------------------------- sspg | C | 1 | 11.17919 .6359466 17.58 0.000 9.932759 12.42562 2 | 27.59469 1.092309 25.26 0.000 25.4538 29.73557 --------------+---------------------------------------------------------------- var(e.glucose)| C | 1 | 22.62693 4.35593 15.5153 32.99827 2 | 1263.401 223.8804 892.6978 1788.043 var(e.insulin)| C | 1 | 26.36603 4.285562 19.17298 36.25767 2 | 283.2775 50.93803 199.137 402.9697 var(e.sspg)#C| 1 | 25.26045 5.003334 17.1334 37.24247 2 | 70.49358 12.7819 49.4094 100.5749 -------------------------------------------------------------------------------

You see that the error variances now differ. I won't show syntax, but you can verify for yourselves that the option covstructure(e._OEn, unstructured) will estimate an unstructured covariance structure within each class, so you will now get error covariances in the output (remember, Stata's default is a diagonal structure within each class).

I know that this default behavior stems from gsem's default options. It may make sense in other contexts. In the context of LPA, it seems to me that the default behavior has you fitting an extremely restrictive model. By my reading of SEM example 52, which introduces LPA, the manual starts with the default behavior, then it mentions that the correct final model had unstructured within-class covariance structure and parameters not constrained equal across classes. It doesn't discuss how restrictive the default behavior is. It doesn't tell you that you should be investigating alternative model structures. The manual does cite a book chapter by Katherine Masyn, which does say you should investigate all alternative model structures, but how many of us go dig through the manual sources? Hence, my ask: change the default options to the least restrictive model structure. If you don't, clarify SEM example 52 to state that you really need to examine alternatives to the default.

Masyn (whose chapter I frequently cite here) appears to say that you need to investigate all 4 potential model structures: class-variant + unstructured indicator covariances, class-invariant + unstructured indicator covariances, and class-variant and -invariant + diagonal indicator covariances. Now, I would not normally care to question her depth of technical knowledge. However, I am still going to differ from her in one respect: I'm not sure how critical it is to examine model structures that seem potentially too constrained to be realistic. For example, consider that flexmix doesn't even appear to offer the option to restrict the class error variances to be equal. Even if I'm wrong on that score, I believe I've shown that Stata's default behavior has you fitting a very restrictive model and potentially going down the wrong path. I'd like to see gsem's default behavior changed for latent profile analysis, obviously retaining the option to constrain the model to be more restrictive if we decide it can be based on the results or some theoretical grounds.

This 2009 paper may be an example of analysts going with an overly constrained model without examining alternatives. The article is free on Pubmed. The model was fit in MPlus, using 3 indicators. I'm not familiar with MPlus' default behavior in latent profile models. However, the tables clearly do not mention any covariance between the indicators, nor do they mention the authors testing between models with diagonal vs unstructured covariance structures. Thus, I'd have to assume that either MPlus' default behavior in 2009 was diagonal covariance structure and the authors didn't consider examining the alternative, or the default was unstructured covariance but the authors chose not to report the results for unknown reasons - maybe they didn't realize what this was? As to the issue of class-invariant vs class-varying error variances, there's no way to tell what structure was examined from the reported information. Table 1 is the results table, and it only reports mean and SE of each indicator in each class. There's no information on indicator variance (which is substantively interesting in itself, IMO).

Regarding my immediately prior post, you'll note that even way back in 2009, this paper reported entropy and the p-values from bootstrap likelihood ratio tests. Hence, I would really like Stata to implement these features.

Last, note that this particular issue does not apply to latent class analysis. In latent class with binary indicators, you have one logit intercept per item. There's no error to be correlated within each class. The variance of a binary item is n*p*(1-p), so it's inherently dependent on the probability, and there's no way to constrain that to be equal across the latent classes. There is the separate issue of potential violations of conditional independence - the assumption in all latent variable models is that given the value of the latent variable (here, that's the latent class), the probability of endorsing each item is independent of the others. There was once a question on how to test for violations of that on the forum. Honestly, I'm not familiar with that issue, but my impression is that MPlus may have some tests. I wouldn't view this as a critical issue yet.

Last edited by Weiwen Ng; 05 Mar 2021, 17:26.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
4 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#469

08 Mar 2021, 02:38

The built in command -_pctile- is unreasonably minimal in what it leaves in r() behind, and this creates some headache. Furthermore, it seems to me that -_pctile- has bugs in degenerate situations with too few observations.

Unlike any other command that I am aware of, -_pctile- does not leave the number of observation on the basis of which the calculation was made r(N), and it does not leave behind in r() an indication that it has failed to calculate what it was asked to calculate. Here is what happens with -_pctile- when everything is fine, and notice that it does not leave r(N):

Code:

. set obs 30 number of observations (_N) was 0, now 30 . gen x = rnormal() . _pctile x, nq(4) . return list scalars: r(r1) = -.7291995286941528 r(r2) = -.0959552293643355 r(r3) = .9089744687080383

And here is what happens with -_pctile- when everything is not fine:

Code:

. gen y = . (30 missing values generated) . _pctile y, nq(4) no observations . return list .

This message "no observations" is not left behind anywhere. Again there is no r(N)

Finally here is what happens in degenerate situations:

Code:

. replace y = rnormal() in 1/2 (2 real changes made) . _pctile y, nq(4) . return list scalars: r(r1) = -1.311697125434875 r(r2) = -.6397798098623753 r(r3) = .032137505710125 . summ y Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- y | 2 -.6397798 .9502346 -1.311697 .0321375 . drop y . gen y = . (30 missing values generated) . replace y = rnormal() in 1 (1 real change made) . _pctile y, nq(4) . return list scalars: r(r1) = -1.227839469909668 r(r2) = -1.227839469909668 r(r3) = -1.227839469909668 . summ y Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- y | 1 -1.227839 . -1.227839 -1.227839 .

How did -_pctile- manage to calculate 3 break points on the basis of 2 observations??? And why is it not producing any error message when on the basis of 1 observation it calculated 3 identical break points?

I would like that -_pctile- at the very minimum leave behind r(N). Also it seems to me that the behaviour in degenerate situations has to be reworked, and some error messages should be left behind.

Last edited by Joro Kolev; 08 Mar 2021, 02:41.
5 likes
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2534
#470

08 Mar 2021, 06:49

Something that I would like is to enhance the graph generating interface so it allows a more modern approach taken by other engines.
Basically, once we have created a plot, for example:

Code:

sysuse auto scatter price mpg,

It would be ideal if we could just add more features on top of this one. For example:

Code:

addplot:lfit price mpg,

Yes, in other words, I wish Stata could add officially add "addplot" (by Ben Jann) to the official Stata Graph Engine machinery.

And if this is added, could also be possible to add the option "legend" for graph combine? (so it is easy to add a single legend to figures)

Last edited by FernandoRios; 08 Mar 2021, 07:44.
5 likes
Comment
JanDitzen

Join Date: Jan 2015

Posts: 354
#471

16 Mar 2021, 03:45

I would appreciate a browser or at least a list for all objects, in particular mata matrices. The list could look like the variable list in Stata, but ultimately, having something click-able behaviour would be great. This might require some work, but I think users would really appreciate it. In addition, free of charge programs have such features.

In detail, I am thinking of R-Studio or Matlab which show all objects, i.e. variables (dataframes), scalars, matrices etc.I understand that Stata's approach is different and more focused on a certain dataset. However, I think having another window would not hurt. Inspecting data is so much easier and straightforward in R-Studio or Matlab. Especially when it comes to arrays or sparse matrices (how is it possible that there are no sparse matrices in mata - as a side note!?). Inspecting data in Stata is in comparison to those tools really painful.
1 like
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2624
#472

16 Mar 2021, 07:12

Originally posted by JanDitzen View Post

I would appreciate a browser or at least a list for all objects, in particular mata matrices. The list could look like the variable list in Stata, but ultimately, having something click-able behaviour would be great. This might require some work, but I think users would really appreciate it. In addition, free of charge programs have such features.

In detail, I am thinking of R-Studio or Matlab which show all objects, i.e. variables (dataframes), scalars, matrices etc.I understand that Stata's approach is different and more focused on a certain dataset. However, I think having another window would not hurt. Inspecting data is so much easier and straightforward in R-Studio or Matlab. Especially when it comes to arrays or sparse matrices (how is it possible that there are no sparse matrices in mata - as a side note!?). Inspecting data in Stata is in comparison to those tools really painful.

While it is not the most elegant solution, you can do the following to get a list of all Mata objects:

Code:

mata: mata describe

Regarding sparse matrices: I think there is a section in The Mata Book. You can use associative arrays in Mata to create sparse matrices.

Last edited by Sebastian Kripfganz; 16 Mar 2021, 07:25.

https://www.kripfganz.de/stata/
2 likes
Comment
Sam Faye

Join Date: Mar 2021

Posts: 1
#473

16 Mar 2021, 20:42

Panel Stochastic Frontier module is ancient! I have have waited more than enough; Will they ever update the module for Panel Stochastic Frontier? PLease? The literature has evolved quite significantly and STATA has not updated anything in the recent editions.
1 like
Comment
JanDitzen

Join Date: Jan 2015

Posts: 354
#474

18 Mar 2021, 02:49

Originally posted by Sebastian Kripfganz View Post

While it is not the most elegant solution, you can do the following to get a list of all Mata objects:

Code:

mata: mata describe

Regarding sparse matrices: I think there is a section in The Mata Book. You can use associative arrays in Mata to create sparse matrices.

Thanks for your reply.

Code:

mata: mata describe

is of course a solution, but something a bit more interactive would be desirable.

On the sparse matrices: storing them is not such a big problem, the difficulty starts when trying to combine those with mathematical functions. Maybe I am missing something here?!
Comment
Fahad Mirza

Join Date: Sep 2018

Posts: 263
#475

20 Mar 2021, 01:35

I have two requests and I am not sure how popular these requests may be but it will be really nice if StataCorp can look into it.

Is there a way to make dashboards using Stata? I am not referring to very complex dashboards but rather, simple ones where you have the option of selecting a subgroup to see how that affects descriptive statistics and regression results. I was thinking maybe the built in command by the name of Dyndoc can address this but honestly, I am not sure about it (nor do i have the relevant knowledge. Will be nice if someone can enlighten me.)

Secondly, will be nice to have Stata allow for basemaps (from open street maps etc.) in grmap and allow for placing city/district/village labels outside the choropleth maps. I saw this long guide by Richard Williams I believe on this and honestly found that to be quite technical, so thought maybe I can share it as a request to have a label option that takes labels outside the choropleth drawing area.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#476

20 Mar 2021, 06:48

Fahad Mirza
regarding your first suggestion, you may be interested in the Stata conference talk that I gave in 2016: https://wbuchanan.github.io/stataConference2016/. I didn’t do anything with regression results, but there are a few examples of how I created interactive graphs in a similar vein to what you requested. With the Python API it would be possible to use existing Python libraries to do stuff that is a bit more complex (e.g., using Dash/Plotly, Bokeh, and/or Altair). The Python API should make it possible to create the different types of GIS related graphics that you mentioned as well.
1 like
Comment
Fahad Mirza

Join Date: Sep 2018

Posts: 263
#477

20 Mar 2021, 07:05

wbuchanan firstly, huge fan, I really admire the time and effort you have put into coding amazing stuff. Secondly, the Stata conference is definitely what I am looking for and I would love to learn how you made this happen. Is there a way/possibility you could share a resource or the code you used in developing these visuals?

Regarding the Python API, well API in itself are something I need to learn but not sure where to start from. I may possibly be sounding excessively like a leech seeking insight/knowledge but I would really appreciate some direction on these topics from where I can get rolling.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#478

21 Mar 2021, 07:47

Fahad Mirza
Thank you for the kind words. Any time you see a url like:

[username].github.io/something

You can navigate to the underlying GitHub repository by changing the order of the elements in the URL:

github.com/[username]/something.

so the source code for the presentation above is available at:

https://github.com/wbuchanan/stataConference2016

I’m more than happy to offer any tips/advise I can provide, but in order to prevent hijacking this thread just send me a message on StataList or email me with any questions you have.
2 likes
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 568
#479

26 Mar 2021, 04:35

If the idea is to move code from Stata to Mata in commands such that mainly input validation is done in Stata code and the rest in Mata, it would nice error handling like the try-except concept were implemented in Mata. Repeating a very old grump

Kind regards

nhb
1 like
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#480

26 Mar 2021, 08:15

I think it would also be nice to allow

Code:

post

, and it's related commands, to post strLs and for

Code:

statsby

to accept expressions that would store a string in the result (e.g., model=e(cmd) fails).
1 like
Comment

	x	y
x	Var(e.x)	0
y	0	Var(e.y)

	x	y
x	Var(e.x)	Cov(e.x, e.y)
y	Cov(e.y, e.x)	Var(e.y)

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment