Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I just noticed that -estat ovtest- always produces the traditional, nonrobust version of RESET even if the -vce(robust)- or vce(cluster id)- options are used in -regress-. This leads to logically inconsistent analysis where one uses robust or cluster-robust inference for everything except specification tests. Why should the original version of the test always be used when it's known how to easily make it robust? RESET is easily set up as a variable addition test, and therefore one can use the same robust statistics as when testing for significance of omitted variables use the -test- option. As far as calling RESET -ovtest-, I've already complained about that because it makes users think it's testing for omitted variables when it is a pure functional form test.

    Admittedly, it's easy to carry out the fully robust RESET in a few lines, but many don't know that. So they rely on -estat ovtest-, never suspecting that they are using a flawed test.

    Comment


    • This post pertains to latent class analysis. There are innumerable posts where someone says that they are running a maximizer and it eventually stalls out at a certain log-likelihood that isn't concave. Here is one example, but again, this is a very frequent question.

      This occurs in binary items when one latent class has an item intercept going to over +15 or under -15. In plain language, that means the you have one latent class that has a 100% or 0% chance of endorsing the item. The maximizer won't declare convergence because it thinks that the likelihood function is not concave. Other software packages, e.g. MPlus and the Penn State University LCA plugin, will automatically constrain the offending intercepts at +/- 15. I believe that MPlus will return an alert that some of the parameters were constrained.

      I would like Stata a) to automatically constrain logit intercepts as I described, b) to report the number of parameters thus constrained and to warn the user that too many constrained parameters might be a sign that you're trying to fit more latent classes than the data can support, and c) to alert users about this in the manual. Alternatively, have the current failure mode be the default, but add an option to constrain the logit intercepts.

      To be honest, part of my reasoning is selfish. I am spending a bunch of time responding to this sort of query on the forum. This would enable me to spend less time. Now, it's true that I could simply elect not to respond to these issues. But other packages do this, and it seems logical that Stata should as well.

      A possible objection is that it enables users to get into some trouble. Maybe it does. However, a knowledgeable reviewer would be able to catch this. If you are reviewing an LCA paper with binary items and you see that there are multiple reported class proportions that are 0 or 1, ask them about it. Mention the Kathryn Masyn article cited in our SEM example 52, which says that too many intercepts constrained like this is a sign of that you may be trying to extract too many classes (or I phrase it in English a couple paragraphs above). Ask them how many parameters were constrained in the k-1st solution, and ask them how that solution differs substantively from the candidate solution. There, now you too know what to ask. I don't think that automatically constraining the parameters is necessarily fail deadly if you report and warn the user.

      Moreover, some users are going to follow the advice that Stata's Jeff Pitblado posted in this thread and turn off the secondary convergence criterion (i.e. the nonrtolerance option) without quite realizing what they're doing. I think that doing this enables worse failures than constraining and warning.

      Second, I'd like Stata to implement an automatic calculation for model entropy. Admittedly this is fairly easy to calculate manually, but people do keep asking.

      Third, I would ideally like to see Stata implement the bootstrap likelihood ratio test for model selection. Normally, we would just compare BIC between models. However, MPlus, which is regarded as the best available software option for this type of modeling, does implement the BLRT. Some simulation work appears to show that this is a good method to choose the correct number of classes, and it does offer a backup to the BIC. However, this doesn't appear to be trivial to implement it. gsem doesn't work with the bootstrap prefix. Moreover, it's possible that some of the bootstrap samples may refuse to converge; Stata would have to decide how to adjudicate that type of event.
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment


      • Speaking about fail safe vs fail deadly, I have a conceptual issue with the default settings for latent profile analysis. The short version is that the default options assume that all the indicators are uncorrelated within each latent class, and that the variances of (the errors of) all the indicators are equal across all latent classes. I believe that this is a highly restrictive assumption, and that it may not be realistic for most situations. Many people will use the default. The manual starts with this default, and doesn't really clarify that you need to examine alternatives to the default. It's not immediately obvious to people, but this can have you fitting this extremely restrictive model and not realizing that you should be at least testing something different.

        The long version requires some setup to explain, so please bear with me. Latent profile analysis is basically latent class analysis with Gaussian indicators. You fit a mean and an error variance for each latent class. I assume that error variance, in this context (because it's not a regression, it's a simple model for a mean of an indicator), translates to the variance of the indicator within the class. Now, consider the diagram below, which I'm borrowing from the R package flexmix. This is an artificial (I think) dataset with two dimensions. You could simply think of them as physical x and y coordinates for this post. When you look at the dataset, you see a few obvious clusters of points, but our computers don't think like we do, so we have to use things like latent profile analysis to get them to put these points into clusters.



        Now, the two panels have the same underlying data. The left panel was classified using a latent profile model. If you read the example starting on pg 14, you'll see that for the left panel, the author ran the wrong model. The covariance matrix for this model is diagonal. In English, the rows and the columns represent the within-class error covariance structure. The covariance of x and x is the variance of x. In a diagonal structure, you estimate the variances and fix the covariances at 0. E.g. for each latent class:
        x y
        x Var(e.x) 0
        y 0 Var(e.y)
        Basically, within each class, x and y are not allowed to be correlated. The alternative assumption is that the covariance structure isn't diagonal. So, you assume the structure below. I believe, from my reading of flexmix's manual that the non-diagonal covariance structure may be the default when you assume Gaussian indicators, and you have to actively specify the more restrictive diagonal structure. Stata's default is a diagonal covariance structure in each class, and you manually specify the unstructured covariance structure as depicted in the table below. (For the record, you add the option covstructure(e._OEn, unstructured) to fit an unstructured covariance structure)
        x y
        x Var(e.x) Cov(e.x, e.y)
        y Cov(e.y, e.x) Var(e.y)
        Imagine that you have a magic multidimensional cookie cutter. In LPA, you tell your cookie cutter that you're going to cut k elliptical cookies out of the data, and please cover as many of the points as you can. Your cookie cutter will decide (through magic) how long each axis of the ellipse will be. In a diagonal covariance structure, you're telling your cookie cutter that the ellipse has to be angled at 0 or 90 degrees, and nothing in between. If you do that, then the magic cookie cutter is going to treat that diagonal cluster of points as two circles. It would be more correct to let the cookie cutter choose the correct angle. If you do that, then the cookie cutter will treat that diagonal group as one group. Again, Stata's default is to tell the cookie cutter that it can only position itself at 0 and 90 degrees, so the picture on the left is our default result in Stata.

        There is a separate issue. Stata's default is to also constrain the class error variances to be equal across latent classes. To demonstrate with Stata's stock dataset:

        Code:
        use https://www.stata-press.com/data/r16/gsem_lca2
        gsem (glucose insulin sspg <- _cons), lclass(C 2) byparm
        
        -------------------------------------------------------------------------------
                      |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        --------------+----------------------------------------------------------------
        1.C           |  (base outcome)
        --------------+----------------------------------------------------------------
        2.C           |
                _cons |  -1.541025   .2205682    -6.99   0.000    -1.973331    -1.10872
        --------------+----------------------------------------------------------------
        glucose       |
                    C |
                   1  |   41.22237   1.298051    31.76   0.000     38.67824     43.7665
                   2  |   115.7123   2.849914    40.60   0.000     110.1266    121.2981
        --------------+----------------------------------------------------------------
        insulin       |
                    C |
                   1  |   20.98005   1.000974    20.96   0.000     19.01817    22.94192
                   2  |   7.553144   2.160949     3.50   0.000     3.317761    11.78853
        --------------+----------------------------------------------------------------
        sspg          |
                    C |
                   1  |   14.96579   .6868081    21.79   0.000     13.61967    16.31191
                   2  |    34.5529    1.53117    22.57   0.000     31.55187    37.55394
        --------------+----------------------------------------------------------------
        var(e.glucose)|
                    C |
                   1  |   191.5596   23.83815                      150.0992    244.4723
                   2  |   191.5596   23.83815                      150.0992    244.4723
        var(e.insulin)|
                    C |
                   1  |   119.0542   14.00336                      94.54204    149.9217
                   2  |   119.0542   14.00336                      94.54204    149.9217
         var(e.sspg)#C|
                   1  |   55.91283   6.713667                      44.18801     70.7487
                   2  |   55.91283   6.713667                      44.18801     70.7487
        -------------------------------------------------------------------------------
        The bolded coefficients are the error variances for each of the two latent classes. This is akin to telling your magic cookie cutter that now, it also has to cut equal-sized ellipses out of the data. flexmix doesn't appear to have an equivalent option at all (or because my R knowledge is less advanced than my Stata knowledge, maybe I've missed the option). For the record, you turn that assumption off this way:

        Code:
        gsem (glucose insulin sspg <- _cons), lclass(C 2) lcinvariant(none) byparm
        -------------------------------------------------------------------------------
                      |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        --------------+----------------------------------------------------------------
        1.C           |  (base outcome)
        --------------+----------------------------------------------------------------
        2.C           |
                _cons |   -.236545   .1805566    -1.31   0.190    -.5904294    .1173394
        --------------+----------------------------------------------------------------
        glucose       |
                    C |
                   1  |   35.98797   .5732451    62.78   0.000     34.86443    37.11151
                   2  |     77.638   4.657351    16.67   0.000     68.50976    86.76624
        --------------+----------------------------------------------------------------
        insulin       |
                    C |
                   1  |    16.5196   .5887541    28.06   0.000     15.36566    17.67354
                   2  |   21.26216   2.112797    10.06   0.000     17.12115    25.40317
        --------------+----------------------------------------------------------------
        sspg          |
                    C |
                   1  |   11.17919   .6359466    17.58   0.000     9.932759    12.42562
                   2  |   27.59469   1.092309    25.26   0.000      25.4538    29.73557
        --------------+----------------------------------------------------------------
        var(e.glucose)|
                    C |
                   1  |   22.62693    4.35593                       15.5153    32.99827
                   2  |   1263.401   223.8804                      892.6978    1788.043
        var(e.insulin)|
                    C |
                   1  |   26.36603   4.285562                      19.17298    36.25767
                   2  |   283.2775   50.93803                       199.137    402.9697
         var(e.sspg)#C|
                   1  |   25.26045   5.003334                       17.1334    37.24247
                   2  |   70.49358    12.7819                       49.4094    100.5749
        -------------------------------------------------------------------------------
        You see that the error variances now differ. I won't show syntax, but you can verify for yourselves that the option covstructure(e._OEn, unstructured) will estimate an unstructured covariance structure within each class, so you will now get error covariances in the output (remember, Stata's default is a diagonal structure within each class).

        I know that this default behavior stems from gsem's default options. It may make sense in other contexts. In the context of LPA, it seems to me that the default behavior has you fitting an extremely restrictive model. By my reading of SEM example 52, which introduces LPA, the manual starts with the default behavior, then it mentions that the correct final model had unstructured within-class covariance structure and parameters not constrained equal across classes. It doesn't discuss how restrictive the default behavior is. It doesn't tell you that you should be investigating alternative model structures. The manual does cite a book chapter by Katherine Masyn, which does say you should investigate all alternative model structures, but how many of us go dig through the manual sources? Hence, my ask: change the default options to the least restrictive model structure. If you don't, clarify SEM example 52 to state that you really need to examine alternatives to the default.

        Masyn (whose chapter I frequently cite here) appears to say that you need to investigate all 4 potential model structures: class-variant + unstructured indicator covariances, class-invariant + unstructured indicator covariances, and class-variant and -invariant + diagonal indicator covariances. Now, I would not normally care to question her depth of technical knowledge. However, I am still going to differ from her in one respect: I'm not sure how critical it is to examine model structures that seem potentially too constrained to be realistic. For example, consider that flexmix doesn't even appear to offer the option to restrict the class error variances to be equal. Even if I'm wrong on that score, I believe I've shown that Stata's default behavior has you fitting a very restrictive model and potentially going down the wrong path. I'd like to see gsem's default behavior changed for latent profile analysis, obviously retaining the option to constrain the model to be more restrictive if we decide it can be based on the results or some theoretical grounds.

        This 2009 paper may be an example of analysts going with an overly constrained model without examining alternatives. The article is free on Pubmed. The model was fit in MPlus, using 3 indicators. I'm not familiar with MPlus' default behavior in latent profile models. However, the tables clearly do not mention any covariance between the indicators, nor do they mention the authors testing between models with diagonal vs unstructured covariance structures. Thus, I'd have to assume that either MPlus' default behavior in 2009 was diagonal covariance structure and the authors didn't consider examining the alternative, or the default was unstructured covariance but the authors chose not to report the results for unknown reasons - maybe they didn't realize what this was? As to the issue of class-invariant vs class-varying error variances, there's no way to tell what structure was examined from the reported information. Table 1 is the results table, and it only reports mean and SE of each indicator in each class. There's no information on indicator variance (which is substantively interesting in itself, IMO).

        Regarding my immediately prior post, you'll note that even way back in 2009, this paper reported entropy and the p-values from bootstrap likelihood ratio tests. Hence, I would really like Stata to implement these features.

        Last, note that this particular issue does not apply to latent class analysis. In latent class with binary indicators, you have one logit intercept per item. There's no error to be correlated within each class. The variance of a binary item is n*p*(1-p), so it's inherently dependent on the probability, and there's no way to constrain that to be equal across the latent classes. There is the separate issue of potential violations of conditional independence - the assumption in all latent variable models is that given the value of the latent variable (here, that's the latent class), the probability of endorsing each item is independent of the others. There was once a question on how to test for violations of that on the forum. Honestly, I'm not familiar with that issue, but my impression is that MPlus may have some tests. I wouldn't view this as a critical issue yet.
        Last edited by Weiwen Ng; 05 Mar 2021, 17:26.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • The built in command -_pctile- is unreasonably minimal in what it leaves in r() behind, and this creates some headache. Furthermore, it seems to me that -_pctile- has bugs in degenerate situations with too few observations.

          Unlike any other command that I am aware of, -_pctile- does not leave the number of observation on the basis of which the calculation was made r(N), and it does not leave behind in r() an indication that it has failed to calculate what it was asked to calculate. Here is what happens with -_pctile- when everything is fine, and notice that it does not leave r(N):

          Code:
          . set obs 30
          number of observations (_N) was 0, now 30
          
          . gen x = rnormal()
          
          . _pctile x, nq(4)
          
          . return list
          
          scalars:
                           r(r1) =  -.7291995286941528
                           r(r2) =  -.0959552293643355
                           r(r3) =  .9089744687080383
          And here is what happens with -_pctile- when everything is not fine:
          Code:
          . gen y = .
          (30 missing values generated)
          
          . _pctile y, nq(4)
          no observations
          
          . return list
          
          .
          This message "no observations" is not left behind anywhere. Again there is no r(N)

          Finally here is what happens in degenerate situations:

          Code:
          . replace y = rnormal() in 1/2
          (2 real changes made)
          
          . _pctile y, nq(4)
          
          . return list
          
          scalars:
                           r(r1) =  -1.311697125434875
                           r(r2) =  -.6397798098623753
                           r(r3) =  .032137505710125
          
          . summ y
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     y |          2   -.6397798    .9502346  -1.311697   .0321375
          
          . drop y
          
          . gen y = .
          (30 missing values generated)
          
          . replace y = rnormal() in 1
          (1 real change made)
          
          . _pctile y, nq(4)
          
          . return list
          
          scalars:
                           r(r1) =  -1.227839469909668
                           r(r2) =  -1.227839469909668
                           r(r3) =  -1.227839469909668
          
          . summ y
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                     y |          1   -1.227839           .  -1.227839  -1.227839
          
          .
          How did -_pctile- manage to calculate 3 break points on the basis of 2 observations??? And why is it not producing any error message when on the basis of 1 observation it calculated 3 identical break points?

          I would like that -_pctile- at the very minimum leave behind r(N). Also it seems to me that the behaviour in degenerate situations has to be reworked, and some error messages should be left behind.
          Last edited by Joro Kolev; 08 Mar 2021, 02:41.

          Comment


          • Something that I would like is to enhance the graph generating interface so it allows a more modern approach taken by other engines.
            Basically, once we have created a plot, for example:
            Code:
            sysuse auto
            scatter price mpg,
            It would be ideal if we could just add more features on top of this one. For example:
            Code:
            addplot:lfit price mpg,
            Yes, in other words, I wish Stata could add officially add "addplot" (by Ben Jann) to the official Stata Graph Engine machinery.

            And if this is added, could also be possible to add the option "legend" for graph combine? (so it is easy to add a single legend to figures)
            Last edited by FernandoRios; 08 Mar 2021, 07:44.

            Comment


            • I would appreciate a browser or at least a list for all objects, in particular mata matrices. The list could look like the variable list in Stata, but ultimately, having something click-able behaviour would be great. This might require some work, but I think users would really appreciate it. In addition, free of charge programs have such features.

              In detail, I am thinking of R-Studio or Matlab which show all objects, i.e. variables (dataframes), scalars, matrices etc.I understand that Stata's approach is different and more focused on a certain dataset. However, I think having another window would not hurt. Inspecting data is so much easier and straightforward in R-Studio or Matlab. Especially when it comes to arrays or sparse matrices (how is it possible that there are no sparse matrices in mata - as a side note!?). Inspecting data in Stata is in comparison to those tools really painful.

              Comment


              • Originally posted by JanDitzen View Post
                I would appreciate a browser or at least a list for all objects, in particular mata matrices. The list could look like the variable list in Stata, but ultimately, having something click-able behaviour would be great. This might require some work, but I think users would really appreciate it. In addition, free of charge programs have such features.

                In detail, I am thinking of R-Studio or Matlab which show all objects, i.e. variables (dataframes), scalars, matrices etc.I understand that Stata's approach is different and more focused on a certain dataset. However, I think having another window would not hurt. Inspecting data is so much easier and straightforward in R-Studio or Matlab. Especially when it comes to arrays or sparse matrices (how is it possible that there are no sparse matrices in mata - as a side note!?). Inspecting data in Stata is in comparison to those tools really painful.
                While it is not the most elegant solution, you can do the following to get a list of all Mata objects:
                Code:
                mata: mata describe

                Regarding sparse matrices: I think there is a section in The Mata Book. You can use associative arrays in Mata to create sparse matrices.
                Last edited by Sebastian Kripfganz; 16 Mar 2021, 07:25.
                https://twitter.com/Kripfganz

                Comment


                • Panel Stochastic Frontier module is ancient! I have have waited more than enough; Will they ever update the module for Panel Stochastic Frontier? PLease? The literature has evolved quite significantly and STATA has not updated anything in the recent editions.

                  Comment


                  • Originally posted by Sebastian Kripfganz View Post
                    While it is not the most elegant solution, you can do the following to get a list of all Mata objects:
                    Code:
                    mata: mata describe

                    Regarding sparse matrices: I think there is a section in The Mata Book. You can use associative arrays in Mata to create sparse matrices.
                    Thanks for your reply.

                    Code:
                    mata: mata describe
                    is of course a solution, but something a bit more interactive would be desirable.

                    On the sparse matrices: storing them is not such a big problem, the difficulty starts when trying to combine those with mathematical functions. Maybe I am missing something here?!

                    Comment


                    • I have two requests and I am not sure how popular these requests may be but it will be really nice if StataCorp can look into it.

                      Is there a way to make dashboards using Stata? I am not referring to very complex dashboards but rather, simple ones where you have the option of selecting a subgroup to see how that affects descriptive statistics and regression results. I was thinking maybe the built in command by the name of Dyndoc can address this but honestly, I am not sure about it (nor do i have the relevant knowledge. Will be nice if someone can enlighten me.)

                      Secondly, will be nice to have Stata allow for basemaps (from open street maps etc.) in grmap and allow for placing city/district/village labels outside the choropleth maps. I saw this long guide by Richard Williams I believe on this and honestly found that to be quite technical, so thought maybe I can share it as a request to have a label option that takes labels outside the choropleth drawing area.

                      Comment


                      • Fahad Mirza
                        regarding your first suggestion, you may be interested in the Stata conference talk that I gave in 2016: https://wbuchanan.github.io/stataConference2016/. I didn’t do anything with regression results, but there are a few examples of how I created interactive graphs in a similar vein to what you requested. With the Python API it would be possible to use existing Python libraries to do stuff that is a bit more complex (e.g., using Dash/Plotly, Bokeh, and/or Altair). The Python API should make it possible to create the different types of GIS related graphics that you mentioned as well.

                        Comment


                        • wbuchanan firstly, huge fan, I really admire the time and effort you have put into coding amazing stuff. Secondly, the Stata conference is definitely what I am looking for and I would love to learn how you made this happen. Is there a way/possibility you could share a resource or the code you used in developing these visuals?

                          Regarding the Python API, well API in itself are something I need to learn but not sure where to start from. I may possibly be sounding excessively like a leech seeking insight/knowledge but I would really appreciate some direction on these topics from where I can get rolling.

                          Comment


                          • Fahad Mirza
                            Thank you for the kind words. Any time you see a url like:

                            [username].github.io/something

                            You can navigate to the underlying GitHub repository by changing the order of the elements in the URL:

                            github.com/[username]/something.

                            so the source code for the presentation above is available at:

                            https://github.com/wbuchanan/stataConference2016

                            I’m more than happy to offer any tips/advise I can provide, but in order to prevent hijacking this thread just send me a message on StataList or email me with any questions you have.

                            Comment


                            • If the idea is to move code from Stata to Mata in commands such that mainly input validation is done in Stata code and the rest in Mata, it would nice error handling like the try-except concept were implemented in Mata. Repeating a very old grump
                              Kind regards

                              nhb

                              Comment


                              • I think it would also be nice to allow
                                Code:
                                post
                                , and it's related commands, to post strLs and for
                                Code:
                                statsby
                                to accept expressions that would store a string in the result (e.g., model=e(cmd) fails).

                                Comment

                                Working...
                                X