Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata -- almost excellent

    In my opinion, Stata is an excellent statistics software -- almost. For many, the criteria for choosing a specific statistics software may be ease of use, expandability, cost, or compatibility with other programs. For me, the most impressive feature is that my statistical thinking (and the methods I use) have improved significantly since I started using Stata -- to which the Stata Forum and its friendly and knowledgeable members are contributing substantially.

    Stata also impresses with its consistency, integrity, and error tolerance: it largely protects against the use of statistical models that are unsuitable for the nature of the variables, and it is almost impossible to change variables unintentionally.

    However, the word “almost impossible” is both a decisive point and a downside for me.

    As far as I know, there are two situations in which the protection from unintended modification of data is not guaranteed:
    • Stata allows you to save data that contain temporary variables. The negative (and probably unnoticed) consequences of this became apparent in my last post, and I demonstrated them elsewhere (admittedly with complex examples). I am aware that changing save in such a way that users must explicitly allow the saving of temporary variables carries the risk of breaking older syntax. Nevertheless, I think the advantages outweigh the disadvantages—and version control, which is recommended anyway, could prevent this scenario (it is also conceivable that such a change to save would only work if the syntax contains a specification of version somewhere in the current .do-file).
    • Another situation is markout. Here, it is easy to make the mistake of not specifying the markout variable as the first variable. The result is a (probably unnoticed) change to the variable specified here. One solution could be that when the markout variable is specified using mark, its characteristic is set accordingly and markout requires this characteristic to exist. However, changing mark and markout accordingly would break existing code. Another possibility would be to use two alternative commands, mark2 and markout2 (or perhaps better do_mark and do_markout for .do files) so that mark and markout can still be used for .ado files, and to explicitly point out the dangers and the alternative in the manual entry for mark and markout. A first, not (!) yet universally applicable example of this would be
    Code:
     program do_mark 
        
       syntax newvarlist(max=1) 
        
       mark `varlist' 
       notes `varlist': markout variable 
    end 
     
    program do_markout, rclass 
     
       syntax varlist [, NUMeric] 
        
       local mov : word 1 of `varlist' 
       local note : char `mov'[note1] 
        
       if "`note'" != "markout variable" { 
          di as err "{bf:`mov'} is not a to-use variable for -do_markout-" 
          error 499 
          exit 
       } 
       else { 
          local varlist : list varlist - mov 
          local varlist : list varlist - mov // if _all has been used for varlist 
          if "`numeric'"=="" { 
             markout `mov' `varlist' 
             return local markvars `varlist' 
          } 
          else { 
             foreach v of varlist `varlist' { 
                cap confirm numeric variable `v' 
                if _rc local strvars "`strvars' `v'"  
             } 
             local nstr : word count `strvars' 
             local numvars : list varlist-strvars 
             markout `mov' `numvars' 
             di as txt "(`nstr' string variable(s) not used by -markout-)" 
             return local strvars `strvars' 
             return local markvars `numvars' 
          } 
          return local tousevar `mov' 
       } 
    end
    Example:
    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . 
    . do_mark valid
    
    . do_markout valid rep78
    
    . sum _all if valid
    
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
            make |          0
           price |         69    6146.043     2912.44       3291      15906
             mpg |         69    21.28986    5.866408         12         41
           rep78 |         69    3.405797    .9899323          1          5
        headroom |         69           3    .8531947        1.5          5
    -------------+---------------------------------------------------------
           trunk |         69    13.92754    4.343077          5         23
          weight |         69    3032.029    792.8515       1760       4840
          length |         69    188.2899     22.7474        142        233
            turn |         69     39.7971    4.441051         31         51
    displacement |         69         198    93.14789         79        425
    -------------+---------------------------------------------------------
      gear_ratio |         69    2.999275    .4626818       2.19       3.89
         foreign |         69    .3043478    .4635016          0          1
           valid |         69           1           0          1          1
    
    . 
    . drop valid
    
    . mark valid
    
    . do_markout valid
    valid is not a to-use variable for -do_markout-
    r(499);
    
    end of do-file
    
    r(499);
    Perhaps someone is volunteering to take up the idea to write universally applicable alternatives for mark and markout?

    With such improvements, Stata would indeed be excellent and in many respects, at least for me, clearly superior to statistical programs such as R, SAS (and definitely SPSS).

  • #2
    Originally posted by Dirk Enzmann View Post
    Another situation is markout. Here, it is easy to make the mistake of not specifying the markout variable as the first variable.
    Out of curiosity, when you don't specify the marker variable first, where do you specify it? Do you mean you forgot to specify it at all? Also, note that markout will issue an error message when the first variable is not a byte variable, thus reducing the odds of mistakenly changing a variable. An additional characteristic seems a bit overkill, especially since it would prevent markout from working with variables that did not come from mark.

    That said, what do you mean by "universally applicable"?

    Comment


    • #3
      daniel klein : You are correct that the odds of mistakenly not specifying the marker variable (the to-use variable) first can only happen with variables of the type byte -- but this may bite you seriously enough.

      I did not suggest to modify mark to add a characteristic to the marker variable and to modify markout to only accept marker variables with that characteristic (as this would break code) but to have two additional commands such as do_mark and do_markout -- that way markout is not prevented from working with variables that did not come from mark (while markout will work with variables coming from do_mark). If you think of a better way to protect the user from the consequenes of forgetting to specify the marker variable (first or at all) I am happy to know.

      As to "universally applicable": A closer look at the manual (current version, p. 4 or p. 333 when accessed via Stata) shows that the marker variable is set to 0 in certain situations and I thought that the program suggested in #1 should take care of that, as well. On second thoughts, however, I think that this is already done because do_markout is only a wrapper for the original markout. If I am correct the suggested programs are already "universally applicable" and can be used in their current version.
      Last edited by Dirk Enzmann; 12 Aug 2025, 02:11.

      Comment


      • #4
        Originally posted by Dirk Enzmann View Post
        If you think of a better way to protect the user from the consequenes of forgetting to specify the marker variable (first or at all) I am happy to know.
        Well, mark, markout, and especially marksample take positional arguments. Those always risk ambiguity. If you want a safer alternative, you might go for a syntax that takes either the markvar or the varlist as a named option. Obviously, this would be more cumbersome to type.

        Originally posted by Dirk Enzmann View Post
        As to "universally applicable": A closer look at the manual (current version, p. 4 or p. 333 when accessed via Stata) shows that the marker variable is set to 0 in certain situations and I thought that the program suggested in #1 should take care of that, as well. On second thoughts, however, I think that this is already done because do_markout is only a wrapper for the original markout. If I am correct the suggested programs are already "universally applicable" and can be used in their current version.
        I'd probably go for an even simpler wrapper; perhaps combining both commands into one (but that might create ambiguity again). Here's a brief draft
        Code:
        program marktouse
            
            version 11.2
            
            capture syntax varlist(ts) [ , * ]
            if ( _rc ) {
                
                gettoken newmarkvar 0 : 0 , parse(" ,")
                
                nobreak {
                    
                    mark `newmarkvar' `macval(0)'
                    
                    char `newmarkvar'[modify] OK
                    
                }
                
                exit
                
            }
            
            gettoken markvar : varlist
            
            local modify : char `markvar'[modify]
            
            if ("`modify'" != "OK") {
                
                display as err "`markvar' not created by marktouse"
                exit 498
                
            }
            
            markout `macval(0)'
            
        end
        Last edited by daniel klein; 12 Aug 2025, 03:23. Reason: even less invasive wrapper

        Comment


        • #5
          Using both commands at once does not allow to repeatedly call markout (or do_markout) to add additional variables to flag cases for use.

          To respond to concerns that do_markout would prevent the use of marker variables not created by do_mark (which would be the purpose of do_markout!) one could add the option "any" to do_markout that allows to use a marker variable without the characteristic set by do_mark (and give a respective hint together with the error message "... is not a marker variable created by -do_mark-"). That way both programs could be used routinely instead of mark and markout.

          Comment


          • #6
            Sorry, I was not reading your suggestion for marktouse closely enough: If I am correct it can be used repeatedly the same way as markout. Using only one program is much better than using something like do_mark and do_markout

            Comment


            • #7
              Originally posted by Dirk Enzmann View Post
              Using both commands at once does not allow to repeatedly call markout (or do_markout) to add additional variables to flag cases for use.
              Yes, it does:
              Code:
              . version 18
              
              . sysuse auto
              (1978 automobile data)
              
              . set seed 42
              
              . generate twenty_percent_missing = 42 if runiform() > .2
              (14 missing values generated)
              
              . marktouse touse2
              
              . marktouse touse2 rep78
              
              . tabulate touse2
              
                   touse2 |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        0 |          5        6.76        6.76
                        1 |         69       93.24      100.00
              ------------+-----------------------------------
                    Total |         74      100.00
              
              . marktouse touse2 twenty_percent_missing
              
              . tabulate touse2
              
                   touse2 |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        0 |         19       25.68       25.68
                        1 |         55       74.32      100.00
              ------------+-----------------------------------
                    Total |         74      100.00

              Comment


              • #8
                Originally posted by Dirk Enzmann View Post
                To respond to concerns that do_markout would prevent the use of marker variables not created by do_mark (which would be the purpose of do_markout!) one could add the option "any" to do_markout that allows to use a marker variable without the characteristic set by do_mark (and give a respective hint together with the error message "... is not a marker variable created by -do_mark-"). That way both programs could be used routinely instead of mark and markout.
                I wouldn't do that. As you say, it defeats the purpose of the wrapper. I think it's better as a one-way route: you can use variables created by the wrapper with markout if you want to.

                Comment


                • #9
                  daniel klein : Thanks for suggesting marktouse -- much better! I see that via syntax you did allow time-series operators. Why is this necessary (I never used any time-series analysis)? And wouldn't it be necessary to allow factor variables, as well?

                  Comment


                  • #10
                    Originally posted by Dirk Enzmann View Post
                    daniel klein : Thanks for suggesting marktouse -- much better! I see that via syntax you did allow time-series operators. Why is this necessary (I never used any time-series analysis)? And wouldn't it be necessary to allow factor variables, as well?
                    I just followed the syntax diagram of markout.

                    Comment


                    • #11
                      Note that the name marktouse is already taken.

                      Comment


                      • #12
                        Would you (and if your time permits: will you) make marktouse publicly available? It will could protect (unfortunately only) those who are aware of the danger to forget specifying the marker variable with markout. I know that writing the corresponding help file would be much more work, but do you think it would be worthwhile?

                        What remains is the issue with saving temporary variables. A small improvement would be to add an option to drop temporary variables when using save. However, again it will only protect those who are aware of the issue (which I assume is a minority of users).

                        Comment


                        • #13
                          I think those who are aware of the potential problems are already at low(er) risk of making these mistake, but here you go; markobs is now available from GitHub. I'll wait a while for any comments and then send the files to Kit Baum for upload to the SSC.

                          Comment


                          • #14
                            Excellent!

                            Comment


                            • #15
                              I think that one solution to the temporary variable problem when using save (see the first point in #1) without breaking code could be to automatically issue a warning if the saved data did contain a temporary variable and to add an option to the save command that removes temporary variables before saving.

                              Comment

                              Working...
                              X