Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Order of variables when expanding wildcards in a varlist

    I couldn't find anything on this in the help or manuals. It has been my experience, to the extent I've noticed, that if I issue a command using a varlist with a wildcard, e.g. -regress y x*-, Stata expands the varlist x* to a list of individual variables, and it preserves the order in which those variables appear in the data set.

    I was in the process of writing a short program that takes a list of variables as arguments, but the first two variables in the list are treated differently in the program from the rest. Would I be safe in calling that program with a sequence like -myprogram x*- when the variable order in the data set puts the intended first two variables first? Or do I have to be more explicit: -myprogram x_first x_second ...- even if x_first and x_second are the first two such variables in the data?

    I've experimented a bit with this, and so far * seems to reliably expand in the order they appear in the data set. But can I count on this in my programming?

  • #2
    It makes me nervous, but I think you are ok. Anyone doing this should keep in mind that "order in which those variables appear in the data set" is not necessarily the same as alphabetical or numeric order, e.g. there is nothing that says v17 can't be in the data set before v1 is. You also have to be careful about things like v1-v10; there can be lots of variables in that range, not just v1, v2, v3, etc.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      In general, explicit is better than implicit, so if I were using a command whose syntax was
      mycmd <var1> <var2> <other vars>
      I would (as a user) specify the first two variables explicitly, rather than using globbing (or a variable range) and relying on the order of variables in the dataset.

      From the documentation under [U] 11.4.1 Lists of existing variables, it would appear that you can rely on globbing (i.e., *) returning variables in the order in which they currently appear in the dataset (though I don't believe this is ever explicitly promised), and of course, that is how variable ranges (e.g., var1-var5) are defined. These rules for expansion are applied when (as a programmer) you write
      syntax varlist ...
      and in that case, if your command treats the first two variables differently from subsequent ones, then you are relying on the user to have specified the command properly. I have never made an exhaustive survey of how syntax is used in commands that treat variables differently depending on their position in the command, but I would presume that this usage is pretty common in such cases.

      Alternatively, you could write
      syntax anything ...
      and then parse the anything in a way that required that the first k variables were specified explicitly (i.e., without globbing), then use unab to expand (if necessary) the remaining variables specified. I suppose you might argue that this provides some protection of the user from him/herself, but IMO that is a bit excessive. Moreover, the fact that it would then be inconsistent with the behavior of other Stata commands (assuming my presumption above is correct) is probably the best argument against it.

      In sum, I would define this as a "user issue" (as compared to a "programmer issue"), and as a user, I would argue for being explicit whenever possible.
      Last edited by Phil Schumm; 17 Jul 2014, 16:20.

      Comment


      • #4
        I was guessing that this was a program Clyde was writing for his own use. If the program might be used by others I'd be leery of anything that assumed the users wouldn't have reordered or renamed the variables. It depends on how much Clyde trusts the users of the program.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        Stata Version: 17.0 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Originally posted by Richard Williams View Post
          I was guessing that this was a program Clyde was writing for his own use.
          FWIW, I was using the term "user" merely to refer an individual calling the program, and the term "programmer" to refer to the one who wrote it; this can be (and often is) the same person. And in general, when writing programs for your own use, it is a good idea to write them as you would for any user, since months (or even years) from now, that's (effectively) what you'll be. But I'm not arguing with anything you said here, Rich; I think we're on the same page WRT this issue.

          Comment


          • #6
            Yes, when I say "It depends on how much Clyde trusts the users of the program" that includes Clyde himself six months from now. For the few programs I have written for SSC, I bet I have spent as much or more time on error checking and interface and ease of use as I have on the statistical parts of the programs.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            Stata Version: 17.0 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Thanks, Richard and Phil. This is, indeed, a program I am writing for my own use. But I quite agree that 6 months from now I may or may not remember enough details to use it safely this way. That's why I was wondering if my observation that varlists are expanded in the order of the existing data set is just a lucky coincidence, or is guaranteed. It would have made life easier with a guarantee, but given that that is not the case, I'm going to revise my code to not depend on this phenomenon.

              Comment


              • #8
                I'm sure this could be written better, but maybe you could do something like this:

                Code:
                clear all
                program mycmd
                    syntax varlist(min = 2 max = 2), [others(varlist) ]
                    local var1: word 1 of `varlist'
                    local var2: word 2 of `varlist'
                    local others: subinstr local others "`var1'" ""
                    local others: subinstr local others "`var2'" ""
                    reg `varlist' `others'
                end
                sysuse auto
                drop make
                renvars * \ v1-v11
                mycmd v1 v2, others(v*)
                The program forces you to explicitly specify the first two variables. After that, you can use wildcards. Any redundant variables get dropped.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                Stata Version: 17.0 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Clyde,
                  Stata wouldn't mind if you write this:
                  Code:
                  clear all
                  sysuse auto
                  drop make
                  rename * v*
                  regress v*
                  where the last line is equivalent of what you will likely end up in your program (imho). It is correct based on Stata's syntax, but a pain to read. I, as a user, think of regress as having two distinct arguments, one is dependent variable, and one independent, so I would be puzzled to see one. On the other hand the number of independent variables doesn't bother me much, so I am comfortable with regress price v*. This may be subjective, but anything positional, requires memory to remember what those positions are. I'd suggest you use options to make it explicit, what the roles of the variables are.

                  Best, Sergiy

                  Comment


                  • #10
                    Thanks again. I agree, and I changed my code along the lines that Richard suggests.

                    Comment


                    • #11
                      Be sure you test it! I had some ideas for a more conventional looking syntax. But, you may be better off with something a little unconventional, since those first two vars have some sort of special status.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      Stata Version: 17.0 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        Here is some slightly tighter code that takes advantage of the macro list commands. I knew something like that existed but couldn't remember it when I wrote my earlier code.

                        Code:
                        clear all
                        * The local others: command will only allow vars to appear once.
                        program mycmd
                            syntax varlist(min = 2 max = 2), [others(varlist) ]
                            local others: list local others - varlist
                            reg `varlist' `others'
                        end
                        
                        * Renamings are done to test the code
                        sysuse auto
                        drop make
                        renvars * \ v1-v11
                        rename (v4 v10) (year gender)
                        mycmd v7 v2, others(v*)
                        -------------------------------------------
                        Richard Williams, Notre Dame Dept of Sociology
                        Stata Version: 17.0 MP (2 processor)

                        EMAIL: [email protected]
                        WWW: https://www3.nd.edu/~rwilliam

                        Comment


                        • #13
                          Richard wrote that the syntax

                          Code:
                          syntax varlist(min = 2 max = 2), [others(varlist) ] 
                          "forces you to explicitly specify the first two variables". Not so. It forces you to specify two variables, but Stata is happy that you do that through a wildcard. We can see this without a program. Know that syntax works on the contents of local macro 0, which is automatically created when you call up a program.

                          Code:
                          . sysuse auto
                          (1978 Automobile Data)  
                          
                          . local 0 m*  
                          
                          . syntax varlist(min=2 max=2)
                          
                          . di "`varlist'"
                          make mpg
                          To force the specification of two distinct names, you would need to use lower-level parsing commands.
                          Last edited by Nick Cox; 18 Jul 2014, 08:55.

                          Comment


                          • #14
                            Good point. But if the wildcard results in more than 2 variable names, you get the error "too many variables specified". If Clyde doesn't trust himself at all when he runs this 6 months from now, he could do something like

                            Code:
                            clear all
                            * The local others: command will only allow vars to appear once.
                            program mycmd
                                syntax varname, var2(varname) [others(varlist) ]
                                local others: list local others - varlist
                                local others: list local others - var2
                                reg `varlist' `var2' `others'
                            end
                            
                            * Renamings are done to test the code
                            sysuse auto
                            drop make
                            renvars * \ v1-v11
                            rename (v4 v10) (year gender)
                            mycmd v7, var2(v5 ) others(v*)
                            You can still use wildcards but it will only work if exactly one variable name matches the wildcard.
                            -------------------------------------------
                            Richard Williams, Notre Dame Dept of Sociology
                            Stata Version: 17.0 MP (2 processor)

                            EMAIL: [email protected]
                            WWW: https://www3.nd.edu/~rwilliam

                            Comment


                            • #15
                              An alternative to using syntax to manage this problem is to use char; I often do this for ad hoc do files for my own use. By assigning eg char var[firstpos] true you can give a variable an attribute that the program (or do-file, which is where syntax doesn't help) can then check before using the variable in the first position. For another example, for custom tables for publication it's often the case that one wants to treat different variables different ways (eg, mean (sd) or median (range), or categories with frequencies); by assigning characteristics to all the variables, the do file which creates the table can be fed a list like var* and arrange/output the list appropriately.

                              Comment

                              Working...
                              X