Order of variables when expanding wildcards in a varlist

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#1

Order of variables when expanding wildcards in a varlist

17 Jul 2014, 14:55

I couldn't find anything on this in the help or manuals. It has been my experience, to the extent I've noticed, that if I issue a command using a varlist with a wildcard, e.g. -regress y x*-, Stata expands the varlist x* to a list of individual variables, and it preserves the order in which those variables appear in the data set.

I was in the process of writing a short program that takes a list of variables as arguments, but the first two variables in the list are treated differently in the program from the rest. Would I be safe in calling that program with a sequence like -myprogram x*- when the variable order in the data set puts the intended first two variables first? Or do I have to be more explicit: -myprogram x_first x_second ...- even if x_first and x_second are the first two such variables in the data?

I've experimented a bit with this, and so far * seems to reliably expand in the order they appear in the data set. But can I count on this in my programming?
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 5008
#2

17 Jul 2014, 15:45

It makes me nervous, but I think you are ok. Anyone doing this should keep in mind that "order in which those variables appear in the data set" is not necessarily the same as alphabetical or numeric order, e.g. there is nothing that says v17 can't be in the data set before v1 is. You also have to be careful about things like v1-v10; there can be lots of variables in that range, not just v1, v2, v3, etc.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#3

17 Jul 2014, 16:17

In general, explicit is better than implicit, so if I were using a command whose syntax was
mycmd <var1> <var2> <other vars>
I would (as a user) specify the first two variables explicitly, rather than using globbing (or a variable range) and relying on the order of variables in the dataset.

From the documentation under [U] 11.4.1 Lists of existing variables, it would appear that you can rely on globbing (i.e., *) returning variables in the order in which they currently appear in the dataset (though I don't believe this is ever explicitly promised), and of course, that is how variable ranges (e.g., var1-var5) are defined. These rules for expansion are applied when (as a programmer) you write
syntax varlist ...
and in that case, if your command treats the first two variables differently from subsequent ones, then you are relying on the user to have specified the command properly. I have never made an exhaustive survey of how syntax is used in commands that treat variables differently depending on their position in the command, but I would presume that this usage is pretty common in such cases.

Alternatively, you could write
syntax anything ...
and then parse the anything in a way that required that the first k variables were specified explicitly (i.e., without globbing), then use unab to expand (if necessary) the remaining variables specified. I suppose you might argue that this provides some protection of the user from him/herself, but IMO that is a bit excessive. Moreover, the fact that it would then be inconsistent with the behavior of other Stata commands (assuming my presumption above is correct) is probably the best argument against it.

In sum, I would define this as a "user issue" (as compared to a "programmer issue"), and as a user, I would argue for being explicit whenever possible.

Last edited by Phil Schumm; 17 Jul 2014, 16:20.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#4

17 Jul 2014, 16:32

I was guessing that this was a program Clyde was writing for his own use. If the program might be used by others I'd be leery of anything that assumed the users wouldn't have reordered or renamed the variables. It depends on how much Clyde trusts the users of the program.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#5

17 Jul 2014, 16:47

Originally posted by Richard Williams View Post

I was guessing that this was a program Clyde was writing for his own use.

FWIW, I was using the term "user" merely to refer an individual calling the program, and the term "programmer" to refer to the one who wrote it; this can be (and often is) the same person. And in general, when writing programs for your own use, it is a good idea to write them as you would for any user, since months (or even years) from now, that's (effectively) what you'll be. But I'm not arguing with anything you said here, Rich; I think we're on the same page WRT this issue.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#6

17 Jul 2014, 16:54

Yes, when I say "It depends on how much Clyde trusts the users of the program" that includes Clyde himself six months from now. For the few programs I have written for SSC, I bet I have spent as much or more time on error checking and interface and ease of use as I have on the statistical parts of the programs.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

17 Jul 2014, 17:41

Thanks, Richard and Phil. This is, indeed, a program I am writing for my own use. But I quite agree that 6 months from now I may or may not remember enough details to use it safely this way. That's why I was wondering if my observation that varlists are expanded in the order of the existing data set is just a lucky coincidence, or is guaranteed. It would have made life easier with a guarantee, but given that that is not the case, I'm going to revise my code to not depend on this phenomenon.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#8

17 Jul 2014, 17:57

I'm sure this could be written better, but maybe you could do something like this:

Code:

clear all program mycmd syntax varlist(min = 2 max = 2), [others(varlist) ] local var1: word 1 of `varlist' local var2: word 2 of `varlist' local others: subinstr local others "`var1'" "" local others: subinstr local others "`var2'" "" reg `varlist' `others' end sysuse auto drop make renvars * \ v1-v11 mycmd v1 v2, others(v*)

The program forces you to explicitly specify the first two variables. After that, you can use wildcards. Any redundant variables get dropped.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#9

17 Jul 2014, 18:18

Clyde,
Stata wouldn't mind if you write this:

Code:

clear all sysuse auto drop make rename * v* regress v*

where the last line is equivalent of what you will likely end up in your program (imho). It is correct based on Stata's syntax, but a pain to read. I, as a user, think of regress as having two distinct arguments, one is dependent variable, and one independent, so I would be puzzled to see one. On the other hand the number of independent variables doesn't bother me much, so I am comfortable with regress price v*. This may be subjective, but anything positional, requires memory to remember what those positions are. I'd suggest you use options to make it explicit, what the roles of the variables are.

Best, Sergiy
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

17 Jul 2014, 18:54

Thanks again. I agree, and I changed my code along the lines that Richard suggests.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#11

17 Jul 2014, 20:30

Be sure you test it! I had some ideas for a more conventional looking syntax. But, you may be better off with something a little unconventional, since those first two vars have some sort of special status.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#12

18 Jul 2014, 08:27

Here is some slightly tighter code that takes advantage of the macro list commands. I knew something like that existed but couldn't remember it when I wrote my earlier code.

Code:

clear all * The local others: command will only allow vars to appear once. program mycmd syntax varlist(min = 2 max = 2), [others(varlist) ] local others: list local others - varlist reg `varlist' `others' end * Renamings are done to test the code sysuse auto drop make renvars * \ v1-v11 rename (v4 v10) (year gender) mycmd v7 v2, others(v*)

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#13

18 Jul 2014, 08:52

Richard wrote that the syntax

Code:

syntax varlist(min = 2 max = 2), [others(varlist) ]

"forces you to explicitly specify the first two variables". Not so. It forces you to specify two variables, but Stata is happy that you do that through a wildcard. We can see this without a program. Know that syntax works on the contents of local macro 0, which is automatically created when you call up a program.

Code:

. sysuse auto (1978 Automobile Data) . local 0 m* . syntax varlist(min=2 max=2) . di "`varlist'" make mpg

To force the specification of two distinct names, you would need to use lower-level parsing commands.

Last edited by Nick Cox; 18 Jul 2014, 08:55.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#14

18 Jul 2014, 09:35

Good point. But if the wildcard results in more than 2 variable names, you get the error "too many variables specified". If Clyde doesn't trust himself at all when he runs this 6 months from now, he could do something like

Code:

clear all * The local others: command will only allow vars to appear once. program mycmd syntax varname, var2(varname) [others(varlist) ] local others: list local others - varlist local others: list local others - var2 reg `varlist' `var2' `others' end * Renamings are done to test the code sysuse auto drop make renvars * \ v1-v11 rename (v4 v10) (year gender) mycmd v7, var2(v5 ) others(v*)

You can still use wildcards but it will only work if exactly one variable name matches the wildcard.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#15

18 Jul 2014, 09:55

An alternative to using syntax to manage this problem is to use char; I often do this for ad hoc do files for my own use. By assigning eg char var[firstpos] true you can give a variable an attribute that the program (or do-file, which is where syntax doesn't help) can then check before using the variable in the first position. For another example, for custom tables for publication it's often the case that one wants to treat different variables different ways (eg, mean (sd) or median (range), or categories with frequencies); by assigning characteristics to all the variables, the do file which creates the table can be fed a list like var* and arrange/output the list appropriately.
Comment

Announcement

Order of variables when expanding wildcards in a varlist

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment