How to write an egen function that takes one variable as an input, but produces two variables using the input variable as a stub?

Joro Kolev

Join Date: Aug 2018
Posts: 3050

How to write an egen function that takes one variable as an input, but produces two variables using the input variable as a stub?

20 Mar 2021, 11:43

Good afternoon,

I am trying to write an egen function that takes one newvar as an input, but generates two variables having newvar as a stub. For concreteness say that I want to generate both the mean and the count of observations, so that I write:

Code:

egen pricestats = temp(price), by(rep)

and this function produces two new variables, pricestats_mean and pricestats_nobs.

This is my ado file:

Code:

cap prog drop _gtemp
program define _gtemp
        version 11, missing
        syntax newvarname =/exp [if] [in] [, BY(varlist)]

        tempvar touse 
        quietly {
                gen byte `touse'=1 `if' `in'
                sort `touse' `by'
                by `touse' `by': gen `typlist' `varlist'_mean = /*
                        */ sum(`exp')/sum((`exp')<.) if `touse'==1
                by `touse' `by': gen `typlist' `varlist'_nobs = /*
                        */ sum((`exp')<.) if `touse'==1        
                
                by `touse' `by': replace `varlist'_mean = `varlist'_mean[_N]
                by `touse' `by': replace `varlist'_nobs = `varlist'_nobs[_N]
                
        }
end

and this is the errors that it generates:

Code:

. set trace off

. sysuse auto
(1978 Automobile Data)

. keep price rep

egen pricestats = temp(price), by(rep)

*** OUTPUT OMITTED***

 ----------------------------------------------------------------------------------------- end _gtemp ---
- global EGEN_SVarname
- global EGEN_Varname
- if _rc { exit _rc }
- quietly count if missing(`dummy')
= quietly count if missing(__000001)
__000001 ambiguous abbreviation
--------------------------------------------------------------------------------------------- end egen ---
r(111);

. des

Contains data from C:\Program Files (x86)\Stata15\ado\base/a/auto.dta
  obs:            74                          1978 Automobile Data
 vars:             4                          13 Apr 2016 17:45
 size:           888                          (_dta has notes)
----------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------
price           int     %8.0gc                Price
rep78           int     %8.0g                 Repair Record 1978
__000001_mean   float   %9.0g                 
__000001_nobs   float   %9.0g                 
------------------------------------------

First my function seems to work, but somebody is complaining about I do not know what... Secondly, the stub of the newly generated variables is wrong.

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 Mar 2021, 12:37

Joro Kolev -

I was interested in this problem, but I find that your example is not a complete reproducible exampe, in that when I run

Code:

cap prog drop _gtemp
program define _gtemp
        version 11, missing
        syntax newvarname =/exp [if] [in] [, BY(varlist)]

        tempvar touse 
        quietly {
                gen byte `touse'=1 `if' `in'
                sort `touse' `by'
                by `touse' `by': gen `typlist' `varlist'_mean = /*
                        */ sum(`exp')/sum((`exp')<.) if `touse'==1
                by `touse' `by': gen `typlist' `varlist'_nobs = /*
                        */ sum((`exp')<.) if `touse'==1        
                
                by `touse' `by': replace `varlist'_mean = `varlist'_mean[_N]
                by `touse' `by': replace `varlist'_nobs = `varlist'_nobs[_N]
                
        }
end

sysuse auto
keep price rep78
egen pricestats = temp(price), by(rep78)

I am told

Code:

. egen pricestats = temp(price), by(rep78)
unknown egen function temp()
r(133);

And I don't even know where the process for adding your own egen functions is documented in order to guess what is needed to get egen to recognize _gtemp.

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

20 Mar 2021, 13:00

Hi William Lisowski , I have put the programme in an ado file, and I have placed it in

C:\ado\plus\_
(note the last underscore _ symbol). I think this is how we are supposed to write egen functions, we place them in the said directory, and they always have to start with the name _g, in my case I called the function and the ado file _gtemp.

https://www.stata.com/support/faqs/p...DO%20directory.
Attached Files

_gtemp.ado (682 Bytes, 1 view)
Comment
daniel klein

Join Date: Mar 2014

Posts: 3861
#4

20 Mar 2021, 13:20

egen creates a, i.e., one temporary variable with the results of your (egen) function, then renames that temporary variable with the name that the caller provides when calling egen. Thus, egen is not set up to produce more than one variable at a time. Moreover, because the variable name that the caller provides is never passed thru to your (egen) function, there is no way that you can use it as a stub-name.* It is probably easier and cleaner to write a dedicated command for what you want.

Edit: * Technically, this is not the whole truth. egen does store some global macros that you could access to get the name for the (one) new variable that the caller has specified. I still recommend not doing this within the egen framework.

Last edited by daniel klein; 20 Mar 2021, 13:24.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

20 Mar 2021, 13:28

You're going to need to clone egen to make your own variant that allows this. By the way, adding stubs to tempnames is a trick I've often used but there is a stern warning somewhere in the manuals that it may not be supported in the future.

Better yet is just to write a separate command to create what you want.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

20 Mar 2021, 14:02

Nick Cox and daniel klein, thank you for the clarification. That egen can handle only one variable, and the other key thing that Nick says that we should not be attaching stubs to tempvars, are bad news. Why should we be not attaching stubs to tempvars and tempnames, these are convenient locals just like every other local...

I very much dislike the idea of writing a separate command for every little thing that I put for convenience in an ado file for fast use. In the last week I have written egen functions for: byable xtile, product/weighted product, weighted means and weighted percentiles, weighted geometric and harmonic means, and I a few more that I cannot recall now. If I make a new command for every little thing like this, it becomes a zoo.

I also very strongly dislike official Stata implementations of functions through commands, -xtile- and -cumul- come to mind. Pain in the neck to use both of those, because they have arbitrary syntax, I think

xtile newvar = oldvar, options
vs
cumul oldvar, gen(newvar) options

Why like this? There is no logic and no systematic approach, and I forget all the time the syntax, and it slows me down because I need to open the help file every time when I use those, although I use them pretty often.

On the other hand egen is nice and systematic, holds everything in one place, and 99% of the time I can guess the syntax even for things that other people have written.

I also found it very easy to parse syntax through egen. I just open an example of an official egen function, and I do not even need to read the manual on how to parse syntax, I can guess how everything works from the example.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2471
#7

20 Mar 2021, 19:54

Hi Joro,
I think the problem of using stubs to locals is a problem because those variables do not behave as locals anymore.
I encounter this problem when I wanted to create multiple tempvars, without declaring them explicitly.
For example:

HTML Code:

tempvar newvar gen `newvar' =0 gen `newvar'1 =1 gen `newvar'2 =2

If you run this within a program or subprogram, the tempvar "newvar" will not stay in your data. However, `newvar'1 and `newvar'2, will remain there. It is then the programmer who has to figure out how to do the clean-up.
This came out as a bug, and took me some tries figuring out what was going on.

Regarding a new program. I get your point. It is easy to have a wrapper like "egen" and just write subprograms that do the rest, without having a popurri of programs. However it is just as easy to write an updated "egen_clone" that works the way you want.
For instance, setting aside the safeguards of -egen-, what it does is very simple:

Code:

** This is what you type with egen egen something = function ( something_else), options ** this is what egen executes: _gfunction something = something_else, options

So you could just write up a small code that does everything "egen" does, but that fits your purposes. And, that works with previous "egen" programs.

For example:

Code:

program cegen, sortpreserve syntax anything(equalok) [if] [in] , [* weights(varname) by(varlist) ] ** parse gettoken y rest:0 , parse("=") gettoken rest rest:0 , parse("=") gettoken fnc rest:rest, parse("(") gettoken rest rest2:rest, parse(")") local fnc = strltrim(subinstr("`fnc'","=","",1)) local rest = subinstr("`rest'","(","",.) local rest = subinstr("`rest'",")","",.) ** checks for varype if `:word count `y''==2 { tokenize `y' local vartype `1' local y `2' } if `:word count `y''==1 { local vartype `=c(type)' } marksample touse, novarlist markout `touse' `weight' `by' _g`fnc' `vartype' `y' = `rest' if `touse', `options' by(`by') end

This program "cegen" (clone egen) is not exhaustive, (have only tried on limited cases), but it basically does the same as -egen-, but rather than creating a tempvar and then renaming it, it passes the newvarname directly to the program of interest. This way, it could create the two variables, as your program was trying to do.

And for syntax consistency...Cannot say much about it. I find myself forgetting the syntax of programs I wrote myself half of the time (for complex programs). And others I struggle trying to find easy to make programs flexible but easy enough for other people to use it. And for Stata, probably some of that is "legacy" from early programmers who thought that was the best way of using it, and now it needs to be kept as part of Stata because of backward compatibility.

Anyways, Perhaps this "cegen" program may be helpful for your purposes.

HTH
2 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

20 Mar 2021, 20:57

FernandoRios many thanks for the code on cloning egen ! I think the idea of cloning egen that Nick proposed and you showed how it can be done is great, I just did not think that I could pull it out myself. I clearly see the benefits of creating jegen (Joro's egen) and putting there everything I ever come up in terms of generating variables. Yet another benefit of this approach, apart from what was mentioned so far, is that we can use names of existing egen functions. E.g., say the only thing that I dislike about -egen, mean- is that it does not generate a weighted mean if needed, I can do -jegen, mean- which does what I want.

As for the tempvars, I guess the proper way for us to have the facility is for Stata Corp to create tempstub, which deletes everything including the stub once the program is done.

I personally found this attaching stubs to tempvars very convenient, because often times the tempvars that I need depend on what the users has specified in the options. Of course this could be done more properly with one extra step where I interpret what the user has typed, and map it to explicit names of tempvars, but this is more work and more typing on my side.

I never found it inconvenient to clean up my working space myself after I am done. The only useful application I ever had of tempvars and tempnames is that using them I do not step onto something that already exists. Given that -drop- and -keep- accept wildcards, there is no problem for me at all after I have created `tempvar'this and `tempvar'that, to write at the end of my programme
drop `tempvar'*.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3861
#9

21 Mar 2021, 03:55

Originally posted by Joro Kolev View Post

The only useful application I ever had of tempvars and tempnames is that using them I do not step onto something that already exists.

Many users seem to think that but it is not true. Temporary (variable)names are not at all guaranteed to be new.

Code:

. clear . generate __000000 = 42 . tempvar foo . generate `foo' = 73 variable __000000 already defined r(110);

People do not tend to name their objects __somehting but you cannot really count on that. Therefore, I believe there might actually be a problem if you

Originally posted by Joro Kolev View Post

write at the end of my programme drop `tempvar'*.

Last edited by daniel klein; 21 Mar 2021, 04:03. Reason: deleted misleading example
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#10

21 Mar 2021, 06:31

We could have a secondary school-type debate discussing contradictory stances.

egen is the Swiss Army knife of Stata, lots of small tools brought together conveniently, and every user should know it well.

egen is an arbitrary ragbag of functions without a consistent rationale.

and I would speak in support of both motions.

egen was I guess written negatively for calculations that one reason or another StataCorp (as now is) didn't want to implement as Stata functions (strict sense) but positively were each very easy to explain -- like means, even if you want them groupwise. Even better if there are family resemblances among what is supported, then one place to look is better than whatever is the number of egen functions.

egen was not written originally, and has not been extended, to support weights, I have no idea of whether this was deliberate or an oversight, and if it was an oversight the company has had a long long time to fix it.

But so long as this is true, egen is not the place to support what xtile or cumul does as support for weights is vital for those commands.

There is some bemusement in posts I often see elsewhere from present or past Stata users asking for the equivalent of egen in R, or some other software. The answer is always that other software has other ways to do it, and they aren't necessarily bundled together.

Equally if egen did not exist I don't believe it would be necessary to re-invent it. But while it exists and is widely useful it will remain supported because anything different would break a very large fraction of user scripts.

The company's mixed attitude can be guessed from the fact that egen has been speeded up a bit recently, but we don't see many new functions being added in each new release. On a lower level, while many users, including myself, have written extra functions, that is not so popular now. The reasons certainly include Mata.

My own ideal, or suggestion, is that

1. every official egen function should just be supported as a standard Stata function that generate can call

2. users should be able to write functions that generate can call.

1 would be easy for the company but 2 raises many more questions, not least on whether it is really good idea, given Mata and other ways of doing things. Also, I am reminded of something quite different where the company's very polite answer was wrapped around "Good idea, but that project would be about 1 programmer-year of work and about 3 users seem really interested. so we are not going to do it".

Otherwise I agree with almost everything said. Consistency of syntax is an excellent goal, as is not changing the syntax that is in long use. (There are compromises like allowing old syntax as well as new syntax. to be supported.)

I've played with the idea of generalising egen but always drawn back myself. You add this, you add that, and where do you stop? There is a long slippery slope towards re-inventing many things already well supported otherwise. To be clear, my point was that you need to clone egen to do what you want; I wasn't proposing it!
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#11

21 Mar 2021, 06:43

daniel klein , what you are showing in #9 is what nightmares are made of. Thank you for letting me know.

You might want to write a Stata Tip about this, and send it to Stata Journal to see whether the editors would agree that this is very interesting as we have a bit of a problem here.

The common belief (that you show to be incorrect), is well justified by what the manual says. E.g., Stata 15, programming manual, p. 301 reads "The tempvar sumsq command creates a local macro called sumsq and stores in it a name that is different from any name currently in the data." Well, this is not what happened in your example, you had the name __000000 already in your data, and yet tempvar chose it...

Following my Stata Tip (Kolev, Gueorgui I. "Stata tip 31: Scalar or variable? The problem of ambiguous names." The Stata Journal 6, no. 2 (2006): 279-280.) I can modify your example to show even more wicked consequences.

Code:

. clear . set obs 3 number of observations (_N) was 0, now 3 . generate __000000variable = 1000000 . tempname myscalar . scalar `myscalar' = 7 . dis `myscalar' 1000000

1million is very different from 7, so when I use myscalar in calculations I will be in trouble deep.

In your example our procedure crashed, so the worst thing that can happen here is that we spend a lot of time scratching our heads as to why the procedure crashed...

In my example Stata quietly and without alerting to any errors, calculated something that she should not have calculated...

Last edited by Joro Kolev; 21 Mar 2021, 06:46.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2471
#12

21 Mar 2021, 07:48

I agree with Nick Cox in #10.
As someone who writes some programs here and there, I some times stop my self of spending too much time on a program that im trying to make "public" because I start thinking...what else could I add to this that other people may find useful, and still easy to code.
That often takes it down the rabbit whole, when you try adding options, and consistency, that, at least in my case, I lose control of the program, which ends up with more bugs than originally intended.
I must say tho, the fact that Stata corp tends to "take" some of the user-written programs and provides support on it is encouraging, as it reduces the amount of syntax disparity and output disparity later on.

I actually like that for things like "egen", "margins", "predict" and "estat" (core for some estimators implementation and post estimators) are actually very standardized in terms of what should they need as input to create a specific output.

In any case, questions like Joro's here, sometimes just shake my programmer's hat, and I write small codes (like the one I suggested), and sometimes share it publicly, in case that anyone finds it useful.

I wonder, Nick Cox , if that is also what you experienced. That when you started working with Stata, you encountered a recurrent problem, wrote a program to solve it, share it, and is now "the prefer way" for most of us Stata users.

Fernando
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#13

21 Mar 2021, 09:18

I am not sure what kind of response FernandoRios is expecting but this is a volley that may sometimes hit the target.

I started with Stata in 1991, I think, after meandering my way through Fortran (exclusively for some while), various Basics, J (a lot), Awk (a lot), Perl (a little), C (a little) on the programming language side and SPSS (a very few times), MIDAS (a lot), PC-ISP (a lot), Minitab (a lot) on the statistical environment side That isn't a complete list or necessarily in chronological order and there were overlaps, but no one need care about exactness. In case anyone is curious, I have more recently used R occasionally. And MATLAB ditto. SAS never.

For a while in the early 1990s (about a year or two in which I was using Stata most days) I only wrote do-files. Programs defining ado files just looked too complicated, but after a while I thought "It can't be really difficult" and it wasn't, although copying code that works without understanding it all was a major tactic and I can't claim that has ever stopped completely. The start of the Stata Technical Bulletin and later Statalist were strong incentives to share and even to showcase one's best work.

I am not sure that my answer will be especially typical or illuminating. I write programs mostly to scratch my own itches. I want to do something, and I can't think of a way to do it quickly but I can think of a way to do it in several steps, but I am bored with writing down the same stuff or reluctant to do it more than once. My most recent publicly released program was of the order of 3 hours' grind to write in general what particular cases take 3 minutes to think through. So in that case and several others I need to use the command a lot myself to justify the time investment, but whoever isn't programming partly because it is fun made the wrong choice.

I will also write commands because someone asked for something and it looks an interesting challenge within my reach. Most of the time, it ends up being something I later realise is useful for me too. What is now the duplicates command arose from repeated Statalist requests. Fittingly Thomas Steichen (long since retired) and I had very similar ideas -- after all, the same problems have the same solutions, as Feynman said somewhere in his physics lectures -- and we put them all together. And then a bit after that I took it all apart and rewrote it and the company folded it back into official Stata. At some point I started using it myself.

The company is folding less and less back from community contributions in each release. There is a strong positive reason for that. The main idea is that net and whatever is built on it (like ssc) make it really easy for people to install community stuff (sorry to those behind firewalls).. There are negatives too: adopting a community-contributed command is much harder work than most users realise and it carries a burden in terms of documentation and support (and to be sure you'd never want the company to say, Yes, but that was originally a user-written command, and we don't understand it well). And a good community-contributed command can stand on its own two feet pretty well without company endorsement.

I have heard comments at users' meetings that the company should just concentrate for one release on folding back the best community commands into the official release. At that all hell breaks loose because the candidates for inclusion are completely different, and other people really want new things, not polished versions of what is now available. In any case, imagine the flak on social media.... It is not going to happen. The key point for the company, for marketing and other reasons, is to concentrate on projects that are way beyond users for whatever reason, naturally including writing or changing proprietary code.

While I am musing on that, a few miscellaneous reflections.

The help file for me is always quite as much effort as the code. A little idiosyncratically, a help file for me is often an early draft of a paper I hope to write, so some of the time and effort are personal choices.

There is very little relationship between what you think is your best work and what appears to be popular. I even regret making public some commands, especially the one for WInsorizing data.
1 like
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 541
#14

21 Mar 2021, 09:20

I aggree with daniel klein 's comment at #9 that you can't always be sure that temporary variable names are guaranteed to be new. In a somewhat different context, in a lengthy post some time ago I did demonstrate that

"Although it is argued that using tempname is safe because Stata will take care that the (internal) names actually used when employing tempname are unique, this is not quite true if the dataset contains variables created by Stata's tempvar facility; see [P] macro. The example below demonstrates that it is possible that you happen to use a dataset which contains a variable named __000000 which has previously been created by using tempvar. If this happens while you are creating a scalar using a name specified by tempname, Stata will not recognize if there is a conflict of names."

As a consequence I did recommend to always use the pseudofunction scalar() in programs which use temporary scalars (contrary to the advice in the technical notes to the command scalar in the [P] manual that recommends to obtain the names for your scalars from Stata’s tempname facility instead of using the scalar() pseudofunction).
2 likes
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2471
#15

22 Mar 2021, 18:41

Originally posted by Nick Cox View Post

I am not sure what kind of response FernandoRios is expecting but this is a volley that may sometimes hit the target.

Thank you Nick Cox . Wasnt really expecting any specific answer. For a moment, I was just kind of engaging in a conversation with you. So thank you for your answer.
Best
Fernando
Comment

Announcement

How to write an egen function that takes one variable as an input, but produces two variables using the input variable as a stub?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment