Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a Stata program for recoding based on a table

    Hi,

    For a project, I would like to create a program resembling something like this:
    Code:
    name_command varlist file_to_recode recode_instructions matching_table
    What that program would do is, for the selected variables contained in the varlist and in the dataset file_to_recode, recode the values of varlist based on the instructions contained in another file called recode_instructions that would be defined at the value level (so one observation by variable value). The matching_table file would also be a third dataset containing the old variable names to be recoded and the new ones.

    The reason I want to do this is that I have a vast number of datasets measuring the same things, but under different names, measurements, and categories. So, rather than doing a line-by-line recoding which can contain mistakes, I wanted to create this gigantic table called "recode_instructions" that would list, for every old variable value, its standardized value and name. It’s still done manually, but it seems safer and clearer to me to just call a command that does this based on an annex dataset.

    I do not have the data I’m working on yet; however, I wanted to discuss this plan with you because, as experienced users, you might know potential (unavoidable?) obstacles that will arise in this job. First of all, before going any further, do you know if any community-contributed command does this kind of recoding task based on a table? If not, then I think there would be a need for this command for data cleaners. Does this seem like a hard job to do?

    I am just asking for broad insights and remarks and not necessarily for code as this is the very first step of my project. Any comment would be greatly appreciated
    Last edited by Yanis Rahmouni; 14 Aug 2023, 03:14.

  • #2
    I will try only what you asked for, broad remarks

    One way to answer this is to ask rhetorically why this command doesn't exist already. The easy part is giving a list of variable names and a file name. The harder part is what is meant by recode instructions and matching table. I don't see any hint here of what that means concretely, and so even abstract advice is difficult if not pointless. You can't put much by way of recode instructions on a command line without creating your own more or less elaborate syntax. Either the matching table is in effect a do-file or equivalent or you've, in generality, a major job of parsing that table and translating to Stata code. Or I misunderstand.

    Most recoding that I am aware of, including mine, is based on do-files. I can think of various pitfalls for the unwary and others can add or subtract advice according to experience and programming style:

    * The Catch-22 is that you have to start out by wiring in many specific details peculiar to a particular dataset -- but that is then a problem if you want to extend or apply the do-file to other similar but not identical datasets.

    * Inefficient code I see here, and elsewhere, often makes too little use of loops and Stata functions (and by that I do mean functions in Stata's sense).

    * Any language whatsover looks weird in part if it's your first language and weird in part if it's not. (Thousands of silly or snarky posts elsewhere are based on the false premiss that a language that looks weird to the poster must be stupidly or badly designed.) In particular, writing your first Stata commands requires learning several things all or once. starting from designing and parsing syntax and effective use of local macros. Despite a history of programming in mainstream languages and other statistical software I didn't write anything but do-files for about two years after starting Stata.

    * Reinventing the wheel is a common flaw. Many of the commands in [D] exist to solve common specific problems.

    Comment

    Working...
    X