Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find out what variables in .dta file are used in .do file

    I am reviewing an old project by someone who is no longer reachable.

    I have her .dta file with input data and the .do file with her analysis. However, the .dta file contains a large number of variables/columns and is simply a mess overall. Many of these variables were poorly named so that it's hard to distinguish what variables are used in the .do file and what variables are never used. For instance, there are several variations of variables containing GDP data (e.g. GDP_11, GDP_12, GDP_101, etc.) when in the end only one of these variations is used in the analysis contained in the .do file. This takes place with other variables as well.

    Is there a way for me to find out what variables in the .dta are actually used in the .do file? I would like to remove all variables from that .dta file that are not used the code contained in the .do file.

    Thank you in advance.

  • #2
    I do not think that there is a general way to do this. If the do-file is simple enough (one do-file, load one dataset, no merging, no appending, etc.) and you desperately wanted this, you could probably implement a brute-force approach, dropping one variable at a time and see whether the do-file still runs without an error. Depending on the running time of the do-file, this will take quite a while. I do not really see the point of doing this in the first place but that is your call to make.

    Edit:

    In #3, Nick explains in more detail what I have vaguely implied in my opening sentence. Parsing the do-file does not seem promising.

    Best
    Daniel
    Last edited by daniel klein; 14 Jan 2020, 11:51.

    Comment


    • #3
      Some difficulties:

      1. The do-file might include abbreviated versions of variable names.

      2. Variable names might also be legal command names, and vice versa so context is important too. Few rules apply absolutely, e.g. the first word on a line might be a variable name if it were a continuation line; the second word in a command might be a subcommand.

      Comment


      • #4
        What Ernestina delPiero mentioned was one of my grumbles some 10 years ago at one of the user group meetings.

        Nick gives some good examples of why this may be difficult by looking at the code. Furthermore the code itself will not answer this question in the absence of the data file (e.g. what falls into the varlist a - h ? Or something may be very-very hidden, like a hardcoded name in a compiled Mata code or a C++ plugin, which you just can't see. Same applies to Java code. Introduction of Python didn't make this easier . In the end your program may be saving a portion of data and calling an entirely external application for processing (say a GIS system), and only a human being would know what it expects to find in that file.

        My idea in the grumble was to introduce a boolean variable property IsTouched, which could be reset to a missing at the start of some command, and examined after it completes. It would have been true if the variable was touched during the code execution (e.g. any value read from that variable, or written to it, etc) or false (can be safely dropped). I remember there was no enthusiasm from the Stata developers at that time, and years later I feel this is correct - the behavior would have been too unpredictable in a general case. At probably for most commands it will mark all variables as 'touched'. (e.g. any preserve statement would technically touch a variable under the above definitions).

        But the question that Ernestina has posed is a valid and common one. Putting it into an applied context, "What is the minimal portion of the data file that I should share with an auditor for my paper review, so that my code still runs on that portion and the auditor is able to reproduce the same result?"

        Bruteforcing the variable drops may not work in the case of masks or ranges, consider:
        Code:
        regress price factor*
        Dropping 1 or 2 factor variables will not change the functioning of the code, but will change the results. You wouldn't want to drop those variables.

        I don't know an answer to the question Ernestina has asked. The above is just an opinion.

        Comment

        Working...
        X