Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Listing all used/unused variables

    Assume you have a large dataset which you conduct some analysis of. There might be some variables which are not used in any of the models and from here-on-out (or for sharing purposes) are simply dead-weight.
    Naturally one would just drop the unneeded variables (or keep the needed ones). But what if the dataset is very large (hundreds of variables and much more) and/or there are many models with many different variables as well.

    Is there any way to "automate" this process? perhaps some command/macro which stores all variables "used" and thus allows the user to "keep" them.

    Illustrative example:
    Code:
    sysuse auto
    tokeep, start
    reg price mpg rep78
    reg trunk weight
    tokeep, end
    
    keep `tokeep'
    where "tokeep" will record price mpg rep78 trunk weight, and thus executing "keep `tokeep'" would keep these variables.

  • #2
    It is quite simple to do so by manually concatenating a list of variables (see help macrolists) that are used; extend the list after each model; something like this:
    Code:
    sysuse auto , clear
    local keepvars ""
    
    local newvars price length
    regress `newvars'
    local keepvars : list keepvars | newvars
    
    local newvars price mpg
    regress `newvars'
    local keepvars : list keepvars | newvars
    
    * ... and so on ... *
    
    keep `keepvars'
    You may choose global instead of local macros in case you're not willing to run your complete code at once.

    Regards
    Bela
    Last edited by Daniel Bela; 21 Mar 2017, 14:58. Reason: syntax typo corrected

    Comment


    • #3
      You may choose global instead of local macros in case you're not willing to run your complete code at once.
      Yes, you may, but you do so at your own risk, and you may mess things up in ways that are difficult to imagine by doing that. Global macros are inherently dangerous and should be used only when there is no alternative. The desire to run only part of the code is a very flimsy excuse for using them.

      If you only want to run part of the code, then the solution that allows you to stay with safe local macros is to comment out the parts you don't want to run.

      Stata version 14.2 makes this easier than ever. In the do-editor, highlight the part of the code you don't want to run. Selected Edit from the do-editor's main menu, then select Advanced, and then select Toggle Comment. This will comment out the code you highlighted. When you're finished and want to restore it to its uncommented form, just repeat the entire operation.

      Actually, if you're running on Windows it's even easier than that. Just highlight the part you want to comment out in the do-editor and then press Ctrl and / simultaneously. To undo this when you're done, highlight that section again, and press Ctrl and / simultaneously again. I imagine that there is a similar keyboard shortcut available for Mac or Unix as well, but I don't know what it is.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Yes, you may, but you do so at your own risk, and you may mess things up in ways that are difficult to imagine by doing that. Global macros are inherently dangerous and should be used only when there is no alternative. The desire to run only part of the code is a very flimsy excuse for using them.
        For the sake of clarification: I did not want to give advice to use global macros; I'm completely with Clyde Schechter here, I also think it is bad practice to use global macros for the named reasons.

        Originally posted by Clyde Schechter View Post
        Actually, if you're running on Windows it's even easier than that. Just highlight the part you want to comment out in the do-editor and then press Ctrl and / simultaneously. To undo this when you're done, highlight that section again, and press Ctrl and / simultaneously again. I imagine that there is a similar keyboard shortcut available for Mac or Unix as well, but I don't know what it is.
        This also works on Linux, with the same keyboard shortcut (<Ctrl>+</>). I suspect the same shortcut for Mac OS, but don't have a system at hand to try. I agree that this should be considered a better alternative to using global macros.

        Kind regards
        Bela

        Comment


        • #5
          On macOS, the command key takes the place of the Ctrl key in this specific shortcut. Thanks for the tip, Clyde!

          Comment


          • #6
            That is interesting. Thanks, Clyde.
            On my German keyboard, it is <Strg>+<#>.
            https://www.kripfganz.de/stata/

            Comment


            • #7
              Ariel has asked a very good question. I have asked the same in one of the "wishes and grumbles" sessions of Stata User Group Meetings a few years ago and the answer was a categorical no. Stata will not report in any user-accessible macro/function/etc what variables were involved in the analysis. (although internally Stata potentially may trace whether a certain element [i,j] of the data matrix was involved in the computations).

              If Ariel wants to simulate such analysis himself, one approach is to keep knocking out variables one by one, repeating the whole analysis and checking the analysis still works, which immediately presents the first challenge of what to do with "smart" commands, which continue working even if the data changes:

              Code:
              summarize *
              The above code will work whether your dataset contains price or price and weight variables.

              The other challenge is that other commands may be affected by the mere presence of some variables, even when not accessing their content.

              All-in-all, this is a very useful question, which often comes when one needs to supply a minimal dataset for reproducing the research, or ordering expensive data from a data vendor, but the only solution so far is to carefully inspect the analysis code and manually jot down the variables used.

              I wish Stata developers introduce a debug setting, which indicates how many values were accessed within each column of the dataset since last setting of the flag, something like:
              Code:
              .sysuse auto
              .set datatrace on
              . sum price
              
                  Variable |       Obs        Mean    Std. Dev.       Min        Max
              -------------+--------------------------------------------------------
                     price |        74    6165.257    2949.496       3291      15906
              
              
              
              . regress p w l
              
                    Source |       SS       df       MS              Number of obs =      74
              -------------+------------------------------           F(  2,    71) =   18.91
                     Model |   220725280     2   110362640           Prob > F      =  0.0000
                  Residual |   414340116    71  5835776.28           R-squared     =  0.3476
              -------------+------------------------------           Adj R-squared =  0.3292
                     Total |   635065396    73  8699525.97           Root MSE      =  2415.7
              
              ------------------------------------------------------------------------------
                     price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                    weight |   4.699065   1.122339     4.19   0.000     2.461184    6.936946
                    length |  -97.96031    39.1746    -2.50   0.015    -176.0722   -19.84838
                     _cons |   10386.54   4308.159     2.41   0.019     1796.316    18976.76
              ------------------------------------------------------------------------------
              
              .set datatrace off
              
              .datatrace report
              
              length 74
              price 148
              weight 74
              
              .
              Best, Sergiy

              Comment


              • #8
                I am considering writing a little program to tackle this issue, as I have a fairly long program that works with large datasets and I need to make the process more efficient. It would read in one or more do or ado files as text data, and then compare the resulting data with a variable list to see if any variables on the variable list are never used. I think this should be feasible as I already wrote something similar to automate backing up files based on the contents of do/ado files.

                Comment


                • #9
                  Code:
                  ssc describe isvar
                  focuses on a specific variant of this problem: to sort possible variable names into used and not used in a dataset.

                  Comment


                  • #10
                    I produced an initial version of the program which I have tentatively named "findunused". It takes as command-line inputs a datafile and a list of one or more do or ado files. It compares the variables in the datafile with the variables it finds in the code files, and reports a list of any variables in the datafile that are not found in the code files.

                    It is limited in that it won't be able to detect variables whose names are formed in code by local macros, or variables contained within .ster files - I have dealt with this latter case by putting the variable lists used by my models as comments within the code files, so that they will be taken into account. For variables whose names are spelled out in full it seems to work reasonably well.

                    The section that uses subinstr is a bit makeshift and won't necessarily deal with all cases - what it tries to do is put spaces between the variable names and any surrounding operators so that the variable names will be separated out when the text data is parsed with split.

                    Code:
                    * version register
                    * 202004171826 first version by Alex Mitrani
                    
                    capture program drop findunused
                    program findunused
                    version 13
                    syntax, datafile(string) codefiles(string) [minvarlength(integer 1)]
                    
                    timer clear
                    timer on 1
                    
                    * example of use:
                    * findunused, datafile("mydatafile.dta") codefiles("myfirstcodefile.ado mysecondcodefile.ado")
                    
                    tempfile mytempfile1 mytempfile2
                    
                    local mydatafile = subinstr("`datafile'",".dta","",.)
                    local mycodefiles = subinstr("`codefiles'",".ado","",.)
                    local mycodefiles = subinstr("`mycodefiles'",".do","",.)
                    
                    local intcounter = 1
                    
                    foreach codefile in `codefiles' {
                    
                        infile str80 var1 using `codefile', clear
                        replace var1=subinstr( var1,"!="," != ",.)
                        replace var1=subinstr( var1,"=="," == ",.)
                        replace var1=subinstr( var1,"="," = ",.)
                        replace var1=subinstr( var1,"*"," * ",.)
                        replace var1=subinstr( var1,"+"," + ",.)
                        replace var1=subinstr( var1,"-"," - ",.)
                        replace var1=subinstr( var1,"/"," / ",.)
                        replace var1=subinstr( var1,"`"," ` ",.)
                        replace var1=subinstr( var1,"'"," ' ",.)
                        replace var1=subinstr( var1,"("," ( ",.)
                        replace var1=subinstr( var1,")"," ) ",.)
                        replace var1=subinstr( var1,">="," >= ",.)
                        replace var1=subinstr( var1,"<="," <= ",.)
                        replace var1=subinstr( var1,">"," > ",.)
                        replace var1=subinstr( var1,"<"," < ",.)    
                    
                        split var1
                        drop var1
                        gen id=_n
                        reshape long var, i(id) j(bit)
                        drop id bit
                        drop if var==""
                        drop if length(var)<`minvarlength'
                        ren var varname
                        tempfile codefile`intcounter'
                        save `codefile`intcounter'', replace
                        local eraselist="`eraselist' `codefile`intcounter''"
                        
                        local ++intcounter
                        
                    }
                    
                    local nfiles = `intcounter'-1
                    
                    use `codefile1', clear
                    
                    if `nfiles'>1 {
                    
                        forvalues filecounter=2/`nfiles' {
                    
                            append using `codefile`filecounter''
                    
                        }
                        
                    }
                    
                    contract varname
                    drop _freq
                    
                    save `mytempfile1'
                    
                    use `datafile', clear
                    keep if _n==1
                    xpose, clear varname
                    drop v1
                    ren _varname varname
                    save `mytempfile2', replace
                    
                    use `mytempfile1', clear
                    joinby varname using `mytempfile2', unmatched(both)
                    tab _merge
                    keep if _merge==2
                    sum
                    local nvars = r(N)
                    sort varname
                    n display ""
                    n display ""
                    n display as yellow "the following `nvars' variables in `mydatafile' are unused in `mycodefiles'"
                    list varname, sep(0)
                    save findunused.dta, replace
                    n display as yellow "`results saved to "findunused.dta".'"
                    
                    foreach myfile in `eraselist' {
                    
                        erase `myfile'
                        
                    }
                    
                    timer off 1
                    
                    timer list
                    
                    end
                    *
                    Any feedback or suggestions for improvements would be greatly appreciated.

                    Thanks

                    Alex
                    Last edited by Alex Mitrani; 17 Apr 2020, 17:24.

                    Comment

                    Working...
                    X