Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping a large set of variables based on a list of names

    Hello all,

    I am trying to find the most efficient way to drop variables I do not need in an analysis. An example of the dataset is below, but the real dataset has 7596 of these seq.* variables.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int pt_id float(seq.19448.104 seq.5730.60 seq.18340.2 seq.7186.111)
    1  85 21.9 12.7  768
    2   11 37.4   16 116.3
    3  58.7 227.5 14.8  58.9
    4  98.7 43.3 37.1  87.6
    5 165.9 45.3 11.7 119.9
    6 121.2 56.6 63.5 105.9
    7 134.6 44.8 14.5 122.8
    8 505.7 28.3 14.2  88.1
    9 633.5 44.7 20.2 104.1
    10  81.6 23.2 10.4  62.4
    end
    I have a separate excel file with a list of the 4984 seq.* variables I do want to include in the analysis, which kind of looks like this:

    seq_name
    seq.19448.104
    seq.5730.60
    seq.18340.2

    So since this excel file does not contain the seq.7186.111 variable, I would want to drop that one.

    Appreciate your help and happy to clarify anything.







  • #2
    First of all, your data example gives

    Code:
    . input int pt_id float(seq.19448.104 seq.5730.60 seq.18340.2 seq.7186.111)
    seq.19448.104 invalid name
    r(198);
    which is expected as variable names in Stata cannot contain dots. If your data (in Stata) really has dots in the variable names, you will have to fix that.

    Assuming you have valid Stata variable names in both the dataset and the Excel file, you could

    Code:
    import excel ...
    the desired variable names from Excel, then put them into a local macro as

    Code:
    levelsof seq_name , local(wanted)
    Then, load your original dataset and simply keep the variable names from the Excel file

    Code:
    keep `wanted'

    Comment


    • #3
      Your -dataex- example data is not real--you have either faked it, or edited a real -dataex- example. Either way, is not usable No Stata data set can have variable names like seq.19448.104: Stata variable names can only contain digits, letters, and underscore (_) characters.

      Your first step is to import the spreadsheet with the names of the variables you want to keep into a Stata data set. The -import excel- command will do this for you. See -help import excel- for details if you are not familiar with it. Let's say you save that under the filename vars_to_keep.dta. And if your big data set is called big_data_set.dta, then the code to do what you want will look like this:
      Code:
      clear*
      use big_data_set
      quietly ds seq*
      local all_seqs `r(varlist)'
      frame create keepers
      frame keepers {
          use vars_to_keep
          levelsof seq_name, local(to_keep)
      }
      frame drop keepers
      
      local to_drop: list all_seqs - to_keep
      drop `to_drop'
      In this code, I have assumed that all of the variables we are concerned with begin with seq.

      Note: You will have to convert all of those inadmissible variable names to real Stata variable names, both in the big data set and in the data set which names the variables you want to keep.

      Added: Crossed with #2. Our solutions are largely similar, however they will produce different results. The code in #2 will remove the variable pt_id. The code here will not. I interpreted O.P.'s request as intending to only eliminate seq* variables. O.P. will have to decide which approach achieves the desired result.

      Comment


      • #4
        Thank you Clyde this solution worked perfectly, and I did want to keep the pt_id. And sorry I did mess with the real -dataex- example, I was trying to share data in a helpful format, but didn't want to share real data. (The stata variable names did not have the "." )

        Comment

        Working...
        X