Dropping a large set of variables based on a list of names

Na Sathe

Join Date: Nov 2023

Posts: 2
#1

Dropping a large set of variables based on a list of names

02 Nov 2023, 13:46

Hello all,

I am trying to find the most efficient way to drop variables I do not need in an analysis. An example of the dataset is below, but the real dataset has 7596 of these seq.* variables.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int pt_id float(seq.19448.104 seq.5730.60 seq.18340.2 seq.7186.111) 1 85 21.9 12.7 768 2 11 37.4 16 116.3 3 58.7 227.5 14.8 58.9 4 98.7 43.3 37.1 87.6 5 165.9 45.3 11.7 119.9 6 121.2 56.6 63.5 105.9 7 134.6 44.8 14.5 122.8 8 505.7 28.3 14.2 88.1 9 633.5 44.7 20.2 104.1 10 81.6 23.2 10.4 62.4 end

I have a separate excel file with a list of the 4984 seq.* variables I do want to include in the analysis, which kind of looks like this:

seq_name
seq.19448.104
seq.5730.60
seq.18340.2

So since this excel file does not contain the seq.7186.111 variable, I would want to drop that one.

Appreciate your help and happy to clarify anything.
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3873
#2

02 Nov 2023, 14:02

First of all, your data example gives

Code:

. input int pt_id float(seq.19448.104 seq.5730.60 seq.18340.2 seq.7186.111) seq.19448.104 invalid name r(198);

which is expected as variable names in Stata cannot contain dots. If your data (in Stata) really has dots in the variable names, you will have to fix that.

Assuming you have valid Stata variable names in both the dataset and the Excel file, you could

Code:

import excel ...

the desired variable names from Excel, then put them into a local macro as

Code:

levelsof seq_name , local(wanted)

Then, load your original dataset and simply keep the variable names from the Excel file

Code:

keep `wanted'
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30150
#3

02 Nov 2023, 14:03

Your -dataex- example data is not real--you have either faked it, or edited a real -dataex- example. Either way, is not usable No Stata data set can have variable names like seq.19448.104: Stata variable names can only contain digits, letters, and underscore (_) characters.

Your first step is to import the spreadsheet with the names of the variables you want to keep into a Stata data set. The -import excel- command will do this for you. See -help import excel- for details if you are not familiar with it. Let's say you save that under the filename vars_to_keep.dta. And if your big data set is called big_data_set.dta, then the code to do what you want will look like this:

Code:

clear* use big_data_set quietly ds seq* local all_seqs `r(varlist)' frame create keepers frame keepers { use vars_to_keep levelsof seq_name, local(to_keep) } frame drop keepers local to_drop: list all_seqs - to_keep drop `to_drop'

In this code, I have assumed that all of the variables we are concerned with begin with seq.

Note: You will have to convert all of those inadmissible variable names to real Stata variable names, both in the big data set and in the data set which names the variables you want to keep.

Added: Crossed with #2. Our solutions are largely similar, however they will produce different results. The code in #2 will remove the variable pt_id. The code here will not. I interpreted O.P.'s request as intending to only eliminate seq* variables. O.P. will have to decide which approach achieves the desired result.
Comment
Na Sathe

Join Date: Nov 2023

Posts: 2
#4

02 Nov 2023, 14:44

Thank you Clyde this solution worked perfectly, and I did want to keep the pt_id. And sorry I did mess with the real -dataex- example, I was trying to share data in a helpful format, but didn't want to share real data. (The stata variable names did not have the "." )
Comment

Announcement

Dropping a large set of variables based on a list of names

Comment

Comment

Comment