Check datasets for variables

Oliver McManus

Join Date: Mar 2016

Posts: 22
#1

Check datasets for variables

09 Jun 2016, 13:00

Hi everyone,

I'm working on a project using 23 years of data, each with its own dataset. There have been differences in the data collected over the years, and this is reflected in the variables recorded. I've spent loads of time manually creating a spreadsheet to illustrate which variables are available in which years, as in the screenshot below.

I imagine that surely there must be a way to do this automatically, to have Stata produce a table like this (albeit without my fancy conditional formatting), but I have no idea how.

Any suggestions that might help me avoid this tedium in future?

Last edited by Oliver McManus; 09 Jun 2016, 13:02.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

09 Jun 2016, 18:50

Here is an example to start you along your way.

Code:

. tempfile varlist

. save `varlist', emptyok
(note: dataset contains 0 observations)
file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_02880.000003 saved

. forvalues i=75/76 {
  2.     use data`i', clear
  3.     describe, replace
  4.     keep name
  5.     gen int datayear = `i'
  6.     list, noobs
  7.     append using `varlist'
  8.     save `varlist', replace
  9.     }

  +-----------------+
  | name   datayear |
  |-----------------|
  | year         75 |
  |    x         75 |
  |    y         75 |
  +-----------------+
file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_02880.000003 saved

  +-----------------+
  | name   datayear |
  |-----------------|
  | year         76 |
  |    y         76 |
  |    z         76 |
  +-----------------+
file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_02880.000003 saved

. use `varlist', clear
(describe data76.dta)

. table datayear name

----------------------------------
          |     variable name     
 datayear |    x     y  year     z
----------+-----------------------
       75 |    1     1     1      
       76 |          1     1     1
----------------------------------


.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#3

09 Jun 2016, 18:58

This was also asked and answered at http://www.statalist.org/forums/foru...-for-variables. My answer there and William's here are minor variants of each other.
Comment
Oliver McManus

Join Date: Mar 2016

Posts: 22
#4

13 Jun 2016, 09:37

Thank you for your suggestions, I've tried both today and I'll reply here so you are both updated.

I've adapted mainly from Clyde's answer, because my datasets aren't consistently named and because I've got too many variables for Stata to display a table. This is what I've got so far:

Code:

local filelist: dir "/home/oliverm/windows/7/Desktop/fortable" files "*.dta" clear tempfile building save `building', emptyok scalar datayear = 1992 foreach f of local filelist { use `"`f'"', clear describe, replace gen year = `=datayear' append using `building' save `"`building'"', replace scalar datayear = `=datayear+1' } use `building', clear keep name year gen byte present_ = 1 reshape wide present, i(year) j(name) string rename present_* * export excel using variable_year_crosswalk.xlsx, clear firstrow(variables)

The problem I'm facing at the moment is in reshaping the data. Specifically, I get the error present_AverageLONGERMVPAMinutesPerValid invalid variable name. I've just read that variable names can only be up to 32 characters long, so I'm assuming that's the issue.

I don't know how many variables are currently 32 characters long (there are thousands), nor whether I could shorten them without them becoming ambiguous (not that I know how to shorten them anyway).

Any suggestions?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

13 Jun 2016, 09:58

Consider replacing your reshape command from Clyde's suggestion with some version of table, replace following the lead in my suggestion.
Comment
Oliver McManus

Join Date: Mar 2016

Posts: 22
#6

13 Jun 2016, 10:05

Just tried, it gives an error, too many values (I've got > 40,000 "observations" in the combined dataset).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#7

13 Jun 2016, 11:47

Well, that's a vexing problem. One starting point is, instead of -gen present = 1-, make it -gen _ = 1-, and then -reshape _, i(year) j(name) string- And then -rename _* *-.

That will get you by with any variable whose name is 31 characters or fewer.

But the one example you give already starts with 32 characters, so this won't help with that one or others that might also be that long. What I have done when confronted with this kind of situation before is to try to abbreviate names using tricks like:

Code:

rename *Average* *Avg* rename *Minutes* *Min* rename *Valid* *Ok*

etc. It often takes a few rounds of trial and error to find tricks like this that are appropriate to your data, but in the end, you can prevail this way if you persist.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

13 Jun 2016, 12:23

Before the reshape, you could try something like the following:

Code:

gen name31 = substr(name,1,31) isid year name31

which would see if the names are distinct to 31 characters within each year.

Did these 23 years of data originate as imported Excel spreadsheets with the first row used to supply the variable names? In that case it is possible that the variable names have already been truncated and otherwise altered by the import process, so that your year-to-year comparisons my not be correct.

I find it interesting that the example variable name you presented "AverageLONGERMVPAMinutesPerValid" contains the substring "LONGERMVPAM" followed by "MinutesPerValid" -- this suggests to me that "PAM" (or three words starting with P, A, and M) was truncated off.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#9

13 Jun 2016, 12:59

Did these 23 years of data originate as imported Excel spreadsheets with the first row used to supply the variable names? In that case it is possible that the variable names have already been truncated and otherwise altered by the import process, so that your year-to-year comparisons my not be correct.

That's a really good point that I overlooked. When Stata imports from Excel with the -firstrow()- option specified, if it encounters would-be variable names that are over 23 characters, it truncates them to 32 characters. If that results in renaming, preserving the first 31 characters and suffixing a 1 on the end, and then a 2 for the next would-be duplicate etc. But if the variables in each spreadsheet you import are different, blablabla...1, etc. could be a different variable depending on which spreadsheet it's in.
Comment
Oliver McManus

Join Date: Mar 2016

Posts: 22
#10

13 Jun 2016, 13:52

I believe the data were originally coded in SPSS, but even that may have changed over the intervening years. I don't know if similar problems would arise there, but regardless, none of the variables I'm interested in have such long names anyway.

The solution I've currently settled on is just specifying the variables I want to know about:

Code:

local filelist: dir "/home/oliverm/windows/7/Desktop/fortable" files "*.dta" clear tempfile building save `building', emptyok scalar datayear = 1992 foreach f of local filelist { use `"`f'"', clear rename *, lower describe, replace clear gen year = `=datayear' append using `building' save `"`building'"', replace scalar datayear = `=datayear+1' } use `building', clear keep name year keep if name == "hsclass" | /// name == "hseg" | /// name == "hsoc" | /// name == "scall" | /// name == "scallx" | /// ... gen byte present_ = 1 reshape wide present, i(year) j(name) string rename present_* * export excel using "variables.xlsx", firstrow(variables) replace

One thing I was wondering about was whether it would be possible to use the resulting Excel file to specify which variables to keep from each year. E.g. if the 2010 data had the variables age, sex and height, I wouldn't have to write keep age sex height. Obviously it's simpler to do it manually in this example, but in reality there are near 200 variables that I'm interested in, and specifying manually which permutation of those are available to be kept for each year is inefficient and prone to error.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

13 Jun 2016, 14:06

One thing I was wondering about was whether it would be possible to use the resulting Excel file to specify which variables to keep from each year. E.g. if the 2010 data had the variables age, sex and height, I wouldn't have to write keep age sex height.

I have used Stata to read an Excel spreadsheet and from it generate Stata commands that I write to a temporary text file that I then run or do or include to execute the generated commands.

In general, it's often quicker, usually more reliable, and always more fun, to write a program to use reliable metadata to create a boring program, than it is to write the boring program directly.
1 like
Comment
Oliver McManus

Join Date: Mar 2016

Posts: 22
#12

14 Jun 2016, 08:57

Originally posted by William Lisowski View Post

I have used Stata to read an Excel spreadsheet and from it generate Stata commands that I write to a temporary text file that I then run or do or include to execute the generated commands.

In general, it's often quicker, usually more reliable, and always more fun, to write a program to use reliable metadata to create a boring program, than it is to write the boring program directly.

Yes I agree. What I want to do is more or less this:

Code:

use "table of variables.xlsx" local vars1991 = XXX use "1991.dta" keep `vars1991' save "1991 restricted.dta"

But I can't figure out how to define the XXX so it includes all the variables present in 1991 according to the spreadsheet. Additionally, I'm not sure if keep `vars1991' would work like that. Do you know how I'd go about this?
I'd also be happy to use the method you mentioned, writing commands to a text file, but I'm even less certain how to go about that.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#13

15 Jun 2016, 09:53

In this case, my approach would be to not use the Excel workbook, but rather the list of year/name pairs you created in post #10 before the reshape wide. Here is the technique I would use to create yearly variable lists from some made-up data.

Code:

. list, sepby(year) noobs

  +-------------+
  | year   name |
  |-------------|
  | 2001      x |
  | 2001      y |
  |-------------|
  | 2002      x |
  | 2002      y |
  | 2002      z |
  |-------------|
  | 2003      x |
  | 2003      z |
  +-------------+

. levelsof year, local(yearlist)
2001 2002 2003

. foreach y of local yearlist {
  2.     quietly levelsof name if year==`y', local(vars)
  3.     local varlist`y' : list clean vars
  4.     display `"varlist`y': `varlist`y''"'
  5. }
varlist2001: x y
varlist2002: x y z
varlist2003: x z

.

And then I would process your files with something like the following untested code.

Code:

foreach y of local yearlist {
    use `varlist'y'' using "`y'", clear
    save "`y' restricted data", replace
}

Comment

Oliver McManus

Join Date: Mar 2016
Posts: 22

#14

16 Jun 2016, 09:26

Thank you so much, that worked perfectly. I've now got everything I wanted, both the spreadsheet for my own reference, and load of datasets restricted to just the variables I need, without repetitive typing on my part.

Here's my code as it currently stands, should it be of interest.

Code:

cd "/home/oliverm/windows/7/Desktop/stata/data"

* Local with list of datasets
local filelist: dir "/home/oliverm/windows/7/Desktop/stata/data/original" files "*.dta"

clear
tempfile building
save `building', emptyok

* Saving each dataset with lowercase variables in stata/data/lower
capture mkdir lower
* For each dataset, make list of variables and put that into "building" along
* with the year
foreach f of local filelist {
    use "original/`f'", clear
    local fileyear = substr(`"`f'"',1,4)
    display `fileyear'
    rename *, lower
    save "lower/`fileyear' HSE lower.dta", replace
    describe, replace clear
    gen year = `fileyear'
    append using `building'
    save `"`building'"', replace
}

use `building', clear
keep name year
sort year


{ // Keeping variables I want
keep if name == "everhart" | ///
        name == "docheart" | ///
        // etc...
        name == "hrpsoc10b" | ///
        name == "sclass"
}
 



preserve
{
    levelsof year, local(yearlist)
    foreach y of local yearlist {
        quietly levelsof name if year==`y', local(vars)
        local varlist`y' : list clean vars
    }
    
    capture mkdir "restricted"
    foreach y of local yearlist {
        use `varlist`y'' using "lower/`y' HSE lower.dta", clear
        save "restricted/`y' HSE restricted.dta", replace
    }
}
restore

gen byte present_ = 1
reshape wide present, i(year) j(name) string
rename present_* *

export excel using "/home/oliverm/windows/7/Desktop/stata/whatvars.xlsx", firstrow(variables) replace

Announcement