Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Check datasets for variables

    Hi everyone,

    I'm working on a project using 23 years of data, each with its own dataset. There have been differences in the data collected over the years, and this is reflected in the variables recorded. I've spent loads of time manually creating a spreadsheet to illustrate which variables are available in which years, as in the screenshot below.

    I imagine that surely there must be a way to do this automatically, to have Stata produce a table like this (albeit without my fancy conditional formatting), but I have no idea how.

    Any suggestions that might help me avoid this tedium in future?
    Click image for larger version

Name:	image_5191.png
Views:	2
Size:	31.7 KB
ID:	1344687
    Last edited by Oliver McManus; 09 Jun 2016, 13:02.

  • #2
    Here is an example to start you along your way.
    Code:
    . tempfile varlist
    
    . save `varlist', emptyok
    (note: dataset contains 0 observations)
    file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_02880.000003 saved
    
    . forvalues i=75/76 {
      2.     use data`i', clear
      3.     describe, replace
      4.     keep name
      5.     gen int datayear = `i'
      6.     list, noobs
      7.     append using `varlist'
      8.     save `varlist', replace
      9.     }
    
      +-----------------+
      | name   datayear |
      |-----------------|
      | year         75 |
      |    x         75 |
      |    y         75 |
      +-----------------+
    file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_02880.000003 saved
    
      +-----------------+
      | name   datayear |
      |-----------------|
      | year         76 |
      |    y         76 |
      |    z         76 |
      +-----------------+
    file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_02880.000003 saved
    
    . use `varlist', clear
    (describe data76.dta)
    
    . table datayear name
    
    ----------------------------------
              |     variable name     
     datayear |    x     y  year     z
    ----------+-----------------------
           75 |    1     1     1      
           76 |          1     1     1
    ----------------------------------
    
    
    .

    Comment


    • #3
      This was also asked and answered at http://www.statalist.org/forums/foru...-for-variables. My answer there and William's here are minor variants of each other.

      Comment


      • #4
        Thank you for your suggestions, I've tried both today and I'll reply here so you are both updated.

        I've adapted mainly from Clyde's answer, because my datasets aren't consistently named and because I've got too many variables for Stata to display a table. This is what I've got so far:

        Code:
        local filelist: dir "/home/oliverm/windows/7/Desktop/fortable" files "*.dta"
        
        clear
        tempfile building
        save `building', emptyok
        
        scalar datayear = 1992
        foreach f of local filelist {
            use `"`f'"', clear
            describe, replace
            gen year = `=datayear'
            append using `building'
            save `"`building'"', replace
            scalar datayear = `=datayear+1'
        }
        
        use `building', clear
        keep name year
        gen byte present_ = 1
        reshape wide present, i(year) j(name) string
        rename present_* *
        
        export excel using variable_year_crosswalk.xlsx, clear firstrow(variables)
        The problem I'm facing at the moment is in reshaping the data. Specifically, I get the error present_AverageLONGERMVPAMinutesPerValid invalid variable name. I've just read that variable names can only be up to 32 characters long, so I'm assuming that's the issue.

        I don't know how many variables are currently 32 characters long (there are thousands), nor whether I could shorten them without them becoming ambiguous (not that I know how to shorten them anyway).

        Any suggestions?

        Comment


        • #5
          Consider replacing your reshape command from Clyde's suggestion with some version of table, replace following the lead in my suggestion.

          Comment


          • #6
            Just tried, it gives an error, too many values (I've got > 40,000 "observations" in the combined dataset).

            Comment


            • #7
              Well, that's a vexing problem. One starting point is, instead of -gen present = 1-, make it -gen _ = 1-, and then -reshape _, i(year) j(name) string- And then -rename _* *-.

              That will get you by with any variable whose name is 31 characters or fewer.

              But the one example you give already starts with 32 characters, so this won't help with that one or others that might also be that long. What I have done when confronted with this kind of situation before is to try to abbreviate names using tricks like:

              Code:
              rename *Average* *Avg*
              rename *Minutes* *Min*
              rename *Valid* *Ok*
              etc. It often takes a few rounds of trial and error to find tricks like this that are appropriate to your data, but in the end, you can prevail this way if you persist.

              Comment


              • #8
                Before the reshape, you could try something like the following:
                Code:
                gen name31 = substr(name,1,31)
                isid year name31
                which would see if the names are distinct to 31 characters within each year.

                Did these 23 years of data originate as imported Excel spreadsheets with the first row used to supply the variable names? In that case it is possible that the variable names have already been truncated and otherwise altered by the import process, so that your year-to-year comparisons my not be correct.

                I find it interesting that the example variable name you presented "AverageLONGERMVPAMinutesPerValid" contains the substring "LONGERMVPAM" followed by "MinutesPerValid" -- this suggests to me that "PAM" (or three words starting with P, A, and M) was truncated off.

                Comment


                • #9
                  Did these 23 years of data originate as imported Excel spreadsheets with the first row used to supply the variable names? In that case it is possible that the variable names have already been truncated and otherwise altered by the import process, so that your year-to-year comparisons my not be correct.
                  That's a really good point that I overlooked. When Stata imports from Excel with the -firstrow()- option specified, if it encounters would-be variable names that are over 23 characters, it truncates them to 32 characters. If that results in renaming, preserving the first 31 characters and suffixing a 1 on the end, and then a 2 for the next would-be duplicate etc. But if the variables in each spreadsheet you import are different, blablabla...1, etc. could be a different variable depending on which spreadsheet it's in.

                  Comment


                  • #10
                    I believe the data were originally coded in SPSS, but even that may have changed over the intervening years. I don't know if similar problems would arise there, but regardless, none of the variables I'm interested in have such long names anyway.

                    The solution I've currently settled on is just specifying the variables I want to know about:

                    Code:
                    local filelist: dir "/home/oliverm/windows/7/Desktop/fortable" files "*.dta"
                    
                    
                    clear
                    tempfile building
                    save `building', emptyok
                    
                    scalar datayear = 1992
                    foreach f of local filelist {
                        use `"`f'"', clear
                        rename *, lower
                        describe, replace clear
                        gen year = `=datayear'
                        append using `building'
                        save `"`building'"', replace
                        scalar datayear = `=datayear+1'
                    }
                    
                    use `building', clear
                    keep name year
                    
                    
                    keep if name == "hsclass" | ///
                            name == "hseg" | ///
                            name == "hsoc" | ///
                            name == "scall" | ///
                            name == "scallx" | ///
                            ...
                    
                    gen byte present_ = 1
                    reshape wide present, i(year) j(name) string
                    rename present_* *
                    
                    export excel using "variables.xlsx", firstrow(variables) replace
                    One thing I was wondering about was whether it would be possible to use the resulting Excel file to specify which variables to keep from each year. E.g. if the 2010 data had the variables age, sex and height, I wouldn't have to write keep age sex height. Obviously it's simpler to do it manually in this example, but in reality there are near 200 variables that I'm interested in, and specifying manually which permutation of those are available to be kept for each year is inefficient and prone to error.

                    Comment


                    • #11
                      One thing I was wondering about was whether it would be possible to use the resulting Excel file to specify which variables to keep from each year. E.g. if the 2010 data had the variables age, sex and height, I wouldn't have to write keep age sex height.
                      I have used Stata to read an Excel spreadsheet and from it generate Stata commands that I write to a temporary text file that I then run or do or include to execute the generated commands.

                      In general, it's often quicker, usually more reliable, and always more fun, to write a program to use reliable metadata to create a boring program, than it is to write the boring program directly.

                      Comment


                      • #12
                        Originally posted by William Lisowski View Post
                        I have used Stata to read an Excel spreadsheet and from it generate Stata commands that I write to a temporary text file that I then run or do or include to execute the generated commands.

                        In general, it's often quicker, usually more reliable, and always more fun, to write a program to use reliable metadata to create a boring program, than it is to write the boring program directly.
                        Yes I agree. What I want to do is more or less this:
                        Code:
                        use "table of variables.xlsx"
                        local vars1991 = XXX
                        use "1991.dta"
                        keep `vars1991'
                        save "1991 restricted.dta"
                        But I can't figure out how to define the XXX so it includes all the variables present in 1991 according to the spreadsheet. Additionally, I'm not sure if keep `vars1991' would work like that. Do you know how I'd go about this?
                        I'd also be happy to use the method you mentioned, writing commands to a text file, but I'm even less certain how to go about that.

                        Comment


                        • #13
                          In this case, my approach would be to not use the Excel workbook, but rather the list of year/name pairs you created in post #10 before the reshape wide. Here is the technique I would use to create yearly variable lists from some made-up data.
                          Code:
                          . list, sepby(year) noobs
                          
                            +-------------+
                            | year   name |
                            |-------------|
                            | 2001      x |
                            | 2001      y |
                            |-------------|
                            | 2002      x |
                            | 2002      y |
                            | 2002      z |
                            |-------------|
                            | 2003      x |
                            | 2003      z |
                            +-------------+
                          
                          . levelsof year, local(yearlist)
                          2001 2002 2003
                          
                          . foreach y of local yearlist {
                            2.     quietly levelsof name if year==`y', local(vars)
                            3.     local varlist`y' : list clean vars
                            4.     display `"varlist`y': `varlist`y''"'
                            5. }
                          varlist2001: x y
                          varlist2002: x y z
                          varlist2003: x z
                          
                          .
                          And then I would process your files with something like the following untested code.
                          Code:
                          foreach y of local yearlist {
                              use `varlist'y'' using "`y'", clear
                              save "`y' restricted data", replace
                          }

                          Comment


                          • #14
                            Thank you so much, that worked perfectly. I've now got everything I wanted, both the spreadsheet for my own reference, and load of datasets restricted to just the variables I need, without repetitive typing on my part.

                            Here's my code as it currently stands, should it be of interest.


                            Code:
                            cd "/home/oliverm/windows/7/Desktop/stata/data"
                            
                            * Local with list of datasets
                            local filelist: dir "/home/oliverm/windows/7/Desktop/stata/data/original" files "*.dta"
                            
                            clear
                            tempfile building
                            save `building', emptyok
                            
                            * Saving each dataset with lowercase variables in stata/data/lower
                            capture mkdir lower
                            * For each dataset, make list of variables and put that into "building" along
                            * with the year
                            foreach f of local filelist {
                                use "original/`f'", clear
                                local fileyear = substr(`"`f'"',1,4)
                                display `fileyear'
                                rename *, lower
                                save "lower/`fileyear' HSE lower.dta", replace
                                describe, replace clear
                                gen year = `fileyear'
                                append using `building'
                                save `"`building'"', replace
                            }
                            
                            use `building', clear
                            keep name year
                            sort year
                            
                            
                            { // Keeping variables I want
                            keep if name == "everhart" | ///
                                    name == "docheart" | ///
                                    // etc...
                                    name == "hrpsoc10b" | ///
                                    name == "sclass"
                            }
                             
                            
                            
                            
                            preserve
                            {
                                levelsof year, local(yearlist)
                                foreach y of local yearlist {
                                    quietly levelsof name if year==`y', local(vars)
                                    local varlist`y' : list clean vars
                                }
                                
                                capture mkdir "restricted"
                                foreach y of local yearlist {
                                    use `varlist`y'' using "lower/`y' HSE lower.dta", clear
                                    save "restricted/`y' HSE restricted.dta", replace
                                }
                            }
                            restore
                            
                            gen byte present_ = 1
                            reshape wide present, i(year) j(name) string
                            rename present_* *
                            
                            export excel using "/home/oliverm/windows/7/Desktop/stata/whatvars.xlsx", firstrow(variables) replace

                            Comment

                            Working...
                            X