Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exporting variable lists from many datasets into a single excel sheet

    Hello,

    I am working with data from a recurring, large-scale survey (~50 biweekly datasets, each containing 200-400+ variables, and ~60-80,000 different participants in each dataset). The variables have changed over time - some become obsolete or are replaced, and some are modified to have their values either expanded (to either more or fewer categories), so I want to have them all in one place to track those changes.

    To start, I would like to list the variables in each dataset in one excel sheet, but so far have only been able to get them listed into one excel file on (50) individual sheets, using the following code:

    Code:
    * adding to individual sheets - the initial code pulls out the variable names, saves as text, then exports variable list to excel. 
    * Based on: https://atcoordinates.info/2022/02/22/creating-stata-variable-lists-in-excel-and-do-files-with-python/     
    
    
    forvalues y = 1(1)50 {
    
    cd "/Users/.../CFI_Data/3_Working_Data"
    use survey_wk_`y', clear
        
        describe, short
        translate @Results check.txt, replace
        
        describe, replace
        sort name
        
        export excel position name using survey_wk_1_50_variables.xlsx, firstrow(variables) sheet(Wk`y')
        clear
    }

    I've tried to modify the code to indicate specific cell placement, trying different variations of 'ul_cell', 'lr_cell', and specific cell values (though that didn't work because it wouldn't iteratively move the values over with each loop, and I'm not sure how to do a loop for cell placement).

    Code:
    ...
        export excel position name using survey_vars_test7.xlsx, firstrow(variables) sheet(Sheet1, modify) cell(ul_cell)
    but got the error message: _CELL invalid row number in cell() option

    I've tried to explore options with putexcel, but this seems to require a matrix, and I can't quite figure out how to output the variable list into a matrix (though there are examples of how to do this with regression results or similar, it doesn't seem to work for the variable list). Listtab/Listtab_vars also does not seem to work, but I'm not sure if these commands are not suitable for what I'm trying to do, or if I've just not coded correctly.

    If this cannot be done, I can just import the variable lists from the 50 excel sheets into a new stata dataset and export that into excel, or even do it manually (49 copy & pastes!), but it seems like something straightforward that should be possible to code (in anticipation of future additions of data).

    Thank you in advance!

  • #2
    My impression here is that you want to do something that probably could be done, but which I'd find undesirable. It sounds like you would want to do what in Stata terminology would be an -append-, in which you would use the same variable names for the different data sets that you are combining. However, if the variables are changing in the way you describe, then putting same-named but actually different variables into the same locations could cause confusion.

    As to your goal of having "them all in one place to track those changes," I would think that goal might better be accomplished in Stata, with a *clean* data set later output to Excel. My guess would be that the users of the Excel file are likely to be less aware and sensitive than you to the changes in the variable contents over time, so that creating some kind of nicer (harmonized) file for them would be a good idea. Now, I have to say I come from the "Friends don't let friends use Excel (except for data entry)" school of thought, so you should apply that filter to my reaction.

    Now, regarding what you are trying to do: Presuming you have the same (or mostly the same) variable names in each data set (ignoring the desirability of that for the moment), your loop could append all of them into one Stata filet by using -append- and marking each data set's observations with some variable to indicate their origin. Then you could export that entire file to Excel. That would look something like this:

    Code:
    clear
    save wholething.dta, emptyok
    forvalues y = 1/50 {
       use survey_wk_`y', clear
       gen str orgin = "wk_`y'"
       order origin // my personal pref.
       append using wholething.dta, replace
       save wholething.dta, replace
    }
    clear
    use wholething.dta
    export excel ... using wholething.xlsx, firstrow(variables) // no "sheet"
    Perhaps I'm misunderstanding what you want, given that I don't see the use of -putexcel- and some kind of matrix.

    Comment


    • #3
      Hi Mike,

      Thank you for the response! I should clarify - I only meant to get lists of what variables are in each dataset into excel, not any of the observations, so that I can check what new variables have been added, or if any were removed or renamed. There are separate data codebooks and references for what updates happen in each dataset, but manually getting those together is laborious and I had hoped Stata would be more efficient. Excel is only being used for organization and tracking, I promise.

      -Append- seemed like it should work, but unfortunately the code didn't run for me, and gave an error indicating there was an invalid 'replace'. I removed the first 'replace' at the end of the line starting with 'append' - but this built a 20 GB dataset with the observations from each week. I've tried a few variations of the code, but can't get it to even run otherwise.

      My goal is just to get a list of the variable names in each dataset, ideally all into one sheet with each week's variable names listed in a new column. Any way to specify that within the code you provided? (because it did work to merge the datasets well, along with the generated variable of the origin week, which is always a nice validation).

      Many thanks -

      Comment


      • #4
        Hi Mike -

        I managed to just-about get it using this merging of code:

        Code:
        clear
        save allvars8.dta, emptyok
        forvalues y = 1/5 {
            use survey_wk_`y', clear
            describe, short
            translate @Results check3.txt, replace
            describe, replace
            gen str origin = "wk_`y'"
            append using allvars8.dta
            save allvars8.dta, replace
        }
        
        clear
        use allvars8.dta
        export excel using survey_allvars8.xlsx, firstrow(variables)
        It puts all the variable names in one column, but with the designated Wk_y in a separate column alongside each variable it's easy to differentiate and separate.
        Thanks very much for your help!

        Comment

        Working...
        X