Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Approaches to reshaping data

    I have a wide dataset from a survey where each row is a respondent ID and each column is a question (or portion of a question, in a few cases). Eventually, I want to have a long dataset with columns for respondent ID, year, subject, activity, and survey question. There are a few ways to do this that I can think of but none of them are elegant so I would like to know
    a. How others would approach this problem and/or;
    b. How my solutions could be improved.
    Meaningless mini-example dataset:
    Code:
    input id q1 q2 q3 q4 q5
    1 56 3 2 10 43
    2 67 4 7 12 54
    3 61 7 2 56 67
    4 58 3 5 56 49
    end
    I also have a "key" excel file that connects each question to a relevant year, subject, and activity.



    Solution 1: One reshape, many generates, one merge

    Technically, I could do a reshape like this:
    Code:
    reshape long q, i(id) j(questionNumber)
    and then manually create the variables using commands like:
    Code:
    gen subject = ""
    replace subject = "math" if questionNumber == 1
    But I have over 200 questions so this method would take a long time, be very messy, and prone to errors (not to mention difficult to QC). I would then have to merge the question number in my dataset to those in the excel files to get the survey question.

    Solution 2: Many renames and reshapes

    Slightly more complicated than the above, I could rename each variable into a format designed for reshapes such as:
    Code:
    rename q1 q1MathTeach2015
    rename q2 q2MathTeach2016
    ...
    Then I would write a series of reshapes:
    Code:
    reshape long q1MathTeach, i(id) j(year)
    reshape long q1Math, i(id) j(activity)
    reshape long q1, i(id) j(subject)
    Using a loop:
    Code:
    foreach var of varlist `variableAbbreviation' {
    reshape long `var', i(id) j(year)
    } foreach var of varlist `variableAbbreviation2' {
    reshape long `var', i(id) j(activity)
    } ...
    Similar to solution 1, this would be messy, although after the macros containing all the variables are made it would be somewhat easier to read in my opinion.

    Solution 3: Make Excel write the code for solution 2

    A "way" around manually writing the code for either of these would be to use my key excel file to write each line of code. I have a column for the question number (e.g. q1, q2, q3...), subject, activity, and year so I could add a column that combines the names using concatenate. Then I could make another concatenate column that produces the reshape command. After that I copy and paste a column of reshape commands from Excel into a do file. I attached an example excel file to illustrate the method.

    These solutions all "work" to varying degrees but I am assuming there are a few better (and worse) methods. How would you approach this problem?

    edit: fixed an error in the excel file and reuploaded it.
    Attached Files
    Last edited by Ethan Adelman-Sil; 30 May 2019, 15:18.

  • #2
    This probably would involve a reshape long as you suggest, followed by a -merge- in which the descriptive information that you apparently have in some other file is put onto that long file. If you want help with that, show an example of your file with the descriptive data. You're unlikely to get help that involves people having to open an Excel file, as relatively few people here will do that (malware, etc.) Import your Excel file into Stata, then use -dataex- per the FAQ to show some example data from it.

    Comment


    • #3
      Hi Mike, good point--I hadn't thought of malware. As is, the excel file is just to show how I do the concatenates that I described in the third solution. My main interest is learning the steps others would take to address this type of a problem--when I find myself working in Stata and Excel at the same time I assume I've made a mistake somewhere.

      Comment

      Working...
      X