I have a wide dataset from a survey where each row is a respondent ID and each column is a question (or portion of a question, in a few cases). Eventually, I want to have a long dataset with columns for respondent ID, year, subject, activity, and survey question. There are a few ways to do this that I can think of but none of them are elegant so I would like to know
I also have a "key" excel file that connects each question to a relevant year, subject, and activity.
Solution 1: One reshape, many generates, one merge
Technically, I could do a reshape like this:
and then manually create the variables using commands like:
But I have over 200 questions so this method would take a long time, be very messy, and prone to errors (not to mention difficult to QC). I would then have to merge the question number in my dataset to those in the excel files to get the survey question.
Solution 2: Many renames and reshapes
Slightly more complicated than the above, I could rename each variable into a format designed for reshapes such as:
Then I would write a series of reshapes:
Using a loop:
Similar to solution 1, this would be messy, although after the macros containing all the variables are made it would be somewhat easier to read in my opinion.
Solution 3: Make Excel write the code for solution 2
A "way" around manually writing the code for either of these would be to use my key excel file to write each line of code. I have a column for the question number (e.g. q1, q2, q3...), subject, activity, and year so I could add a column that combines the names using concatenate. Then I could make another concatenate column that produces the reshape command. After that I copy and paste a column of reshape commands from Excel into a do file. I attached an example excel file to illustrate the method.
These solutions all "work" to varying degrees but I am assuming there are a few better (and worse) methods. How would you approach this problem?
edit: fixed an error in the excel file and reuploaded it.
a. How others would approach this problem and/or;Meaningless mini-example dataset:
b. How my solutions could be improved.
Code:
input id q1 q2 q3 q4 q5 1 56 3 2 10 43 2 67 4 7 12 54 3 61 7 2 56 67 4 58 3 5 56 49 end
Solution 1: One reshape, many generates, one merge
Technically, I could do a reshape like this:
Code:
reshape long q, i(id) j(questionNumber)
Code:
gen subject = "" replace subject = "math" if questionNumber == 1
Solution 2: Many renames and reshapes
Slightly more complicated than the above, I could rename each variable into a format designed for reshapes such as:
Code:
rename q1 q1MathTeach2015 rename q2 q2MathTeach2016 ...
Code:
reshape long q1MathTeach, i(id) j(year) reshape long q1Math, i(id) j(activity) reshape long q1, i(id) j(subject)
Code:
foreach var of varlist `variableAbbreviation' {reshape long `var', i(id) j(year)} foreach var of varlist `variableAbbreviation2' {reshape long `var', i(id) j(activity)} ...
Solution 3: Make Excel write the code for solution 2
A "way" around manually writing the code for either of these would be to use my key excel file to write each line of code. I have a column for the question number (e.g. q1, q2, q3...), subject, activity, and year so I could add a column that combines the names using concatenate. Then I could make another concatenate column that produces the reshape command. After that I copy and paste a column of reshape commands from Excel into a do file. I attached an example excel file to illustrate the method.
These solutions all "work" to varying degrees but I am assuming there are a few better (and worse) methods. How would you approach this problem?
edit: fixed an error in the excel file and reuploaded it.
Comment