Dear Statalist members,
I have a dataset that contains information about individuals. There are two ways to identify an individual: by using a variable id in the format "1", "2", etc. or by using a variable name. The mapping between the name and ID is done in a separate file called corresp.dta. I want to reshape the data to the long format, but my dataset is not consistently organized. For example, in Excel, I have a set of n columns indicating whether the individual (there are n of them) knows the individual displayed in the variable. Because the individual name may contain special characters and could be complicated to read in a variable name, I cannot use the firstrow option when importing the spreadsheet. So, I thought about importing the Excel initial first row that is organized into something like this: "Do you know this individual? [XXXXXX]" as a string observation to 1) isolate the XXXXXX and 2) change it with its corresponding ID based on corresp.dta. However, the problem is that I want to append different files before reshaping, and these files do not always follow the "Do you know this individual? [XXXXXX]" format. In some of them, only "[XXXXXX]" is displayed. Here's a dataex example to help understand my request:
FILE 1 :
FILE 2 :
The process of appending the files would not be successful as the variable names, as they are imported directly from Excel, would be different. My goal is to, for each file to be appended,
3. Replace the variable name with its first observation and adding "knows" as a prefix for the reshape process. Once the files are appended, I will have
It may be important to note that the order of the variables is crucial. For example, AAAA must always be before YYYY, who must always be before XXXX, etc. I am trying to use the substr function to remove the square brackets, but since the name of the individual is not the same across variables, I am having trouble figuring it out. I believe step two involves preserving and restoring the names and IDs from corresp.dta using the levelsof command, but for the third step, I have no idea what to do.
I understand that this is a long and tedious thread, so any suggestions or guidance for any part of the process would be greatly appreciated. Thank you for the help provided by this forum, from a very beginner user!
I have a dataset that contains information about individuals. There are two ways to identify an individual: by using a variable id in the format "1", "2", etc. or by using a variable name. The mapping between the name and ID is done in a separate file called corresp.dta. I want to reshape the data to the long format, but my dataset is not consistently organized. For example, in Excel, I have a set of n columns indicating whether the individual (there are n of them) knows the individual displayed in the variable. Because the individual name may contain special characters and could be complicated to read in a variable name, I cannot use the firstrow option when importing the spreadsheet. So, I thought about importing the Excel initial first row that is organized into something like this: "Do you know this individual? [XXXXXX]" as a string observation to 1) isolate the XXXXXX and 2) change it with its corresponding ID based on corresp.dta. However, the problem is that I want to append different files before reshaping, and these files do not always follow the "Do you know this individual? [XXXXXX]" format. In some of them, only "[XXXXXX]" is displayed. Here's a dataex example to help understand my request:
FILE 1 :
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str36(A B C D) "Do you know this indiv? [INDIV AAAA]" "Do you know this indiv? [INDIV YYYY]" "Do you know this indiv? [INDIV XXXX]" "Do you know this indiv? [INDIV ZZZZ]" "0" "1" "0" "1" "1" "0" "0" "0" "0" "1" "0" "0" "1" "0" "0" "0" end
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str12(E F G H) "[INDIV AAAA]" "[INDIV YYYY]" "[INDIV XXXX]" "[INDIV ZZZZ]" "0" "1" "0" "0" "1" "0" "1" "1" "0" "1" "0" "1" "1" "1" "1" "0" end
The process of appending the files would not be successful as the variable names, as they are imported directly from Excel, would be different. My goal is to, for each file to be appended,
- simplify the first observation from "Do you know this individual? [INDIV AAAA]" or "[INDIV AAAA]" to just "INDIV AAAA"
- replace the name of this first observation with its corresponding ID in the corresp.dta file. Here's a toy example of what it may contain:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str10 name str1 id "INDIV AAAA" "1" "INDIV YYYY" "2" "INDIV XXXX" "3" "INDIV ZZZZ" "4" end
Code:
* Example generated by -dataex-. For more info, type help dataex clear input byte(knows1 knows2 knows3 knows4) 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 end
I understand that this is a long and tedious thread, so any suggestions or guidance for any part of the process would be greatly appreciated. Thank you for the help provided by this forum, from a very beginner user!
Comment