Mapping variable names to IDs to reshape a dataset with inconsistent variable names

Adam Sadi

Join Date: Jul 2022

Posts: 68
#1

Mapping variable names to IDs to reshape a dataset with inconsistent variable names

20 Jan 2023, 04:59

Dear Statalist members,

I have a dataset that contains information about individuals. There are two ways to identify an individual: by using a variable id in the format "1", "2", etc. or by using a variable name. The mapping between the name and ID is done in a separate file called corresp.dta. I want to reshape the data to the long format, but my dataset is not consistently organized. For example, in Excel, I have a set of n columns indicating whether the individual (there are n of them) knows the individual displayed in the variable. Because the individual name may contain special characters and could be complicated to read in a variable name, I cannot use the firstrow option when importing the spreadsheet. So, I thought about importing the Excel initial first row that is organized into something like this: "Do you know this individual? [XXXXXX]" as a string observation to 1) isolate the XXXXXX and 2) change it with its corresponding ID based on corresp.dta. However, the problem is that I want to append different files before reshaping, and these files do not always follow the "Do you know this individual? [XXXXXX]" format. In some of them, only "[XXXXXX]" is displayed. Here's a dataex example to help understand my request:

FILE 1 :

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str36(A B C D) "Do you know this indiv? [INDIV AAAA]" "Do you know this indiv? [INDIV YYYY]" "Do you know this indiv? [INDIV XXXX]" "Do you know this indiv? [INDIV ZZZZ]" "0" "1" "0" "1" "1" "0" "0" "0" "0" "1" "0" "0" "1" "0" "0" "0" end

FILE 2 :

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str12(E F G H) "[INDIV AAAA]" "[INDIV YYYY]" "[INDIV XXXX]" "[INDIV ZZZZ]" "0" "1" "0" "0" "1" "0" "1" "1" "0" "1" "0" "1" "1" "1" "1" "0" end

The process of appending the files would not be successful as the variable names, as they are imported directly from Excel, would be different. My goal is to, for each file to be appended,
simplify the first observation from "Do you know this individual? [INDIV AAAA]" or "[INDIV AAAA]" to just "INDIV AAAA"

replace the name of this first observation with its corresponding ID in the corresp.dta file. Here's a toy example of what it may contain:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str10 name str1 id "INDIV AAAA" "1" "INDIV YYYY" "2" "INDIV XXXX" "3" "INDIV ZZZZ" "4" end

3. Replace the variable name with its first observation and adding "knows" as a prefix for the reshape process. Once the files are appended, I will have

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(knows1 knows2 knows3 knows4) 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 1 1 0 end

It may be important to note that the order of the variables is crucial. For example, AAAA must always be before YYYY, who must always be before XXXX, etc. I am trying to use the substr function to remove the square brackets, but since the name of the individual is not the same across variables, I am having trouble figuring it out. I believe step two involves preserving and restoring the names and IDs from corresp.dta using the levelsof command, but for the third step, I have no idea what to do.

I understand that this is a long and tedious thread, so any suggestions or guidance for any part of the process would be greatly appreciated. Thank you for the help provided by this forum, from a very beginner user!

Last edited by Adam Sadi; 20 Jan 2023, 05:02.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30060

20 Jan 2023, 10:50

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str36(A B C D)
"Do you know this indiv? [INDIV AAAA]" "Do you know this indiv? [INDIV YYYY]" "Do you know this indiv? [INDIV XXXX]" "Do you know this indiv? [INDIV ZZZZ]"
"0"                                    "1"                                    "0"                                    "1"                                  
"1"                                    "0"                                    "0"                                    "0"                                  
"0"                                    "1"                                    "0"                                    "0"                                  
"1"                                    "0"                                    "0"                                    "0"                                  
end
tempfile dataset1
save `dataset1'

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 name str1 id
"INDIV AAAA" "1"
"INDIV YYYY" "2"
"INDIV XXXX" "3"
"INDIV ZZZZ" "4"
end
tempfile crosswalk_data
save `crosswalk_data'

clear
clear frames
use `dataset1'
ds _all
local vbles `r(varlist)'
rename (`r(varlist)') name#, addnumber
gen `c(obs_t)' obs_no = _n
reshape long name, i(obs_no) j(column)

gen begin = strpos(name, "[")
replace name = substr(name, begin+1, strlen(name)-begin-1) if obs_no == 1

frame create crosswalk
frame crosswalk: use `crosswalk_data'
frlink m:1 name, frame(crosswalk)
frget id, from(crosswalk)
assert missing(crosswalk) == (obs_no != 1)

replace name = "knows"+id if !missing(id)
drop begin crosswalk id
reshape wide
foreach v of varlist name* {
    rename `v' `=`v'[1]'
}
drop in 1
drop obs_no

Comment

Adam Sadi

Join Date: Jul 2022

Posts: 68
#3

24 Jan 2023, 01:57

Dear Clyde : Thank you for doing what I couldn't do in days. Unfortunately, the line

Code:

assert missing(crosswalk) == (obs_no != 1)

was very useful even if I wish it wasn't! It yielded the following message : "assertion is false". Is there any way to ask Stata for more details about the reason why the assertion is false ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30060
#4

24 Jan 2023, 10:07

At the point where the assertion fails, run:

Code:

browse if missing(crosswalk) != (obs_no != 1)

and Stata will show you the problematic observations. Most likely this will traceable back to some column in your original data set where your corresp.dta file has no entry for the corresponding INDIV.
Comment
Adam Sadi

Join Date: Jul 2022

Posts: 68
#5

25 Jan 2023, 02:12

Thank you for your useful insights Clyde. Indeed there are some problems that I did not know about, namely there are first observations with two pairs of square brackets (for instance, the first pair of square brackets had the question label within it), so the code :

Code:

gen begin = strpos(name, "[") replace name = substr(name, begin+1, strlen(name)-begin-1) if obs_no == 1

replaced the content of name with what was inside the first pair of brackets, while I needed what was within the second one. I tried adding these lines of code:

Code:

gen begin2 = strpos(subinstr(name, "[", "(", 1), "[")

But then I came to realize that it would also remove the brackets of the observations that only have one pair of them, since I'm looping on several files. Is there a way to add a condition like "if the number of pairs of brackets is equal to two, replace the first pair by parentheses" ?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10176
#6

25 Jan 2023, 02:42

Provide an example including a few cases using dataex.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30060

25 Jan 2023, 14:27

If it is always the case that the [INDIV AAAA] that you are interested in always appears at the end of the string, so that only the last [ and ] in the string are of interest and the ] of interest is the final character in the string, then the following modification to the code will work:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str44(A B C D)
"Do you know this indiv [qwert]? [INDIV AAAA]" "Do you know this indiv? [INDIV YYYY]" "Do you know this indiv? [INDIV XXXX]" "Do you know this indiv? [INDIV ZZZZ]"
"0"                                    "1"                                    "0"                                    "1"                                  
"1"                                    "0"                                    "0"                                    "0"                                  
"0"                                    "1"                                    "0"                                    "0"                                  
"1"                                    "0"                                    "0"                                    "0"                                  
end
tempfile dataset1
save `dataset1'

* Example generated by -dataex-. For more info, type help dataex
clear
input str10 name str1 id
"INDIV AAAA" "1"
"INDIV YYYY" "2"
"INDIV XXXX" "3"
"INDIV ZZZZ" "4"
end
tempfile crosswalk_data
save `crosswalk_data'

clear
clear frames
use `dataset1'
ds _all
local vbles `r(varlist)'
rename (`r(varlist)') name#, addnumber
gen `c(obs_t)' obs_no = _n
reshape long name, i(obs_no) j(column)

gen reversed_name = reverse(name)
replace name = reverse(substr(reversed_name, 2, strpos(reversed_name, "[")-2)) if obs_no == 1

frame create crosswalk
frame crosswalk: use `crosswalk_data'
frlink m:1 name, frame(crosswalk)
frget id, from(crosswalk)
assert missing(crosswalk) == (obs_no != 1)

replace name = "knows"+id if !missing(id)
drop reversed_name crosswalk id
reshape wide
foreach v of varlist name* {
    rename `v' `=`v'[1]'
}
drop in 1
drop obs_no

If, the position is not always at the very end of the string, but it is always the last set of brackets that encloses the target material, then a slightly different modification (post back) can be written. But if there can be multiple pairs of brackets and there is no clear rule as to which one is going to be the right target, well, then we may have a big problem on our hands. Definitely show new -dataex- output when posting back, and make it as representative of the difficulties in the data as possible.

Announcement