Creating new numeric variable according to a string variable

Alex Blanchet

Join Date: Oct 2015

Posts: 7
#1

Creating new numeric variable according to a string variable

16 Nov 2015, 20:27

Hi everyone,

I am dealing with survey data with open-ended questions that I need to clean. The questions I am interested in are factual knowledge questions about different people. Respondents were asked things like "Do you recall the name of the person who holds X position?". Respondents then had to type in the name of the person. Say the correct response was "John Doe", respondents may have written JOHN DOE; John Doe, john doe, Doe, DOE, Bob Doe, Bill Doe, Doe, DOE, J Doe...

I now need to clean this and create binary variables where 1=correct and 0=incorrect response. I think that the only possible approach is to do this incrementally by progressively coding all the specific ways of writing "John Doe" used by the respondents.

So I would proceed like this :

gen dummy1=0

To create a dummy variable for the first knowledge question. But then, I need to recode the values of dummy1 according to the answer given on the question about John Doe (lets call that variable "qdoe"). So, first, I would need to recode as 1 in dummy1 all answers in qdoe that contains "doe" (and then all its possible variations : Doe, DOE... ).

What I have in mind is something like this :

recode dummy1 0=1 if qdoe=="Doe"
recode dummy1 0=1 if qdoe=="doe"
recode dummy1 0=1 if qdoe=="DOE"

etc.

But, this does not work. Moreover, what I need is that dummy1 be recoded to 1 if qdoe contains "Doe", whether or not qdoe is perfectly equal to "Doe". Hence, respondents who answered "John Doe" and those who only wrote "Doe" without the first name would both be coded as 1 simultaneously.

Any help would be highly appreciated!
Tags: None
Alex Blanchet

Join Date: Oct 2015

Posts: 7
#2

16 Nov 2015, 20:59

I found the solution there : http://stackoverflow.com/questions/2...tring-variable

Sorry if you lost your time on my problem.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#3

16 Nov 2015, 21:10

Well, if it's literally true that any answer that contains doe (in any combination of cases) would be counted as correct, then you can just do:

Code:

replace dummy1 = 1 if strpos(upper(qdoe), "DOE")

But that probably isn't quite enough, because that would also give credit for somebody who wrote "Thomas Blesdoe" as the response. You might get a little bit further than this if you used:

Code:

forvalues j = 1/3 { replace dummy1 = 1 if upper(word(qdoe, `j')) == "DOE" }

That will at least the "Thomas Blesdoe" type of false positive. But it could still be defeated by misspellings of "Doe" or spurious word-internal spaces.

When all is said and done, working with free-form response data is difficult. You can write some clever general code that will get most of the job done, but in the end, you will probably have to do some visual inspections and write a few additional lines of code to handle the bizarre cases that inevitably crop up.

The most important thing to learn from this is that when you are in charge of designing the data collection, avoid free-form response items as much as possible.
Comment
Alex Blanchet

Join Date: Oct 2015

Posts: 7
#4

16 Nov 2015, 21:27

Thanks, I will use your approach for more complicated spelling.

My approach is basically to add different spelling as I judge necessary when looking at what remains after some screening. This is the only way to do it.

I am not in charge of this specific survey, and I was actually hoping to get some initial results tonight. But when I saw the raw data, I realized that this was not going to happen... after a few facepalms.

But thanks again!
Comment

Announcement

Creating new numeric variable according to a string variable

Comment

Comment

Comment