Recoding a string variable into numeric (instead of destring) and subsequent mismatch of values

Natasha H

Join Date: Oct 2014

Posts: 5
#1

Recoding a string variable into numeric (instead of destring) and subsequent mismatch of values

17 Sep 2019, 19:55

I have some questions regarding an issue I have encountered when trying to change a string variable into a numeric (categorical) one. I am writing my question in a series of commands to show my steps.

My variable is called "confused" - with a 5-point Likert scale response for possible answers (1=Very unlikely, 2=Unlikely, 3=Neutral, 4=Likely, 5=Very likely). By chance, the category "Unlikely" appears not to have been selected by survey respondents for this particular variable, and there were 23 respondents who didn't complete the question.

Here is my variable:
. tab confused, missing

Consider a |
patient who |
you think |
might have |
delirium. How |
likely do the |
words li | Freq. Percent Cum.
--------------+-----------------------------------
| 23 10.13 10.13
Likely | 66 29.07 39.21
Neutral | 2 0.88 40.09
Very likely | 135 59.47 99.56
Very unlikely | 1 0.44 100.00
--------------+-----------------------------------
Total | 227 100.00

For analysis, I want to change the categorical Likert responses into numeric data with labels.

1) First, I tried destring:
. destring confused, replace

The error message was:
confused: contains nonnumeric characters; no replace

2) I wasn't sure why I got this message, so I tried encode:
encode confused, generate(confused2)

. d confused2

storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------------------------------------
confused2 long %13.0g confused2
Consider a patient who you think might have delirium. How likely do the words li

3) This looks good, but I would like to ensure that the correct categories have the correct numeric labels for future analysis. I prefer to have numbers as the primary data source so that if I need to do things quickly with future commands, I can write <if confused ==1> instead of <if confused =="Very unlikely">.

Initially I thought:
. label define likertlbl 1 "Very unlikely" 2 "Unlikely" 3 "Neutral" 4 "Likely" 5 "Very likely"

. label values confused2 likertlbl

. tab confused2, missing

Consider a |
patient who |
you think |
might have |
delirium. How |
likely do the |
words li | Freq. Percent Cum.
--------------+-----------------------------------
Very unlikely | 66 29.07 29.07
Unlikely | 2 0.88 29.96
Neutral | 135 59.47 89.43
Likely | 1 0.44 89.87
. | 23 10.13 100.00
--------------+-----------------------------------
Total | 227 100.00

But the output "confused2" here doesn't match the original "confused" categories - the response "likely" should have had n=66 from the original data, but here it has been made to match the "very unlikely" response.

I think this is because the responses were not in hierarchical order in the "confused" variable as they appear alphabetical in the list. Therefore when I used the <label define> command, it assumed that everything was already in hierarchical order in "confused2" when it assigned labels.

FIRST QUESTION: Is my interpretation of this mismatch correct?

4) So, to try again, I performed the following commands

. generate confused3 = 1 if confused =="Very unlikely"
(226 missing values generated)

. replace confused3 = 2 if confused =="Unlikely"
(0 real changes made)

. replace confused3 = 3 if confused =="Neutral"
(2 real changes made)

. replace confused3 = 4 if confused =="Likely"
(66 real changes made)

. replace confused3 = 5 if confused =="Very likely"
(135 real changes made)

. tab confused3, missing

confused3 | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 0.44 0.44
3 | 2 0.88 1.32
4 | 66 29.07 30.40
5 | 135 59.47 89.87
. | 23 10.13 100.00
------------+-----------------------------------
Total | 227 100.00

Now I have the correct "match" - I know there were 66 responses that selected "likely" where 4=likely.
I can now attach my labels like this:

. label values confused3 likertlbl

. tab confused3, missing

confused3 | Freq. Percent Cum.
--------------+-----------------------------------
Very unlikely | 1 0.44 0.44
Neutral | 2 0.88 1.32
Likely | 66 29.07 30.40
Very likely | 135 59.47 89.87
. | 23 10.13 100.00
--------------+-----------------------------------
Total | 227 100.00

So in the end, I have achieved what I wanted. However this was very labor intensive!

SECOND QUESTION: Is there an easier way to achieve what I wanted than going through the code in section 4?

Previously I have made the changes for the numeric categories in Excel, so that when I import into Stata, they are already numeric. And then I simply assign the variable labels. I was hoping to do everything in Stata at once so that there was a clear log file for all commands with data cleaning and manipulation.

Apologies if these are simple questions and I have made it a long-winded scenario!

Any assistance appreciated.
Tags: None

Dave Airey

Join Date: Apr 2014
Posts: 398

17 Sep 2019, 21:12

Please enclose your code and output in CODE tags so the formatting is readable.

If you have a character variable and want to encode it as a numerical variable, one way is to create a label and then use it with an encode statement like below:

Code:

clear

input str20 likely int freq
"very unlikely" 10
"unlikely" 20
"neutral" 5
"likely" 5
"very likely" 2
end
compress likely
expand freq

tabulate likely

label define likely ///
    1 "very unlikely" ///
    2 "unlikely" ///
    3 "neutral" ///
    4 "likely" ///
    5 "very likely"
    
rename likely likely_str
encode likely_str, generate(likely) label(likely)

tabulate likely
tabulate likely, nolabel

Last edited by Dave Airey; 17 Sep 2019, 21:19.

Comment

Natasha H

Join Date: Oct 2014
Posts: 5

18 Sep 2019, 15:15

Thank you Dave. I thought I had copied the code using the "source" button so that it would appear formatted correctly (because when I created the post, it appeared in the correct format), but when it was posted it ended up all in the same text format. I didn't know about the CODE tags beforehand, so apologies. I will try this now and hope that I have used the buttons correctly.

So I think I followed what you did and generated "confused3" which is in long format with the new code. But when I try to interrogate one of the survey responses, eg. trying to list the occupation of the respondents who answered "Very unlikely" to this variable, no list is generated. I tried using the number 1 (which should represent the category Very unlikely) (output blank), then tried using the old text name (error as it is no longer string variable).

When I used your last command about "nolabel", the label values have somehow been switched to 6, 7, 8, 9, 10 rather than 1, 2, 3, 4, 5. (I had attempted to encode a different variable prior to re-trying the "confused" variable). Why has this happened?

If we had followed the format I had intended where 1=very unlikely, I would have thought 6=very unlikely by this system. But I have 66 responses attached to the "6" label" which is the number of "likely" responses. So it doesn't seem that my hierarchical structure has been preserved correctly. Not sure what I am doing wrong.

Code:

. tab confused, missing

   Consider a |
  patient who |
    you think |
   might have |
delirium. How |
likely do the |
     words li |      Freq.     Percent        Cum.
--------------+-----------------------------------
              |         23       10.13       10.13
       Likely |         66       29.07       39.21
      Neutral |          2        0.88       40.09
  Very likely |        135       59.47       99.56
Very unlikely |          1        0.44      100.00
--------------+-----------------------------------
        Total |        227      100.00

. label define likely 1 "very unlikely" 2 "unlikely" 
> 3 "neutral" 4 "likely" 5 "very likely"

. encode confused, generate(confused3) label(likely)

. d confused3

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------
confused3       long    %13.0g     likely     Consider a
                                        patient who you
                                        think might have
                                        delirium. How likely
                                        do the words li

. tab confused3, missing

   Consider a |
  patient who |
    you think |
   might have |
delirium. How |
likely do the |
     words li |      Freq.     Percent        Cum.
--------------+-----------------------------------
       Likely |         66       29.07       29.07
      Neutral |          2        0.88       29.96
  Very likely |        135       59.47       89.43
Very unlikely |          1        0.44       89.87
            . |         23       10.13      100.00
--------------+-----------------------------------
        Total |        227      100.00

. list occupation if confused3 ==1

. list occupation if confused3 =="Very unlikely"
type mismatch
r(109);

. list occupation if confused3 ==Very unlikely
Very not found
r(111);

. list occupation if confused3 =="1"
type mismatch
r(109);
. tab confused3, nolabel

 Consider a |
patient who |
  you think |
 might have |
  delirium. |
 How likely |
     do the |
   words li |      Freq.     Percent        Cum.
------------+-----------------------------------
          6 |         66       32.35       32.35
          7 |          2        0.98       33.33
          9 |        135       66.18       99.51
         10 |          1        0.49      100.00
------------+-----------------------------------
      Total |        204      100.00

. list occupation if confused3 ==10

     +---------------+
     |    occupation |
     |---------------|
 55. | Allied health |
     +---------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

18 Sep 2019, 16:53

The reason this happened is because -encode- treats the value labels as case sensitive. The -label define- command in #3 has the value labels in lower case. But the data values actually begin with upper case levels. So Stata does not apply the lower case labels, but instead adds new labels starting where the original label leaves off. To use -encode- with a pre-specified label it is crucial that the text in the labels match the text in the string variable exactly. Variations in case, spacing, etc. will cause undesired problems like what you have observed.
1 like
Comment
Natasha H

Join Date: Oct 2014

Posts: 5
#5

19 Sep 2019, 02:56

Ah, thanks Clyde! I didn't spot that! Usually I'm good with my case sensitive labels but I must have been in a hurry. Greatly appreciated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

19 Sep 2019, 09:19

No problem.

While I have your attention, let me remind you that the norm in this community is to promote professionalism and collegiality by using our real given and surnames as our username. It is not possible for you to edit your user profile to make this change. But you can click on "Contact Us" in the lower right corner of this page and message the forum administrator requesting him to do it for you. Thanks.
1 like
Comment

Announcement