Finding the values behind encode

Lorien Nair

Join Date: May 2019
Posts: 115

Finding the values behind encode

29 Oct 2019, 06:49

Hi All,

I am working with two rounds of survey data, that interviews individuals across different states (varname v024) in India. I want to append the datasets but there are a few issues with the encoded state names that I need to sort out.

For example in data from 2015-16

Code:

tab v024

                      state |      Freq.     Percent        Cum.
----------------------------+-----------------------------------
andaman and nicobar islands |      2,811        0.40        0.40
             andhra pradesh |     10,428        1.49        1.89
          arunachal pradesh |     14,294        2.04        3.94
                      assam |     28,447        4.07        8.00
                      bihar |     45,812        6.55       14.55
                 chandigarh |        746        0.11       14.65
               chhattisgarh |     25,172        3.60       18.25
   --------------------------------------------------------------

Here for example the state andhra pradesh is encoded with value 2.

In the data from 2005-06, however, label names and values change:

Code:

 tab v024

                 state |      Freq.     Percent        Cum.
-----------------------+-----------------------------------
[jm] jammu and kashmir |      3,281        2.64        2.64
 [hp] himachal pradesh |      3,193        2.57        5.20
           [pj] punjab |      3,681        2.96        8.16
      [uc] uttaranchal |      2,953        2.37       10.54
          [hr] haryana |      2,790        2.24       12.78
            [dl] delhi |      3,349        2.69       15.47
        [rj] rajasthan |      3,892        3.13       18.60
    [up] uttar pradesh |     12,183        9.79       28.40
            [bh] bihar |      3,818        3.07       31.47
           [sk] sikkim |      2,127        1.71       33.18
[ar] arunachal pradesh |      1,647        1.32       34.50
         [na] nagaland |      3,896        3.13       37.63
          [mn] manipur |      4,512        3.63       41.26
          [mz] mizoram |      1,791        1.44       42.70
          [tr] tripura |      1,906        1.53       44.23
        [mg] meghalaya |      2,124        1.71       45.94
            [as] assam |      3,840        3.09       49.03
      [wb] west bengal |      6,794        5.46       54.49
        [jh] jharkhand |      2,983        2.40       56.89
           [or] orissa |      4,540        3.65       60.54
     [ch] chhattisgarh |      3,810        3.06       63.60
   [mp] madhya pradesh |      6,427        5.17       68.77
          [gj] gujarat |      3,729        3.00       71.77
      [mh] maharashtra |      9,034        7.26       79.03
   [ap] andhra pradesh |      7,128        5.73       84.76
        [ka] karnataka |      6,008        4.83       89.59
              [go] goa |      3,464        2.78       92.37
           [ke] kerala |      3,566        2.87       95.24
       [tn] tamil nadu |      5,919        4.76      100.00
-----------------------+-----------------------------------
                 Total |    124,385      100.00

And the same state andhra pradesh now has label [ap] andhra pradesh with value equal to 28.

I thought to fix this I could instead generate a new variable called state, replace values and define labels to match 2015-16, and then append the two, dataset after creating a variable called state in 2015-16.

Code:

gen state =.
replace state = 2 if v024 == 28
replace state = 3 if v024 == 12
replace state = 4 if v024 == 18

label define 2 "andhra pradesh"  3 "arunachal pradesh" 4 "assam"

Else, appending without these changes result in the wrong states being matched based on the encoded value.

My question now is, given the rather large number of observations,how do I find the corresponding value behind each label without having to scroll through the data browser ie 1 - andaman and nicobar islands, 2- andhra pradesh 3 - arunachal pradesh etc? Also does the aforementioned method seem like the most efficient way to accomplish the correct append?

Thanks a lot!

Best,
Lori

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35651
#2

29 Oct 2019, 07:01

Taking this from the top,

1. If variables have been encoded using even slightly different label definitions, then merging or appending will produce a mess.

2. it's safer to append or merge on the equivalent string variables, but anomalies will still need cleaning up.

Some commands will show numbers and labels together, e.g. fre from SSC. Otherwise, hit the encoded variable with numlabel.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#3

29 Oct 2019, 07:03

You could

Code:

tabulate v024 , nolabel

Or you could

Code:

codebook v024

You could also look up the label name (or use extended functions) and

Code:

label list `: value label v024'

There are community-contributed commands that could also be helpful. For example,

Code:

ssc install fre fre v024

or

Code:

ssc install elabel elabel list (v024)

I think I would follow Clyde's advice [and essentially what Nick says in #2] with small adjustments: decode, make the value labels the same in both datasets, then append, and encode again.

[Edit]
This is not tested. Assuming that the labels in the second dataset are all prefixed with [ab], something along these lines should work fine

Code:

tempfile tmp // temporary name for the first dataset use dataset1 // dataset without [ab] prefixed labels decode v024 , generate(s_v024) drop v024 save "`tmp'" // save under temporary name use dataset2 // dataet with [ab] prefixed labels decode v024 , generate(s_v024) replace s_v024 = substr(s_v024, 6, .) // strip [ab] append using "`tmp'" // append the first dataset encode s_v024 , generate(v024)

Best
Daniel

Last edited by daniel klein; 29 Oct 2019, 07:17.
Comment

Lorien Nair

Join Date: May 2019
Posts: 115

30 Oct 2019, 08:06

Thank you Daniel and Nick.

I used the code:

Code:

decode v024, gen(sv024)
br sv024
drop v024 v101

gen sv0242 = "andhra pradesh" if sv024 == "[ap] andhra pradesh"
replace sv0242 = "arunachal pradesh" if sv024 == "[ar] arunachal pradesh"
replace sv0242 = "assam"  if sv024 == "[as] assam"
replace sv0242 = "bihar" if sv024 == "[bh] bihar"
replace sv0242 = "chhattisgarh" if sv024 == "[ch] chhattisgarh"
replace sv0242 = "goa" if sv024 == "[go] goa"

drop sv024
ren sv0242 sv024

then appended the datasets after making suitable varname changes in the master file. The append worked since now the nomenclature is uniform across datasets.

Thanks again,

Lori

Announcement

Finding the values behind encode

Comment

Comment

Comment