Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding the values behind encode

    Hi All,

    I am working with two rounds of survey data, that interviews individuals across different states (varname v024) in India. I want to append the datasets but there are a few issues with the encoded state names that I need to sort out.

    For example in data from 2015-16

    Code:
    tab v024
    
                          state |      Freq.     Percent        Cum.
    ----------------------------+-----------------------------------
    andaman and nicobar islands |      2,811        0.40        0.40
                 andhra pradesh |     10,428        1.49        1.89
              arunachal pradesh |     14,294        2.04        3.94
                          assam |     28,447        4.07        8.00
                          bihar |     45,812        6.55       14.55
                     chandigarh |        746        0.11       14.65
                   chhattisgarh |     25,172        3.60       18.25
       --------------------------------------------------------------
    Here for example the state andhra pradesh is encoded with value 2.

    In the data from 2005-06, however, label names and values change:

    Code:
     tab v024
    
                     state |      Freq.     Percent        Cum.
    -----------------------+-----------------------------------
    [jm] jammu and kashmir |      3,281        2.64        2.64
     [hp] himachal pradesh |      3,193        2.57        5.20
               [pj] punjab |      3,681        2.96        8.16
          [uc] uttaranchal |      2,953        2.37       10.54
              [hr] haryana |      2,790        2.24       12.78
                [dl] delhi |      3,349        2.69       15.47
            [rj] rajasthan |      3,892        3.13       18.60
        [up] uttar pradesh |     12,183        9.79       28.40
                [bh] bihar |      3,818        3.07       31.47
               [sk] sikkim |      2,127        1.71       33.18
    [ar] arunachal pradesh |      1,647        1.32       34.50
             [na] nagaland |      3,896        3.13       37.63
              [mn] manipur |      4,512        3.63       41.26
              [mz] mizoram |      1,791        1.44       42.70
              [tr] tripura |      1,906        1.53       44.23
            [mg] meghalaya |      2,124        1.71       45.94
                [as] assam |      3,840        3.09       49.03
          [wb] west bengal |      6,794        5.46       54.49
            [jh] jharkhand |      2,983        2.40       56.89
               [or] orissa |      4,540        3.65       60.54
         [ch] chhattisgarh |      3,810        3.06       63.60
       [mp] madhya pradesh |      6,427        5.17       68.77
              [gj] gujarat |      3,729        3.00       71.77
          [mh] maharashtra |      9,034        7.26       79.03
       [ap] andhra pradesh |      7,128        5.73       84.76
            [ka] karnataka |      6,008        4.83       89.59
                  [go] goa |      3,464        2.78       92.37
               [ke] kerala |      3,566        2.87       95.24
           [tn] tamil nadu |      5,919        4.76      100.00
    -----------------------+-----------------------------------
                     Total |    124,385      100.00
    And the same state andhra pradesh now has label [ap] andhra pradesh with value equal to 28.

    I thought to fix this I could instead generate a new variable called state, replace values and define labels to match 2015-16, and then append the two, dataset after creating a variable called state in 2015-16.

    Code:
    gen state =.
    replace state = 2 if v024 == 28
    replace state = 3 if v024 == 12
    replace state = 4 if v024 == 18
    
    label define 2 "andhra pradesh"  3 "arunachal pradesh" 4 "assam"
    Else, appending without these changes result in the wrong states being matched based on the encoded value.

    My question now is, given the rather large number of observations,how do I find the corresponding value behind each label without having to scroll through the data browser ie 1 - andaman and nicobar islands, 2- andhra pradesh 3 - arunachal pradesh etc? Also does the aforementioned method seem like the most efficient way to accomplish the correct append?

    Thanks a lot!

    Best,
    Lori

  • #2
    Taking this from the top,

    1. If variables have been encoded using even slightly different label definitions, then merging or appending will produce a mess.

    2. it's safer to append or merge on the equivalent string variables, but anomalies will still need cleaning up.

    Some commands will show numbers and labels together, e.g. fre from SSC. Otherwise, hit the encoded variable with numlabel.

    Comment


    • #3
      You could

      Code:
      tabulate v024 , nolabel
      Or you could

      Code:
      codebook v024
      You could also look up the label name (or use extended functions) and

      Code:
      label list `: value label v024'
      There are community-contributed commands that could also be helpful. For example,

      Code:
      ssc install fre
      fre v024
      or

      Code:
      ssc install elabel
      elabel list (v024)

      I think I would follow Clyde's advice [and essentially what Nick says in #2] with small adjustments: decode, make the value labels the same in both datasets, then append, and encode again.


      [Edit]
      This is not tested. Assuming that the labels in the second dataset are all prefixed with [ab], something along these lines should work fine

      Code:
      tempfile tmp                          // temporary name for the first dataset
      use dataset1                          // dataset without [ab] prefixed labels
      decode v024 , generate(s_v024)
      drop v024
      save "`tmp'"                          // save under temporary name
      
      use dataset2                          // dataet with [ab] prefixed labels
      decode v024 , generate(s_v024)
      replace s_v024 = substr(s_v024, 6, .) // strip [ab]
      
      append using "`tmp'"                  // append the first dataset
      encode s_v024 , generate(v024)

      Best
      Daniel
      Last edited by daniel klein; 29 Oct 2019, 07:17.

      Comment


      • #4
        Thank you Daniel and Nick.

        I used the code:

        Code:
        decode v024, gen(sv024)
        br sv024
        drop v024 v101
        
        gen sv0242 = "andhra pradesh" if sv024 == "[ap] andhra pradesh"
        replace sv0242 = "arunachal pradesh" if sv024 == "[ar] arunachal pradesh"
        replace sv0242 = "assam"  if sv024 == "[as] assam"
        replace sv0242 = "bihar" if sv024 == "[bh] bihar"
        replace sv0242 = "chhattisgarh" if sv024 == "[ch] chhattisgarh"
        replace sv0242 = "goa" if sv024 == "[go] goa"
        
        drop sv024
        ren sv0242 sv024
        then appended the datasets after making suitable varname changes in the master file. The append worked since now the nomenclature is uniform across datasets.

        Thanks again,

        Lori

        Comment

        Working...
        X