Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replace values of string variables based on a fraction of my dataset

    Hello. I'm using a mapping table linking a question (varname) to its short label (label) in a survey. Whenever the question refers to a specific individual, the id of this individual is mentioned in its label. For instance, the question C2O for the individual A1 will be C2U_A1 on Stata. However, for a reason that has nothing to do with my problem here, I had to completely change the IDs of the individuals, and I would like to change the values of the label variable based on the new ID. Here's a fraction of my dataset :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str15 label str3 new_id str5 old_id
    "C2O_O1_"  "E1"  "O1" 
    "C2O_O2_"  "E2"  "O2" 
    "C2O_O3_"  "E3"  "O3" 
    "C2O_O4_"  "E4"  "O4" 
    "C2O_O32_" "E0"  "O32"
    "C2O_O5_"  "E5"  "O5" 
    "C2O_O6_"  "E7"  "O6" 
    "C2O_O7_"  "E8"  "O7" 
    "C2O_O8_"  "E9"  "O8" 
    "C2O_O9_"  "E10" "O9" 
    "C3O_O1_SQ003_"  "" ""
    "C3O_O1_SQ001_"  "" ""
    "C3O_O1_SQ002_"  "" ""
    "C3O_O11_SQ003_" "" ""
    "C3O_O11_SQ001_" "" ""
    "C3O_O11_SQ002_" "" ""
    "C3O_O18_SQ003_" "" ""
    "C3O_O18_SQ001_" "" ""
    "C3O_O18_SQ002_" "" ""
    "C3O_O19_SQ003_" "" ""
    end
    As you can see the old_id denotes the old individual id. I would like to change the value of old_id to the new one in new_id. This is easy and I know how to do this, for example by concatenating strings with variable contents using replace label = "C2O" + id + "_". However my dataset is a weird mix of appended dataset, variables gotten from frames and it is really not consistent. So for another set of questions such as, C3O I have missing values. Now, I would like a code that would, no matter where I am in the dataset, replace the value of old_id by the one in id based on the question C2O. I'm not sure if it's clear so I'm ready to give further details If needed. Thank you everyone for your time !



  • #2
    I'm not certain I understand what you want, but it might be this:
    Code:
    frame put new_id old_id if substr(label, 1, 3) == "C2O", into(crosswalk)
    frame crosswalk {
        duplicates drop
        isid new_id, sort
    }
    
    gen break1 = strpos(label, "_")
    gen id_length = strpos(substr(label, break1+1, .), "_") - 1
    replace old_id = substr(label, break1+1, id_length)
    frlink m:1 old_id, frame(crosswalk)
    replace new_id = frval(crosswalk, new_id) if missing(new_id)
    replace new_id = frval(crosswalk, new_id) if missing(new_id)
    gen new_label = cond(!missing(new_id), subinstr(label, old_id, new_id, 1), label)
    I note, though, that in your example data there are a number of labels like "C30_O11*" and some others where there is no corresponding information among the C20 items, so for them new_id and old_id will remain unchanged.

    In the event the -isid new_id, sort- command breaks telling you that new_id does not uniquely identify the observations, that means that some of the C2O items have contradictory information about the correspondence between old_id and new_id. If that happens, you have to fix the inconsistencies in your data.

    I suspect there is a more elegant way to isolate the part of label that corresponds to old_id using regular expressions, and perhaps somebody who is facile with those will offer that approach later in the thread.

    Comment

    Working...
    X