Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regular expressions to pair and compare variables by name

    I have a large dataset that contains information about pieces of equipment. I need to be able to aggregate up individual equipment characteristics to a group level.

    Variables are labeled in the following way: <Group ID>_<Subgroup ID>_<element>_<equipment #>. For example, "PP_cam_old_2" captures the old format of the second piece of equipment listed under "PP" and "cam."

    I would like to be able to compare information about individual pieces of equipment, and aggregate up within groups and subgroups. This requires me to match variables based on their Header and Subheader IDs, as well as the equipment # (to make sure I’m comparing pieces of information about the same piece of equipment). For example, if I want to compare the old format of the second piece of equipment under "PP" and "cam" to the new format for the second piece of equipment under "PP" and "cam" I need to compare the PP_cam_old_2 to PP_cam_new_2 . This would allow me to tell if the equipment is being upgraded, and to what.

    I would prefer not to do this manually for all pieces of equipment listed in the dataset. It seems like some combination of a forloop with an embedded regular expression would be the way to go. I am new to using regular expressions, but my experiencing using them to relabel this dataset has been encouraging.

    In the example above, my initial thought was to start by generating a new variable that records the difference between <Group ID>_<Subgroup ID>_< equipment #> to <Group ID>_<Subgroup ID>_new_< equipment #>. I started playing around with the below, but this is clearly not correct:

    foreach v of varlist PP_*_old* {
    local var: var `v'
    if regexm(`"`var'"', "^(PP_)(.+)_old_([0-9]+$"){
    gen (1)_(2)_1 = .
    if regexs(1)_(2)_old_regexs(3) = "hardware" & regexs(1)_(2)_new_regexs(3) = "hardware"
    replace (1)_(2)_1 = 1
    }
    }

    I get the following error:
    PP_scg_switch_old_1 not allowed
    r(101);

    Interestingly, this is not the first variable that matches the criteria listed in -if regexm(`"`var'"', "^(PP_)(.+)_old_([0-9]+$")-

    Any thoughts?

  • #2
    I think the syntax for your local should be
    Code:
    local var "var`v'"
    Also, if statements should use two = and be on the right-hand side of the replace function
    Code:
     replace (1)_(2)_1 = 1 if regexs(1)_(2)_old_regexs(3) == "hardware" & regexs(1)_(2)_new_regexs(3) == "hardware"
    What does your data look like? Are you trying to parse the values of variables or the names of the variables themselves?
    Last edited by Kris Bitney; 14 Apr 2017, 15:29.

    Comment


    • #3
      1) I find your description of your data hard to understand and your terminology unusual (you say "variable" where I think you mean "observation), and I suspect others will have difficulty as well. As described in the FAQ, a data example (see -dataex-) is almost always a better way to help others help you.

      2) That aside: My experience is that 90% of questions raised about how to do something with regular expressions can be done quite well without them. You might look at -help string functions-.

      Comment


      • #4
        .

        Comment


        • #5
          Ah, thank you for the == catch.

          And yes, I should be clearer in my terminology. I am trying to parse the values of the variables. That is, I am trying to compare observations within a single record. I attach an image of the data. As you can see from this screenshot, the dataset is incredibly sparse. Referencing the variables show in the screenshot, I’d be trying to compare a single record’s observation for MC_auto_old_1 to that for MC_auto_new_1.


          Click image for larger version

Name:	Data capture.PNG
Views:	1
Size:	18.7 KB
ID:	1383794


          Comment

          Working...
          X