Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rename variables depending on the existence of a digit and a special character

    I'm working with a data set resembling the extract below:

    Code:
    clear
    set obs 10
    generate var2000 = runiform()
    generate var_2001 = runiform()
    generate ind20032004 = runiform()
    generate ind_20032004 = runiform()
    generate ind65somthing_2005 = runiform()
    generate ind65something2006 = runiform()
    I would like to apply the following transformations to this data:
    1. Select variables that do not have the underscore sign in the variable name and ignore variables that have the underscore in the name already present
    2. Add the underscore sign before the last set of digits; the generated variable names should resemble the table below:
    Current variable name New variable name Change
    var2000 var_2000 Underscore sign added
    var_2001 var_2001 No change
    ind20032004 ind_20032004 Underscore sign added
    ind_20032004 ind_20032004 No change
    ind65somthing_2005 ind65somthing_2005 No change
    ind65something2006 ind65something2006 Underscore sign added













    I'm thinking that the pseudocode would look like that:
    PHP Code:
    IF sign "_" does not exist then
       find the_last_set_of_digits 
    and get length and insert "_" before
    ELSE
      
    ignore 
    Last edited by Konrad Zdeb; 25 Feb 2015, 02:58. Reason: Tags.
    Kind regards,
    Konrad
    Version: Stata/IC 13.1

  • #2
    rename *20* *_20*
    rename *__2* *_2*

    first to add _ to all variables, 2nd to get rid of double __ in variable names.
    Will add the _ to any 20 in your var names so may still need polishing afterwards. See whether similar transformations work best for your set of variable names

    Comment


    • #3
      Almost but it returned an error:

      Code:
      . * Rename those variables.
      . foreach var of varlist `renvarsnoscore' {
        2.         rename `var' *_20*
        3. }
      
      too many wildcards in newname
          You requested SIMD2004 be renamed *_20*.  There are more wildcards in new than in old.  Wildcards in old and
          new correspond one-to-one from left-to-right unless you specify explicit subscripts in new.  Anyway, rename
          ran out of wildcards in old when matching the wildcards in new.  Perhaps you just made a mistake or perhaps
          you forgot an explicit subscript in new.  Or perhaps you forgot to specify option addnumber, which allows you
          to specify an extra #, (#), (##), ...
      I've to pass the variable list by hand as I've other variables in this data set that I do not want to touch.
      Kind regards,
      Konrad
      Version: Stata/IC 13.1

      Comment


      • #4
        Konrad,

        there are two problems with your table overview. First the name ind_20032004 appears twice as a new variable name. As you know, this is impossible. Second, you claim an underscore was added to ind65something2006, but there is none. Supposing ind_20032004 is actually something like ind_20042005 and ind65something2006 shall be named ind65something_2006 here is the code

        Code:
        // remove the underscores
        ren (*_#) (*#)
        
        // now add the undersores
        ren (*#) (*_#)
        If you tell us more about the variables you do not want to touch, there is probably a way to do it without a loop.

        Best
        Daniel
        Last edited by daniel klein; 25 Feb 2015, 03:39.

        Comment


        • #5
          Your code isn't what Jorrit suggested. Each rename within the loop tries to rename a single variable by a wildcard, which makes no sense; and Stata is telling you this. The key point about rename groups is that it loops for you.

          Variables without underscores in their names will be identified by

          Code:
          ds *_*, not
          Note that your sandbox dataset (which is helpful) would be problematic for your rules as two variables would be renamed to the same name, which can't happen.

          Judging by your examples then something like this may be sufficient

          Code:
          clear
          set obs 10
          generate var2000 = runiform()
          generate var_2001 = runiform()
          generate ind20032004 = runiform()
          * generate ind_20032004 = runiform()
          generate ind65somthing_2005 = runiform()
          generate ind65something2006 = runiform()
          
          ds *_*, not
          
          foreach v in `r(varlist)' {
                 forval y = 2000/2014 {
                       local newv : subinstr local v "`y'" "_`y'"
                       if length("`v'") != length("`newv'") {
                              rename `v' `newv'
                              continue, break
                       }
                }
          }
          There'll be a solution with regular expressions.

          Comment


          • #6
            Thanks very much for showing the interest and useful suggestions.
            Kind regards,
            Konrad
            Version: Stata/IC 13.1

            Comment


            • #7
              Focusing on how to insert an underscore character before the last sequence of digits, you can use

              Code:
              local vlist x x1y v1 var2000 ind20032004 ind65something2006
              foreach v in `vlist' {
                  if regexm("`v'", "(.*[^0-9])([0-9]+)$") {
                         local v_y = regexs(1) + "_" + regexs(2)
                         dis "rename `v' `v_y'"
                  }
              }
              The pattern contains 2 subexpressions bracketed by parentheses. The first subexpression is "(.*[^0-9])" and breaks down to
              1. "." match any character
              2. "*" modify #1 to match zero or more
              3. "[^0-9]" matches any character that is not a digit
              The second subexpression is "([0-9]+)" and breaks down to
              1. "[0-9]" match a digit
              2. "+" modify #1 to match one or more
              Finally, the pattern ends with "$" which indicates that a successful match must extend to the last character. The first subexpression match is placed in regexs(1)and the second in regexs(2).

              Note that the following simpler pattern does not work because "*" quantifier is greedy:

              Code:
              local vlist x x1y v1 var2000 ind20032004 ind65something2006
              foreach v in `vlist' {
                  if regexm("`v'", "(.*)([0-9]+)$") {
                         local v_y = regexs(1) + "_" + regexs(2)
                         dis "rename `v' `v_y'"
                  }
              }

              Comment


              • #8
                Robert, thank you for your input, much appreciated.
                Kind regards,
                Konrad
                Version: Stata/IC 13.1

                Comment

                Working...
                X