Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Renaming and Numbering Variables Sharing Common First 32 Characters

    Hi all! I am importing a large dataset from Excel with many variable names that are over 32 characters. Obviously these are truncated to 32 characters upon import; however a large series of these share the first 32 characters in common. As such I am left with a number of variables whose names are the column number from excel, but whose label's contain the full name.

    Example:
    Name Label
    PayorGovernmentHealthInsura Payor - Government Health Insurance - A
    AJ Payor - Government Health Insurance - B
    AK Payor - Government Health Insurance - C
    AL Payor - Government Health Insurance - D
    AM Payor - Government Health Insurance - E
    AN Payor - Government Health Insurance - F
    AO Payor - Government Health Insurance - G
    AP Payor - Government Health Insurance - H
    AQ Payor - Government Health Insurance - I

    I've been trying to write code that selects variables with names greater than 32 characters, generates a variable name from the label, truncates that name to 30 characters then adds a sequential two digit number at the end to produce the following result:
    Name Label
    PayorGovernmentHealthInsu01 Payor - Government Health Insurance - A
    PayorGovernmentHealthInsu02 Payor - Government Health Insurance - B
    PayorGovernmentHealthInsu03 Payor - Government Health Insurance - C
    PayorGovernmentHealthInsu04 Payor - Government Health Insurance - D
    PayorGovernmentHealthInsu05 Payor - Government Health Insurance - E
    PayorGovernmentHealthInsu06 Payor - Government Health Insurance - F
    PayorGovernmentHealthInsu07 Payor - Government Health Insurance - G
    PayorGovernmentHealthInsu08 Payor - Government Health Insurance - H
    PayorGovernmentHealthInsu09 Payor - Government Health Insurance - I

    Ideally these numbers would restart at 01 for each common 30 character stem.

    I have limited coding experience in languages other than STATA so I've been struggling for several hours writing the following clunky code:

    gen k = 0
    foreach var of varlist _all {
    local label : variable label `var'
    local new_name = lower(substr(strtoname("`label'"), 1, 30))
    if strlen(`new_name') = 30 {
    replace k = k + 1
    tostring k, replace
    local num_name = `new_name' + k
    rename `var' `num_name'
    destring k, replace
    }
    else {
    rename `var' `new_name'
    replace k = 0
    }


    I also tried using addnumber but the way I've written it each variable gets a "1" added, instead of sequential numbers.

    foreach var of varlist _all {
    local label : variable label `var'
    local new_name = lower(substr(strtoname("`label'"), 1, 30))
    rename `var' `new_name'#, addnumber
    }


    I would love any advice on writing this code or tips on how others have handled a similar issue.

    Thanks so much!

    Mark

  • #2
    ***UPDATE***


    Fixed some errors in my previous code and now it works as I wrote it.



    local i = 0
    local k = 0
    foreach var of varlist _all {
    local i = `i' + 1
    local label : variable label `var'
    local new_name = lower(substr(strtoname("`label'"), 1, 30))
    if strlen("`new_name'") == 30 {
    local k = `k' + 1
    rename `var' `new_name'`k'
    }
    else {
    rename `var' `new_name'
    local k = 0
    }
    }


    However, I need to build in a way to check to see if the first 30 characters are the same as any of the others, as while the following code correctly returns:
    Name Label
    PayorGovernmentHealthInsu01 Payor - Government Health Insurance - A
    PayorGovernmentHealthInsu02 Payor - Government Health Insurance - B
    PayorGovernmentHealthInsu03 Payor - Government Health Insurance - C
    PayorGovernmentHealthInsu04 Payor - Government Health Insurance - D
    PayorGovernmentHealthInsu05 Payor - Government Health Insurance - E
    PayorGovernmentHealthInsu06 Payor - Government Health Insurance - F
    PayorGovernmentHealthInsu07 Payor - Government Health Insurance - G
    PayorGovernmentHealthInsu08 Payor - Government Health Insurance - H
    PayorGovernmentHealthInsu09 Payor - Government Health Insurance - I



    It also incorrectly returns:
    Name Label
    race___asian Race - Asian
    race___american_indian___alask1 Race - American Indian / Alaskan Native
    race___native_hawaiian___pacif2 Race - Native Hawaiian / Pacific Islander
    race___other Race - Other

    In this example the second and third variables are uniquely identified so I would ideally like to have them as their truncated names without a number.



    Thanks Statalist!



    Also Nick Cox I submitted a request to change my user name to my real name.

    Comment


    • #3
      All I did was to send a psychic pulse through the Statasphere. My powers are growing. Next, the hammer of Thor.

      Comment


      • #4
        I felt the shockwave all the way over in Philadelphia shortly after my post. Knew I had to act.

        Comment


        • #5
          I like your code without having an answer to #2. If this were my problem I would roll up my sleeves and do renaming in blocks, certainly all double underscores to underscores, cutting down prefixes, and so on. So I would write a specific do-file for the purpose based on taste and getting intelligible names,

          Comment

          Working...
          X