Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing Excel variables into Stata with shortened names

    Dear all,

    I am working with a dataset that contains information about individuals and their relationships with other individuals. Each observation in the dataset includes a series of binary variables, with each variable equal to 1 if the individual knows another individual (including themselves). There are a total of n individuals and therefore n variables.

    I have an Excel spreadsheet that I would like to import into Stata, but the variable names are too long for Stata to recognize them as variables. I only need a portion of the variable name, specifically the id number of the individual that the variable refers to.

    Is there a way to import the first line of the Excel file into Stata as just the id number (e.g. N45 if the name of the variable in Excel is "looooooooongname (N45) loooongname")? If not, is there a workaround to achieve this? I am new to data cleaning and open to any suggestions. Note that I am not sure if I am allowed to directly edit the Excel file, so that is not an option.

    I am new to data cleaning, so any advice would help! I wish I could give a -dataex- example, but I'm not even at the step where I have a Stata dataset at my disposal... So apologies for this
    Last edited by Adam Sadi; 08 Jan 2023, 16:46.

  • #2
    Using -import excel- without the -firstrow- option will cause these would-be variable names to be read in as the first observation of the data set, and the Stata variable names will be things like A, B, C, ...

    The next step is to edit the values in that first observation to extract what lies between the parentheses. I have no doubt that there is a simple way to do that with regular expressions, but I will defer to others who may respond later with that. Even without regex you can use the -strpos()- function to locate the ( and ) that surround what you actually want, and then -replace- all of those values by the substring that lies between the parentheses. Something like this:
    Code:
    foreach v of varlist * {
        local lparen = strpos(`v'[1], "(")
        local rparen = strpos(`v'[1], ")")
        if `lparen' > 0 & `rparen' > `lparen' {
            replace `v' = substr(`v', `lparen'+1, `rparen'-`lparen'-1) in 1
        }
        rename `v' `:=`v'[1]'
    }
    You may have to modify that code if some of the names contain multiple (...) sequences, or if, in error, there are unbalanced parentheses. Similarly if what lies between the parentheses turns out not to be, itself, a legal variable name, or if it duplicates an earlier variable name, the code will halt with an error message, and you will have to do some additional -replace- commands before the one in the loop to cope with those problems.

    Once that is done, you will then have to deal with variables that should have been numeric but instead came in as strings because of the reading of the first row. The -destring- command will basically handle that in bulk, but you will have to be attentive to problems pointed out by -destring- as it runs.

    Editing the names in Excel is not a good idea. In addition to being tedious, it is error prone, and you will have no audit trail of what you did, so that you cannot defend the integrity of your work if asked.
    Last edited by Clyde Schechter; 08 Jan 2023, 17:19.

    Comment


    • #3
      Dear Clyde,

      Thank you for your answer. You code was exactly what I needed. There are indeed cases where the names contain multiple parentheses sequences, but I should be able to edit the code you provided the right way. You have helped me a lot!

      Comment


      • #4
        I just realized there are some variables who do not have any id number at all in their first observation, therefore they won't have any parenthesis. It's basically information not on the target individual but the source one. Is there any way to get rid of all variables that do not contain the id number for the first observation?

        Comment


        • #5
          Well, the code shown in #2 can be modified to remove any variable that does not have a pair of parentheses:
          Code:
          foreach v of varlist * {
              local lparen = strpos(`v'[1], "(")
              local rparen = strpos(`v'[1], ")")
              if `lparen' > 0 & `rparen' > `lparen' {
                  replace `v' = substr(`v', `lparen'+1, `rparen'-`lparen'-1) in 1
                  rename `v' `:=`v'[1]'
              }
              else {
                  drop `v'
              }
          }

          Comment

          Working...
          X