Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • General advices on how to append datasets from Excel with different variable orders

    I am quite new to Stata (17) and I have to deal with many datasets originating from Excel. The problem with these datasets is that their column orders is different, although the content of each of them is the same. The cause of this problem is that some datasets have additional variables in the middle of the "normal" order (i.e. the variables common to every dataset) which shifts the order of the Excel file to a few columns throughout the spreadsheet. I can't show my actual data, so I took the time to create a fake example resembling the data salad I have right now. If var1, var2, var3... are the variables that are supposed to be common to every dataset, and btwn1, btwn2... etc. are supposed to be variables specific to one file.

    Now this is file1 after it just got imported from Excel -without the option firstrow- :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str20(A B C D E F)
    "var1" "var2" "var3" "var4" "var5" "var6"
    "info" "info" "info" "info" "info" "info"
    "info" "info" "info" "info" "info" "info"
    end
    This is file2 :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str20(A B C D E F)
    "var1" "btwn1" "var2" "btwn2" "var3" "var4"
    "info" "info"  "info" "info"  "info" "info"
    "info" "info"  "info" "info"  "info" "info"
    end
    I can't just append my datasets like this, because the contents of var2 in file1 would get mixed up with btwn1. And to make things worse, one variable may have several writings across Excel files ! for instance var1 could be "time", "B1. time", or just "B1" etc. This is why I can't use the -firstrow- option. Fortunately, I have a mapping table that tells me the id and the label of each variable.

    Now, I am clearly not expecting someone to give me a Stata code that would solve perfectly my problem, because let's be honest, data cleaning is boring... I came on this forum with the hope of finding users more familiar with data cleaning that would give me some guidance on how THEY would approach the problem. Hence the "general advices" in my title.

    I can give as many details regarding my data structure as one wants if you need, as long as it's fakeable.

    Thank you for your help !


  • #2
    You can run the highlighted code following each import. The -strtoname()- function will create a legal Stata name from the variable name in the Excel file and the label will be the exact variable name in case it becomes difficult to identify what variable is what. Stata variable names are 32 characters in length, so you will need to have variable names that are distinct once we take the first 32 characters including spaces, otherwise the code will fail. See

    Code:
    help [M-1] naming
    So here it is.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str20(A B C D E F)
    "var1" "var2" "var3" "var4" "var5" "var6"
    "info" "info" "info" "info" "info" "info"
    "info" "info" "info" "info" "info" "info"
    end
    
    foreach var of varlist *{
        label var `var' "`=strtoname(`var'[1])'"
        rename `var' `=strtoname(`var'[1])'
    }
    drop in 1
    Res.:

    Code:
    . desc
    
    Contains data
      obs:             2                          
     vars:             6                          
    -------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -------------------------------------------------------------------------------------------------------------------------------
    var1            str20   %20s                  var1
    var2            str20   %20s                  var2
    var3            str20   %20s                  var3
    var4            str20   %20s                  var4
    var5            str20   %20s                  var5
    var6            str20   %20s                  var6
    -------------------------------------------------------------------------------------------------------------------------------
    Sorted by:
         Note: Dataset has changed since last saved.

    Comment


    • #3
      And to make things worse, one variable may have several writings across Excel files ! for instance var1 could be "time", "B1. time", or just "B1" etc.
      This you will not be solved with #2. There is no obvious way to know that B1= time ex ante, and this will limit any automation. So

      1. Import
      2. Unify variable names across datasets
      3. Run #2

      Comment


      • #4
        Andrew : Thank you for your message.

        I thought about using -strtoname()-, however I knew that it wouldn't work because, as you already mentioned it, some of my variable names on Excel each have a common long introduction sentence (more than 32 characters) before actually changing. I thought about using -strtoname(substr())- but I'm sure there are better options as this one could involve mistakes resulting from a wrong counting of characters.

        I have the feeling I have no other choice but to do some manual coding here

        Comment

        Working...
        X