General advices on how to append datasets from Excel with different variable orders

Thomas Brot

Join Date: Feb 2023

Posts: 33
#1

General advices on how to append datasets from Excel with different variable orders

08 Feb 2023, 07:47

I am quite new to Stata (17) and I have to deal with many datasets originating from Excel. The problem with these datasets is that their column orders is different, although the content of each of them is the same. The cause of this problem is that some datasets have additional variables in the middle of the "normal" order (i.e. the variables common to every dataset) which shifts the order of the Excel file to a few columns throughout the spreadsheet. I can't show my actual data, so I took the time to create a fake example resembling the data salad I have right now. If var1, var2, var3... are the variables that are supposed to be common to every dataset, and btwn1, btwn2... etc. are supposed to be variables specific to one file.

Now this is file1 after it just got imported from Excel -without the option firstrow- :

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str20(A B C D E F) "var1" "var2" "var3" "var4" "var5" "var6" "info" "info" "info" "info" "info" "info" "info" "info" "info" "info" "info" "info" end

This is file2 :

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str20(A B C D E F) "var1" "btwn1" "var2" "btwn2" "var3" "var4" "info" "info" "info" "info" "info" "info" "info" "info" "info" "info" "info" "info" end

I can't just append my datasets like this, because the contents of var2 in file1 would get mixed up with btwn1. And to make things worse, one variable may have several writings across Excel files ! for instance var1 could be "time", "B1. time", or just "B1" etc. This is why I can't use the -firstrow- option. Fortunately, I have a mapping table that tells me the id and the label of each variable.

Now, I am clearly not expecting someone to give me a Stata code that would solve perfectly my problem, because let's be honest, data cleaning is boring... I came on this forum with the hope of finding users more familiar with data cleaning that would give me some guidance on how THEY would approach the problem. Hence the "general advices" in my title.

I can give as many details regarding my data structure as one wants if you need, as long as it's fakeable.

Thank you for your help !
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10222

08 Feb 2023, 08:03

You can run the highlighted code following each import. The -strtoname()- function will create a legal Stata name from the variable name in the Excel file and the label will be the exact variable name in case it becomes difficult to identify what variable is what. Stata variable names are 32 characters in length, so you will need to have variable names that are distinct once we take the first 32 characters including spaces, otherwise the code will fail. See

Code:

help [M-1] naming

So here it is.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str20(A B C D E F)
"var1" "var2" "var3" "var4" "var5" "var6"
"info" "info" "info" "info" "info" "info"
"info" "info" "info" "info" "info" "info"
end

foreach var of varlist *{
    label var `var' "`=strtoname(`var'[1])'"
    rename `var' `=strtoname(`var'[1])'
}
drop in 1

Res.:

Code:

. desc

Contains data
  obs:             2                          
 vars:             6                          
-------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------------------------------
var1            str20   %20s                  var1
var2            str20   %20s                  var2
var3            str20   %20s                  var3
var4            str20   %20s                  var4
var5            str20   %20s                  var5
var6            str20   %20s                  var6
-------------------------------------------------------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10222
#3

08 Feb 2023, 08:19

And to make things worse, one variable may have several writings across Excel files ! for instance var1 could be "time", "B1. time", or just "B1" etc.

This you will not be solved with #2. There is no obvious way to know that B1= time ex ante, and this will limit any automation. So

1. Import
2. Unify variable names across datasets
3. Run #2
Comment
Thomas Brot

Join Date: Feb 2023

Posts: 33
#4

08 Feb 2023, 08:21

Andrew : Thank you for your message.

I thought about using -strtoname()-, however I knew that it wouldn't work because, as you already mentioned it, some of my variable names on Excel each have a common long introduction sentence (more than 32 characters) before actually changing. I thought about using -strtoname(substr())- but I'm sure there are better options as this one could involve mistakes resulting from a wrong counting of characters.

I have the feeling I have no other choice but to do some manual coding here
Comment

Announcement