Combining large number of datasets with duplicates and different number of variables

Abhinav Gupta

Join Date: Jul 2018

Posts: 1
#1

Combining large number of datasets with duplicates and different number of variables

07 Jul 2018, 17:06

I am trying to combine multiple datasets (around 40) with different number of variables in each. I cannot use merge as there are duplicates in the ID field in each data set.
I am trying to use the append function with the below listed code

However I am facing the problem that each time stata encounters a new variable it creates a new version of the variable. eg., if the variable name was a and it was present it datasets number 2 and 4, I find the final output has variables a_m2 and a_m4 instead of just a.
Is there a simple way to prevent this from happening

I tried adding all variables to all datasets and adding zeros but cannot seem to get all variable names from the different datasets.

clear all
cd C:\hrs2002\stata
! dir *.dct /a-d /b >C:\hrs2002\stata\filelistdct.txt
file open myfile using C:\hrs2002\stata\filelistdct.txt, read

/* extract all the dta files */
file read myfile line
local i= 1
while r(eof)==0 { /* while you're not at the end of the file */
display "`line'"
infile using "`line'"
save c:\hrs2002\data\H002_`i'.DTA
local a " descsave, list(name, clean noobs noheader) "
local i = `i' + 1
file read myfile line
clear
}
file close myfile

cd C:\hrs2002\data
use H002_1, clear

forvalues j = 2/`i'{
append using H002_`j'
}
save wave_002.DTA
save wave_002, replace
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#2

07 Jul 2018, 19:30

I have never known -append- to fabricate new variable names as you describe. I am of the belief that these distinctive names a_m2 and a_m4 are already present in the text files you are reading in in the first loop.

I think that to get specific advice on how to manage your dilemma, you need to post examples of some of the files you are reading in. My suggestion is to use the -dataex- command to post examples from several of the H002_* data sets. so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

It seems to me there are several possibilities. It may be that you really do need to -merge- rather than -append-, but that the merge key needs to consist of more than just the id variable so as to uniquely identify observations. Or perhaps this data is, indeed, most suited to -append-; in which case you will need to write some code to clean up the different data sets and harmonize their variable names, etc. That kind of code is typically messy, and it is highly idiosyncratic, depending on the most minute details of the data sets being harmonized. So, without seeing example data, it isn't possible to give concrete advice.
Comment

Announcement

Combining large number of datasets with duplicates and different number of variables

Comment