merge long and wide dataset

pierre martin

Join Date: Nov 2017

Posts: 63
#1

merge long and wide dataset

21 Jul 2024, 06:25

Hi all,

I have multiple wide and long datasets stored in a unique folder. I am trying to write a code to merge them all. I have datasets with 1 line per participant (unique id = identano) and datasets with multiple lines per participant (still identano for each participant but visitnum for the number of visit baseline visit 1 etc.). I would like STATA to automatically recognize a wide or long dataset and create my unique working dataset automatically.

I wrote this code but it is still not working, someone here to help me? Thanks so much! I imagine there are many ways to make this code easier and shorter +++ as well. Thanks a lot to you all +++

Best,
Pierre

local dir "xx"
cd "`dir'"

local files: dir "`dir'" files "*.dta"

local base_datasets ""
local visit_datasets ""

foreach file of local files {

use "`file'", clear

capture confirm variable visitnum
if !_rc {
local visit_datasets "`visit_datasets' `file'"
}
else {
local base_datasets "`base_datasets' `file'"
}
}

foreach base_file of local base_datasets {
if `base_merged' == 0 {
use "`base_file'", clear
local base_merged = 1
}
else {
merge 1:1 identano using "`base_file'", assert(match using) nogenerate
}
}

save base_temp.dta, replace

local visit_merged = 0
foreach visit_file of local visit_datasets {
use "`visit_file'", clear

reshape wide *, i(identano) j(visitnum)

if `visit_merged' == 0 {
save visit_temp.dta, replace
local visit_merged = 1
}
else {
merge 1:1 identano using visit_temp.dta, assert(match using) nogenerate
save visit_temp.dta, replace
}
}

use base_temp.dta, clear

merge 1:1 identano using visit_temp.dta, assert(match using) nogenerate

save dataset_final.dta, replace

erase base_temp.dta
erase visit_temp.dta
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#2

21 Jul 2024, 08:23

Here are some suggestions for another post that may help people to answer your question:
1) Describe in detail what you mean by "not working." There are many ways in which something can "not work." Answering without knowledge of this is very difficult.
2) To aid in #1, using -set trace on- would help. In fact, using that will give information that might enable you to answer your own question.
3) Explain how and where your local base_merged is assigned a value. (Perhaps you think base-merged has a default value of 0, but that is not true.) If any of your locals have values assigned outside the code shown here, the same advice would apply to them.)
4) Show examples of some of your data files, using -dataex-. Among other things, this will help in understanding the features of your files that make it possible to automatically distinguish your different types of files.
Comment
pierre martin

Join Date: Nov 2017

Posts: 63
#3

21 Jul 2024, 09:24

Hi Mike,

Thanks for your response. I want to merge multiple long and wide datasets. My wide datasets are like the one entitled infl.dta as follows with each line = a participant (identano).

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int identano float infl 759 46.48893 462 47.54221 20 53.46077 288 64.74852 563 61.09239 end

Than I have long datasets like the one as follows where each participant is identified using identano but for each participant i have 2 lines one for visitnum 0 which is baseline and 1 for visitnum 1 which is year 1 data and of course i have multiple variables. I would like to merge my multiple long and my multiple wide datasets automatically (they are all located in the same folder) to produce a unique dataset which would be my unique working dataset for all analyses.

Exemple long dataset called mmse here.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int identano byte visitnum str1(mmseres_1 mmseres_2) 805 0 "1" "1" 805 1 "1" "1" 489 0 "1" "1" 489 1 "1" "1" 21 0 "1" "1" end

I imagine this is quite simple and very basic just merge 2 long datasets and 2 wide datasets with a unique participant identifier which is identano and a unique visit identifier which is visitnum.

Any idea how I could do that? I would save an enormous amount of time. Thanks so much to you all +++

Best
Pierre
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#4

21 Jul 2024, 11:16

Well, the following code, begun in the working directory that contains the files, can do this for you:

Code:

clear* local files: dir "." files "*.dta" local first_file: word 1 of `files' local files: list files - first_file use `"`first_file'"' confirm var identano capture confirm var visitnum if c(rc) == 0 { local master_dimensions 2 } else { local master_dimensions 1 } foreach f of local files { capture isid identano using `"`f'"' if c(rc) == 0 { local using_dimensions 1 } else { capture isid identano visitnum using `"`f'"' if c(rc) == 0 { local using_dimensions 2 } else { display as error `"`f' not uniquely identifiable with vbles identanon and visitnum"' exit(9) } } display `"`f'", `master_dimensions', `using_dimensions' if `master_dimensions' == 1 & `using_dimensions' == 1 { merge 1:1 identano using `"`f'"', nogenerate } else if `master_dimensions' == 1 & `using_dimensions' == 2 { merge 1:m identano using `"`f'"', nogenerate local master_dimensions = 2 } else if `master_dimensions' == 2 & `using_dimensions' == 1 { merge m:1 identano using `"`f'"', nogenerate } else { merge 1:1 identano visitnum using `"`f'"', nogenerate } }

That said, I advise against doing this. If there really are so many files that this will save you a lot of time, then it is very likely that there are incompatibilities among these files, or other problems combining them, that need your attention, not blitzing over. In my experience, large collections of files, even from reputable sources, are seldom compatible as received and usually need considerable cleaning to make them compatible. And, in my experience, clinical datasets are among the worst in this regard.

By incompatible I mean files that contain what is supposed to be the same variable, but the names are slightly different in the two files. Or worse, a variable that has the same name in two files but actually refers to different things in each. Or a variable that is numeric in one file and the corresponding variable (same name) is string. Or two numeric variables that are both numeric but represent different coding schemes (have conflicting value labels, for example) for the same construct. Also, you are proposing to -merge- all of these files: if two of them contain the same variable, is one of them supposed to update or replace the other? If so, the order in which they are handled is important, and the -merge- command needs to be modified with -update- or -update replace- options accordingly. But that same specification could be wrong as applied to other files. So unless you have already considered and resolved all these potential problems, automating this mass merge is really looking for trouble.

To help you identify some of the incompatibilities that may be present among these files, I recommend you install Mark Chatfield's -precombine- command from SSC. Running it on the list of files will alert you to potential incompatibilities so you can fix them first. (It will not, however, help you with the problem of update/replace issues.)
Comment
pierre martin

Join Date: Nov 2017

Posts: 63
#5

23 Jul 2024, 08:49

Hello Clyde,

Thank you so much for all your help. This is so much appreciated. I will really pay attention to what you said, especially with many datasets. The precombine command is very helpful too.

Thank you again!

Best,
Pierre
Comment

Announcement

merge long and wide dataset

Comment

Comment

Comment

Comment