Combining duplicate obs across multiple vars in wide format

Simone Matecna

Join Date: Jul 2020
Posts: 7

Combining duplicate obs across multiple vars in wide format

15 Apr 2021, 13:08

Hello,

I have a dataset that is currently in wide format in the following form:

ID	name	dob	ID1_1	name1_1	dob1_1	ID2_1	name2_1	dob2_1	ID3_1	name3_1	dob3_1
i123	Hubert	1940
i234	Jenny	1970							i345	Sam	1963
i235	Paul	1962	i768	Katy	1950	i768	Katy	1950
i435	Angela	1980	i980	Megan	1942				i980	Megan	1942

As you can see, there are obs that are duplicated within rows that are associated with a given ID. You can see that there are also gaps in between the obs, i.e. the data is stretched longer across the rows than it could be. The dataset is unique at the level of ID, and I'd like to keep the associated relationships in wide format. What I'm trying to do is concatenate / combine the duplicate obs that occur across the rows (and the different var names), so that I can have a dataset that would look like this:

ID	name	dob	ID1_1	name1_1	dob1_1	ID2_1	name2_1	dob2_1	ID3_1	name3_1	dob3_1
i123	Hubert	1940
i234	Jenny	1970	i345	Sam	1963
i235	Paul	1962	i768	Katy	1950
i435	Angela	1980	i980	Megan	1942	i980	Megan	1942

Essentially I just want to move everything "up" towards the identifying ID for ease of inspection / viewing, and most importantly to remove the duplicates that occur across the different var names (which are associated by the var`i'_`num')

Any and all help is much appreciated!

Tags: categorical, foreach, loop, stata, wide

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

15 Apr 2021, 13:57

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str5 id str7 name int dob str5 id1_1 str6 name1_1 int dob1_1 str5(id2_1 name2_1) int dob2_1 str5 id3_1 str6 name3_1 int dob3_1 "i123 " "Hubert " 1940 "" "" . "" "" . "" "" . "i234 " "Jenny " 1970 "" "" . "" "" . "i345 " "Sam " 1963 "i235 " "Paul " 1962 "i768 " "Katy " 1950 "i768 " "Katy " 1950 "" "" . "i435 " "Angela " 1980 "i980 " "Megan " 1942 "" "" . "i980 " "Megan " 1942 end gen long obs_no = _n rename (id name dob) =0_1 reshape long id@_1 name@_1 dob@_1, i(obs_no) j(_j) drop if missing(id_1) & missing(name_1) & missing(dob_1) by obs_no id_1 name_1 dob_1 (_j), sort: keep if _n == 1 by obs_no (_j), sort: replace _j = _n-1 reshape wide rename *0_1 *

Frankly, unless you have a clear and persuasive reason for retaining the wide layout, you should skip the last two commands and stick with long layout. There aren't many things that can be done easily (or at all) with wide data in Stata. As you can see, this task of pruning the data, was quite simple and short once the data were -reshape-d to long. Doing it directly in wide layout would have required several loops and a lot of more or less unreadable code. Most things in Stata are like that.

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Simone Matecna

Join Date: Jul 2020

Posts: 7
#3

15 Apr 2021, 14:16

Thank you, Clyde, for your response. My data is sensitive and as such I cannot post the direct output from the dataex command, unfortunately.

I tried the code that you provided, but it seems to have gotten me back to exactly where I was before I merged the datasets (in wide format) onto the IDs and did not do anything to help. I think the issue here is that you have "0_1" and my data goes from name1_1 to name22_5 (similarly for dob and ID). Can you please explain how the
reshape long id@_1 name@_1 dob@_1, i(obs_no) j(_j) accounts for the numbers as they grow? I also thought of reshaping long, but thought this would need to be done in a loop to account for all the vars as they grow from 1- 22, and 1-5, respectively (i.e. name1_1 to name22_5)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

15 Apr 2021, 15:51

Let me point out that the problem stems from the fact that your example data and your description in post #1 gave no indication at all that the variable names could end in anything other than "_1". Not until post #3 did you make that apparent.

You could help us help you by providing a data example - using dataex as Clyde recommended - that better represents the variable names that occur in your data.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

15 Apr 2021, 16:25

Also, in light of the new information given in #3, you need to clarify how the results should look. Suppose in some observation everything between id1_1 name1_1 dob1_1 and id1_2 name1_2 dob1_2 is missing. Do you want id1_2 name1_2 dob1_2 to "move left" to become id2_1 name2_1 dob2_1? Or should the _2, _3, _4, and _5 series maintain their separate "identities" with data "moving left" only within, but not among, these series?

I am familiar with confidentiality issues, and understand that you cannot post your actual data. But just as you made up a table of fake data to show in #1, you can make a table of fake data in the Stata data editor having the structure of your actual data set and then use -dataex- to show that. Also, no need to show all of the id* dob* and name* variables. Just showing enough series that one can see the variation of both numerical indices in the names would suffice.
Comment

Announcement

Combining duplicate obs across multiple vars in wide format

Comment

Comment

Comment

Comment