Combine the values of several obserations to one observation

Sven Christens

Join Date: Mar 2017

Posts: 2
#1

Combine the values of several obserations to one observation

14 Mar 2017, 18:28

Dear stata-users,

I'm currently building a dataset. My problem is that in some cases I have several observations (same ID) that I want to reduce to one oberservation that contains the information of the original observations. For dummies this would mean that the dummy of the new observation equals 1 if any of the original observations takes a 1. For texts I would like to have a field that combines all texts of the orginal observations.

My data looks like this:
ID dum_1 text_1

1 0 abc

1 1 def

1 1 ghi

2 0 jkl

2 0 mno

3 1 pqr

4 0 stu

I want it to look like this:
ID dum_1 text_1

1 1 abc;def;ghi

2 0 jkl;mno

3 1 pqr

4 0 stu

I though about reshape and to create a new variable but I have many variables in the dataset and didn't know how to handle the different numbers of observations with the same ID.

Any suggestions are very welcome.
Sven
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

14 Mar 2017, 19:15

Code:

by ID, sort: egen new_dum_1 = max(dum_1) by ID: replace text_1 = text_1[_n-1] + ";" + text_1 if _n > 1 by ID: replace text_1 = text_1[_N] by ID: keep if _n == 1

Do read the User's Guide [U] and Getting Started [GS] sections of the PDF manuals that came with your Stata installation. They will acquaint you with the "bread and butter" commands that are used in Stata for basic data management and analysis. In particular, you will learn about -by-, and -egen-. The reading will take you some time, but it will be rapidly repaid with more effective and efficient use of Stata. You won't remember all the details, but you will likely retain enough to know what specific commands are likely to work for your problems, and you can then look up the details in the -help- files or manual sections.

Also, in the future, please post example data using the -dataex- command. HTML tables can be difficult to import to Stata, and, even then, often leave important details of the data set to the imagination. Run -ssc install dataex- to install the -dataex- command on your system. Then run -help dataex- to see the simple instructions for using it. Use it all the time, going forward, to show example data on this Forum.
Comment
Roger Chu

Join Date: Feb 2016

Posts: 84
#3

14 Mar 2017, 19:19

Hello Sven,

Your data is currently in "long" form -- that is multiple observations per ID. If you use reshape, you can get your data in "wide" form (1 observation per ID, but multiple variables). -reshape- will not give you what you say you want in the post, but it might give you something more useful, depending on your analysis. I'll go over how to do what you say you want, and then I'll give an example of -reshape-.

1) What you say you want: make 1 observation per ID, with dum_1 being the maximum of the dum_1 values, and text_1 being all the previous values concatenated with ";" as the delimiter.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(ID dum_1) str3 text_1 1 0 "abc" 1 1 "def" 1 1 "ghi" 2 0 "jkl" 2 0 "mno" 3 1 "pqr" 4 0 "stu" end bysort ID: egen max_dum = max(dum_1) gen concat_text = "" by ID: replace concat_text = cond(_n>1,concat_text[_n-1] + ";","") + text_1[_n] by ID: keep if _n==_N drop dum_1 text_1 rename max_dum dum_1 rename concat_text text_1

2) If you want to reshape instead, you need to create a variable that indicates the different observations within each group of ID numbers. You'll see in the code below where I create the variable obs. This is the indicator to identify different observations within groups of ID numbers.

Keep in mind, with the code below, the way in which you sort your observations prior to creating obs will dictate how your values of dum_1 and text_1 are associated with each other after your reshape. For example, in the code below, "abc" is the first value of text_1 for ID==1. "jkl" is the first value of text_1 for ID==2. So, when you reshape, "abc" and "jkl" will be the values for text_11 for ID==1 and ID==2 respectively.

But if instead your data is sorted so that "def" is the first value of text_1 for ID==1, while "jkl" is the first value of text_1 for ID==2, then "def" and "jkl" will be associated together as values for text_11 for ID==1 and ID==2 respectively.

This specific point might be confusing and hard to explain over a the internet. I would just suggest that you do the reshape below multiple times, but while sorting your data differently to see how the reshape ends up with different associations between different data points.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(ID dum_1) str3 text_1 1 0 "abc" 1 1 "def" 1 1 "ghi" 2 0 "jkl" 2 0 "mno" 3 1 "pqr" 4 0 "stu" end bysort ID: gen obs = _n reshape wide dum_1 text_1, i(ID) j(obs)

Last edited by Roger Chu; 14 Mar 2017, 19:20. Reason: Crossed with Clyde's post. His replace method for the text is simpler than mine. Not sure why I got fixated on creating a new variable!
Comment
Sven Christens

Join Date: Mar 2017

Posts: 2
#4

15 Mar 2017, 10:07

Thank you very much. This was exactly what I was looking for.
Comment
Lonneke Graas

Join Date: May 2019

Posts: 15
#5

13 Jun 2019, 03:03

Hi,

I have a similar question, but it's slightly different and I am hoping for some advice.
I am using Stata 14.2 on Windows.

Below is an example of my data. For some ID's there is only one observation, some ID's occur more than once.
I want to create a new variable, like 'totalfunction' that concatenates different functions so that I can see which ID has more than one function.
So for example, for ID 530 and ID 549, the 'totalfunction' variable should just be "3"
And for ID 587 the 'totalfunction' variable should be "3_12"

I tried to work with the suggestions posted here, but then I would get a 'totalfunction' of "3_3" for ID 549 for example.

I also tried to make dummies for every function and then create a new dummy like

Code:

'function3_12' = 1 if by ID function3 == 1 & function12 == 1

but I'm not sure what the exact syntax for such a command would be.

I would like to get totalfunction = 3_12 for every observation of ID 587, because later on I want to delete duplicates (based on another variable) but I want to keep track of the functions.

Hopefully the problem is clear and someone is willing to help me!

Kind regards, Lonneke

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long ID str2 function 530 "3" 549 "3" 549 "3" 553 "3" 554 "3" 556 "3" 568 "3" 568 "12" 587 "12" 587 "12" 587 "12" 587 "12" 587 "3" 587 "12" 600 "12" 600 "12" 600 "12" 600 "12" 600 "3" 600 "12" 600 "12" 600 "12" 600 "12" 608 "3" 608 "12" 610 "3" 610 "12" 610 "12" 613 "3" 617 "12" 617 "3" 623 "12" 623 "3" 627 "3" 627 "12" 634 "12" 634 "3" 639 "3" 639 "3" 655 "3" 669 "12" 669 "3" 681 "12" 681 "3" 687 "3" 688 "3" 688 "3" 692 "3" 692 "12" 693 "3" 694 "12" 694 "12" 694 "12" 694 "5" 694 "12" 694 "12" 697 "3" 713 "3" 716 "3" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" 717 "12" end

Last edited by Lonneke Graas; 13 Jun 2019, 03:06.
Comment

ID	dum_1	text_1
1	0	abc
1	1	def
1	1	ghi
2	0	jkl
2	0	mno
3	1	pqr
4	0	stu

ID	dum_1	text_1
1	1	abc;def;ghi
2	0	jkl;mno
3	1	pqr
4	0	stu

Announcement

Combine the values of several obserations to one observation

Comment

Comment

Comment

Comment