Aggregating duplicate observations comprising both string and numeric variables

cherry singhal

Join Date: Nov 2016
Posts: 47

Aggregating duplicate observations comprising both string and numeric variables

08 Nov 2016, 23:28

I have the following unbalanced panel data set.
This dataset contains both numeric and string variables. Variables in columns 2-6 are numeric and columns 7-8 are string.
I want to have unique observations under firmID-Year pair. But as can be seen, obs 2,3,4,5,7,8 are duplicates in terms of firmID-year pair.

ObsNo.	industryID	firmID	Year	Var1	var2	var3	var4
1	1641	1	2006	12	9.3	USD	QW
2	1641	1	2008	9	21	USD	DA
3	1641	1	2008	11	23	USD	CN
4	3486	2	2009	7	99	USD
5	3486	2	2009	5.7	45	USD
6	3486	2	2003	34	21.21	USD	YY
7	1641	3	2004	0.9	0	USD	RP
8	1641	3	2004	55		USD

I do not want to just drop the duplicate values. I want to aggregate the duplicate observations into a single observation. I tried to use collapse command, but it only considers numeric variables and not string variables. Hence my post here.

I would like to do the following:
1. For every numeric variable, I want to take a mean in case of numeric variables
2. In the case of string variables, I want to keep one of the two values (anyone of two works). If for a string variable, one value is missing and one is not, I want to keep the non-missing value.

So the aforementioned sample data should look like this-

ObsNo.	industryID	firmID	Year	Var1	var2	var3	var4
1	1641	1	2006	12	9.3	USD	DA
2	1641	1	2008	10	22	USD	DA
3	3486	2	2009	6.35	72	USD
4	3486	2	2003	34	21.21	USD	YY
5	1641	3	2004	27.95	0	USD	RP

Plase help. Thanks much!

Tags: aggregate, duplicate, numeric, panel, string

Robert Picard

Join Date: Mar 2014
Posts: 1536

09 Nov 2016, 08:12

I tried to use collapse command, but it only considers numeric variables and not string variables.

That's not correct. This appears to do what you want:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte obsno int industryid byte firmid int year float(var1 var2) str3 var3 str2 var4
1 1641 1 2006  12   9.3 "USD" "QW"
2 1641 1 2008   9    21 "USD" "DA"
3 1641 1 2008  11    23 "USD" "CN"
4 3486 2 2009   7    99 "USD" ""  
5 3486 2 2009 5.7    45 "USD" ""  
6 3486 2 2003  34 21.21 "USD" "YY"
7 1641 3 2004  .9     0 "USD" "RP"
8 1641 3 2004  55     . "USD" ""  
end

collapse (mean) var1 var2 (firstnm) industryid var3 var4, by(firmid year)

Comment

cherry singhal

Join Date: Nov 2016

Posts: 47
#3

09 Nov 2016, 13:48

Thank you! I had incomplete information on the collapse command. What is also great is that the collapse command does not alter data in non-duplicate observations.
I have one further question, though.
In my post, I gave a sample dataset to explain my problem. But in my real dataset, I have 621 variables in total. So I would need to specify all of them in the collapse command, which makes it error-prone and time-consuming. Is there a quicker/more efficient way to mention all variables in the collapse command without me having to create two lists, one for taking the mean of numeric variables and one for string variables? Thanks again.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#4

10 Nov 2016, 08:29

Code:

help ds

lets you identify string and numeric variables separately.
Comment

Announcement

Aggregating duplicate observations comprising both string and numeric variables

Comment

Comment

Comment