Problems with merging data and multiple observations per firm per year

Jana Schue

Join Date: Oct 2021

Posts: 116
#1

Problems with merging data and multiple observations per firm per year

01 Feb 2023, 08:12

Hello all,

I am working with panel data and have a problem with one of my databases, namely the CDP database.
For example, a section of my database looks like this:

id fyear purpose
87 2009 0
138 2009 1
138 2009 1
138 2009 1
138 2009 1
252 2009 1
333 2009 1
333 2009 1
345 2009 0
462 2009 0
593 2009 0
629 2008 0
674 2009 0
689 2009 0
921 2009 1
921 2009 1
921 2009 1
921 2009 1
921 2009 1
921 2009 1
921 2009 1

There are several observations per year for a company (identified by id). For me it is only relevant if a company has purpose=1 at least once in a year. This means that they have voluntarily invested in carbon offsetting, which later becomes my independent variable.

I need to merge this database in a further step. For this I have an excel file that assigns a gvkey to the identification in the CDP database, i.e. id.
Is there a way that in this merge I not only assign the respective gvkey to the id, but also keep only one observation per company per year? However, it would still be important here that if company xy has, for example, in year 2009 once purpose=1 and once purpose=0, only the observation with purpose=1 remains. Otherwise, I would find out for the year that the company in question did not invest, even though it did so once.

I have already tried to merge, this is my code (where "CDP_gvkey.xlsx" is the database where the mapping of id and gvkey can be found, and "CDP.dta" is my database with id, fyear and purpose)

import excel "CDP_gvkey.xlsx", sheet("Matching North America") firstrow
keep GVKEY Account_number
duplicates drop
rename GVKEY gvkey
rename Account_number id
destring gvkey id, replace
merge 1:m id using "CDP.dta", nogenerate
drop if gvkey==.
drop if fyear==.
duplicates drop fyear gvkey, force
save "CDP_final.dta", replace
clear

Here it might be interesting to mention that after "drop if fyear==." I looked into the data and saw that a large number of the companies, if they had multiple observations in a specific year, then always had the same ones (i.e. always 0 or always 1). However, there are still a few that have purpose=1 several times in one year and then purpose=0 once in the same year. If I then execute the command "duplicates drop fyear gvkey, force", sometimes exactly the one observation with purpose=0 remains for the company in the year. Is there a way around this?

I appreciate any help very much!

Many greetings,
Jana

Last edited by Jana Schue; 01 Feb 2023, 08:54. Reason: I read through my question again and was able to formulate a few places more clearly
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

01 Feb 2023, 10:30

Perhaps this example will demonstrate useful technique to reduce the data you show to one observation for each id and fyear.

Code:

. collapse (max) purpose, by(id fyear)

. list, clean noobs

     id   fyear   purpose  
     87    2009         0  
    138    2009         1  
    252    2009         1  
    333    2009         1  
    345    2009         0  
    462    2009         0  
    593    2009         0  
    629    2008         0  
    674    2009         0  
    689    2009         0  
    921    2009         1  

.

Comment

Jana Schue

Join Date: Oct 2021

Posts: 116
#3

01 Feb 2023, 13:02

Thank you very much William! It worked!
Comment

Announcement

Problems with merging data and multiple observations per firm per year

Comment

Comment