Hello all,
In a large dataset I'm currently working with there's a fair amount of duplicates of the variable 'bvd_id'. As I'm interested in merging this data on another dataset, using bvd_id, I need to drop duplicates.
For most observations there's missing data on one variable or another, hence in an attempt to retain the rows with the most data i have the following code:
bysort bvd_id: egen count = count if (country_iso != "." & bvd_sector != "." & incorporation_date != "." & employees0104 != "." & employees_last_avail != "." & revenue0104 != "." & revenue_last != ".")
sort count
duplicates drop bvd_id, force
drop count
When manually checking some of the duplicated rows, before dropping, i see that one row could have data on variable X where it is missing in the second row - but then second row has data on Y when first row doesn't. So what I'm looking to do is to combine the data to retain as much information as possible. In case of both rows is non-missing on X, i would want to prioritize the row with the least missing, i.e. the first one to be listed following my previously listed code.
I'm a bit lost as to how to do this. I would greaty appreciate any help!
Kind regards, Lorens
In a large dataset I'm currently working with there's a fair amount of duplicates of the variable 'bvd_id'. As I'm interested in merging this data on another dataset, using bvd_id, I need to drop duplicates.
For most observations there's missing data on one variable or another, hence in an attempt to retain the rows with the most data i have the following code:
bysort bvd_id: egen count = count if (country_iso != "." & bvd_sector != "." & incorporation_date != "." & employees0104 != "." & employees_last_avail != "." & revenue0104 != "." & revenue_last != ".")
sort count
duplicates drop bvd_id, force
drop count
When manually checking some of the duplicated rows, before dropping, i see that one row could have data on variable X where it is missing in the second row - but then second row has data on Y when first row doesn't. So what I'm looking to do is to combine the data to retain as much information as possible. In case of both rows is non-missing on X, i would want to prioritize the row with the least missing, i.e. the first one to be listed following my previously listed code.
I'm a bit lost as to how to do this. I would greaty appreciate any help!
Kind regards, Lorens
Comment