Dropping duplicates with most missing and retaining non-missing cells from dropped duplicate

Lorens Elvang

Join Date: Apr 2023

Posts: 22
#1

Dropping duplicates with most missing and retaining non-missing cells from dropped duplicate

01 May 2023, 03:01

Hello all,

In a large dataset I'm currently working with there's a fair amount of duplicates of the variable 'bvd_id'. As I'm interested in merging this data on another dataset, using bvd_id, I need to drop duplicates.

For most observations there's missing data on one variable or another, hence in an attempt to retain the rows with the most data i have the following code:

bysort bvd_id: egen count = count if (country_iso != "." & bvd_sector != "." & incorporation_date != "." & employees0104 != "." & employees_last_avail != "." & revenue0104 != "." & revenue_last != ".")
sort count
duplicates drop bvd_id, force
drop count

When manually checking some of the duplicated rows, before dropping, i see that one row could have data on variable X where it is missing in the second row - but then second row has data on Y when first row doesn't. So what I'm looking to do is to combine the data to retain as much information as possible. In case of both rows is non-missing on X, i would want to prioritize the row with the least missing, i.e. the first one to be listed following my previously listed code.

I'm a bit lost as to how to do this. I would greaty appreciate any help!

Kind regards, Lorens
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35790
#2

01 May 2023, 04:36

This kind of problem usually calls for collapse not duplicates. One strategy is to keep the minimum and maximum of numeric variables; if they are equal you have just one distinct non-missing value.
Comment
Lorens Elvang

Join Date: Apr 2023

Posts: 22
#3

01 May 2023, 05:58

Hi Nick,
Thanks for your reply!

I was unfamiliar with collapse, but looks like exactly what i want. I have a follow up question if you don't mind.

For numeric values I'm leaning towards collapsing using means, and for string variables using first non-missing. Is there a way to collaps numeric and string variables simultaneously?

I've attempted the following, but & is not allowed for collapse:
local vars employees0104 employees_last revenue0104 revenue_last last_avail_year
local vars1 country_iso bvd_sector incorporation_date
collapse (mean) `vars' & collapse (firstnm) `vars1', by (bvd_id)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35790
#4

02 May 2023, 03:39

Code:

collapse (mean) `vars' (firstnm) `vars1', by (bvd_id)
1 like
Comment
Lorens Elvang

Join Date: Apr 2023

Posts: 22
#5

02 May 2023, 09:30

Cheers Nick!

Sorting wasn't needed when collapsing. As I now need to apply duplicates drop, I'm once again interested in sorting my data in a way so that, for duplicated rows of 'company', the rows with the least missing cells appears first. This way i would drop the least informative rows when using duplicates drop. Alternatively, if there's a way for duplicates drop to not keep the first row.

Listed below is an example of duplicated rows of a company. Here I want to keep the second row.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str486 company float employees0104 double employees_last float revenue0104 double revenue_last float last_avail_year "AIRBUS DEFENCE AND SPACE GMBH" . . . . . "AIRBUS DEFENCE AND SPACE GMBH" 110662 70054 32075000 35925000 2019 "AIRBUS DEFENCE AND SPACE GMBH" . 1 . . 2017 "AIRBUS DEFENCE AND SPACE GMBH" . 3104 . 1694116 2015 end

Thank you very much for your help so far
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35790
#6

02 May 2023, 11:18

I don't understand #5 as nothing in your data example looks like duplicated observations. It seems that you have panel data/
Comment

Lorens Elvang

Join Date: Apr 2023
Posts: 22

02 May 2023, 12:57

It is not panel data although I understand it might look like it, specifically due to the variable 'last_avail_year'. I believe the rows represent different subsidiaries of a branch, and just so happens to have 'last_avail_year' sorted in a way that makes it look like panel.

Here's a different example from my data, including more variables, which should hopefully make it clear that it is not panel data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str486 company str17 bvd_id str56 bvd_sector str10 incorporation_date float employees0104 double employees_last float revenue0104 double revenue_last float last_avail_year
"AJINOMOTO OMNICHEM" "BE2142568563" "Chemicals, Petroleum, Rubber & Plastic" " 1/10/1978" .   . .         .    .
"AJINOMOTO OMNICHEM" "BE2036255276" "Chemicals, Petroleum, Rubber & Plastic" "."          .   . .         .    .
"AJINOMOTO OMNICHEM" "BE0403078352" "Chemicals, Petroleum, Rubber & Plastic" " 4/19/1946" . 846 . 451205.55 2021
"AJINOMOTO OMNICHEM" "BE2142568662" "Chemicals, Petroleum, Rubber & Plastic" " 7/16/1980" .   . .         .    .
end

Nonetheless, I'd still love your input on how to sort it so that the row with the least missing is listed first. I've played around with rmiss and rmiss2, but can't quite get it to work correctly.

// Lorens

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35790
#8

02 May 2023, 13:03

There are function in egen to count missings across rows, meaning observations.
Comment

Announcement