Drop duplicates across two variables

Clemens Oberhuemer

Join Date: May 2021

Posts: 4
#1

Drop duplicates across two variables

16 Jun 2021, 06:55

Hello,

I am working with monthly data where I compute the change in total industry employment between various industries from one month to the next.

Below shows two rows of my spreadsheet. L_industry is the industry in the month before and industry the one in the current month (e.g., 1681 individuals moved from industry 170 to 0 between two months and 1140 into the other directions - indicated by tot_trans and tot_trans_1; the (absolute) difference is then 541).
For each combination of industry and L_industry, there is an additional row in which transitions go into the other direction, in the first case from 0 to 170.

I am interested in the difference of industry transitons in both directions and managed to prepare the data in a way that I have total transitions into both directions in one row. However I want to drop all observations where it goes into the other direction. That means that I want to drop the cases where transitions go from 0 to 170, 170 to 6170 etc., as it contains the same information.

I thought about dropping duplicates of d_tot_trans but there are numerous observations with different industries where d_tot_trans is the same.

My current approach is to create two groups, group_1 is the industry group and group_2 the L_industry group. Group_1 and group_2 are equal to 1 both in the case of transitions from 170 to 0 for the former of transitions from 0 to 170 for the latter.

I thus want to drop observations where group_2 takes on a value that already exist in group_1, e.g., 1 for the example above. I am not sure however how to write the code for this and would appretiate any help with that or an alternative approach to achieve my goal. Thank you!

industry L_industry tot_trans tot_trans_1 d_tot_trans group_1 group_2

0 170 1681 1140 541 1 257

170 6170 16 14 2 363 14898
Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1058

16 Jun 2021, 07:41

Use -egen rowmin- and -egen rowmax- to extract the origin industry and destination industry to rearrange them into identical order. Then remove the duplicates. Here is an example:

Code:

clear
input industry L_industry
0 170
170 0
170 6170
6170 170
end

* Isolate the bigger and the smaller industry code:
egen ind_lower = rowmin(industry L_industry)
egen ind_upper = rowmax(industry L_industry)

* Drop duplicates:
duplicates drop ind_lower ind_upper, force

* Look at data:
list

We started with:

Code:

     +---------------------+
     | industry   L_indu~y |
     |---------------------|
  1. |        0        170 |
  2. |      170          0 |
  3. |      170       6170 |
  4. |     6170        170 |
     +---------------------+

The two -egen- commands added:

Code:

     +-------------------------------------------+
     | industry   L_indu~y   ind_lo~r   ind_up~r |
     |-------------------------------------------|
  1. |        0        170          0        170 |
  2. |      170          0          0        170 |
  3. |      170       6170        170       6170 |
  4. |     6170        170        170       6170 |
     +-------------------------------------------+

Results after removing duplicates defined by having the same ind_lower and ind_upper:

Code:

     +-------------------------------------------+
     | industry   L_indu~y   ind_lo~r   ind_up~r |
     |-------------------------------------------|
  1. |        0        170          0        170 |
  2. |      170       6170        170       6170 |
     +-------------------------------------------+

Last edited by Ken Chui; 16 Jun 2021, 07:44.

Comment

Clemens Oberhuemer

Join Date: May 2021

Posts: 4
#3

17 Jun 2021, 06:56

Thank you very much, Ken! That is exactly what I needed.
Comment

industry	L_industry	tot_trans	tot_trans_1	d_tot_trans	group_1	group_2
0	170	1681	1140	541	1	257
170	6170	16	14	2	363	14898

Announcement

Drop duplicates across two variables

Comment

Comment