Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop duplicates across two variables

    Hello,

    I am working with monthly data where I compute the change in total industry employment between various industries from one month to the next.

    Below shows two rows of my spreadsheet. L_industry is the industry in the month before and industry the one in the current month (e.g., 1681 individuals moved from industry 170 to 0 between two months and 1140 into the other directions - indicated by tot_trans and tot_trans_1; the (absolute) difference is then 541).
    For each combination of industry and L_industry, there is an additional row in which transitions go into the other direction, in the first case from 0 to 170.

    I am interested in the difference of industry transitons in both directions and managed to prepare the data in a way that I have total transitions into both directions in one row. However I want to drop all observations where it goes into the other direction. That means that I want to drop the cases where transitions go from 0 to 170, 170 to 6170 etc., as it contains the same information.

    I thought about dropping duplicates of d_tot_trans but there are numerous observations with different industries where d_tot_trans is the same.

    My current approach is to create two groups, group_1 is the industry group and group_2 the L_industry group. Group_1 and group_2 are equal to 1 both in the case of transitions from 170 to 0 for the former of transitions from 0 to 170 for the latter.

    I thus want to drop observations where group_2 takes on a value that already exist in group_1, e.g., 1 for the example above. I am not sure however how to write the code for this and would appretiate any help with that or an alternative approach to achieve my goal. Thank you!


    industry L_industry tot_trans tot_trans_1 d_tot_trans group_1 group_2
    0 170 1681 1140 541 1 257
    170 6170 16 14 2 363 14898

  • #2
    Use -egen rowmin- and -egen rowmax- to extract the origin industry and destination industry to rearrange them into identical order. Then remove the duplicates. Here is an example:

    Code:
    clear
    input industry L_industry
    0 170
    170 0
    170 6170
    6170 170
    end
    
    * Isolate the bigger and the smaller industry code:
    egen ind_lower = rowmin(industry L_industry)
    egen ind_upper = rowmax(industry L_industry)
    
    * Drop duplicates:
    duplicates drop ind_lower ind_upper, force
    
    * Look at data:
    list
    We started with:
    Code:
         +---------------------+
         | industry   L_indu~y |
         |---------------------|
      1. |        0        170 |
      2. |      170          0 |
      3. |      170       6170 |
      4. |     6170        170 |
         +---------------------+
    The two -egen- commands added:

    Code:
         +-------------------------------------------+
         | industry   L_indu~y   ind_lo~r   ind_up~r |
         |-------------------------------------------|
      1. |        0        170          0        170 |
      2. |      170          0          0        170 |
      3. |      170       6170        170       6170 |
      4. |     6170        170        170       6170 |
         +-------------------------------------------+
    Results after removing duplicates defined by having the same ind_lower and ind_upper:
    Code:
         +-------------------------------------------+
         | industry   L_indu~y   ind_lo~r   ind_up~r |
         |-------------------------------------------|
      1. |        0        170          0        170 |
      2. |      170       6170        170       6170 |
         +-------------------------------------------+
    Last edited by Ken Chui; 16 Jun 2021, 07:44.

    Comment


    • #3
      Thank you very much, Ken! That is exactly what I needed.

      Comment

      Working...
      X